Evaluate ML/LLM models against standard benchmarks. Runs MMLU, GSM8K, HumanEval, and custom test suites. Generates comparison reports with statistical significance tests and cost-performance analysis.. Input: Model endpoint or weights. Output: Benchmark report + charts. Python 3.11+. 92% reliable. Use programmatically via API or CLI.
pip install dejavu-mcp
dejavu skill install model-evaluation
dejavu skill execute model-evaluation --input '...'
$6.67/month subscription includes 1,000 free credits. This skill uses credits per execution. Subscribe now →