Model Evaluation & Benchmarking

Evaluate ML/LLM models against standard benchmarks. Runs MMLU, GSM8K, HumanEval, and custom test suites. Generates comparison reports with statistical significance tests and cost-performance analysis.. Input: Model endpoint or weights. Output: Benchmark report + charts. Python 3.11+. 92% reliable. Use programmatically via API or CLI.

Install via Dejavu

1 Install the Dejavu MCP package:

pip install dejavu-mcp

2 Subscribe at keepingtrack.biz/skills-landing ($6.67/month)

3 Install this skill:

dejavu skill install model-evaluation

4 Your AI agent can now use it:

dejavu skill execute model-evaluation --input '...'

Pricing

$6.67/month subscription includes 1,000 free credits. This skill uses credits per execution. Subscribe now →

Get Dejavu — $6.67/month