Run Your Own LLM Evaluation
Test any models with your prompts. Start free, no credit card required.
Coding Performance with 10 Evaluators — Run
Comprehensive evaluation of 2 language models across 1 system prompt with rigorous benchmarking and scoring criteria.
7.18
mistral-large-2512
5.00
Spread: 4.36 pts
—
Response time
80
8 total responses
Executive Insights
Key takeaways from this evaluation
Top Performer
mistral-large-2512
7.18
4.36 pts ahead of #2
Model Rankings
Ranked by overall performance score
mistral-large-2512
mistralai/mistral-large-2512
Good
Responses
4
Avg Latency
—
Cost
$0.0014
mistral-small-3.2-24b-instruct
mistralai/mistral-small-3.2-24b-instruct
Needs Improvement
Responses
4
Avg Latency
—
Cost
$0.0002
Evaluator Consensus
How 10 evaluator models ranked the candidates via blind comparison
majority Agreement
9 of 10 evaluators agree on the top model
mistral-large-2512
Avg Rank
1.1
Range
#1–2
#1 Votes
9/10
Latency
—
mistral-small-3.2-24b-instruct
Avg Rank
1.9
Range
#1–2
#1 Votes
1/10
Latency
—
gpt-5.4-mini
gemini-3.1-flash-lite-preview
minimax-m2.7
kimi-k2.5
claude-sonnet-4.6
grok-4.1-fast
deepseek-v3.2
mistral-small-2603
qwen3.5-27b
nova-2-lite-v1
Score Comparison
Visual comparison of all model scores
Performance by System Prompt
How each model performs across different evaluation contexts
Top Performer
mistral-large-2512
7.18
Performance by Test Prompt
Model results broken down by individual test prompts
| Test Prompt | Avg Score |
|---|---|
Javascript Function 2 responses | 5.00 |
Write an Interval Merge Function 2 responses | 5.00 |
Debug Python 2 responses | 5.00 |
Refactor Javascript 2 responses | 5.00 |
About This Evaluation
Methodology, criteria weights, and evaluation confidence
8
Total Responses
80
Total Evaluations