Run Your Own LLM Evaluation
Test any models with your prompts. Start free, no credit card required.
Coding Performance with 10 Evaluators — Run
Comprehensive evaluation of 3 language models across 1 system prompt with rigorous benchmarking and scoring criteria.
5.90
claude-sonnet-4.6
5.00
Spread: 2.31 pts
1614ms
Response time
120
12 total responses
Executive Insights
Key takeaways from this evaluation
Top Performer
claude-sonnet-4.6
5.90
0.39 pts ahead of #2
Model Rankings
Ranked by overall performance score
claude-sonnet-4.6
anthropic/claude-sonnet-4.6
Average
Responses
4
Avg Latency
469ms
Cost
$0.0142
gpt-5.3-codex
openai/gpt-5.3-codex
Average
Responses
4
Avg Latency
2745ms
Cost
$0.0141
deepseek-v3.2
deepseek/deepseek-v3.2
Needs Improvement
Responses
4
Avg Latency
1629ms
Cost
$0.0004
Evaluator Consensus
How 10 evaluator models ranked the candidates via blind comparison
majority Agreement
7 of 10 evaluators agree on the top model
claude-sonnet-4.6
Avg Rank
1.3
Range
#1–2
#1 Votes
7/10
Latency
469ms
gpt-5.3-codex
Avg Rank
1.9
Range
#1–3
#1 Votes
3/10
Latency
2745ms
deepseek-v3.2
Avg Rank
2.8
Range
#2–3
#1 Votes
0/10
Latency
1629ms
gemini-3.1-flash-lite-preview
claude-sonnet-4.6
minimax-m2.7
kimi-k2.5
deepseek-v3.2
grok-4.1-fast
mistral-small-2603
qwen3.5-27b
nova-2-lite-v1
gpt-5.4-mini
Score Comparison
Visual comparison of all model scores
Performance by System Prompt
How each model performs across different evaluation contexts
Top Performer
claude-sonnet-4.6
5.90
Performance by Test Prompt
Model results broken down by individual test prompts
| Test Prompt | Avg Score |
|---|---|
Javascript Function 3 responses | 5.00 |
Write an Interval Merge Function 3 responses | 5.00 |
Debug Python 3 responses | 5.00 |
Refactor Javascript 3 responses | 5.00 |
About This Evaluation
Methodology, criteria weights, and evaluation confidence
12
Total Responses
120
Total Evaluations