Run Your Own LLM Evaluation
Test any models with your prompts. Start free, no credit card required.
Coding Performance with 10 Evaluators — Run
Comprehensive evaluation of 2 language models across 1 system prompt with rigorous benchmarking and scoring criteria.
6.84
gpt-5.5
6.58
Spread: 0.52 pts
2689ms
Response time
60
6 total responses
Executive Insights
Key takeaways from this evaluation
Top Performer
gpt-5.5
6.84
0.52 pts ahead of #2
Best Value
deepseek-v4-pro
6.32
Best score-to-cost ratio
Model Rankings
Ranked by overall performance score
gpt-5.5
openai/gpt-5.5
Above Average
Responses
4
Avg Latency
1215ms
Cost
$0.0308
deepseek-v4-pro
deepseek/deepseek-v4-pro
Above Average
Responses
2
Avg Latency
4162ms
Cost
$0.0010
Evaluator Consensus
How 10 evaluator models ranked the candidates via blind comparison
split Agreement
Evaluators disagree on which model is best
gpt-5.5
Avg Rank
1.5
Range
#1–2
#1 Votes
5/10
Latency
1215ms
deepseek-v4-pro
Avg Rank
1.5
Range
#1–2
#1 Votes
5/10
Latency
4162ms
kimi-k2.5
gpt-5.4-mini
gemini-3.1-flash-lite-preview
claude-sonnet-4.6
minimax-m2.7
deepseek-v3.2
grok-4.1-fast
mistral-small-2603
qwen3.5-27b
nova-2-lite-v1
Score Comparison
Visual comparison of all model scores
Performance by System Prompt
How each model performs across different evaluation contexts
Top Performer
gpt-5.5
6.84
Performance by Test Prompt
Model results broken down by individual test prompts
| Test Prompt | Avg Score |
|---|---|
Javascript Function 2 responses | 5.00 |
Write an Interval Merge Function 1 responses | 10.00 |
Debug Python 2 responses | 5.00 |
Refactor Javascript 1 responses | 10.00 |
About This Evaluation
Methodology, criteria weights, and evaluation confidence
6
Total Responses
60
Total Evaluations