Run Your Own LLM Evaluation

Test any models with your prompts. Start free, no credit card required.

Start Free

LLM Evaluation

comparative Evaluation

Coding Performance with 10 Evaluators — Run

Comprehensive evaluation of 2 language models across 1 system prompt with rigorous benchmarking and scoring criteria.

Top Score

6.84

gpt-5.5

Average Score

6.58

Spread: 0.52 pts

Avg Latency

2689ms

Response time

Evaluations

6 total responses

Executive Insights

Key takeaways from this evaluation

Top Performer

gpt-5.5

6.84

0.52 pts ahead of #2

Best Value

deepseek-v4-pro

6.32

Best score-to-cost ratio

Model Rankings

Ranked by overall performance score

gpt-5.5

Winner

openai/gpt-5.5

Performance Score

6.84/ 10

Above Average

Responses

Avg Latency

1215ms

Cost

$0.0308

deepseek-v4-pro

deepseek/deepseek-v4-pro

Performance Score

6.32/ 10

Above Average

Responses

Avg Latency

4162ms

Cost

$0.0010

Evaluator Consensus

How 10 evaluator models ranked the candidates via blind comparison

split Agreement

Evaluators disagree on which model is best

gpt-5.5

Avg Rank

1.5

Range

#1–2

#1 Votes

5/10

Latency

1215ms

deepseek-v4-pro

Avg Rank

1.5

Range

#1–2

#1 Votes

5/10

Latency

4162ms

Per-Evaluator Rankings

How each evaluator model individually ranked the candidates

kimi-k2.5

5 evals

gpt-5.56.67

deepseek-v4-pro5.00

gpt-5.4-mini

6 evals

gpt-5.57.50

deepseek-v4-pro5.00

gemini-3.1-flash-lite-preview

6 evals

gpt-5.510.00

deepseek-v4-pro0.00

claude-sonnet-4.6

6 evals

gpt-5.510.00

deepseek-v4-pro0.00

minimax-m2.7

6 evals

deepseek-v4-pro10.00

gpt-5.55.00

deepseek-v3.2

6 evals

deepseek-v4-pro10.00

gpt-5.55.00

grok-4.1-fast

6 evals

deepseek-v4-pro10.00

gpt-5.55.00

mistral-small-2603

6 evals

gpt-5.57.50

deepseek-v4-pro5.00

qwen3.5-27b

6 evals

deepseek-v4-pro10.00

gpt-5.55.00

nova-2-lite-v1

4 evals

deepseek-v4-pro10.00

gpt-5.56.67

Score Comparison

Visual comparison of all model scores

Run Your Own Model Comparison

Compare any LLM across custom criteria and prompts
Automated scoring with AI evaluators
Share results and track model performance over time

Performance by System Prompt

How each model performs across different evaluation contexts

Coding Agent

6 responses • avg score 6.58

Top Performer

gpt-5.5

6.84

gpt-5.5

6.84

deepseek-v4-pro

6.32

Performance by Test Prompt

Model results broken down by individual test prompts

Test Prompt	Avg Score	Best	Worst
Javascript Function 2 responses	5.00	5.6 deepseek-v4-pro	4.4 gpt-5.5
Write an Interval Merge Function 1 responses	10.00	10.0 gpt-5.5	10.0 gpt-5.5
Debug Python 2 responses	5.00	7.0 deepseek-v4-pro	3.0 gpt-5.5
Refactor Javascript 1 responses	10.00	10.0 gpt-5.5	10.0 gpt-5.5

About This Evaluation

Methodology, criteria weights, and evaluation confidence

Evaluation Criteria

Method:

comparative

Accuracy50%

Instruction Following50%

Total Responses

Total Evaluations