PeerLMPeerLM
Evaluation Report

Run Your Own LLM Evaluation

Test any models with your prompts. Start free, no credit card required.

Start Free
LLM Evaluation
comparative Evaluation

Coding Performance with 10 Evaluators — Run

Comprehensive evaluation of 3 language models across 1 system prompt with rigorous benchmarking and scoring criteria.

Top Score

5.90

claude-sonnet-4.6

Average Score

5.00

Spread: 2.31 pts

Avg Latency

1614ms

Response time

Evaluations

120

12 total responses

Executive Insights

Key takeaways from this evaluation

Top Performer

claude-sonnet-4.6

5.90

0.39 pts ahead of #2

Model Rankings

Ranked by overall performance score

1

claude-sonnet-4.6

Winner

anthropic/claude-sonnet-4.6

Performance Score
5.90/ 10

Average

Responses

4

Avg Latency

469ms

Cost

$0.0142

2

gpt-5.3-codex

openai/gpt-5.3-codex

Performance Score
5.51/ 10

Average

Responses

4

Avg Latency

2745ms

Cost

$0.0141

3

deepseek-v3.2

deepseek/deepseek-v3.2

Performance Score
3.59/ 10

Needs Improvement

Responses

4

Avg Latency

1629ms

Cost

$0.0004

Evaluator Consensus

How 10 evaluator models ranked the candidates via blind comparison

majority Agreement

7 of 10 evaluators agree on the top model

1

claude-sonnet-4.6

Avg Rank

1.3

Range

#1–2

#1 Votes

7/10

Latency

469ms

2

gpt-5.3-codex

Avg Rank

1.9

Range

#1–3

#1 Votes

3/10

Latency

2745ms

3

deepseek-v3.2

Avg Rank

2.8

Range

#2–3

#1 Votes

0/10

Latency

1629ms

Per-Evaluator Rankings
How each evaluator model individually ranked the candidates

gemini-3.1-flash-lite-preview

12 evals
1
claude-sonnet-4.66.25
2
gpt-5.3-codex6.25
3
deepseek-v3.22.50

claude-sonnet-4.6

12 evals
1
claude-sonnet-4.66.25
2
gpt-5.3-codex5.00
3
deepseek-v3.23.75

minimax-m2.7

12 evals
1
claude-sonnet-4.66.25
2
gpt-5.3-codex5.00
3
deepseek-v3.23.75

kimi-k2.5

12 evals
1
gpt-5.3-codex8.75
2
claude-sonnet-4.63.75
3
deepseek-v3.22.50

deepseek-v3.2

12 evals
1
claude-sonnet-4.67.50
2
deepseek-v3.25.00
3
gpt-5.3-codex2.50

grok-4.1-fast

12 evals
1
gpt-5.3-codex7.50
2
claude-sonnet-4.63.75
3
deepseek-v3.23.75

mistral-small-2603

12 evals
1
claude-sonnet-4.67.50
2
deepseek-v3.25.00
3
gpt-5.3-codex2.50

qwen3.5-27b

12 evals
1
gpt-5.3-codex6.25
2
claude-sonnet-4.65.00
3
deepseek-v3.23.75

nova-2-lite-v1

9 evals
1
claude-sonnet-4.66.67
2
gpt-5.3-codex5.00
3
deepseek-v3.23.33

gpt-5.4-mini

12 evals
1
claude-sonnet-4.66.25
2
gpt-5.3-codex6.25
3
deepseek-v3.22.50

Score Comparison

Visual comparison of all model scores

Run Your Own Model Comparison

  • Compare any LLM across custom criteria and prompts
  • Automated scoring with AI evaluators
  • Share results and track model performance over time

Performance by System Prompt

How each model performs across different evaluation contexts

Coding Agent
12 responses • avg score 5.00

Top Performer

claude-sonnet-4.6

5.90

1
claude-sonnet-4.6
5.90
2
gpt-5.3-codex
5.51
3
deepseek-v3.2
3.59

Performance by Test Prompt

Model results broken down by individual test prompts

Test PromptAvg Score

Javascript Function

3 responses

5.00

Write an Interval Merge Function

3 responses

5.00

Debug Python

3 responses

5.00

Refactor Javascript

3 responses

5.00

About This Evaluation

Methodology, criteria weights, and evaluation confidence

Evaluation Criteria
Method:
comparative
Accuracy50%
Instruction Following50%

12

Total Responses

120

Total Evaluations

Ready to evaluate your models?

Compare LLMs with custom criteria, automated scoring, and shareable reports.

Free to start • No credit card required

PeerLMPowered by PeerLM
PeerLMThis report was generated by PeerLM
Run Your Own Evaluation