Understanding Your Results

Read the leaderboard, recommendations, and response explorer.

After a run completes, the results page provides several views to help you make a decision.

Recommendation

An AI-generated summary that identifies the top-performing model and explains why. Includes a statistical significance warning when top models score within a narrow margin.

Model Tradeoff Analysis

A table comparing each model across score, latency, and cost per response. Three "Pick" badges highlight different optimization strategies:

Top Quality — highest overall score
Fast — lowest latency among models scoring 90%+ of the top score
Value — best score-to-cost ratio among models scoring 80%+ of the top score

Overall Model Rankings

Ranked cards showing each model's overall score (out of 10), response count, and average latency. Click a model to see its per-criteria breakdown.

Per-Evaluator Breakdown

Expandable sections for each evaluator model, showing how that evaluator ranked the generators. Useful for spotting evaluator-specific biases.

Response Explorer

Filter by model and system prompt to see the raw responses, rendered prompt, and associated evaluation scores. Includes latency timing for each response.

Actions

Set as Baseline — mark this run as the reference point for future comparisons
Recompute — re-aggregate scores without calling any APIs (free)
Share — create a public link (see Sharing & Exporting)
Export — download results as JSON or CSV