Understanding Your Results
Read the leaderboard, recommendations, and response explorer.
After a run completes, the results page provides several views to help you make a decision.
Recommendation
An AI-generated summary that identifies the top-performing model and explains why. Includes a statistical significance warning when top models score within a narrow margin.
Model Tradeoff Analysis
A table comparing each model across score, latency, and cost per response. Three "Pick" badges highlight different optimization strategies:
- Top Quality — highest overall score
- Fast — lowest latency among models scoring 90%+ of the top score
- Value — best score-to-cost ratio among models scoring 80%+ of the top score
Overall Model Rankings
Ranked cards showing each model's overall score (out of 10), response count, and average latency. Click a model to see its per-criteria breakdown.
Per-Evaluator Breakdown
Expandable sections for each evaluator model, showing how that evaluator ranked the generators. Useful for spotting evaluator-specific biases.
Response Explorer
Filter by model and system prompt to see the raw responses, rendered prompt, and associated evaluation scores. Includes latency timing for each response.
Actions
- Set as Baseline — mark this run as the reference point for future comparisons
- Recompute — re-aggregate scores without calling any APIs (free)
- Share — create a public link (see Sharing & Exporting)
- Export — download results as JSON or CSV