HelpGetting StartedUnderstanding Your Results

Understanding Your Results

Read the leaderboard, recommendations, and response explorer.

After a run completes, the results page provides several views to help you make a decision.

Recommendation

An AI-generated summary that identifies the top-performing model and explains why. Includes a statistical significance warning when top models score within a narrow margin.

Model Tradeoff Analysis

A table comparing each model across score, latency, and cost per response. Three "Pick" badges highlight different optimization strategies:

  • Top Quality — highest overall score
  • Fast — lowest latency among models scoring 90%+ of the top score
  • Value — best score-to-cost ratio among models scoring 80%+ of the top score

Overall Model Rankings

Ranked cards showing each model's overall score (out of 10), response count, and average latency. Click a model to see its per-criteria breakdown.

Per-Evaluator Breakdown

Expandable sections for each evaluator model, showing how that evaluator ranked the generators. Useful for spotting evaluator-specific biases.

Response Explorer

Filter by model and system prompt to see the raw responses, rendered prompt, and associated evaluation scores. Includes latency timing for each response.

Actions

  • Set as Baseline — mark this run as the reference point for future comparisons
  • Recompute — re-aggregate scores without calling any APIs (free)
  • Share — create a public link (see Sharing & Exporting)
  • Export — download results as JSON or CSV