Baselines & Run Comparison
Set baseline runs and compare results side by side.
Setting a Baseline
Mark any completed run as the baseline for its suite using the Set as Baseline button. The baseline run appears with a "Baseline" badge in the run history.
When a new run completes for the same suite, its results are automatically compared against the baseline. The recommendation summary highlights regressions or improvements relative to the baseline.
Run Comparison
Compare any two runs side by side from the Runs page. Select two completed runs to see:
- Score differences per model and per criterion
- Latency changes
- Cost differences
- Which models improved or regressed
When to Use
- After a model update — compare the auto-run result against your baseline to detect regressions
- After editing prompts — see how prompt changes affect scores
- Across configurations — compare different criteria weights or evaluator selections