Baselines & Run Comparison

Set baseline runs and compare results side by side.

Setting a Baseline

Mark any completed run as the baseline for its suite using the Set as Baseline button. The baseline run appears with a "Baseline" badge in the run history.

When a new run completes for the same suite, its results are automatically compared against the baseline. The recommendation summary highlights regressions or improvements relative to the baseline.

Run Comparison

Compare any two runs side by side from the Runs page. Select two completed runs to see:

Score differences per model and per criterion
Latency changes
Cost differences
Which models improved or regressed

When to Use

After a model update — compare the auto-run result against your baseline to detect regressions
After editing prompts — see how prompt changes affect scores
Across configurations — compare different criteria weights or evaluator selections