How Evaluations Work
The three-phase pipeline: Generate, Evaluate, Aggregate.
Every evaluation run goes through three sequential phases. Understanding this pipeline helps you interpret results and debug issues.
Phase 1: Generate
PeerLM sends each combination of system prompt + test prompt to every generator model. Before calling the provider API, it checks the response cache. If a cached response exists (same model version, same prompt content), it's reused at zero cost.
For each API call, PeerLM checks the model's capabilities and only sends supported parameters (temperature, seed, etc.). The fully-rendered prompt is stored for audit purposes.
Phase 2: Evaluate
Each evaluator model scores every response against your criteria on a 1-10 scale. With multiple evaluators, each response gets scored independently by each evaluator, reducing single-model bias.
If an evaluator fails on a specific response (timeout, rate limit, etc.), the error is logged with a category. You can retry individual evaluators later without re-running the entire evaluation.
Phase 3: Aggregate
Scores are combined using your criteria weights to produce a weighted overall score per model. The system then:
- Ranks models by overall score
- Evaluates pass/fail thresholds (if configured)
- Computes the tradeoff analysis (quality/speed/value picks)
- Generates the AI recommendation summary
- Compares against the baseline run (if one is set)
Idempotency
All phases are idempotent. If a run is retried, existing responses and scores are skipped rather than duplicated. This makes retries safe and efficient.