Configuring Evaluation Criteria
Define custom criteria with weights to shape scoring.
Criteria define what evaluator models score on. Each criterion contributes to the overall model score based on its weight.
Default Criteria
New suites start with four default criteria:
- Accuracy (weight 3) — factual correctness
- Clarity (weight 2) — clear, well-structured response
- Completeness (weight 2) — covers all aspects of the prompt
- Tone (weight 1) — appropriate style and voice
Custom Criteria
On Pro and Enterprise plans, you can create custom criteria. Each criterion needs:
- Name — short label (e.g., "Code Quality")
- Description — explains what to evaluate. This is included in the evaluator's prompt, so be specific.
- Weight — higher weight = more influence on the overall score
How Weights Work
The overall score is a weighted average. If you have Accuracy (weight 3) and Tone (weight 1), Accuracy counts for 75% of the overall score and Tone for 25%.
Tip: Write clear, specific criterion descriptions. The evaluator model sees this text when scoring, so vague descriptions lead to inconsistent scores. Instead of "Is it good?", try "Does the response provide accurate, verifiable information without hallucination?"