Configuring Evaluation Criteria

Define custom criteria with weights to shape scoring.

Criteria define what evaluator models score on. Each criterion contributes to the overall model score based on its weight.

Default Criteria

New suites start with four default criteria:

Accuracy (weight 3) — factual correctness
Clarity (weight 2) — clear, well-structured response
Completeness (weight 2) — covers all aspects of the prompt
Tone (weight 1) — appropriate style and voice

Custom Criteria

On Pro and Enterprise plans, you can create custom criteria. Each criterion needs:

Name — short label (e.g., "Code Quality")
Description — explains what to evaluate. This is included in the evaluator's prompt, so be specific.
Weight — higher weight = more influence on the overall score

How Weights Work

The overall score is a weighted average. If you have Accuracy (weight 3) and Tone (weight 1), Accuracy counts for 75% of the overall score and Tone for 25%.

Tip: Write clear, specific criterion descriptions. The evaluator model sees this text when scoring, so vague descriptions lead to inconsistent scores. Instead of "Is it good?", try "Does the response provide accurate, verifiable information without hallucination?"