Creating Your First Suite
Configure generators, evaluators, criteria, and run settings.
An evaluation suite is a saved configuration that defines what gets evaluated and how. You can run the same suite repeatedly to track changes over time.
Generator Models
These are the models being evaluated. They receive your system + test prompts and produce responses. Select models from any available provider. Your plan determines which model tiers are accessible.
Evaluator Models
Evaluator models judge the quality of generator responses. Using multiple evaluators from different providers reduces scoring bias. Each evaluator scores every response against your criteria.
System Prompts & Test Prompts
Select prompts from your Library. Every combination of system prompt x test prompt x generator model produces one response. More prompts = more data points but higher credit cost.
Criteria
Criteria define what evaluators score on. Each criterion has a name, description, and weight. Higher weights mean that criterion contributes more to the overall score.
Default criteria include Accuracy, Clarity, Completeness, and Tone. Custom criteria are available on Pro and Enterprise plans.
Configuration
- Deterministic Mode — attempts to use temperature=0 and fixed seed for reproducible outputs (model support varies)
- Responses per Combination — how many times to sample each prompt+model pair (more samples = more reliable scores)
Estimated Credits
The suite builder shows a real-time credit estimate based on the formula: (generator weights + evaluator weights) x system prompts x test prompts x samples. See Credits & Pricing for details.