Creating Your First Suite

Configure generators, evaluators, criteria, and run settings.

An evaluation suite is a saved configuration that defines what gets evaluated and how. You can run the same suite repeatedly to track changes over time.

Generator Models

These are the models being evaluated. They receive your system + test prompts and produce responses. Select models from any available provider. Your plan determines which model tiers are accessible.

Evaluator Models

Evaluator models judge the quality of generator responses. Using multiple evaluators from different providers reduces scoring bias. Each evaluator scores every response against your criteria.

System Prompts & Test Prompts

Select prompts from your Library. Every combination of system prompt x test prompt x generator model produces one response. More prompts = more data points but higher credit cost.

Criteria

Criteria define what evaluators score on. Each criterion has a name, description, and weight. Higher weights mean that criterion contributes more to the overall score.

Default criteria include Accuracy, Clarity, Completeness, and Tone. Custom criteria are available on Pro and Enterprise plans.

Configuration

Deterministic Mode — attempts to use temperature=0 and fixed seed for reproducible outputs (model support varies)
Responses per Combination — how many times to sample each prompt+model pair (more samples = more reliable scores)

Estimated Credits

The suite builder shows a real-time credit estimate based on the formula: (generator weights + evaluator weights) x system prompts x test prompts x samples. See Credits & Pricing for details.