What is PeerLM?

Overview of the platform and core workflow.

PeerLM is an LLM evaluation platform that lets you compare language models side by side using structured, reproducible evaluations. Instead of manually testing prompts in different chat interfaces, PeerLM automates the process and produces ranked leaderboards.

Core Workflow

Build your Library — Create system prompts (personas) and test prompts (tasks) that define what you want to evaluate.
Configure a Suite — Select which models generate responses, which models evaluate them, and what criteria matter.
Run an Evaluation — PeerLM sends every prompt to every model, then has evaluator models score the responses.
Review Results — See a ranked leaderboard with scores, latency, cost, and an AI-generated recommendation.

Who is it for?

Product teams choosing which LLM to integrate into their product.
AI engineers testing prompt changes across models before deploying.
Procurement teams comparing vendors with objective, shareable reports.

Key Capabilities

400+ models across providers (OpenAI, Anthropic, xAI, and more)
Multi-evaluator scoring to reduce bias
Auto-Run for continuous regression testing
Shareable public reports for stakeholder alignment
Response caching to save credits on repeat runs