Evals in production: how we score multi-agent systems

In a demo, "the agent gave a great answer" is enough. In production, it isn't.

If a CAPA report names the wrong root cause, a regulator notices. If an MDM policy is misclassified, the wrong devices get reconfigured at 3am. The cost of being wrong is real, and vibes aren't going to catch you before the customer does.

So every multi-agent system we ship goes through the same three-tier eval harness. This post is what's in it.

Tier 1 — golden-set regression evals

The simplest tier, and the highest leverage. We curate a small set — typically 50 to 200 examples — of inputs with known-good expected outputs. These come from real production data, scrubbed for PII.

On every model swap, prompt change, or orchestrator update, we replay the golden set and diff outputs against the recorded expected outputs.

The diff is rubric-based, not string-match. A rubric is just a small list of YES/NO checks:

unknown node

A judge model scores each rubric on each output. We track pass rates over time per rubric, not aggregated. A 92% overall pass rate hides which 8% are failing, and the failures rarely distribute evenly.

Tier 2 — adversarial / red-team evals

Golden sets catch regressions. They don't catch novel failures. So we maintain a growing adversarial set of inputs designed to break the system: ambiguous prompts, partial data, contradictory evidence, prompt injection attempts.

Every released system runs the adversarial set nightly. New failures get triaged: if it's a real bug, we fix it. If it's an out-of-scope behavior, it goes into a documented "out of scope" set that we re-validate every release. Both lists are version-controlled next to the code.

Tier 3 — live shadow evals

Some failure modes only show up in real traffic. So for the first 90 days of production, we run a shadow agent in parallel with the live one. The shadow uses a slightly different config and we compare outputs offline.

Disagreements get a human label. Over weeks, this surfaces the long-tail failures the golden set never anticipated. After 90 days the shadow either gets graduated or retired.

What we don't do

We don't trust a single eval score. Per-rubric pass rates only. The header number is for executives; the per-rubric breakdown is for engineers.
We don't trust judge models alone. Every quarter, a human-labeled subsample re-calibrates the judge. Judges drift when the underlying model is upgraded.
We don't ship without a release-blocker rubric. There's always one rubric — usually safety-related — where regression is non-negotiable, regardless of overall score.

Tooling

We run the harness on top of Inspect AI for orchestration and a custom dashboard for the trend lines. We'll write up the dashboard in a follow-up post.

If you're shipping multi-agent systems into regulated industries and you don't have evals like this, please write to us before you ship.