Implementation · LLM Evaluation

LLM evaluation harnesses that catch failures before customers do.

Golden sets, LLM-as-judge, regression gates, production tracing.

We build evaluation harnesses for teams running LLM systems in production. The deliverable is a golden test suite, an LLM-as-judge scoring layer, a regression gate wired into deploy, and a production tracing pipeline that turns customer complaints into reproducible test cases. Without this, you ship by vibes.

Golden sets
Curated test corpus
LLM-as-judge
Scoring at scale
Pre-deploy
Regression gate on every change
6–10 weeks
Typical harness implementation
Problem framing

Vibes-based AI deployment is the default failure mode.

Most teams ship LLM features without a working evaluation harness. The prompt is tuned by feel against a handful of examples, the model swap from one Claude version to the next is approved by the engineer who happened to test it on Tuesday, and the first signal that quality regressed is a customer support ticket. This is not unusual; it is the default. Public benchmark scores do not predict your production behaviour because your distribution is not the benchmark's distribution.

The work is unglamorous. A golden test set is a curated corpus of inputs paired with expected outputs or scoring criteria, kept under version control, expanded continuously from production traffic and incident reports. An LLM-as-judge layer applies graded rubrics at scale where exact-match scoring fails — long-form generation, RAG answers, tool-use trajectories. A regression gate runs the suite against every prompt change, every model version bump, every retrieval index update, and blocks deploy on materially worse scores. Production tracing closes the loop by sampling live traffic into the same harness.

The pattern we recommend treats evaluation as part of the product, not a sidecar. The harness ships before the feature, not after. Test cases come from real failure modes, including adversarial inputs and prompt-injection probes. The judge model is itself evaluated against human annotations on a representative sample. This is the approach we deploy in regulated environments where 'we did not know it regressed' is not an acceptable answer to a supervisor.

How we approach it

From ad-hoc spot checks to disciplined release.

  1. 01

    Golden set curation from real traffic.

    We seed the test corpus from real production traffic — sampled, anonymised, annotated. We add incident-derived cases (every customer complaint becomes a test), adversarial probes (jailbreaks, injection, edge tokens), and distributional cases (long, short, multilingual, code-heavy). The set is version-controlled and reviewed by the domain expert.

  2. 02

    LLM-as-judge with calibrated rubrics.

    Where exact-match scoring fails — long-form answers, RAG retrieval, tool-use chains — we deploy LLM-as-judge with explicit graded rubrics. The judge is itself evaluated against human annotations on a representative sample; we calibrate the judge model and the rubric before it gates anything material. Pairwise comparison is often more reliable than absolute scoring; we use it where it fits.

  3. 03

    Regression gate in CI/CD.

    The harness runs on every prompt change, every model version bump, every retrieval index update, and every dependency change. Materially worse scores on any rubric block the deploy. Statistical significance is enforced — single-run scoring is noise. The gate runs against the golden set plus a sampled production replay.

  4. 04

    Production tracing into the harness.

    Live traffic is sampled at a configurable rate, traced end-to-end (prompt, retrieval, tool calls, model version, output), and fed back into the harness pipeline. Customer complaints become reproducible cases inside hours, not weeks. The tracing layer is the bridge between the lab and the field.

  5. 05

    Drift detection on inputs and outputs.

    Distributions shift. Inputs drift as users find new ways to ask. Outputs drift as model versions change underneath you. We instrument input embeddings and output statistics and alert on drift before the regression gate has anything to catch. By the time scoring drops you have already shipped.

Implementation anchor

The architecture, end to end.

The harness sits on four layers. Data — the golden set under version control (we use Git LFS or DVC depending on size), annotated with rubrics, owners, and provenance. Scoring — rule-based for exact-match, embedding-based for semantic similarity, LLM-as-judge for graded rubrics, and human-in-the-loop sampling for the cases where neither suffices. Execution — a pipeline (Step Functions, Argo, or GitHub Actions) that runs the suite on demand and on schedule. Surfacing — dashboards, regression reports, slack/email alerts, and an annotation interface where domain experts review failures.

Where the system uses Claude on Bedrock, we wire prompt caching into the evaluation pipeline so re-running the suite stays cheap. Where there is a judge model, we use a different family from the system under test — Claude judging Llama, or Cohere judging Mistral — to avoid the systematic bias that comes from same-family scoring. The judge prompt is itself versioned and evaluated.

Key clauses
  • Golden set under version control with provenance and owners
  • Rule-based, embedding, LLM-as-judge, and human-in-the-loop layers
  • Regression gate in CI/CD with statistical significance enforcement
  • Production sampling and trace ingestion into the harness
  • Drift detection on input embeddings and output statistics
  • Judge model from a different family to avoid same-family bias
What good looks like

The end state we drive toward.

A golden test set the team trusts, a regression gate that has actually blocked deploys, a tracing pipeline that turns complaints into reproducible cases inside hours, and an evaluation discipline that ships with the feature rather than after.

200+
Curated golden test cases
Every PR
Regression run cadence
<24h
Complaint to reproducible case
Weekly
Golden set expansion cadence

Illustrative, drawn from published architectures and forthcoming engagements. Specific test counts and cadences are conditioned on use case complexity and deployment velocity.

Engage

Scoped evaluation harness implementation.

Send us the LLM system you are running, the deploy cadence, and the failure modes you are tracking (or the ones you suspect you are not). We come back with a fixed-scope implementation proposal and a sample golden test set inside ten working days.