An operator-grade pattern from the CreativeMinds Development (cmdev) AI engineering practice. Companion to the Amazon Bedrock for Production AI series and the prior piece on Air-Gapped LLM Deployments.
Key takeaways
- Enterprise LLM pilots that ship on vibe checks fail at month three — the fix is eval-driven engineering, where every change to model, prompts, RAG, or tools is gated by a measurement against a stable test set.
- Production workloads need four distinct eval surfaces — offline, online, regression, and A/B — each with its own metrics, cadence, and audience. Conflating them is the most common architectural mistake.
- A 300–500 item golden set, sourced from real user traffic with SME labels, is the sweet spot for daily CI-grade evaluation; below 100 items, noise overwhelms the signal.
- LLM-as-judge works when structured into per-claim rubrics, decomposed into components, calibrated against a human-labelled subset, and pinned to a specific judge model version — naïve "score this 1–10" prompts are unreliable.
- The eval harness costs ~10% of the AI workload it measures; a customer-visible incident costs orders of magnitude more. The trade is always favourable.
The Tuesday that ends the pilot
On a Tuesday morning, a customer-success lead opens a ticket. A regulated refund went out for the wrong amount. The AI assistant that classifies the case had answered confidently. The answer was wrong. By noon, the head of risk is on a call asking how many other queries this month went the same way, and nobody in the room can answer. The pilot pauses. Sometimes it never restarts.
This story is almost universal among the enterprise LLM projects that fail. Month one is the demo. The model passes the test questions, the tools fire on cue, the synthesis reads well. Month two is the limited rollout. Most queries work, the odd weird answer gets hand-corrected, and the team adopts a working motto — we'll fix that in the next iteration. Month three is the Tuesday.
The cause is consistent because the method is consistent: the team relied on vibe checks. Someone tried a dozen queries, the answers looked right, the project shipped. Vibe checks are like a chef tasting a soup at one spoonful and declaring the whole pot fine — it works for a single bowl, not for a kitchen serving a thousand. There is no harness underneath. There is no record of what good looked like last month against which to detect this month's drift.
The fix has a name. Eval-driven engineering treats every change to the model, the prompts, the retrieval, the tools as a code change that has to pass a measurement before it ships. The harness runs in CI on every pull request. It runs in production against a sample of live traffic. It produces an artefact a regulator can read. Building it is one engineering sprint. Not building it is the Tuesday.
This piece is the architecture cmdev ships for regulated enterprise LLM workloads. It is the layer that converts a black box into a measured system, defensible to the CISO, the regulator, and the engineers themselves when the next inevitable thing goes wrong.
Four windows on the same workload
A production LLM workload needs four distinct evaluation surfaces. Most teams build one, sometimes two. The deployments that survive 18 months in production have all four. Think of them as four windows on the same building — each opens onto a different room, and the architect who only cuts one ends up with a corridor nobody can walk.
| Type | What it measures | When it runs | Who consumes it |
|---|---|---|---|
| Offline eval | Quality on a stable golden set | On every code change (CI) | Engineering — gates merges |
| Online eval | Quality on live production traffic | Continuously, per-query sampling | Operations — feeds dashboards and alerts |
| Regression eval | Quality drift over time relative to baseline | Weekly or monthly | Engineering + risk — catches silent degradation |
| A/B eval | Quality of variant X vs variant Y on matched traffic | When testing a change | Product — gates rollouts |
Each surface has its own metric vocabulary and its own audience. Conflating them is the most common architectural mistake — a team builds an eval system that mixes offline and online and produces one that does neither well. The mature pattern is four pipelines that share a metrics vocabulary but run independently, like four instruments in an orchestra reading the same score in their own clef.
Counting the things that actually predict failure
The metric matters more than the model. A model that hits 92% accuracy but produces opaque, uncitable answers is worse for regulated work than a model that hits 87% with full citations. Choosing what to measure is the architectural decision; choosing the model is the implementation detail.
For retrieval-augmented generation (RAG)
| Metric | What it tells you | How we measure |
|---|---|---|
| Hit rate | Did the retrieval bring back the right chunk(s) for the question? | Per-query: was the expected source chunk in the top-K retrieved? |
| Mean reciprocal rank (MRR) | How high did the right chunk rank? | Per-query: 1 / position of expected chunk in retrieved set |
| Context precision | Of retrieved chunks, what fraction are actually relevant? | LLM-as-judge over each retrieved chunk against the query |
| Context recall | Of the chunks needed to fully answer, what fraction were retrieved? | LLM-as-judge against the golden answer's source set |
| Faithfulness | Does the answer follow from the retrieved context, or did the model hallucinate? | LLM-as-judge: extract claims from answer, check each is supported by context |
| Answer relevancy | Does the answer address the question that was asked? | Embedding similarity between question and answer summary |
| Citation accuracy | Do the citations in the answer point to chunks that actually support each claim? | Per-claim check against per-citation chunk |
For agent workloads
| Metric | What it tells you | How we measure |
|---|---|---|
| Tool call accuracy | Did the agent call the right tool for the task? | Compare called tool sequence to expected sequence per scenario |
| Tool argument accuracy | Did the agent call the tool with correct arguments? | Per-call: structural check against expected argument shape |
| Goal completion | Did the agent finish the task? | Per-scenario: terminal state matches expected state |
| Turn count | How many model-tool turns did the agent take? | Per-scenario: histogram tracked against baseline |
| Steering recovery rate | Of induced errors, how many did the agent recover from? | Per-injected-error scenario: did the agent self-correct? |
Cross-cutting metrics that always matter
Latency distribution at P50, P95, P99 belongs in every dashboard. Slow answers are worse than fast wrong answers for most use cases — speed is itself a quality signal. Cost per query, aggregated by workload and query class, catches a runaway-cost regression before the AWS bill does. Refusal rate matters because refusal creeps in silently as Guardrails policies tighten. And confidence calibration — does the model's expressed confidence actually predict its correctness — is the metric that downstream decisions stand on.
The metric set that actually works is opinionated and short. A team instrumenting fifteen RAG metrics learns less than a team acting on six. The metrics above are the six-to-eight that genuinely predict production quality across deployments.
The golden set, or: who labels the labels
The harness is only as good as the golden set it runs against. A golden set is a stack of question-answer pairs that the team agrees represents what good looks like. Three rules hold up over time.
Source from real users, not engineering imagination. The most common failure mode is a test set the engineering team wrote — full of plausible-but-not-actual queries. The model passes the engineering test set and fails on production traffic because the distributions never matched. It is the difference between rehearsing on a stage and performing in the street. The right move is to sample live traffic (anonymised), have an SME team label the expected outputs, and use that as the golden set. Refresh quarterly with new samples.
Cover the edge cases on purpose. A golden set that mirrors production traffic under-represents the rare queries where the model fails spectacularly. Augment deliberately — ambiguous queries, queries that should produce a refusal, queries that test specific Guardrails policies, queries from low-volume topic clusters. Report per-segment, so the team sees median production quality and edge-case quality as distinct numbers, not as a smoothed average that hides the cliff.
Version it, expire it, treat it like code. The golden set lives in version control. It gets pull-request review. It has changelog entries when items move. An item gets marked stale when the underlying ground-truth changes — a policy update means yesterday's golden answer is today's wrong answer. A golden set nobody touches is one that no longer reflects reality.
The right size lands at 300-500 items for most workloads. Below 100, noise overwhelms signal — you cannot tell whether the new prompt is better or whether you got lucky. Above 1,000, the cost of running the harness becomes a barrier to running it frequently. The 300-500 band is the sweet spot for daily CI-grade evaluation.
The harness, plain and small
The architecture of a working LLM eval harness fits on one screen.
# eval/harness.py — the minimal shape
from dataclasses import dataclass
from typing import Callable
import boto3
@dataclass
class GoldenItem:
id: str
query: str
expected_answer: str
expected_chunks: list[str]
segment: str # for per-segment reporting
difficulty: str # easy / standard / hard / adversarial
@dataclass
class EvalResult:
item_id: str
actual_answer: str
retrieved_chunks: list[str]
metrics: dict[str, float]
latency_ms: int
cost_usd: float
def run_eval(
golden_set: list[GoldenItem],
pipeline: Callable[[str], dict], # the system under test
judge_model_id: str = "anthropic.claude-sonnet-4-6-20251022",
) -> list[EvalResult]:
"""Run the pipeline against each golden item, score with LLM-as-judge."""
results = []
for item in golden_set:
start = time.monotonic()
pipeline_output = pipeline(item.query)
latency_ms = int((time.monotonic() - start) * 1000)
metrics = {
"hit_rate": hit_rate(pipeline_output["chunks"], item.expected_chunks),
"mrr": mean_reciprocal_rank(pipeline_output["chunks"], item.expected_chunks),
"faithfulness": llm_judge_faithfulness(
question=item.query,
answer=pipeline_output["answer"],
context=pipeline_output["chunks"],
judge_model=judge_model_id,
),
"relevancy": embedding_similarity(item.query, pipeline_output["answer"]),
"citation_accuracy": citation_accuracy_check(
answer=pipeline_output["answer"],
citations=pipeline_output["citations"],
chunks=pipeline_output["chunks"],
),
}
results.append(EvalResult(
item_id=item.id,
actual_answer=pipeline_output["answer"],
retrieved_chunks=pipeline_output["chunks"],
metrics=metrics,
latency_ms=latency_ms,
cost_usd=pipeline_output["cost_usd"],
))
return results
The shape is deliberately minimal. The complexity lives in the metric implementations and the judge prompts, not in the skeleton. The harness produces a structured result per item; the dashboard layer trends them.
The same harness runs in three contexts. In CI on every pull request, it gates the merge if any metric regresses beyond a configured threshold — a typical run is 300 items, 4-6 minutes of wall clock, two or three dollars of evaluation cost. In production, it samples one to five percent of live traffic and shadow-evaluates the outputs offline. And monthly, it runs against the full golden set and writes a signed PDF report with metric trends and drift findings — the artefact a CISO hands an examiner without rewriting it first.
The judge, calibrated like an instrument
LLM-as-judge is the trick that makes RAG and agent evaluation possible at scale. The naïve version — ask Claude to score this answer 1-10 — is wishful thinking dressed as engineering. The version that holds up looks more like calibrating a laboratory instrument than asking a colleague's opinion.
A working judge needs a structured rubric, not a holistic score. The prompt specifies exactly what to evaluate, with anchored examples for each band. Not is this answer good but for each claim in the answer, is it supported by the provided context, score each claim independently. The judge then returns components — per-claim faithfulness, per-citation accuracy — that the harness aggregates programmatically. Holistic scores are gut feel; component scores are evidence.
Calibration matters as much as the rubric. Fifty to a hundred items in the golden set get human SME labels. The judge's outputs sit alongside the human labels, and the delta becomes a tracked metric in its own right. If the judge starts drifting from human alignment, either the rubric needs tuning or the model needs replacing.
Use a stronger model to judge than to serve. Claude Sonnet for production reasoning; Claude Opus for the judge. The cost is acceptable because the judge runs on the golden set of 300-500 items, not on every production query. The quality gap matters because the judge defines the quality bar — you cannot grade a Cambridge essay with a Year 5 marker.
For online evaluation, judge a one-to-five-percent stratified sample rather than every query. Stratify by query class so the rare types do not vanish from the measurement.
A working judge prompt for faithfulness:
You are evaluating whether an AI assistant's answer is faithful to its provided context.
Question: {question}
Context:
{retrieved_chunks}
Answer: {answer}
Extract every factual claim from the answer (claims, not opinions or hedges).
For each claim:
1. State the claim verbatim.
2. Identify the chunk(s) in context that support it (or "none").
3. Score: "supported" / "partially supported" / "unsupported" / "contradicted".
Return JSON:
{
"claims": [
{"claim": "...", "supporting_chunk_ids": [...], "score": "..."},
...
]
}
Per-item faithfulness becomes the fraction of claims marked supported over total claims. Across the golden set, that becomes the faithfulness metric the dashboard tracks over time.
The drift you cannot see
The hardest production failure mode is silent quality drift. The model did not change. The prompts did not change. The corpus did not change. Quality drops anyway. The retrieval has shifted underneath because a knowledge-base refresh changed what gets returned — sometimes through RAG poisoning. The Bedrock catalogue has quietly moved an upstream model version. A new product line has changed the distribution of user queries. A tighter Guardrails policy has raised the refusal rate just enough that nobody notices until the customer-success ticket arrives.
Drift is what makes the harness valuable beyond the launch sprint. The pattern that catches it has four parts. A daily golden-set rerun against the trailing 30-day baseline alarms when any metric regresses more than two standard deviations. Shadow evaluation on one to five percent of live traffic — structured exactly like the golden-set run — catches the patterns the golden set does not cover. Per-segment trending makes regressions visible in specific user segments before they reach the headline metric. And causal attribution turns the alarm into something actionable: the harness reports which deploys, ingestions, and model updates correlate with the regression window, so the on-call engineer starts with a hypothesis instead of a search.
The loop runs in the customer's account, on the customer's data, with results landing in the customer's S3 audit bucket. It is part of the air-gapped deployment from the prior article, not a separate cloud service.
What bites in production
Five frictions show up in real deployments, and they are the ones cmdev engineers have hit and engineered past.
The first is getting good golden-set labels at scale. Producing high-quality expected outputs for 300-500 items is the single biggest blocker we see. SMEs are busy. Engineers writing labels produce engineer-thinking-like-engineer answers that miss what the SMEs would actually have said. Outsourcing produces uneven quality. The pattern that works is a two-pass protocol: Claude Sonnet generates a draft answer with the company documentation in context, an SME reviews and corrects. SME time per item falls from ten minutes of writing-from-scratch to two minutes of editing, and a 400-item golden set becomes a 2-3 day project rather than a 2-3 week one.
The second is judge-cost arithmetic. A 400-item run with claim decomposition and per-citation checks can issue 10-20 judge calls per item. At Opus rates, a single eval run costs $20-50, and running daily plus on every PR adds up to a meaningful line on the bill. The optimisations stack: Claude Haiku as a first-pass classifier skips detailed judging on obviously-correct items; cached judge results help during dev cycles where the same pipeline output appears repeatedly; and batch inference (per Bedrock Series Part 7) covers the non-blocking eval runs. Together they cut eval cost by roughly 70 percent without a measurable quality regression on the trend.
The third is the judge drifting underneath you. The judge model is a moving target — Bedrock's catalogue updates, versions deprecate, and a judge calibrated to humans six months ago may no longer be the same judge. We hit this in production once when faithfulness scores trended upward across a quarter. Nothing in the pipeline had improved. The judge had simply become more permissive after a model version change. The fix is the same fix you apply to the pipeline model: pin the version. When the pinned judge approaches deprecation, run a parallel calibration against the successor on the labelled subset, document the delta, then switch with eyes open. Record the judge version in every eval-run metadata entry so historical trends remain interpretable.
The fourth is multimodal outputs. When the model produces images, charts, or formatted documents, text-based eval metrics fall apart. There is no LLM-as-judge prompt for did this chart correctly visualise the data. The patterns that work are pragmatic — deterministic checks on structured outputs (schema validation, range checks, expected-element-present), LLM-as-judge against extracted text from labelled images, and human spot-check sampling for genuinely visual outputs. The honest answer is that multimodal eval is harder and we accept a higher human-in-the-loop burden on those workloads.
The fifth is the most uncomfortable: the drift alarm nobody acts on. A 2 AM alert that nobody triages is worse than no alert, because it teaches the team to ignore the signal. We have shipped harnesses where the dashboard was built brilliantly and then never opened again. Closing this requires organisational gravity, not engineering. Drift alarms route to the same on-call channel as service-degradation alarms. Each alarm class has a documented runbook. When a drift alarm fires, the post-incident review carries the same rigour as a customer-visible incident. The harness either has gravity, or it is decoration.
Five things that hold up
Across the eval deployments cmdev has shipped, five lessons compound.
The eval harness is the deployable artefact. The model is fungible — the Bedrock catalogue moves, customer preferences shift between Claude and Llama, retrieval architectures evolve. The harness is the layer that says regardless of what sits underneath, here is the measured quality. It is the operator's most durable AI engineering investment.
The first month of eval data is worth more than the first six months of model improvements. A team that ships a working harness in week four of a deployment outperforms a team that ships a better model in month six, because the first team iterates with evidence and the second iterates blind.
The golden set is a regulatory artefact, not just a developer tool. When the regulator asks how do you know your AI is producing quality outputs, the golden set plus the harness plus the trended metrics is the answer. Without them, the answer is hand-waving. With them, it is a twelve-page report an examiner signs off in one read.
LLM-as-judge needs the same engineering rigour as the pipeline it judges. Calibrated rubrics, version-pinned judge models, cost discipline, drift detection. Teams that treat the judge as an afterthought get unreliable evaluation; teams that treat it as a first-class surface get the measurement they need.
The cost of building the harness is roughly ten percent of the cost of the workload it measures. The cost of not building it is the cost of a customer-visible incident. The trade is always favourable, and the teams that miss it are the teams that have not yet done the arithmetic.
If a regulator asked you tonight how you know your AI is producing the right outputs, what would you hand them?
FAQs
Why won't a single eval system work for offline and online evaluation?
Offline eval gates merges in CI on a stable golden set; online eval samples live traffic continuously to feed dashboards and alerts. They have different metrics, different consumers (engineering vs operations), and different latency requirements. Teams that build one mixed pipeline end up with one that does neither well — the mature pattern is four distinct pipelines that share a metrics vocabulary but run independently.
How do we get 300–500 SME-labelled golden items without burning out the SME team?
Use a two-pass labelling protocol. Pass 1: an LLM (Claude Sonnet, with company documentation as context) generates draft expected answers for each query. Pass 2: an SME reviews and either approves or corrects. SME time drops from ~10 minutes per item to ~2 minutes, turning a 400-item golden set into a 2–3 day SME project rather than 2–3 weeks.
Why does our faithfulness metric trend keep rising without us improving the pipeline?
The judge model probably drifted. Bedrock catalogue updates can quietly replace your judge with a more permissive version, and scores rise without any pipeline change. Pin the judge model version, run parallel calibration against the successor on a labelled subset before switching, document the calibration delta, and record judge version in every eval-run metadata entry so historical trends remain interpretable.
What's the cost discipline around LLM-as-judge on continuous evaluation?
A 400-item eval with claim-decomposition and per-citation checks can issue 10–20 judge calls per item, costing $20–50 per run on Opus. Use Haiku as a first-pass classifier to skip detailed judging on obviously-correct cases, cache judge results on identical pipeline outputs, and use batch inference for non-blocking runs. The combination cuts eval cost ~70% with no measurable quality regression on the metric trend.
Will regulators actually accept the golden set + harness as evidence?
Yes — when packaged as a monthly signed PDF report with metric trends and drift findings, the golden set, harness, and trended metrics answer the regulator's question "how do you know your AI is producing quality outputs." Without them, the answer is hand-waving. With them, the answer is a 12-page report that an examiner signs off in one read.
Engaging with cmdev
CreativeMinds Development (cmdev) is the engineering studio behind this evaluation framework. We ship measurable, audit-defensible AI for regulated enterprises in Africa and the EU — banks under CBN CSAT, energy operators under NMDPRA and NIS2, fintechs under NDPA, healthcare networks under HIPAA-equivalent regional regimes. The eval harness is part of the production architecture we deploy, not an afterthought bolt-on.
- Email: [email protected]
- Cloud security services: /services/cloud-security
- Companion architecture series: Amazon Bedrock for Production AI, Air-Gapped Bedrock, AWS-for-banks
Mayowa Adewole is CTO and Principal AI Engineer at CreativeMinds Development. He leads cmdev's AI engineering practice for regulated enterprises across Africa and the EU, with deployments in production for banking, energy, and critical-infrastructure customers.
