Engineering

Mitigating Non-Deterministic AI Failures: The Day 2 Problem for Enterprise LLMs

Mayowa A.14 min read
Mitigating Non-Deterministic AI Failures: The Day 2 Problem for Enterprise LLMs
Share
~20 min

An operator-grade pattern from the CreativeMinds Development (cmdev) AI engineering practice. Companion to Custom Evaluation Frameworks for Enterprise LLMs and the Amazon Bedrock for Production AI series.

The failure mode that costs enterprises the most

Enterprise LLM deployments rarely die in a single catastrophic event. They die slowly, by a thousand drifts, in the months after launch when the team's attention has moved to the next initiative. The failure mode looks like this: the workload ran beautifully for ten weeks; in week eleven the customer-support team starts hearing complaints; in week twelve the compliance officer notices that the audit-evidence pack from the same monthly query now reads differently; in week thirteen the on-call gets a page because P99 latency is now 22 seconds when it used to be 4. Nothing broke. Everything drifted.

This is the Day 2 problem. The model didn't change, the prompts didn't change, the corpus didn't change — but the system behaviour did. The day-one architecture from the air-gapped Bedrock article is what makes the workload deployable. The day-two monitoring architecture is what keeps it deployable for the eighteen months that justify the initial investment. Most teams build the first and not the second.

This piece documents what we ship for Day 2. The three classes of non-deterministic failures, the detection patterns for each, the recovery playbooks, and the alarm-fatigue trap that turns a working monitoring system into expensive decoration.

Three classes of non-deterministic failures

The drift surfaces are not equivalent. Each has its own detection signal, its own root cause, and its own recovery path.

Class 1 — Structural hallucination drift

The model starts producing answers that are confidently wrong in a slightly different way each week. Not the same hallucination repeating — new hallucinations, often plausible enough that they pass a casual review. The cause is usually upstream and silent: a Bedrock model-version update, a Knowledge Base ingestion that pulled in stale or contradictory content, a Guardrails policy tightening that pushed the model toward less-anchored answers, or a slow degradation in the model's calibration as Anthropic / Meta / Cohere update training weights between minor versions.

The detection signal is the faithfulness metric trend from the eval harness in the prior article. Hallucination drift shows up as a slow downward trend on faithfulness scored against the golden set, often well before any individual customer complaint reaches the team. The signal is statistical: a faithfulness score that was 0.94 in week one might be 0.91 in week eight and 0.87 in week twelve. Each individual week's drop is within noise; the trend is the alarm.

Class 2 — Semantic variance

Two customers ask the same question on the same day. They get different answers. Both answers are technically correct. But the inconsistency erodes trust faster than a single wrong answer ever could, because it suggests the system is non-deterministic in a way the user cannot reason about.

The cause is usually the combination of temperature settings (even at temperature = 0 there is residual non-determinism in transformer inference at scale) and the retrieval layer — slightly different top-K chunks come back for semantically-equivalent queries, and the model synthesises slightly different summaries from them. Multi-model routing can compound this: a query that classifies as "simple" on Monday and "standard" on Tuesday routes through different model tiers and produces visibly different answers.

The detection signal is answer consistency, measured by running the same canonical golden-set queries through the production pipeline twice (or N times) per day and comparing the outputs. The metric is semantic_similarity(answer_1, answer_2) measured via embedding distance. Variance above a configured threshold triggers an alarm.

Class 3 — Latency degradation

P95 and P99 latencies drift upward over weeks. P50 typically stays flat — which is why median-based dashboards miss this entirely. The cause is some combination of: KB vector index growing past its tuned size, Bedrock regional load shifts, prompt-template growth as the team adds more context, retry storms from upstream rate limits, AgentCore session-memory accumulation, or the model's response-token count creeping upward as the agent's tool descriptions grow.

The detection signal is P95 / P99 latency tracked per query class (per Bedrock Part 6 observability). Per-class is important because aggregate P95 can hide a specific query-type degradation behind volume from other classes. CloudWatch metrics with a query_class dimension surface this cleanly.

The Day 2 monitoring architecture

Day 2 monitoring architecture: Production traffic flows through the Bedrock-backed agent. Three parallel detection paths run continuously. (1) Faithfulness eval pipeline — sample 1-5% of live traffic, run LLM-as-judge against the response and retrieved context, emit faithfulness score per query class to CloudWatch. (2) Consistency eval pipeline — daily golden-set rerun produces N answers per query, embedding-distance compared, semantic_variance metric emitted per class. (3) Latency profile — every invocation tagged with query_class dimension, P50/P95/P99 tracked per dimension. Each metric stream feeds CloudWatch alarms with 2-sigma thresholds against a 30-day rolling baseline. Alarm fires include causal attribution context: recent deploys, KB ingestions, model version updates, Guardrails changes, query-distribution shifts. Recovery playbook routes by signal class: rollback model version, replay KB ingestion, tune Guardrails, re-tune retrieval. All findings land in an incident ticket with the trended chart as evidence.

The architecture has three parallel detection pipelines feeding a unified alarm-and-recovery layer. Each pipeline is its own deployable, monitorable subsystem, but they share the metric vocabulary so the on-call sees one coherent picture.

Detection — what we instrument

Continuous evaluation pipeline (faithfulness, drift)

A scheduled job runs the golden set from the evaluation article against the production pipeline daily. The judge model — pinned Claude Opus, calibrated against human SME labels — scores each item on the metrics that matter for the workload. The harness emits per-metric CloudWatch metrics tagged with query_class, judge_model_version, pipeline_version, kb_snapshot_id. Alarms fire on:

  • 2σ regressions on any metric against the 30-day rolling baseline
  • Sustained 1σ regressions over 7-day windows (catches slow drift that doesn't trip the 2σ gate)
  • Calibration delta between judge model version N and N+1 when a judge upgrade is in flight

The same harness runs against a 1-5% sample of live production traffic for the online eval pipeline. Stratified sampling ensures rare query classes are not absent from the measurement. Cost runs ~$3-8 per day for typical workloads.

Consistency / variance pipeline

A standing job picks a 50-item subset of the golden set and runs each query through production N times (typically N=5, spread across the day to capture intra-day variance). Pairwise embedding similarity computed for each pair of answers; the minimum similarity per query is the per-query consistency score. Aggregated, this becomes the semantic_consistency_score metric tracked per query class.

For multi-model routing pipelines, the consistency pipeline runs per tier — Haiku consistency, Sonnet consistency, and cross-tier consistency tracked separately. Cross-tier inconsistency is the signal that the router is the variance source, not the underlying models.

Latency profile

Every Bedrock invocation emits a CloudWatch metric tagged with query_class. The CloudWatch dashboards track P50, P95, P99 per dimension. The alarm condition is sustained P99 above the 30-day P99 baseline by 50% for any single query class — sustained meaning over a six-hour window so transient spikes don't page the on-call at 2 AM.

For agent workloads, latency is also profiled per tool — tool_call_latency tagged with tool_name. A specific tool's degradation (often a downstream service's degradation) gets flagged independently of the agent's overall response time.

The causal-attribution layer — what makes alarms actionable

A drift alarm without causal attribution is an alarm that gets snoozed. The pattern that closes the loop: when an alarm fires, the harness reports which recent system changes correlate with the regression window. The attribution surface:

  • Recent deploys in the workload account (CodeDeploy, ECS service updates, Lambda version updates)
  • KB ingestion events (Bedrock Knowledge Base ingestion job completions)
  • Model version updates (Bedrock catalogue changes published during the window)
  • Guardrails policy changes (new guardrail version, threshold tuning)
  • Query distribution shifts (new high-volume query class appearing or an existing class dropping out)
  • Downstream service health (CloudTrail Lake / RDS / DynamoDB metrics that correlate with the agent's tool calls)

The alarm body reads not just "faithfulness regressed by 0.07 on the 'compliance-policy-query' class" but "faithfulness regressed by 0.07 on the 'compliance-policy-query' class — correlated with KB ingestion job ing-3af2 completed 2026-06-02 14:30 UTC and Guardrails version 4 deployed 2026-06-03 08:15 UTC." The on-call has a starting point, not a search problem.

The causal-attribution table lives in a CloudWatch Logs Insights query that runs at alarm fire time and is embedded in the PagerDuty incident creation. No human is correlating timelines at 2 AM.

Recovery patterns — the four playbooks

Recovery routes by detected failure class. The playbooks:

Playbook A — Model version rollback

When the causal attribution points to a Bedrock model version update as the trigger, the recovery is to re-pin the previous version. The agent's foundation model ID is held as a CloudFormation / Terraform parameter; rollback is a single parameter change and a redeploy. Time to recover: 5-10 minutes from the parameter change.

The defensible discipline: always run at least one minor model version behind the catalogue's current. Today's pinned version is tested; the version Bedrock just published is not yet. When the catalogue advances, the team runs the eval harness against the new version in shadow, confirms equivalence or better, then advances the pin. Skipping this discipline is how teams get caught by Bedrock catalogue updates.

Playbook B — Knowledge Base rollback

When the causal attribution points to a KB ingestion as the trigger, the recovery is to roll back the KB to its prior snapshot. Bedrock Knowledge Bases support data-source version management; the rollback is a single API call.

The discipline: snapshot before every ingestion job. The KB version graph becomes the rollback target list. Ingest, run the eval against the new snapshot, confirm equivalence, then advance the production pointer. If the eval regresses, the production pointer never advances.

Playbook C — Guardrails tuning

When the causal attribution points to a Guardrails policy change as the trigger, the recovery is to roll back the guardrail version. Bedrock Guardrails are versioned by design; the agent references the guardrail by ARN with version, and rollback is a version reference change.

The discipline: stage Guardrails policy changes in DRAFT, evaluate, then publish. The published version becomes the production reference; the DRAFT is mutable. A bad publish gets rolled back to the prior published version while the team iterates on a new DRAFT.

Playbook D — Retrieval re-tuning

When the causal attribution points to retrieval-layer drift (KB hit rate dropping, chunk relevance scoring lower), the recovery is to re-tune the retrieval parameters: top-K, chunking strategy, hybrid-search weights, reranking thresholds. This is the slowest of the four playbooks — re-ingestion of the corpus may be required — and the one most likely to be the actual cause for the slow-drift failure mode rather than the sudden-regression mode.

The discipline: the retrieval harness is independent of the synthesis harness. Retrieval-quality regressions surface against retrieval-specific metrics (hit rate, MRR, context precision) before they show up as faithfulness regressions in synthesis. Catching retrieval drift early is cheaper than catching it via the downstream synthesis signal.

The alarm-fatigue trap

The hardest problem in production AI monitoring is not building the system. It is keeping the team responsive to the alarms over months. Five patterns we have seen erode response discipline, and the engineering moves that counter each:

1. The 2 AM noise alarm. A drift alarm fires at 2 AM on a Saturday. The on-call snoozes it. The pattern repeats. Within six weeks, drift alarms are routinely snoozed and the system might as well not exist. Counter: route drift alarms to a daytime queue, not PagerDuty Sev-1. The signal is slow-drift; the response can be slow too. Reserve Sev-1 paging for hard failures (the agent is throwing 5xx, the eval harness is producing nothing) and customer-impacting incidents.

2. The alarm with no runbook. An alarm fires, the on-call has no documented recovery path, the on-call improvises, the response is uneven across engineers. Counter: each alarm class has a linked runbook with the playbook above, the queries to run, the parameters to check. The runbook lives in the same repo as the agent and is version-controlled.

3. The metric that nobody owns. Five engineers built the harness; ten months in, three of them have moved teams; the remaining two are uncertain whether a faithfulness regression of 0.04 is a real problem. Counter: assign a named metric owner per signal. The owner reviews the metric trend weekly, makes calls on threshold changes, signs off on alarm acknowledgements.

4. The judge-model drift problem. The judge model used in the eval harness itself drifts (covered in detail in the eval article). When the judge becomes more lenient, faithfulness metrics rise without the pipeline improving. Counter: version-pin the judge, run human-SME calibration quarterly, treat judge drift as its own monitoring class.

5. The "we tuned the threshold so the alarm stopped" pattern. A team's response to a noisy alarm is to raise the threshold until the alarm stops, not to fix the underlying signal noise. Counter: threshold changes require sign-off from the metric owner and a rationale entered in the change log. Cosmetic threshold tuning becomes visible.

Friction points — what bites in real deployments

Five frictions we have engineered past:

1. Judge cost on continuous evaluation

Running the full golden-set harness daily with claim-decomposition on Claude Opus costs $20-50 per run. Weekly cost: $140-350. Monthly: $560-1400. Real money for an ongoing measurement, but well below the cost of one undetected production incident.

The mitigation we ship: tiered judging. Haiku as a first-pass classifier to skip detailed judging on obvious-correct cases (high-similarity to golden answer); Opus only on the ambiguous middle band. Drops judge cost ~70%.

2. Drift detection latency

A 30-day rolling baseline means a regression takes 14-21 days to clearly trip the 2σ alarm. Slower drift takes correspondingly longer. For workloads where regulatory exposure is high, this lag is operationally significant.

The mitigation: add a 7-day baseline alongside the 30-day, with a tighter threshold. Catches faster drifts at the cost of more false positives on legitimate week-over-week noise.

3. The eval harness becomes the production pipeline's flakiest dependency

The harness runs against Bedrock; Bedrock throttles; the harness fails; the team disables the harness to "fix it tomorrow"; it never gets re-enabled. We have seen this pattern multiple times.

The mitigation: the harness has its own retry policy with backoff, exponential jitter, and Bedrock-aware error classification (per the Step Functions article). The harness is treated as production code, with its own oncall and its own runbook for when it itself is failing.

4. Per-tenant drift is hidden by aggregate metrics

A multi-tenant deployment may have a single tenant whose retrieval has drifted while the aggregate metric stays healthy. The aggregate hides the per-tenant signal.

The mitigation: per-tenant metric dimensioning, with alarms per high-value tenant. For workloads where tenants are heterogeneous in scale, the per-tenant signal is the actionable one; the aggregate is decoration.

5. The "we have monitoring but we don't have postmortems" pattern

A team builds the monitoring, the alarms fire, the on-call acts, the system recovers — and nobody writes a postmortem. The institutional learning never compounds. Six months later the team faces the same failure mode again.

The mitigation: make the postmortem a mandatory step in the alarm closure workflow. The alarm is not "closed" until a brief postmortem (cause, response, prevention) is filed in the runbook directory. The body of postmortems becomes the institutional memory.

What this taught us about enterprise scaling

Five things hold up across the Day 2 deployments we ship:

1. The architecture that makes the workload deployable is not the same as the architecture that makes it operable. Day 1 architecture is what the reference series documents. Day 2 architecture is what this article documents. Teams that ship one without the other are shipping an experiment, not a system.

2. Drift is the dominant failure mode at 12-month timescales. Catastrophic failures get attention precisely because they are catastrophic. Slow drift is the failure mode that does the actual damage to enterprise AI's reputation in the year after deployment.

3. The eval harness becomes the operations team's source of truth. Once the harness exists and produces trusted metrics, the operational conversation shifts from "I think the model got worse" to "faithfulness is down 0.07 on the policy-query class since the KB refresh." The shift in conversation quality is itself the return on the eval-harness investment.

4. The causal-attribution layer is the difference between an alarm system and a working alarm system. Alarms without attribution get ignored; attribution turns the on-call's job from a search problem into a verification problem. This is the highest-leverage piece to invest in for monitoring quality.

5. The team that maintains the day-2 architecture is the team that has earned trust to deploy more AI workloads. Customers, executives, and regulators all read this signal correctly. Teams that show up with a working day-2 story get the next greenlight; teams that show up with only day-1 do not.


Engaging with cmdev

CreativeMinds Development (cmdev) ships the day-2 monitoring architecture as a standing part of every production AI engagement. Our pattern works across banking under CBN CSAT, energy operators under NMDPRA and NIS2, fintechs under NDPA, and healthcare networks under HIPAA-equivalent regional regimes. The eval harness, the drift detection, and the recovery playbooks compose as the operability layer the day-1 architecture assumes.

Mayowa A. is CTO of CreativeMinds Development. He leads cmdev's AI engineering practice for regulated enterprises across Africa and the EU, with deployments in production for banking, energy, and critical-infrastructure customers.

day-2-aiproduction-aiamazon-bedrockdrift-detectionhallucinationllm-monitoringclauderollbackcanaryenterprise-ai

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation