Knowledge Base Hygiene for Contact Center AI: What Connect + Bedrock Assumes You Already Did

A retail bank in the Midlands, month four of the rollout. The Connect AI Agent is live on a quarter of inbound calls. The pilot dashboards are green. Then a regulator sends a routine letter asking why three customers in the same week received refund guidance that contradicted the bank's current policy. The post-mortem takes a fortnight. The model is fine. The contact flow is fine. The retrieval call is doing exactly what it was asked to do — return the most semantically similar chunk to the caller's question, regardless of which year the policy was authored or whether it was ever current.

In the same SharePoint folder where the 2026 fee schedule lives, the 2022 version is still there, unmarked, indistinguishable to the embedding model, scoring a hair higher on cosine similarity because the older language matched the caller's older vocabulary. The agent did not lie. It retrieved a true statement from a real document and reasoned over it fluently. The document was simply no longer the bank's policy.

AWS will sell you Amazon Connect AI Agent in a thirty-minute demo. A Bedrock Knowledge Base wired into a contact flow, a managed agent loop, hybrid retrieval, citations in the response. The slide that does not appear is the one that matters most in month three: the assumption that the corpus on the other end of the retrieval call is clean. In every real contact centre we have walked into, it isn't. The KB is five to ten years of policy PDFs, internal wikis, agent macros, scanned forms, conflicting product versions, and a SharePoint folder structure designed by someone who left in 2021.

This picks up where what the managed service ships and what you still build stops: the work the managed service does not do, in the layer most teams underestimate by an order of magnitude.

Key takeaways

The KB is the model. Connect AI Agent's behaviour is almost entirely determined by the corpus and the retrieval configuration. The model layer you get for free; the corpus is the part you own.
Conversational retrieval needs shorter chunks, higher overlap, and sentence-aware splits. The Bedrock 300/60 default is wrong for voice and chat; 150-200 tokens with 30-40% overlap is the working band.
Conflicting policy versions are the dominant failure mode in regulated contact-centres. Fix corpus-side: archive prefix excluded from ingestion, effective-date metadata, single source of truth per topic.
Sensitivity classification belongs at the chunk, not the document. Document-level labels either over-restrict or over-share. Chunk classification enforced at the retrieval gateway is the only pattern that survives a compliance read.
KB hygiene is 60-70% of the engineering effort on a 20-week Connect AI rollout. The vendor slide costs it at 10-15%. The inversion is where most rollouts go sideways in month four.

Knowledge base hygiene pipeline — eight stages from source to production: source audit walks every SharePoint and Confluence and file server discovering 2-3 times the expected document count and an ownership map, de-duplication collapses near-duplicates into a single canonical per topic, version conflict resolution moves deprecated policies to an archive prefix excluded from ingestion and enforces effective-date metadata at the retrieval gateway, sensitivity classification at the chunk level rather than the document level enforces a retrieval-gateway filter before nearest-neighbour, conversational chunking at 150-200 tokens with 30-40% overlap and sentence-aware splitter replaces the wrong-for-voice Bedrock default of 300 tokens with 60 overlap, embedding refresh pipeline driven by SharePoint and Confluence webhooks through EventBridge with differential ingest re-embedding only changed chunks within a 4-hour budget for routine changes and 1-hour for compliance-critical changes, golden Q-and-A eval gate of 200-500 real call transcripts paired with senior-agent answers running a no-regression criterion per segment blocking staging-to-production promotion on fail, and drift monitoring on production retrieval distribution and the I-don't-know rate and handoff rate and citation source mix to catch what the eval gate did not; stages 3 and 7 and 8 are the gates that decide whether the agent ever states a wrong policy. — Figure 1 — The eight stages of KB hygiene. The corpus is your engineering surface, not AWS's. Stages 1-2 are editorial discipline with the biggest payback per hour; stage 3 and stages 7-8 are the gates that decide whether the agent ever states the wrong policy.

The corpus is the engine; the model is the steering wheel

Most teams arrive thinking the LLM is doing the work and the knowledge base is the input. This is backwards for contact-centre AI. Once you have chosen the Bedrock-managed model Connect AI Agent wraps for you, reasoning quality is fixed. The single variable that determines whether the caller hears the right answer is what came back from the retrieval call.

The model cannot save a bad retrieval. Think of the agent as a brilliant new hire on their first day, handed whatever paperwork is on the desk and asked to answer the phone. The hire is articulate, calm, and quick. They will state the contents of the document in front of them with complete confidence. If the document is wrong, the answer is wrong. Citations help the auditor afterwards. They do nothing for the caller in the moment.

The corpus is your engineering surface, not AWS's. Bedrock will ingest whatever you point at the S3 bucket. The managed service will not tell you the bucket contains 14 versions of the refund policy across three folders, or that the customer-facing FAQ and the internal call-handling guide use different terms for the same product.

What is actually in the cupboard

Half a week of work that prevents months of confused tuning. Four passes.

The first is a document inventory. Walk every source the agent will read from — SharePoint sites, Confluence spaces, the customer-portal CMS, the internal wiki, call-handling guides on the file server, the policy archive nobody has opened since the last regulator visit. Most first-time efforts discover two to three times the document count they expected. This is the moment teams realise the corpus is less a library and more an attic.

The second is a version-skew count. For each policy area — refunds, claims, account changes, KYC — count documents on that topic. Anything above one is a version-skew candidate. In a regulated contact-centre we audited recently, "refund policy" had nine documents: three customer-facing versions, two internal call-handling versions, two compliance versions, a deprecated 2024 policy, and a draft 2026 update in a SharePoint folder named "WIP do not use" that the agent could read perfectly well.

The third is a last-updated distribution. Plot documents by last-modified date. A healthy corpus has a tail that drops sharply past 18 months. An unhealthy corpus has a flat plateau going back five years, with no metadata distinguishing "still current" from "operationally dead". The shape tells you how much deprecation work is required.

The fourth is an ownership map. For each policy area, who owns the source of truth? In the worst contact centres, nobody knows. The ownership map is the upstream end of the freshness pipeline; without it, change-detection has nowhere to land.

The audit output is a single inventory table — document, source, topic, version, last-modified, owner, classification, deprecation decision. Every subsequent stage hangs on it.

A conversation is not an essay

Bedrock's defaults — 300 tokens, 60 overlap — are designed for document Q&A and they are wrong for voice and chat.

A voice caller does not ask one self-contained question. They ask a fragment, the agent clarifies, they answer, retrieval fires again. The shape resembles a tennis rally, not a lecture. Each call sees a short query, not a paragraph. Short queries match better against short chunks. Conversational retrieval also has a lower latency budget than document Q&A, and the answer is a sentence or two — stuffing 300-token chunks into the context dilutes attention against content the answer will not use.

The working configuration:

Parameter	Document Q&A default	Conversational working band
Chunk size	300 tokens	150-200 tokens
Overlap	60 tokens (20%)	50-80 tokens (30-40%)
Splitter	Token boundary	Sentence-aware
Strategy	Fixed-size	Hierarchical for policy text; fixed-size for FAQs

The sentence-aware split is the underrated lever. Bedrock's default splitter cuts sentences in half, producing chunks that read as nonsense to the embedder and the model — the equivalent of tearing a page diagonally and asking someone to read the result aloud. A sentence-aware splitter — in advanced chunking configuration or via a Lambda transform on the ingest path — makes every chunk a coherent unit.

Hierarchical chunking is right for policy text where the answer depends on the section it sits in. The child chunk drives retrieval precision; the parent chunk returns to the model with surrounding context. For FAQ corpora, fixed-size aligned to one or two FAQs per chunk is right.

The problem of two truths

The failure mode that bites every regulated contact centre, and the managed service does nothing about.

The audit surfaced the version skew. The retrieval layer cannot tell which is current. Cosine similarity returns whichever document has the closest semantic match, irrespective of whether it was deprecated in 2023. The model reasons fluently over whatever came back. The agent states an old policy with confidence.

The fix has three layers, all corpus-side.

The first is deprecation, not deletion. Old policies cannot simply be deleted; the regulator may require their retention. The pattern: a parallel archive/ prefix in the S3 source bucket. Deprecated policies move there the moment they are superseded; the inclusion prefix excludes archive/. The archive remains accessible to humans who need it but is invisible to retrieval. Deprecation is a deliberate, dated, owned act — not a slow drift of files going stale in place.

The second is effective-date metadata on every chunk. Each chunk carries effective_from and effective_until. The 2024 refund chunk carries effective_from: 2024-01-01, effective_until: 2026-03-31; the 2026 update carries effective_from: 2026-04-01, effective_until: null. Retrieval filters by effective_until is null OR effective_until > today. The filter is mandatory, applied at the retrieval gateway — not in the prompt. Think of it as a passport check at the door, not a polite request from a host at the dinner table. Prompt-layer filtering is bypassable; retrieval-layer filtering is the trust boundary.

The third is a single source of truth per topic. The ownership map feeds a curation discipline: per policy area, exactly one current document, one named owner. Parallel "current" documents — one customer-facing, one internal — get reconciled into a single source with a sensitivity-classified internal section, not maintained as two files that will drift within six months. Editorial work, not engineering work, and the highest retrieval-quality payback per hour.

All three together eliminate the contradictory-policy failure at the architectural layer. Without all three, retrieval-quality tuning keeps producing the wrong answer for the right reason.

A redacted page, not a locked filing cabinet

A single policy document is rarely uniformly classified. The customer-facing refund procedure is public. The internal exceptions matrix in the appendix is restricted. The PII-handling addendum cross-references NDPA Section 39. Document-level classification either over-restricts (the public section becomes unreachable) or over-shares (the restricted appendix leaks).

The right mental model is a hospital chart. The cover sheet with the patient's name and admission date is one classification. The clinician's notes are another. The mental-health appendix is a third. You do not lock the whole file because of the appendix, and you do not hand the appendix to anyone who needs the cover sheet. The chunk, not the document, is the unit of classification.

The working pattern, consistent with the five-gate model in DSPM meets RAG:

Each chunk carries an explicit classification field — public, internal, confidential, restricted — set at write time by a classifier over the chunk text, or by section-level annotations the chunker respects. For policy text, labelled section headings the chunker propagates work well.

The retrieval gateway enforces classification as a mandatory filter. An authenticated customer gets public plus their own confidential records. An unauthenticated IVR caller gets only public. An agent-assist surface where a human agent is logged in gets public and internal. The filter applies before nearest-neighbour, not after.

Restricted-tier content is not embedded at all. It remains in the source system, retrievable through a lookup tool the agent can call only when the principal's classification permits. The vector store is not the system of record for restricted content; the source is.

A KB stale for a week is wrong by Friday

A KB that ingested correctly six months ago and has not been refreshed is now wrong. The slide says "weekly re-ingestion". Production reality is hours-to-low-single-digit-days for regulated workloads.

Three components.

Change detection at the source. SharePoint, Confluence, and Notion all expose change-notification APIs — webhooks, Microsoft Graph notifications, Confluence's CDC stream. A small Lambda listens, identifies affected documents, and pushes a delta event to an EventBridge bus. This replaces "re-scan on a cron", which works for small corpora and falls over at scale.

Differential ingestion. When a delta event arrives, the pipeline re-fetches, re-classifies, re-chunks, and re-embeds only the changed chunks. The Bedrock Knowledge Base ingestion API supports this; cost stays proportional to change volume rather than corpus size.

Eval gate before publication. A changed document does not go live until the eval suite passes. Changed chunks land in a staging index; the regression eval runs against staging; on pass, staging chunks promote to production atomically. On fail, the change is held and the owner is notified. This stops the case where someone edits the refund policy badly, the bad version ingests, and the agent confidently states it for three days before anyone notices.

End-to-end budget under four hours for routine changes, under one hour for compliance-critical changes. Weekly re-ingestion is a delayed-failure pipeline, not a freshness pipeline.

What real callers actually sound like

The retrieval harness from RAG with Bedrock Knowledge Bases is the foundation. For contact-centre AI it specialises in three ways.

Real call transcripts, not synthetic queries. 200 to 500 historical caller queries, each paired with the answer a senior agent would have given and the source document it came from. Engineer-built synthetic queries are too clean — real callers are vague, mid-sentence, code-switching, asking three things at once. Reflect that, or the scores overstate production quality.

Contact-centre-specific signals beyond hit rate, rank, faithfulness, and citation accuracy. No-answer correctness — when the corpus does not contain the answer, did the agent say so? Handoff appropriateness — did the agent escalate when it should have, and not when it should not have? Policy-version accuracy — did the answer come from the current policy?

Regression on every update. The eval gate runs against the staging index for each change event. The criterion is not "absolute scores above a threshold" but "no regression against the previous production baseline on any segment".

Drift detection on the production response stream. Distribution shift in retrieval scores, the "I don't know" rate, the escalation rate, citations to particular source documents. Drift is the early signal that something changed which the regression gate did not catch.

The cheapest hallucination reduction is admitting the gap

The hardest design problem is the negative case. When retrieval returns nothing the model can confidently answer from, the agent has to do three things, and none of them is "guess".

Admit the gap. The system prompt instructs the model: if retrieved context does not support an answer, say so. Necessary but not sufficient — models still hallucinate under pressure from a polite caller asking again. The guardrail is the response-time grounding check (Bedrock Guardrails' contextual grounding, or an equivalent) that rejects responses whose claims do not trace to retrieved chunks.

Take the right next action. The handoff policy is a small piece of business logic — not in the model, not in the prompt — that decides what happens. Route to a human, take a callback, offer self-service, escalate to a specialist queue. Per-intent and per-classification; data, not code; the operations team owns it.

Leave a clean trail. The no-answer event is logged with the query, retrieval results, gap classification, and handoff action. The log feeds the KB-gap report that drives content creation, and the audit trail that proves the agent did not invent an answer.

The cheapest hallucination reduction available: every policy area has at least one chunk that explicitly says "for cases outside this policy, contact a human agent". The answer becomes "I can't help with this case directly — let me hand you to an agent" instead of an improvisation.

What the vendor slide costs wrong

A 20-week Connect AI Agent rollout, costed honestly:

Workstream	Vendor slide	Actual on regulated buyer
Connect provisioning, contact flow, agent configuration	35-40%	10-15%
KB hygiene (audit, deprecation, chunking, classification, freshness)	10-15%	60-70%
Eval suite (golden set, regression harness, drift detection)	5-10%	8-12%
Integration with CRM, ticketing, identity	15-20%	5-10%
Compliance audit log layer	5%	5-8%
Pilot, tuning, go-live	15-20%	5-8%

The misread is consistent. Connect provisioning is genuinely fast — the managed service does what the slide says. The KB work is genuinely slow — and it is the part that determines whether the agent works for callers.

Two practical consequences. The team composition is wrong if the rollout is staffed with Connect specialists and AI engineers but no content owners, no document curators, and no policy specialists from the regulated business. Those are the people who do the audit, own deprecation calls, and maintain the freshness pipeline. They are not bonus headcount; they are the critical path.

The timeline is wrong if KB work runs parallel with Connect integration. A defensible plan front-loads weeks 1-8 on the corpus — at minimum the audit and deprecation pass — and weeks 9-20 on Connect-side integration, eval tuning, and go-live. The teams we see succeed do this; the teams we see slip month four did not.

The Connect AI Agent is a confident new hire. Whether you handed them the right paperwork on day one decides everything that follows.

FAQs

How long does the corpus audit actually take?

For a single-line-of-business contact centre with 500-2,000 documents, half a week to a week for the inventory, plus one to two weeks for version-skew and ownership-map work. For multi-product or multi-region, two to four weeks total. The audit is the part most teams cut to hit a kick-off date and the part they pay back at 5-10x cost during pilot tuning.

Why are shorter chunks better for voice than for document Q&A?

Voice queries are fragments, not paragraphs. Short queries match better against short chunks. Voice also has a lower retrieval latency budget and produces short answers, so 300-token chunks dilute attention against content the answer will not use. 150-200 tokens with 30-40% overlap and sentence-aware splits is the working band.

How do we handle two policies that say opposite things?

Corpus-side, not model-side. Move deprecated policies to an archive prefix the ingestion excludes. Carry effective-date metadata on every chunk and filter retrieval by current effective date. Curate to a single source of truth per policy area. The combination eliminates the conflict at the retrieval layer rather than asking the model to reason about which version is current — a job models do badly.

Where does sensitivity classification belong — document or chunk?

Chunk. A single policy frequently mixes public, internal, and restricted content across sections. Document-level labels either over-restrict or over-share. Chunk-level classification enforced at the retrieval gateway as a mandatory filter against the caller's principal is the only pattern that lets the agent answer the public question without exposing the restricted one.

What is the realistic share of KB hygiene work in a 20-week rollout?

60-70% of the engineering effort on a regulated-buyer rollout. The vendor slide costs it at 10-15%. Team composition and timeline both have to reflect the real distribution — content owners and policy specialists are critical path, and the corpus work has to front-load before Connect-side integration has anything stable to measure against.

Companion content

How to engage

We design and run knowledge base hygiene programmes for Connect AI Agent and custom Bedrock-backed contact-centre rollouts — the audit, the deprecation pass, the conversational chunking discipline, the chunk-level sensitivity classification, the freshness pipeline, and the eval harness that survives a compliance review. Talk to us at creativeminds.dev/contact.