Key takeaways
- The production architecture is a five-stage pipeline — Connect contact flow, Lex for intent and slot capture, Bedrock Knowledge Bases for retrieval, a Bedrock Agent for action-taking, Polly Neural or a Bedrock voice model for response — with a deterministic supervisor handoff at the end of every path.
- End-to-end latency target is 1.2-1.6 seconds from end-of-user-utterance to first audible response token. Streaming TTS, prompt caching, and warm Lambda paths are what make this hit; buffered TTS makes it unusable.
- Lex and Bedrock compose, not compete. Lex captures structured slots deterministically; Bedrock reasons over the unstructured remainder. Slot validation against business rules happens before Bedrock is invoked, not after.
- A 4-minute fully-automated call lands around $0.18-0.32 fully loaded; the same call escalated to a human at minute three is 3-4x that figure once agent time is priced in. Automation rate is the unit-economics lever.
- The pattern breaks on long calls (context bloat past 12 turns), on dialect coverage outside North American English (Nigerian, Indian, Latin American Spanish need explicit ASR tuning), and on supervisor handoff if transcript, intent stack, and KB citations aren't passed across cleanly.
The contact-centre AI conversation has changed shape
Eighteen months ago, a generative-AI contact-centre brief looked like a chatbot proof-of-concept. Today the same brief looks like a production architecture problem. The model is no longer the interesting question — Claude Sonnet, Nova Pro, or Mistral Large will all hold an adequate conversation about a refund policy. The interesting questions sit around the model: how fast the first word reaches the caller, how reliably the agent identifies when it cannot help, how the supervisor picks up a call mid-stream with full context, and what it costs per minute at the volumes a real contact centre runs.
The architecture that wins is reasonably stable now. Connect handles telephony, IVR, queueing, and routing. Lex handles intent detection and structured slot capture. Bedrock Knowledge Bases handle grounded retrieval. A Bedrock Agent handles action-taking under explicit permissions. Polly Neural (or, increasingly, Bedrock-hosted voice models) handles text-to-speech. Each piece has a job; the integration between them is where the engineering work lives.
This piece is the working pattern we ship for that integration, including the failure modes that only surface in production traffic.
The architecture overview
The shape of a production Connect-plus-Bedrock voice pipeline:
- Caller dials in. Connect picks up; the contact flow runs.
- Contact flow invokes Lex for intent detection and slot capture.
- Slot validation against business rules — captured slots (account number, claim ID, date range) are validated against the system of record through Lambda before Bedrock is invoked. Wrong account number means re-prompt, not Bedrock.
- Bedrock Knowledge Base retrieval — with validated slots in hand, the KB returns top-K chunks with citations and metadata.
- Bedrock Agent reasoning and action — a Bedrock Agent reasons over retrieved chunks plus conversation state and decides: respond, invoke an action group (process refund, open a case, book an appointment), or escalate.
- Response synthesis — agent text streams through Polly Neural or a Bedrock voice model back to the caller.
- Escalation gate — every turn checks an escalation predicate. If it fires, the call hands off to a human supervisor with full transcript, intent stack, KB citations, and reasoning trace.
The composition is what matters. Lex without Bedrock cannot handle the unstructured request ("I changed jobs last month and I'm not sure my dependants are still covered"). Bedrock without Lex cannot reliably capture the structured part ("my member ID is 8847-2261-0034") at the precision downstream systems require. Together they cover the request surface a competent human agent would, with a deterministic escalation path for the rest.
Bedrock Knowledge Bases for contact-centre corpora
Retrieval separates a contact-centre agent that grounds its answers from one that hallucinates policy. Four things matter.
Corpus hygiene matters more than chunking strategy. A contact-centre corpus is policy documents, procedure manuals, FAQs, scripts, call-handling guides — half out of date on any given day. The first move on every engagement is a corpus audit. If the 2024 and 2026 refund policies both sit in the bucket with no distinguishing metadata, the agent quotes whichever ranks higher on cosine similarity. That is a corpus problem, not a model problem.
Chunking for conversational retrieval is tighter than for document Q&A. Smaller chunks (200-300 tokens) with tighter top-K (3-5 after re-ranking) keep retrieval and synthesis inside the 1.2-second envelope. Hierarchical chunking is right for long policy documents — small leaf chunks for precision, parent chunks returned for context.
Sensitivity classification at ingestion, not retrieval. Each document carries a metadata sidecar with classification (public, customer-visible, internal-only) and residency (EU, Nigeria, UK, US). Retrieval calls are issued with mandatory filters — classification IN ('public','customer-visible') AND residency = caller_residency — enforced at the API layer, not the prompt layer. NDPA and GDPR compliance ride on this being non-negotiable.
Re-ranking is on for production. Cohere Rerank's 10-30% precision gain on hard queries is worth the 80-150ms latency wherever wrong answers carry real consequence. The exception is FAQ-lookup where retrieval is already near-perfect.
The deeper RAG playbook lives in RAG with Bedrock Knowledge Bases; the contact-centre additions are the residency filter, the smaller chunk size, and the corpus audit as a precondition.
Voice integration — the part most teams get wrong
Voice is where the pattern stops being a chat application and becomes a real-time systems problem. AWS publishes a reference implementation at aws-samples/sample-amazon-connect-bedrock-agent-voice-integration — Connect media streaming into a Lambda that calls a Bedrock Agent and streams the response through Polly. Read the code, then engineer around its assumptions.
Natural conversational pacing is 200-400ms between turns; beyond 800ms callers ask "hello?"; beyond 1.5 seconds they assume the system has failed. End-to-end budget from end-of-user-utterance to first audible response token:
| Stage | Latency | Notes |
|---|---|---|
| ASR transcription (Connect → Lex) | 150-300ms | Streaming partial results; Lex native |
| Slot validation Lambda | 30-80ms warm / 800-1500ms cold | Provisioned concurrency mandatory |
| KB retrieval (embedding + search + rerank) | 250-450ms | Re-ranking is the biggest contributor |
| Bedrock Agent reasoning (first token) | 400-800ms | Prompt caching cuts this 30-50% on later turns |
| TTS synthesis (first audible, streaming) | 100-200ms | Polly Neural streaming |
| End-to-end target | 1.2-1.6 seconds | Without streaming, 2.5-3.5 sec — unusable |
The single most important architectural decision is streaming vs buffered TTS. Buffered synthesises the whole response then plays it. Streaming receives tokens as they arrive and plays the first segment while later ones generate. Polly Neural supports streaming through Connect's media streaming API; Bedrock voice models support it through InvokeModelWithResponseStream. Use streaming. The difference between 600ms first-audible-token and 1.8 seconds is the difference between a usable voice agent and an unusable one.
Polly Neural is the default — AWS-native, lowest latency, strong English (US, UK, AU) coverage, well-priced at $16/1M characters. Right wherever conversational pacing matters more than vocal warmth.
A Bedrock voice model wins where vocal warmth and prosody carry signal — premium experiences, healthcare, wealth management — or where Polly's language coverage is weak. Cost is 2-4x per character; first-token latency is 50-100ms longer.
ASR is where dialect issues bite hardest. Connect ships strong North American English and serviceable Standard British English. Nigerian English, Indian English, Caribbean English, and the various Spanish dialects need explicit ASR profile tuning, or a third-party layer (Deepgram, Speechmatics) substituted in front of Lex. The failure mode is silent — transcription returns plausible-but-wrong text, slot capture fires on the wrong value, the agent confidently misunderstands. Catching this requires sampled audio review against the transcript.
Lex vs direct Bedrock — and the hybrid that wins
The framing "Lex or Bedrock" misreads what each does. Lex converts an utterance into a structured intent with deterministic slot capture against a defined schema; it is not good at open-ended reasoning. Bedrock reasons over retrieved context; it is not good at deterministic slot capture against a strict schema.
The hybrid is the production pattern: Lex captures slots through deterministic NLU; slots are validated against business rules and the system of record; only then does the Bedrock Agent run, with both the slot values and the unstructured remainder as inputs.
A claim-status call illustrates. Caller: "Check claim 884720, and tell me if my deductible has been met for the year." Lex extracts check_claim_status with claim_id: 884720; the deductible question is the unstructured remainder. A validation Lambda confirms the claim. KB retrieval runs with residency and product-line filters. The Bedrock Agent answers both in one turn.
Bedrock-only fails the slot capture — "884720" gets coerced wrong, the policy system rejects, three extra turns happen, the caller hangs up. Lex-only captures the ID cleanly but has no path to the deductible question; everything escalates. The hybrid runs both, and slot validation before Bedrock is invoked is the single most important rule.
Evaluation — what voice AI quality actually means
Voice evaluation is harder than chat evaluation because the audio adds a layer the model never sees. Four metrics matter, tracked separately:
- Transcript quality — Word Error Rate against sampled human re-transcription. Sub-8% for North American English, sub-12% for Indian English, sub-15% for Nigerian English are realistic working bars. Bad transcripts produce confident-but-wrong intent capture; fixing this upstream is cheaper than every downstream fix.
- Intent and slot accuracy — Above 92% intent and 95% slot is achievable on a well-tuned bot; below 85%, the agent feels broken regardless of downstream reasoning quality.
- Escalation correctness — escalating when the agent could have handled it inflates per-call cost; failing to escalate when it cannot help produces a repeat call. Precision and recall both above 90% against a sampled-review rubric.
- Audit trail completeness — transcript, intent stack, slot values, KB citations, reasoning trace, action group invocations, escalation decision and reason. The regulatory artefact for NDPA, GDPR, HIPAA, or PCI workloads; designing it as an afterthought is the most expensive mistake we see.
The harness runs continuously — a 1-3% sample of production calls routes through human review, trendlines publish, regressions alert. Without this loop, every model update, KB ingestion, or Lex retrain ships blind.
Cost shape — a worked example
The cost surface combines per-minute Connect and Lex pricing with per-token Bedrock pricing. Modelling it as "Bedrock plus small Connect overhead" understates the real bill.
A 4-minute fully-automated call in US East at current public pricing:
| Component | Cost |
|---|---|
| Connect voice ($0.018/min × 4 min) | $0.072 |
| Lex voice requests ($0.0040 × 8 turns) | $0.032 |
| KB retrieval + Cohere Rerank (8 queries) | $0.017 |
| Bedrock Agent input — Claude Sonnet (~22K tokens, with prompt caching) | $0.005 |
| Bedrock Agent output (~3.5K tokens) | $0.053 |
| Polly Neural TTS streaming (~3,200 chars) | $0.051 |
| Connect logging + CloudWatch | $0.010 |
| Total | ~$0.24 |
The same call escalated at minute three adds Connect agent time (1 minute at a fully-loaded $35/hr ≈ $0.58) plus platform fee (~$0.05). Total: roughly $0.86, with agent time dominating.
Every percentage point of automation rate the agent can hold without sacrificing CSAT translates to a meaningful unit-economics improvement at contact-centre volumes. The structural point: Bedrock is rarely the dominant line on a fully-automated call (Connect, TTS, and Lex usually are); human time dominates the moment a call escalates.
Where the pattern breaks
Five failure modes worth knowing before they bite.
Long calls and context bloat. Past about 12 turns, conversation history competes with retrieved context for the model's attention. Fix: explicit summarisation at turn 8 — replace verbatim history with a structured summary of intent stack, slots, prior responses, and outstanding questions; reason over the summary plus the last two turns.
Multi-turn intent switches. "Actually, forget the claim, I need to update my address." Lex tracks the current intent but not the implicit switch; Bedrock keeps reasoning about the claim. Fix: a per-turn intent-switch detector on Haiku or Mistral that classifies whether the latest utterance maintains, refines, or replaces the active intent.
Dialect and accent coverage. The Nigerian English caller whose claim ID transcribes with a trailing zero elided produces a valid-looking wrong number; the Mumbai English caller whose "1A8" becomes 1 plus a stray letter. Fix: per-dialect ASR profiles where Connect supports them, sampled audio review, and slot validation that rejects implausible values rather than passing them downstream.
Supervisor handoff. Caller escalated, human picks up, caller asked to repeat everything — the single most expensive UX failure in any contact-centre AI deployment. Fix: engineer the handoff packet — transcript, intent stack, slots, KB citations, reasoning trace, escalation reason — rendered on the supervisor's screen at call pickup, so the supervisor opens the conversation with full context already in hand.
Knowledge Base drift. Policy update lands in the bucket; KB hasn't re-ingested; the agent quotes the old policy. Fix: scheduled re-ingestion (daily for active corpora, hourly for high-change content), change-detection sync, freshness metadata as a default filter — in place before the first policy update, not after the first complaint.
What separates Day 90 survivors
The pattern works because each service does the job it is good at. The engineering is in the integration: slot validation before Bedrock, mandatory metadata filters at retrieval, streaming TTS end to end, the evaluation harness running continuously, and a supervisor handoff engineered for the moment the agent reaches its limit.
That last point separates the deployments that survive Day 90 from those that don't. An agent that escalates cleanly with full context preserved is the agent customers tolerate when it cannot help. One that doesn't erodes trust call by call until the project gets shut down.
FAQs
Why use Lex at all if Bedrock can hold a conversation?
Different jobs. Lex captures structured slots (account numbers, claim IDs, dates) with deterministic confidence scoring against a defined schema — exactly where Bedrock is weakest. Bedrock reasons over unstructured context — exactly where Lex has no path. The hybrid runs Lex first to capture and validate slots against business rules, then runs Bedrock with the validated structured inputs plus the unstructured remainder. Slot validation before Bedrock is invoked prevents Bedrock burning tokens on inputs the downstream systems will reject.
What end-to-end latency is achievable?
1.2-1.6 seconds from end-of-user-utterance to first audible response token in a well-engineered deployment — streaming TTS, prompt caching, warm Lambda paths, re-ranking on. Without streaming TTS, the number is 2.5-3.5 seconds and the agent feels broken. The biggest contributor is Bedrock Agent first-token time (400-800ms); prompt caching cuts this 30-50% on multi-turn conversations.
Polly Neural or a Bedrock voice model?
Polly Neural is the default — AWS-native, lowest latency, strong English coverage, well-priced. A Bedrock voice model wins where vocal warmth and prosody carry signal (premium experiences, healthcare, wealth management) or where the language isn't well-supported by Polly. The trade-off is 2-4x per-character cost and 50-100ms additional first-token latency.
How do we keep data inside the right residency boundary?
Mandatory metadata filters at retrieval, enforced at the API layer not the prompt layer. Each KB document carries a residency tag at ingestion. Every retrieval call is issued with `residency = caller_residency` plus a classification filter; the retrieval API rejects calls missing these filters. NDPA and GDPR compliance ride on this.
What does the supervisor handoff actually contain?
A structured packet — transcript, intent stack, captured slot values, KB citations, reasoning trace, escalation reason — rendered on the supervisor's screen at call pickup. The supervisor opens with full context. Done improperly (caller asked to repeat everything) is the single most expensive UX failure in any contact-centre AI deployment.
Companion content
- RAG with Bedrock Knowledge Bases: From S3 to Vector Retrieval
- Multi-Model AI on Amazon Bedrock: How We Deploy the Right Model for Every Task
- Cost Optimisation on Amazon Bedrock
- Cold-Start Latency and Cost for Multimodal RAG Pipelines
- Security Guardrails and Observability for Bedrock
How to engage
We design, ship, and operate AI-powered contact-centre architectures on Amazon Connect and Bedrock — latency-engineered, residency-aware, evaluation-instrumented, and built around supervisor handoff as a first-class capability. Talk to us at creativeminds.dev/contact.
