Voice Integration on Amazon Connect with Bedrock: Polly Neural vs Bedrock Voice, Streaming, and the Latency Budget

An operator-grade pattern from the CreativeMinds Development (cmdev) AI engineering practice. The narrower, more technical companion to Amazon Connect AI Agent — what ships, what you build and Building AI Agents on Amazon Bedrock Foundations.

A retail bank in Lagos, week three of a pilot. The agent is technically correct on every test we throw at it — knowledge base hygiene clean, intents wired, transcripts logging. Then we put a real caller on the line. The caller asks about her account balance. The agent waits. The caller waits. After 1.4 seconds — almost certainly under two — the caller says "hello? are you there?" The agent then answers, accurately and politely, the question that was now muddled by the caller's interjection. The accuracy was perfect. The voice agent had still failed.

Every voice agent we ship is rated against the same buyer question — does it sound like a person? The engineering translation is can the agent respond inside the human turn-taking window without the caller noticing the gap. Conversational research puts the median gap between human speakers at around 200 milliseconds. Comfortable turn-taking sits at 500. Beyond 800 it reads as hesitation. Beyond 1.2 seconds the caller assumes the line is broken.

Every decision — TTS engine, streaming pattern, model tier, STT configuration — collapses to whether you hit the budget on the median, and where the 95th percentile sits. This piece is the arithmetic.

Key takeaways

The bar for natural voice AI is end-to-end response under 800ms; conversational quality lives under 500ms. Above 1.2s the caller treats the system as broken. The budget is the architecture.
Polly Neural with streaming over Kinesis Video Streams is the production default. Polly Generative sounds materially better but adds 150–300ms of synthesis latency and is not yet streaming-native on Connect — keep it for IVR prompts and non-conversational responses.
The streaming-assembly pattern hides the 250–600ms Bedrock generation time. Start TTS synthesis on the first complete clause from the model stream, not on the full response. This is what makes a Sonnet-tier agent feel as responsive as a scripted IVR.
The dialect problem is the silent killer on non-North-American English. Transcribe defaults to en-US; Nigerian, Indian, Scottish, and Singaporean callers see intent classification drop 8–15 points without the regional variant, custom vocabulary, and silence-detection retuning.
The cost shape on a 10K-minutes-per-month workload is ~$90–160 per thousand minutes all-in. Bedrock is the variable line and the prompt cache is the lever that decides whether reasoning runs at $0.06/min or $0.12/min.

Voice AI end-to-end latency waterfall — six stages stacking from caller speech end to first audio back to the caller: Transcribe Streaming final transcript (100ms median, 220ms 95th), Transcribe to Bedrock routing hop (30ms, 60ms), Bedrock Sonnet first token with prompt-cache hit (250ms, 450ms), first clause assembled from the streaming tokens (150ms, 300ms), Polly Neural streaming synthesis of the first clause (200ms, 350ms), and KVS to Connect audio path back to the caller (50ms, 100ms); the bands mark less-than-500ms conversational ideal, 500-800ms acceptable, 800-1200ms awkward, and greater-than-1200ms broken; median end-to-end first audio is approximately 550ms within the conversational threshold, with 95th percentile at approximately 1000ms reading as a brief pause not a broken system. — Figure 1 — The end-to-end latency waterfall. 550ms median sits inside the conversational band; the 95th at 1000ms is a brief pause, not a broken system. Every TTS choice, every streaming pattern, every model tier reduces to which band the median lands in.

The 200-millisecond window humans speak inside

Voice agents do not fail because they say the wrong thing. They fail because they say the right thing too late. Conversation is a duet, not a relay. The gap between one speaker finishing and the next beginning is, on average, around 200 milliseconds. The same gap in a tennis rally is roughly the time between racket strike and the other player's first step. Open that window even slightly wider and the rhythm collapses. The caller asks again. The agent answers the new question and the old one at the same time. Coherence is lost. Trust is lost faster.

Every TTS choice, every streaming pattern, every model tier reduces to did we hit the budget on the median, and where does the 95th sit.

Four voices, three of which you will not use for live turns

Connect ships Amazon Polly as the default, in four tiers that are not equivalent for real-time voice.

Polly Standard is concatenative — audibly synthetic. Fine for IVR menus, unacceptable for a conversational agent in 2026. It sounds like the early sat-nav of 2008.

Polly Neural is the production default. Naturalistic prosody, streaming synthesis via Connect KVS, SSML control. Joanna, Matthew, Aditi, and Ayanda cover most deployments. The Nigerian English voice is the obvious gap — the workaround is Ayanda (South African English) with a custom lexicon for Nigerian terms.

Polly Generative is materially more human than Neural at the cost of 150-300 milliseconds additional synthesis latency. Not yet streaming-native on Connect KVS. Neural for live turns, Generative for non-real-time content. Think of Generative as a pre-recorded studio voice — extraordinary for scripted disclosures, wrong for live conversation because the studio cannot match the tempo of the room.

Bedrock voice models — Nova Sonic — collapse STT, LLM, and TTS into one invocation. The latency win is real (200-350 milliseconds end-to-end) but you lose the intermediate transcript surface that compliance logging, intent classification, and handoff context depend on. Anthropic Claude voice via Bedrock is the same trade-off. Regulated buyers keep the pipeline pattern. Consumer workloads with light audit weight take the speech-to-speech unlock.

SSML control is the unsung lever. Polly accepts <prosody>, <break>, <say-as>. It matters more on Neural than on Generative, which interprets prosody from the text and ignores some directives. For a banking agent reading out an account balance correctly the first time, SSML on Neural is the reliable path.

The default decision rubric:

Use case	TTS choice	Why
Live conversational turns	Polly Neural with streaming	Latency, SSML control, KVS-native
IVR prompts, scripted disclosures	Polly Generative (pre-synthesised)	Quality over latency for static content
Voicemail summaries, callback messages	Polly Generative or Long-form	Async, batch synthesis is fine
Consumer agent, no audit weight	Nova Sonic speech-to-speech	200-350ms end-to-end, no transcripts
Regulated voice agent	Polly Neural + Bedrock pipeline	Transcripts for the audit log

Figure 2 — Workload type against TTS engine. Polly Neural is still the production default in 2026 — the only path that combines streaming synthesis, SSML control, and intermediate transcripts for the audit log.

Speaking before you have finished thinking

The biggest perceived-latency lever is whether TTS starts speaking before the LLM finishes. Batch waits for the full response — on a 200-token Sonnet response at around 60 tokens per second, that's 3.3 seconds of generation before TTS starts. Add 200 milliseconds of synthesis and the caller has heard 3.5 seconds of silence. Unshippable.

The streaming-assembly pattern: open a streaming Bedrock invocation; a clause-boundary detector watches the stream for sentence-final punctuation, comma-bounded clauses of more than five tokens, or a 200-millisecond gap. On the first complete clause, issue a Polly streaming SynthesizeSpeech, which streams audio bytes back over a KVS-fed Connect channel. The agent starts speaking within around 200-350 milliseconds of first token, with remaining clauses synthesising in parallel.

This is how an experienced interpreter at a UN session works. She does not wait for the speaker to finish a paragraph. She begins translating the moment a complete clause lands, holding the rest in working memory while her mouth is still moving on the previous one. Done well, the listener never notices the seam.

The trick is the clause-boundary detector. Too aggressive and the agent produces unnatural mid-sentence pauses. Too conservative and the streaming advantage collapses. We have converged on sentence-final punctuation as the primary signal with a 1.2-second hard cap. Empirical median first-audio latency, against Sonnet with Polly Neural streaming, is 380-520 milliseconds.

The pattern collapses under Polly Generative because synthesis is not yet streaming-native — the reason we keep Generative for non-conversational use.

The accent problem nobody designs for

The inbound budget is smaller than people expect. Transcribe Streaming on en-US returns partial transcripts within 200-400 milliseconds of speech onset and a stable final 100-300 milliseconds after speech end. The Connect-to-Transcribe pipeline adds 50-100 milliseconds. Inbound STT contributes 350-800 milliseconds to end-to-end depending on configuration and accent — the part of the pipeline most teams underinvest in.

The dialect problem is the silent killer. Transcribe ships regional variants — en-GB, en-IN, en-AU, en-NG, en-ZA — but Connect does not select them automatically. Default en-US against a strong Nigerian or Indian accent shows word-error rates of 18-28 per cent versus 6-10 per cent on the regional variant. Intent classification on the downstream Bedrock call drops 8-15 points, translating directly to a higher handoff rate.

Configuration moves: set the regional code at the contact-flow level — detect from inbound number, IVR selection, or a brief language-identification call. Build a custom vocabulary for domain terms. Retune silence detection from the 1-second default to 1.5-1.8 seconds for non-North-American callers. Enable automatic language identification for code-switching speakers.

The hardest case is the multi-dialect contact centre. A Lagos-based centre takes calls from Lagos, Abuja, Port Harcourt, Kano, and Accra in a single shift, and variance within "Nigerian English" is itself wide. We ship dialect detection on the first three seconds with a hand-off to the right Transcribe profile — around 400 milliseconds of one-off setup, not per-turn cost.

When two voices arrive at the same time

A conversational agent must let the caller interrupt. The pattern is barge-in, and it is harder than it looks.

Half-duplex is the default. The agent listens for voice energy on the inbound stream while speaking. On detected speech, TTS stops and Transcribe opens. Works on clean audio. Fails on noisy lines and echo bleed — the false-positive rate on noisy mobile lines is a tuning problem on every Nigerian and Indian deployment.

Full-duplex runs Transcribe and Polly in parallel and lets the model decide on meaning, not voice energy, whether to keep talking or yield. More natural and dramatically harder to implement. Nova Sonic supports it natively; the pipeline path generally does not, and half-duplex is what most teams ship. It is the difference between a polite dinner guest who pauses the moment anyone takes a breath, and a friend who hears your tone shift and knows you are about to say something important.

Half-duplex tuning: VAD threshold (too low and breath triggers it, too high and the caller has to shout); echo cancellation on outbound audio so TTS does not trigger inbound VAD; minimum speech duration before barge-in fires (200-300 milliseconds is the sweet spot).

The 550-millisecond budget, line by line

The budget we underwrite for a regulated voice agent on Connect-plus-Bedrock-plus-Polly, target 550 milliseconds median:

Stage	Median	95th
Inbound speech end → Transcribe final transcript	100ms	220ms
Transcribe → Bedrock invocation	30ms	60ms
Bedrock first token (Sonnet, prompt-cache hit)	250ms	450ms
First clause assembled from streaming tokens	150ms	300ms
Polly Neural streaming synthesis, first clause	200ms	350ms
KVS → Connect audio path → caller	50ms	100ms
End-to-end first-audio latency	~550ms	~1000ms

550 milliseconds median feels conversational. 1000 milliseconds at the 95th percentile reads as a brief pause, not a broken system. Push the 95th above 1.2 seconds and the system reads as broken even when the median is fine.

The levers: the prompt cache cuts first-token latency by 100-250 milliseconds (highest-leverage single move). Streaming clause assembly hides the rest of generation. The regional Transcribe variant keeps inbound honest. Provisioned throughput eliminates 95th-percentile queue variance. Miss on any of these and the caller experiences the system as awkward — not broken, just off — and satisfaction scores drop quietly over the month.

Where the money actually goes

Mid-2026 rates against a 10,000-minute monthly workload (around 5,000 calls at 2 minutes average handle time):

Component	Rate	Monthly
Connect telephony (inbound, US)	$0.018/min	$180
Transcribe Streaming	$0.024/min	$240
Polly Neural TTS	$16/1M chars (~$0.024/min at 60% talk)	$144
Bedrock Sonnet (800 in + 200 out per turn, 5 turns/call)	~$0.06/min with cache hit	$600
Bedrock KB retrieval	~$0.005/min	$50
Total		~$1,214

$90 to $160 per thousand minutes all-in. Bedrock dominates the variable portion and the prompt cache keeps it at $0.06 per minute rather than $0.12-0.15 per minute. Polly Generative on live turns would add around $200 per month at $30 per million characters versus $16 per million for Neural.

Production cost surprises. Long-tail loops push per-call Bedrock cost three to five times above the median — set a per-call cost-anomaly alarm. KB re-retrieval on every turn multiplies retrieval cost by turn count — cache for the conversation. Streaming responses bill the same as non-streaming on input tokens; streaming changes perceived latency, not cost shape.

The friction points production teaches you

The model occasionally emits a long internal pause. Claude or Nova sometimes drifts mid-response. The streaming-clause pattern hits the 1.2-second cap and synthesises an incomplete clause; the caller hears awkward truncation. Mitigation: a clause buffer that holds until the next punctuation or sentence-final token arrives.

Multi-language switching mid-call. Transcribe auto-identification handles the inbound. Sonnet or Opus reasons on multilingual content. Polly cannot switch language mid-utterance. Pragmatic pattern: respond in English with an acknowledgement — "I understood you, let me respond in English".

Regional accents not in the training set. Nigerian Pidgin, Ghanaian English, Sierra Leonean Krio fall through to en-US. Custom vocabularies help on terminology, not phonetics. Handoff rate will be higher. Instrument the trigger by detected accent.

Background noise and mobile-line quality. The cmdev test corpus includes Lagos commute audio — traffic, generators, market noise — because that is production reality for Nigerian retail banking. Mitigation: noise suppression on inbound audio before Transcribe — adds 50-100 milliseconds, recovers most of the accuracy loss.

Compliance disclosures that must be delivered verbatim. KYC consent prompts, recording notifications, T&C read-outs drift if the model paraphrases. Pattern: hard-coded SSML disclosure blocks pre-synthesised at deployment, rendered as-is, with the model forbidden from regenerating the text.

The architecture is the budget, not the tuning step

Five things hold up across the voice deployments we ship.

The latency budget is the architecture. Every TTS choice, every streaming pattern, every model tier reduces to whether the median turn hits the budget. It is a first-class constraint, not a tuning step at the end.

Polly Neural streaming is still the production default in 2026. Generative sounds better and Nova Sonic is faster, but Neural is the only path that combines streaming synthesis, SSML control, and intermediate transcripts — what a regulated voice agent needs together.

The dialect problem is solvable and almost always under-invested. The regional Transcribe variant plus custom vocabulary plus retuned silence thresholds closes most of the gap. Teams that skip the work ship a system the buyer's customers experience as condescending.

The cost shape is dominated by Bedrock and tamed by the prompt cache. $0.06 per minute versus $0.12 per minute on reasoning is the difference between a workload that pencils and one that does not. The cache is not optional at production volume.

The voice-quality bar is rising every quarter. Callers compare contact-centre agents to consumer voice assistants. The architecture that shipped in 2025 will not hold in 2027 without active retuning. The caller in Lagos who asked "hello? are you there?" will not call back if the agent makes her ask twice.

FAQs

Polly Generative or Neural for live conversational turns?

Neural with streaming. Generative sounds better but is not yet streaming-native on the Connect KVS path, and the 150-300ms additional synthesis latency pushes a tight budget into the awkward zone. Keep Generative for IVR prompts, voicemail summaries, and scripted disclosures.

When does Nova Sonic or Claude voice beat the Polly-plus-Bedrock pipeline?

On consumer workloads with light audit weight, where the 200-350ms end-to-end win is decisive and you can live without the intermediate transcripts that compliance logging, intent classification, and supervisor handoff context depend on. Regulated buyers in banking, healthcare, and insurance keep the pipeline-with-transcripts pattern.

What end-to-end latency should we underwrite?

550ms median, 1000ms at the 95th, on Transcribe Streaming with the correct regional variant, Bedrock Sonnet with cache hits, streaming clause assembly into Polly Neural, and the Connect KVS audio path. Below 500ms is conversational quality; above 800ms feels hesitant; above 1.2s the caller treats the system as broken. The biggest single lever is the prompt cache.

How do we handle Nigerian, Indian, and other non-North-American accents?

Set the correct regional Transcribe code at the contact-flow level rather than defaulting to en-US. Build a custom vocabulary for domain terms. Retune silence-detection thresholds upwards to 1.5-1.8 seconds. For code-switching callers, enable automatic language identification on the first three seconds and route to the matching profile. Closes most of the 8-15 point accuracy gap.

What does the workload cost at 10,000 minutes per month?

Roughly $1,200 all-in — $180 Connect telephony, $240 Transcribe, $144 Polly Neural, $600 Bedrock Sonnet with cache hits, $50 KB retrieval. Call it $90-160 per thousand minutes depending on tier. Bedrock dominates the variable portion and the cache keeps it at $0.06/min rather than $0.12-0.15/min. Long-tail loops push per-call cost 3-5x above the median.

Companion content

How to engage

We design and ship Connect-plus-Bedrock voice agent rollouts for regulated buyers — with the latency budget, the dialect tuning, the streaming-assembly pattern, and the cost discipline that survive a compliance review and a production call volume. Talk to us at creativeminds.dev/contact.

Mayowa A. is CTO of CreativeMinds Development. He leads cmdev's AI engineering practice for regulated enterprises across Africa and the EU.