Engineering

Voice Integration on Amazon Connect with Bedrock: Polly Neural vs Bedrock Voice, Streaming, and the Latency Budget

Mayowa A.13 min read
Voice Integration on Amazon Connect with Bedrock: Polly Neural vs Bedrock Voice, Streaming, and the Latency Budget
Share
~19 min

An operator-grade pattern from the CreativeMinds Development (cmdev) AI engineering practice. The narrower, more technical companion to Amazon Connect AI Agent — what ships, what you build and Building AI Agents on Amazon Bedrock Foundations.

Key takeaways

  • The bar for natural voice AI is end-to-end response under 800ms; conversational quality lives under 500ms. Above 1.2s the caller treats the system as broken. The budget is the architecture.
  • Polly Neural with streaming over Kinesis Video Streams is the production default. Polly Generative sounds materially better but adds 150–300ms of synthesis latency and is not yet streaming-native on Connect — keep it for IVR prompts and non-conversational responses.
  • The streaming-assembly pattern hides the 250–600ms Bedrock generation time. Start TTS synthesis on the first complete clause from the model stream, not on the full response. This is what makes a Sonnet-tier agent feel as responsive as a scripted IVR.
  • The dialect problem is the silent killer on non-North-American English. Transcribe defaults to en-US; Nigerian, Indian, Scottish, and Singaporean callers see intent classification drop 8–15 points without the regional variant, custom vocabulary, and silence-detection retuning.
  • The cost shape on a 10K-minutes-per-month workload is ~$90–160 per thousand minutes all-in. Bedrock is the variable line and the prompt cache is the lever that decides whether reasoning runs at $0.06/min or $0.12/min.
Voice AI end-to-end latency waterfall — six stages stacking from caller speech end to first audio back to the caller: Transcribe Streaming final transcript (100ms median, 220ms 95th), Transcribe to Bedrock routing hop (30ms, 60ms), Bedrock Sonnet first token with prompt-cache hit (250ms, 450ms), first clause assembled from the streaming tokens (150ms, 300ms), Polly Neural streaming synthesis of the first clause (200ms, 350ms), and KVS to Connect audio path back to the caller (50ms, 100ms); the bands mark less-than-500ms conversational ideal, 500-800ms acceptable, 800-1200ms awkward, and greater-than-1200ms broken; median end-to-end first audio is approximately 550ms within the conversational threshold, with 95th percentile at approximately 1000ms reading as a brief pause not a broken system.
Figure 1 — The end-to-end latency waterfall. 550ms median sits inside the conversational band; the 95th at 1000ms is a brief pause, not a broken system. Every TTS choice, every streaming pattern, every model tier reduces to which band the median lands in.

What "natural" voice AI actually demands

Every voice agent we ship is rated against the same buyer question — "Does it sound like a person?" The engineering translation is can the agent respond inside the human turn-taking window without the caller noticing the gap. Conversational research puts the median gap between human speakers at ~200ms, comfortable turn-taking at 500ms. Beyond 800ms reads as hesitation; beyond 1.2 seconds the caller assumes the line is broken.

Every decision — TTS engine, streaming pattern, model tier, STT configuration — collapses to did we hit the budget on the median, and where does the 95th sit. This piece is the arithmetic.

The TTS decision — Polly Neural, Polly Generative, Bedrock voice

Connect ships Amazon Polly as the default, in four tiers that are not equivalent for real-time voice.

Polly Standard — concatenative voices, audibly synthetic. Fine for IVR menus, unacceptable for a conversational agent in 2026.

Polly Neural (NTTS) — the production default. Naturalistic prosody, streaming synthesis via Connect KVS, SSML control. Joanna, Matthew, Aditi, and Ayanda cover most deployments; the Nigerian English voice is the obvious gap and the workaround is Ayanda (South African English) with a custom lexicon for Nigerian terms.

Polly Generative — materially more human than Neural at the cost of 150-300ms additional synthesis latency. Not yet streaming-native on Connect KVS. Neural for live turns, Generative for non-real-time content.

Bedrock voice models — Nova Sonic collapses STT-LLM-TTS into one invocation. Latency win is real (200-350ms end-to-end) but you lose the intermediate transcript surface that compliance logging, intent classification, and handoff context depend on. Anthropic Claude voice via Bedrock is the same trade-off. Regulated buyers keep the pipeline pattern; consumer workloads with light audit weight take the speech-to-speech unlock.

SSML control — Polly accepts <prosody>, <break>, <say-as>. Matters more on Neural than on Generative, which interprets prosody from the text and ignores some directives. For a banking agent reading out an account balance correctly the first time, SSML on Neural is the reliable path.

The default decision rubric:

Use case TTS choice Why
Live conversational turns Polly Neural with streaming Latency, SSML control, KVS-native
IVR prompts, scripted disclosures Polly Generative (pre-synthesised) Quality over latency for static content
Voicemail summaries, callback messages Polly Generative or Long-form Async, batch synthesis is fine
Consumer agent, no audit weight Nova Sonic speech-to-speech 200-350ms end-to-end, no transcripts
Regulated voice agent Polly Neural + Bedrock pipeline Transcripts for the audit log
TTS engine decision matrix — four workload types against four voice engines; rows are chat-style assistant, transactional voice for banking and KYC and claims, long-form narration for voicemail summaries and scripted disclosures, and multi-language and dialect workloads with code-switching callers; columns are Polly Neural with streaming and SSML and KVS-native, Polly Generative and Long-form with plus-150-to-300ms latency overhead, Bedrock Nova Sonic with speech-to-speech 200-350ms end-to-end but no transcript surface, and Claude voice on Bedrock as the premium light-audit edge case; best-fit cells highlighted include Polly Neural for chat-style, transactional regulated voice, and multi-language workloads, Polly Generative for long-form narration, and Nova Sonic as a strong alternative for consumer chat-style with no audit requirement; avoid cells flag Nova Sonic and Claude voice for transactional regulated workloads because they break the audit-log layer.
Figure 2 — Workload type against TTS engine. Polly Neural is still the production default in 2026 — the only path that combines streaming synthesis, SSML control, and intermediate transcripts for the audit log.

Streaming versus batch synthesis

The biggest perceived-latency lever is whether TTS starts speaking before the LLM finishes. Batch waits for the full response — on a 200-token Sonnet response at ~60 tokens/second, that's 3.3 seconds of generation before TTS starts. Add 200ms of synthesis and the caller has heard 3.5 seconds of silence. Unshippable.

The streaming-assembly pattern: open a streaming Bedrock invocation; a clause-boundary detector watches the stream for sentence-final punctuation, comma-bounded clauses of more than five tokens, or a 200ms gap; on the first complete clause, issue a Polly streaming SynthesizeSpeech, which streams audio bytes back over a KVS-fed Connect channel; the agent starts speaking inside ~200-350ms of first token, with remaining clauses synthesising in parallel.

The trick is the clause-boundary detector. Too aggressive and the agent produces unnatural mid-sentence pauses; too conservative and the streaming advantage collapses. We have converged on sentence-final punctuation as the primary signal with a 1.2-second hard cap. Empirical median first-audio latency, against Sonnet with Polly Neural streaming, is 380-520ms.

The pattern collapses under Polly Generative because synthesis is not yet streaming-native — the reason we keep Generative for non-conversational use.

The STT side — Transcribe, dialect, and the accent problem

The inbound budget is smaller than people expect. Transcribe Streaming on en-US returns partial transcripts within 200-400ms of speech onset and a stable final 100-300ms after speech end. The Connect-to-Transcribe pipeline adds 50-100ms. Inbound STT contributes 350-800ms to end-to-end depending on configuration and accent — the part of the pipeline most teams underinvest in.

The dialect problem is the silent killer. Transcribe ships regional variants — en-GB, en-IN, en-AU, en-NG, en-ZA — but Connect does not select them automatically. Default en-US against a strong Nigerian or Indian accent shows word-error rates of 18-28% versus 6-10% on the regional variant. Intent classification on the downstream Bedrock call drops 8-15 points, translating directly to a higher handoff rate.

Configuration moves: set the regional code at the contact-flow level (detect from inbound number, IVR selection, or a brief language-identification call); build a custom vocabulary for domain terms; retune silence detection from the 1-second default to 1.5-1.8 seconds for non-North-American callers; enable automatic language identification for code-switching speakers.

The hardest case is the multi-dialect contact centre. A Lagos-based centre takes calls from Lagos, Abuja, Port Harcourt, Kano, and Accra in a single shift, and variance within "Nigerian English" is itself wide. We ship dialect detection on the first three seconds with a hand-off to the right Transcribe profile — ~400ms of one-off setup, not per-turn cost.

The barge-in pattern — half-duplex versus full-duplex

A conversational agent must let the caller interrupt. The pattern is barge-in and it is harder than it looks.

Half-duplex. Agent listens for voice energy on the inbound stream while speaking; on detected speech, TTS stops and Transcribe opens. Default Connect AI Agent behaviour. Works on clean audio, fails on noisy lines and echo bleed — false-positive rate on noisy mobile lines is a tuning problem on every Nigerian and Indian deployment.

Full-duplex. Transcribe and Polly run in parallel and the model decides on meaning, not voice energy, whether to keep talking or yield. More natural and dramatically harder to implement. Nova Sonic supports it natively; the pipeline path generally does not, and half-duplex is what most teams ship.

Half-duplex tuning: VAD threshold (too low and breath triggers it, too high and the caller has to shout); echo cancellation on outbound audio so TTS doesn't trigger inbound VAD; minimum speech duration before barge-in fires (200-300ms is the sweet spot).

The end-to-end latency budget — a worked example

The budget we underwrite for a regulated voice agent on Connect-plus-Bedrock-plus-Polly, target 550ms median:

Stage Median 95th
Inbound speech end → Transcribe final transcript 100ms 220ms
Transcribe → Bedrock invocation 30ms 60ms
Bedrock first token (Sonnet, prompt-cache hit) 250ms 450ms
First clause assembled from streaming tokens 150ms 300ms
Polly Neural streaming synthesis, first clause 200ms 350ms
KVS → Connect audio path → caller 50ms 100ms
End-to-end first-audio latency ~550ms ~1000ms

550ms median feels conversational. 1000ms at the 95th percentile reads as a brief pause, not a broken system. Push the 95th above 1.2s and the system reads as broken even when the median is fine.

Levers: prompt cache cuts first-token latency by 100-250ms (highest-leverage single move); streaming clause assembly hides the rest of generation; the regional Transcribe variant keeps inbound honest; provisioned throughput eliminates 95th-percentile queue variance. Miss on any of these and the caller experiences the system as awkward — not broken, just off — and satisfaction scores drop quietly over the month.

The cost shape — worked example at 10K minutes per month

Mid-2026 rates against a 10,000-minute monthly workload (~5,000 calls at 2 minutes AHT):

Component Rate Monthly
Connect telephony (inbound, US) $0.018/min $180
Transcribe Streaming $0.024/min $240
Polly Neural TTS $16/1M chars (~$0.024/min at 60% talk) $144
Bedrock Sonnet (800 in + 200 out per turn, 5 turns/call) ~$0.06/min with cache hit $600
Bedrock KB retrieval ~$0.005/min $50
Total ~$1,214

$90-160 per thousand minutes all-in. Bedrock dominates the variable portion and the prompt cache keeps it at $0.06/min rather than $0.12-0.15/min. Polly Generative on live turns would add ~$200/month at $30/1M characters versus $16/1M for Neural.

Production cost surprises: long-tail loops push per-call Bedrock cost 3-5x above the median (per-call cost-anomaly alarm); KB re-retrieval on every turn multiplies retrieval cost by turn count (cache for the conversation); streaming responses bill the same as non-streaming on input tokens — streaming changes perceived latency, not cost shape.

Where it breaks — the real-world friction points

Long pauses from the model mid-response. Claude or Nova occasionally emits a long internal pause. The streaming-clause pattern hits the 1.2s cap and synthesises an incomplete clause; the caller hears awkward truncation. Mitigation: a clause buffer that holds until the next punctuation or sentence-final token arrives.

Multi-language switching mid-call. Transcribe auto-identification handles the inbound; Sonnet or Opus reasons on multilingual content; Polly cannot switch language mid-utterance. Pragmatic pattern: respond in English with an acknowledgement ("I understood you, let me respond in English").

Regional accents not in the training set. Nigerian Pidgin, Ghanaian English, Sierra Leonean Krio fall through to en-US. Custom vocabularies help on terminology, not phonetics. Handoff rate will be higher; instrument the trigger by detected accent.

Background noise and mobile-line quality. The cmdev test corpus includes Lagos commute audio (traffic, generators, market noise) because that's production reality for Nigerian retail banking. Mitigation: noise suppression on inbound audio before Transcribe — adds 50-100ms, recovers most of the accuracy loss.

Compliance disclosures that must be delivered verbatim. KYC consent prompts, recording notifications, T&C read-outs drift if the model paraphrases. Pattern: hard-coded SSML disclosure blocks pre-synthesised at deployment, rendered as-is, with the model forbidden from regenerating the text.

What this taught us about voice agent engineering

1. The latency budget is the architecture. Every TTS choice, every streaming pattern, every model tier reduces to whether the median turn hits the budget. First-class constraint, not a tuning step.

2. Polly Neural streaming is still the production default in 2026. Generative sounds better and Nova Sonic is faster, but Neural is the only path that combines streaming synthesis, SSML control, and intermediate transcripts — what a regulated voice agent needs together.

3. The dialect problem is solvable and almost always under-invested. Regional Transcribe variant plus custom vocabulary plus retuned silence thresholds closes most of the gap. Teams that skip the work ship a system the buyer's customers experience as condescending.

4. The cost shape is dominated by Bedrock and tamed by the prompt cache. $0.06/min versus $0.12/min on reasoning is the difference between a workload that pencils and one that doesn't. The cache is not optional at production volume.

5. The voice-quality bar is rising every quarter. Callers compare contact-centre agents to consumer voice assistants. The architecture that shipped in 2025 will not hold in 2027 without active retuning.

FAQs

Polly Generative or Neural for live conversational turns?

Neural with streaming. Generative sounds better but is not yet streaming-native on the Connect KVS path, and the 150-300ms additional synthesis latency pushes a tight budget into the awkward zone. Keep Generative for IVR prompts, voicemail summaries, and scripted disclosures.

When does Nova Sonic or Claude voice beat the Polly-plus-Bedrock pipeline?

On consumer workloads with light audit weight, where the 200-350ms end-to-end win is decisive and you can live without the intermediate transcripts that compliance logging, intent classification, and supervisor handoff context depend on. Regulated buyers in banking, healthcare, and insurance keep the pipeline-with-transcripts pattern.

What end-to-end latency should we underwrite?

550ms median, 1000ms at the 95th, on Transcribe Streaming with the correct regional variant, Bedrock Sonnet with cache hits, streaming clause assembly into Polly Neural, and the Connect KVS audio path. Below 500ms is conversational quality; above 800ms feels hesitant; above 1.2s the caller treats the system as broken. The biggest single lever is the prompt cache.

How do we handle Nigerian, Indian, and other non-North-American accents?

Set the correct regional Transcribe code at the contact-flow level rather than defaulting to en-US. Build a custom vocabulary for domain terms. Retune silence-detection thresholds upwards to 1.5-1.8 seconds. For code-switching callers, enable automatic language identification on the first three seconds and route to the matching profile. Closes most of the 8-15 point accuracy gap.

What does the workload cost at 10,000 minutes per month?

Roughly $1,200 all-in — $180 Connect telephony, $240 Transcribe, $144 Polly Neural, $600 Bedrock Sonnet with cache hits, $50 KB retrieval. Call it $90-160 per thousand minutes depending on tier. Bedrock dominates the variable portion and the cache keeps it at $0.06/min rather than $0.12-0.15/min. Long-tail loops push per-call cost 3-5x above the median.

Companion content

How to engage

We design and ship Connect-plus-Bedrock voice agent rollouts for regulated buyers — with the latency budget, the dialect tuning, the streaming-assembly pattern, and the cost discipline that survive a compliance review and a production call volume. Talk to us at creativeminds.dev/contact.

Mayowa A. is CTO of CreativeMinds Development. He leads cmdev's AI engineering practice for regulated enterprises across Africa and the EU.

amazon-connectamazon-bedrockamazon-pollyvoice-aistreaminglatencycontact-centeraws

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation