Series · Amazon Bedrock for Production AI · Part 2 of 8 ← Part 1: Foundations · RAG with Bedrock Knowledge Bases · Part 3: Open-source Agent Frameworks on Bedrock →
A compliance officer asks her newly-deployed copilot about the company's refund policy for a 2024 purchase. The agent returns a fluent, confident answer. It is from a 2019 policy archive that should have been retired three years ago. Nobody in the room notices for two weeks.
This is what production RAG failure looks like, almost always. The model is articulate. The retrieval is wrong. The pipeline upstream of the language model handed it the wrong context, indexed at the wrong granularity, retrieved by the wrong method — and the answer reads competent because Claude reads competent. The fluency disguises the failure.
Key takeaways
- Production RAG has five gates: source storage, chunking, embedding, vector storage, retrieval. Most production failures we see are at one of these five — not in the model.
- The pipeline is multi-model by design. Cohere Embed v3 or Titan v2 produces vectors; Claude reasons over the retrieved chunks. The cost ratio between embedding and reasoning is roughly 20-50× per token.
- Vector store choice has the longest reversal cost in the pipeline. OpenSearch Serverless is the defensible AWS-native default; pgvector on Aurora is the cheaper alternative for moderate scale.
- Hybrid search (BM25 + vector) plus Cohere Rerank improves retrieval precision 10-30% on hard queries. Top-K = 3-5 after re-ranking beats stuffing 50 chunks into the context window.
- You cannot tune any of this without a retrieval harness — 100-500 golden query/answer pairs with hit rate, rank, faithfulness, and citation accuracy tracked over time. Without it, every decision is guesswork.
Four Out Of Five Bedrock Workloads
Build five production Bedrock workloads in 2026 and four will be retrieval-augmented generation in some shape. The agent answering customer questions from a policy archive. The internal copilot reasoning over the company's documentation. The compliance assistant retrieving regulatory text. The data-room agent for due-diligence work. The technical-support agent that grounds answers in product documentation. Documents go in one end; an answer with citations comes out the other.
The pipeline that does that work has five gates: source storage, chunking, embedding, vector storage, retrieval. Each is a real engineering decision with real cost and quality consequences. And the inconvenient truth, observed across deployments, is that production RAG rarely fails at the model. It fails at one of these five gates upstream.
This piece is the working playbook for those five gates on Amazon Bedrock, and for the multi-model topology that makes the whole pipeline cost-defensible.
Five Gates From Bucket To Answer
The diagram makes two things explicit. The first: the pipeline is multi-model by design. Think of it like a translation team. Cohere Embed v3 or Amazon Titan Embeddings v2 are the bilingual clerks who convert every document into the numeric language the search index understands. Claude is the senior translator who reads what the clerks retrieved and writes the actual answer. The clerks cost a fraction of what the senior translator costs — roughly twenty to fifty times less per token — which is exactly why using Claude for embeddings would be economically indefensible even if it exposed the API. It does not.
The second: the vector store is the architectural decision with the longest tail. Swapping embedding models means re-embedding the corpus, which is real but manageable cost. Swapping vector stores means re-ingesting everything, re-indexing, re-tuning hybrid search, re-validating the retrieval harness — often months of work. Choose this gate the way you would choose a database, not the way you would choose a library.
Gate one — what goes into the bucket
Bedrock Knowledge Bases ingest from Amazon S3 by default, with increasing support for Web Crawlers, Confluence, Salesforce, SharePoint, and custom data sources. The bucket is the front door.
What lives behind that door matters as much as what is inside the documents themselves. The defensible posture has four pieces. Keep one bucket per domain rather than per agent — mixing customer-facing FAQs with internal engineering runbooks is the simplest way to surface internal documents to external users. Apply KMS Customer Managed Keys, Object Lock, and versioning, because the corpus is sensitive by default and the next bad ingestion will need a rollback. Discipline the source format: PDFs are the worst input for RAG (poor layout reconstruction, OCR errors on scans, mangled tables), and where they are unavoidable, pre-process through Amazon Textract or Unstructured.io before landing in the source bucket. And write a metadata sidecar for every document — <filename>.metadata.json declaring source, published_date, classification, language, owner_team. Without metadata you cannot do filtered retrieval, and without filtered retrieval you cannot do per-tenant or per-language or per-recency queries, which means cross-tenant leaks become a question of when, not if.
The work at this gate is one-time per corpus. It also determines the ceiling on every gate downstream.
Gate two — chunking, or what a page actually means
A 200-page document is not a unit of retrieval. The chunking strategy splits documents into the units the embedding model can vectorise and the retrieval layer can return. Bedrock Knowledge Bases support four strategies.
Fixed-size is the default for most homogeneous text corpora. It splits into N tokens (default 300) with M token overlap (default 60). It is predictable, fast, and well-understood — the workhorse setting.
Hierarchical chunking uses two levels: small chunks for retrieval precision, large parent chunks returned to the model for context. Imagine borrowing a single paragraph from a contract but quoting the whole clause it sits inside. This is the right move for legal contracts, regulatory text, and long-form documentation where the parent's context disambiguates the child's meaning.
Semantic chunking splits at semantic boundaries detected by an embedding model. Higher ingestion cost; better answer quality. It earns its place in heterogeneous corpora where fixed-size obviously fragments meaning — a knowledge base of cooking recipes where ingredients and instructions blur together is the textbook case.
No chunking — treating each document as one chunk — only works for tiny documents where the whole thing fits in the model's context. Rare in production.
The defensible default for unfamiliar corpora is fixed-size 300/60. Tune from there based on the retrieval harness. Two operational rules sit beneath this. Chunk size is not "set it and forget it" — re-chunk and re-embed when the harness's hit rate falls below the threshold the use case requires, on roughly a quarterly cadence. Overlap matters for boundary continuity; zero overlap loses meaning at chunk boundaries and excessive overlap inflates corpus size linearly. The working band is 15 to 25 per cent.
Gate three — choosing the clerk
The embedding model converts each chunk into a vector. Three things change with the choice: vector dimension (which drives storage cost), per-token embedding cost (which drives ingestion cost), and retrieval quality (which drives whether the right chunks come back).
Three credible options live on Bedrock. Cohere Embed English v3 produces 1024-dimension vectors, holds the strongest English retrieval quality in the catalogue, and supports the query-vs-document distinction that lets you embed questions and answers differently. The built-in re-ranker is a quiet advantage. The trade-off is its English-first design — the multilingual variant exists but lags. Cohere Embed Multilingual v3 also runs at 1024 dimensions and is the right default for Nigerian, pan-African, or EU multilingual corpora; expect a slight quality dip on English-only work compared to its English-only sibling. Amazon Titan Embeddings v2 is the AWS-native option, configurable to 256, 512, or 1024 dimensions for cost tuning, multilingual, and competitive on most benchmarks — with a marginal English-retrieval gap behind Cohere.
The defensible choice is Cohere Embed v3 for production retrieval quality. Titan Embeddings v2 wins where dimension flexibility justifies the marginal quality trade. Where Nigerian, African, or European multilingual support matters, Cohere Multilingual v3 is the option.
One restated point. Claude does not provide an embeddings API on Bedrock. The embedding step is Cohere or Titan. The reasoning step is Claude. The separation is the reason the pipeline survives a finance review.
Gate four — the store, and the cost of being wrong
Vector store choice has the longest reversal cost of any decision in the pipeline. Five options are credible.
Amazon OpenSearch Serverless is the default for AWS-native RAG. Managed, integrates natively with Bedrock KB, hybrid search (BM25 plus k-NN) built in, scales independently of indexing. The price is higher per-OCU than self-managed, with a minimum two OCUs per collection — roughly $700 a month before you ingest anything.
pgvector on Aurora or RDS Postgres is right when the application already speaks Postgres, when SQL filtering of vectors matters, and when the corpus is moderate scale (under about 10M vectors). It costs more operational overhead than serverless — manual tuning of HNSW or IVF indexes, query latency climbing above roughly 5M vectors.
Pinecone is the choice when the team has Pinecone expertise; the serverless tier is cost-effective for small-to-medium workloads and the managed UX is strong. The price is being outside AWS — data egress for sensitive workloads, vendor lock-in to manage.
MongoDB Atlas Vector Search earns its place where the application already uses MongoDB and the document data lives natively alongside the vectors. Atlas-only; per-cluster pricing.
Redis Enterprise Cloud is the right answer when sub-millisecond retrieval matters — real-time recommendation, conversational latency-critical apps. Per-vector cost is higher than alternatives, so it is not the default for typical RAG.
The defensible AWS-native default is OpenSearch Serverless, with pgvector on Aurora as the cheaper alternative for moderate scale on Postgres-native applications. Pinecone is right when the team has existing investment and accepts the cross-cloud data flow.
The honest rule cuts across all five: choose the store the team can actually operate. A perfectly-tuned Pinecone deployment beats a misconfigured OpenSearch every time, and a well-operated OpenSearch Serverless beats a mismanaged pgvector cluster every time. Operational competence dominates theoretical fit.
Gate five — retrieval mechanics
Retrieval is more than nearest-neighbour vector search. Three layers stack on top of each other.
Hybrid search runs lexical (BM25) and vector search together, weighted, and merges the results. Pure vector search misses queries where the answer hinges on a specific named entity, a code identifier, or a product SKU. Pure lexical search misses queries where the answer is phrased semantically differently from the question. Bedrock Knowledge Bases on OpenSearch Serverless do hybrid out of the box. The default weight is fifty-fifty. Corpora rich in named entities — technical docs, legal text — skew toward 60/40 in favour of lexical. Conversational corpora skew the other way. Tune via the retrieval harness.
Re-ranking is the second layer. After the first-pass retrieval returns top-k chunks, a re-ranker scores each candidate against the query more precisely. Cohere Rerank is the production-grade option on Bedrock; it runs over the top 25 to 100 candidates and returns the top 3 to 10 in a refined order. The re-ranker is a different model from the embedder — it sees the query and document together and produces a relevance score that beats raw cosine similarity. The price is a call costing roughly the same as 25 to 100 embedding calls. The benefit is retrieval precision often jumping 10 to 30 per cent on hard queries. For most production RAG the trade is worth making; the workloads where it is not are those where retrieval is already near-perfect or the latency budget will not absorb the extra call.
Filtered retrieval is the third layer, and where Gate One pays off. The metadata sidecars become filter inputs. A query like "what's our refund policy for 2024 purchases" benefits from published_date >= 2024-01-01 AND classification = 'public'. Filters apply before the vector search, reducing the candidate set the embedder has to score, lowering cost and improving precision at the same time. Three filters are nearly always worth having: classification (public/internal/confidential, drives per-tenant access control at retrieval time), language (cuts the multilingual corpus to the user's language), and freshness (last-modified date for time-sensitive queries).
Where Claude Lives
After retrieval, the top-K chunks are stitched into the context window along with the user's query and the system prompt. This is where Claude does the actual reasoning. Three operational notes.
Citations are mandatory. The prompt instructs Claude to return each factual claim with the chunk ID it came from. The Knowledge Base API returns retrievalResults[].location per chunk; pass these IDs into the prompt and require them in the response. Users see citations; auditors see traceability; hallucinations become detectable because they carry no citation.
Context window discipline matters more than people expect. Claude 4.x has very long context windows — 200K-plus tokens. The temptation is to stuff 50 chunks into the prompt. The right move is the smallest defensible context: top 3-5 chunks after re-ranking. More context dilutes attention and inflates per-query cost. Use the long context for the single hard document, not for piles of mediocre ones.
The synthesis prompt has a durable structure. A brief system prompt with the role. Explicit instruction to cite per claim. Instruction to say "I don't know" when the retrieved context does not support an answer. Retrieved chunks as numbered references. User query last. The exact prompt is tunable; the structure survives.
The Multi-Model Topology, In Motion
Per query, the pipeline makes four model invocations. Query embedding via Cohere Embed v3 or Titan v2 converts the user's query into a vector for retrieval — cheap, fast. Document retrieval is internal to the vector store — no model invocation, just vector arithmetic. Re-ranking via Cohere Rerank scores the top-N candidates at mid cost. Synthesis via Claude 4.x Sonnet (the default) reasons over the retrieved chunks and produces the answer — the expensive step.
For most production RAG, Sonnet is the right tier. Haiku is sufficient for simple factual lookups — FAQ retrieval, single-fact answers. Opus is overkill except for legal, financial, or scientific synthesis where the stakes justify the cost. A cascade pattern routes simple factual lookups to Haiku with a single retrieval call and no re-ranking, standard knowledge queries to Sonnet with re-ranking and top 3 chunks, and hard synthesis (multi-document, comparative analysis) to Opus with re-ranking, top 5 chunks, and extended thinking enabled. A router function — often a small Haiku call itself — classifies the incoming query and routes accordingly. The savings over running every query through Opus are an order of magnitude on per-query cost, without measurable quality regression on the simple-query class. Part 7 unpacks the cascade in depth.
Without The Harness, You Are Guessing
You cannot tune any of the above without a retrieval-quality harness. A golden set of 100 to 500 query/answer pairs — drawn from real users or curated to look like them — each with the question, the expected answer, and the source documents the answer should come from.
Per query, four metrics. Hit rate asks whether the right document appeared in the top-K retrieval. Rank asks where in the top-K it landed. Faithfulness asks whether the model's answer matches the source documents, measured by an LLM-as-judge over a sample. Citation accuracy asks whether the citations point to chunks that actually support each claim. Aggregate weekly during the tuning phase and monthly in production steady state. The trendline matters more than any single number — retrieval quality should improve as the harness drives the tuning, and degradations should trigger alerts.
Without this harness, every RAG decision — chunking, embedding model, vector store, weights, top-K, re-ranking on or off — is guesswork. With it, decisions are evidence-driven and the model behind the agent is doing the job it is good at.
The Five Production Failures You Will Meet
Five patterns recur often enough to call out.
The "looks right but isn't" answer. Retrieval brought back chunks that are topically adjacent but factually wrong for the specific question. The answer reads fluently because Claude is fluent; the answer is wrong because the chunk was. The fix is re-ranking on, citations required, faithfulness eval in the harness.
The empty corpus on a fresh query. A new product or new policy lands but the embeddings have not been refreshed. The agent confidently retrieves the old policy and answers from it. The fix is scheduled re-ingestion plus change-detection sync from the source system, with freshness metadata as a default filter.
The cross-tenant leak. Customer A's data appears in Customer B's retrieved chunks because the metadata filter was not applied. The fix is classification metadata as a mandatory filter at retrieval time, enforced at the API layer rather than the prompt layer.
The long-document fragmentation. A 50-page contract chunked at fixed 300-token boundaries; retrieval brings back a chunk from page 17 with no context. The fix is hierarchical chunking with parent-chunk return.
The PDF-OCR garbage. A scanned PDF's text is OCR'd badly, the embeddings are nonsense, retrieval returns nothing relevant. The fix is pre-processing PDFs through Textract or Unstructured.io before ingestion, with metadata-flagging scanned documents for manual review.
A Working Knowledge Base Configuration
The Terraform/CDK shape of a defensible production KB:
knowledgeBase:
name: company-policy-kb
embeddingModel: cohere.embed-english-v3
vectorStore:
type: opensearch-serverless
collectionName: policy-vectors
encryptionPolicy: aws/kms-cmk-policy-vectors
networkPolicy: aws/private-vpc-only
dataSource:
s3Configuration:
bucketArn: arn:aws:s3:::company-policy-source-prod
inclusionPrefixes: ["policies/", "faqs/"]
chunkingConfiguration:
strategy: HIERARCHICAL
hierarchical:
levelConfigurations:
- level: 1
maxTokens: 1500
- level: 2
maxTokens: 300
ingestionSchedule:
type: scheduled
cron: "cron(0 2 * * ? *)" # 02:00 UTC daily
retrieval:
hybridSearch: true
bm25Weight: 0.4
vectorWeight: 0.6
rerankingModel: cohere.rerank-v3
defaultTopK: 5
Cohere Embed v3 for embeddings (production-quality English) with hierarchical chunking (right for policy text). OpenSearch Serverless with KMS CMK and private-VPC-only network policy. Hybrid search with re-ranking on, top-K = 5 after re-rank. Daily ingestion schedule for freshness, manual re-ingestion on demand for change events. This KB wires into a Bedrock Agent (per Part 1) by referenced ID, and the agent gets the retrieval capability without separate retrieval code. The KB can be updated independently of the agent.
The pipeline is not the model. The pipeline is the model's hands and eyes — and if the hands grab the wrong document, the model cannot save you. What does your harness say about the last hundred queries?
FAQs
Why not use Claude for embeddings?
Claude does not provide an embeddings API on Bedrock. Even if it did, the economics would not work — the cost ratio between embedding-grade models (Cohere Embed v3, Titan v2) and reasoning models (Claude) is roughly 20-50× per token. The separation is what makes RAG cost-defensible. The embedding step is cheap and fast; the reasoning step is expensive and selective.
When should I choose hierarchical over fixed-size chunking?
Hierarchical chunking is right for documents with deep structure where context disambiguates meaning — legal contracts, regulatory text, long-form documentation. Small chunks drive retrieval precision; the larger parent chunks return to the model with surrounding context. Fixed-size 300/60 is the defensible default for homogeneous corpora. Semantic chunking is worth the ingestion cost when fixed-size obviously fragments meaning.
Is re-ranking always worth the cost?
For most production RAG, yes — re-ranking improves precision 10-30% on hard queries at the cost of roughly 25-100 embedding calls per query. It is not worth it when retrieval is already near-perfect (rare) or when latency budget makes the extra call infeasible (real-time conversational UX with sub-second targets).
What goes in a metadata sidecar?
A {filename}.metadata.json file alongside each document declares attributes the Knowledge Base ingestion respects as filterable retrieval metadata: source, published_date, classification, language, owner_team. The three filters that nearly always matter: classification (public/internal/confidential, drives per-tenant access at retrieval time), language (cuts multilingual corpora to the user's language), and freshness (last-modified date for time-sensitive queries).
What are the production failure modes worth knowing in advance?
Five we see often. "Looks right but isn't" answers when retrieval returns topically adjacent but factually wrong chunks (fix: re-ranking, citations, faithfulness eval). Stale answers when embeddings aren't refreshed after a policy update (fix: scheduled re-ingestion + freshness filter). Cross-tenant leaks when the classification filter isn't applied (fix: mandatory filter at the API layer). Long-document fragmentation (fix: hierarchical chunking). PDF-OCR garbage (fix: pre-process through Textract before ingestion).
What's next
Part 2 documented the retrieval layer. Part 3 picks up the orchestration layer: when Bedrock Agents' managed loop is the right answer, when AgentCore + Strands is, and when LangChain or LlamaIndex earns its complexity. Each path has its own RAG integration pattern; the KB built here works against all of them.
The full series:
- Part 1 — Foundations: Building AI Agents on Amazon Bedrock
- Part 2 — RAG with Bedrock Knowledge Bases (this piece)
- Part 3 — Open-source Agent Frameworks on Bedrock
- Part 4 — Model Customization on Amazon Bedrock
- Part 5 — Multi-step AI Workflows with Step Functions and Bedrock
- Part 6 — Security Guardrails and Observability for Bedrock
- Part 7 — Cost Optimization on Bedrock (deepest treatment of multi-model routing)
- Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage
The Amazon Bedrock series accompanies the Hardening-before-AWS series and the AWS-for-banks architecture series. Both substrates assume the security and identity foundations are in place; this series builds the AI workload on top.
