Production AI on Amazon Bedrock.
Architected for regulated enterprises, not pilots.
We architect, build, and operate production AI systems on Amazon Bedrock for clients who cannot ship pilots that leak. Claude is the default reasoning model. The work covers private-network access, customer-managed KMS, fine-grained IAM, guardrails, evaluation harnesses, and the cost discipline that keeps multi-model architectures from running away from the bill.
Bedrock is the easy part. Production is the work.
Amazon Bedrock removed the model-serving problem. You no longer choose between hosting a transformer yourself, calling a third-party API outside your VPC boundary, or accepting whatever the cloud's own model can do. Bedrock gives you Claude (Anthropic), Llama (Meta), Mistral, Cohere, Titan, and several others through a single API, inside your AWS account, with PrivateLink and customer-managed KMS available as first-class controls. For regulated enterprises this changes the architecture conversation.
What Bedrock does not solve is the production stack around the model: identity and entitlement enforcement on the call site, prompt-injection defence at the input boundary, RAG retrieval that survives adversarial documents, evaluation harnesses that catch quality regressions before customers do, observability that lets you trace a bad answer back to a prompt and a context window, and the cost discipline that prevents an Opus-only architecture from billing you a six-figure surprise.
The pattern we deploy is Claude-first by default for reasoning — Sonnet for most production traffic, Haiku for cheap classification and routing, Opus for the few classes of task where the gap is worth the cost — with prompt caching aggressively configured and a cascade pattern that routes by intent. Embeddings run on Cohere or Titan because that is where the price-to-quality curve sits. Llama and Mistral pick up bulk batch work where cost dominates. This is the architecture we recommend.
From pilot anxiety to production discipline.
- 01
Network and key isolation as the foundation.
Bedrock invocation routes through VPC interface endpoints (PrivateLink) so call traffic never leaves the AWS backbone. Customer-managed KMS keys encrypt model invocation logs, knowledge base contents, and any custom-model artefacts. IAM policies restrict invocation to specific roles, models, and regions. This is the substrate; we set it before we write a prompt.
- 02
Multi-model routing with Claude as the default.
Claude is the default reasoning model. We layer a cascade pattern — Haiku handles classification and intent routing, Sonnet executes most reasoning, Opus is reserved for the narrow class of tasks where the quality gap justifies the cost. Cohere or Titan handle embeddings; Llama and Mistral pick up bulk batch jobs. The cascade is the cost lever.
- 03
Prompt caching and cost engineering.
Prompt caching on Claude through Bedrock cuts repeat-context cost by an order of magnitude in well-architected systems. We design caching against your real traffic profile — system prompts, tool definitions, RAG context — and instrument the cache hit rate as a first-class metric. Cost discipline is an engineering discipline, not a procurement one.
- 04
Guardrails, evaluation, and observability.
Bedrock Guardrails handle the obvious — PII redaction, denied topics, prompt-injection patterns. The real work is the custom evaluation harness — golden test sets, LLM-as-judge configurations, and the regression pipeline that runs before every prompt change. Observability traces the bad answer back to the prompt, the retrieval, and the model version.
- 05
RAG with Bedrock Knowledge Bases — or without.
Knowledge Bases simplify the retrieval pipeline when the embedding model and the vector store fit. Where they do not, we deploy OpenSearch or pgvector with custom chunking, hybrid retrieval, and a reranker. The choice depends on document characteristics, latency budget, and the adversarial profile of the corpus.
The patterns we deploy in production.
The reference architecture sits on three layers. Identity and network — IAM Identity Center for human access, IAM roles for service access, VPC interface endpoints for Bedrock invocation, customer-managed KMS for encryption, CloudTrail data events for invocation auditing. Model and orchestration — Bedrock with multi-model routing, Step Functions for multi-step workflows, EventBridge for event-driven invocation, Lambda or Fargate for the application layer. Data and evaluation — Knowledge Bases or self-managed retrieval, S3 with lifecycle policies for artefact storage, evaluation harness on Step Functions running pre-deploy and on schedule.
Cost discipline runs through every layer. Cascade routing on the model layer. Prompt caching on Claude. Lifecycle policies on S3. Reserved capacity where traffic is predictable. We instrument cost per request, per user, and per use case, and we surface the metric in the same dashboard as latency and quality.
- VPC interface endpoints (PrivateLink) for Bedrock invocation
- Customer-managed KMS keys for invocation logs and KB content
- IAM least-privilege scoped to model, region, and action
- Claude-first cascade — Haiku, Sonnet, Opus by intent
- Prompt caching as a first-class cost lever
- Bedrock Guardrails plus custom evaluation harness
The end state we drive toward.
Bedrock invocation inside your VPC, models routed by intent with Claude as the default, prompt caching cutting repeat-context cost, an evaluation harness that catches regressions before deploy, and observability that traces every answer back to the prompt and the retrieval.
- 100%
- Invocation through PrivateLink
- 60–80%
- Prompt cache hit rate, well-tuned
- <2s p95
- Production answer latency
- Pre-deploy
- Evaluation harness on every change
Illustrative, drawn from published architectures and forthcoming engagements. Specific metrics are conditioned on traffic shape, use case, and the maturity of the existing AWS landing zone.
Where this work connects on the site.
- 01
- 02
- 03
- 04
Scoped Bedrock implementation assessment.
Send us the use case, the AWS account structure, and the data classification involved. We come back with a fixed-scope implementation proposal, a reference architecture diagram, and a sample evaluation harness inside ten working days.