Model Customization on Amazon Bedrock: When Prompt Engineering Stops Being Enough

Series · Amazon Bedrock for Production AI · Part 4 of 8 ← Part 3: Open-source Agent Frameworks · Model Customization · Part 5: Step Functions Orchestration →

The first message arrives on a Wednesday afternoon. The agent that classifies expense receipts is misreading a particular vendor's format. By Thursday morning the engineering lead has booked a sprint to fine-tune the underlying model. It is the wrong sprint. The receipts in question share a single layout quirk that a six-line prompt change would have caught — but the team has reached, by reflex, for the heaviest tool in the drawer.

We see this every quarter. The reflex is to assume the model needs surgery. The reality is that prompts, retrieval, and tool descriptions absorb most of what looks like a model problem, and customisation enters honestly only when those three have been exhausted.

Key takeaways

Most "the model is bad" diagnoses are actually prompt, RAG, or tool-design problems — try those three first before customizing anything.
Claude is not customizable on Bedrock. The customization paths apply to Llama, Titan, Cohere Command, Mistral, and bring-your-own via Custom Model Import.
The four paths are Continued Pre-Training (unlabeled domain corpus), Supervised Fine-Tuning (input/output pairs), Distillation (Claude teacher → smaller student), and Custom Model Import (your model behind the Bedrock API).
The right architectural conclusion is rarely "customize one model" — it's the combined-models pattern: pinned Claude for hard reasoning, custom-tuned smaller model for the narrow recurring task, with a Haiku-class router deciding which gets the call.
A held-out evaluation harness with quality + latency + cost metrics is non-optional. Without it, every customization decision is faith-based; with it, it's engineering.

The Sprint That Should Not Have Been Booked

Three things almost always come before customisation, and almost always solve the problem on their own.

Better prompts come first. Tighten the system prompt. Add few-shot examples. Restructure the task into smaller steps the model can take in sequence. Where reasoning depth matters, switch on Claude's extended-thinking mode and let the model spend more compute before it answers. Prompt iteration is days of work; fine-tuning is weeks of work. The ratio matters.

Better RAG comes second. Re-evaluate chunking. Try a different embedding model. Consider whether the vector store is the right one. Re-weight hybrid search. Add a re-ranker. Most of what the retrieval piece catalogues as failure modes shows up first as "the model is bad" and resolves on the retrieval side.

Better tool design comes third. Tools are the agent's hands. If the OpenAPI descriptions are vague, the agent calls the wrong tool — and the model gets blamed for the choice. Narrower tools with sharper schemas almost always lift quality more than fine-tuning would.

After those three, the customisation conversation is honest. The question is no longer "do we customise" but "which path, on which model, for which task" — and the answer is often still none of them.

Anthropic Holds The Door Shut

Worth stating early because it shapes everything that follows. Anthropic does not expose fine-tuning of Claude on Bedrock. Claude is available for inference. It is the production-default reasoning model recommended in Part 1. It is not a customisation target.

The Bedrock customisation menu, then, is everything except Claude. Llama supports supervised fine-tuning and continued pre-training through Meta's path. Amazon Titan adds distillation to that menu. Cohere Command and Mistral both accept supervised fine-tuning. And Custom Model Import accepts any compatible architecture you bring yourself — Llama, Mistral, Falcon, Mixtral variants.

This is not a bug. It is a constraint that shapes the architecture this piece builds toward. Claude for the hard reasoning step. A customised smaller model for the narrow recurring task that does not need Claude's depth. Each does what it is good at, and the production deployment is the composition of both.

Four Doors, Each Opening Onto Different Terrain

Bedrock model customization decision tree — no labelled data routes to Continued Pre-Training (CPT, unlabeled); hundreds-to-thousands of input/output pairs for a specific task routes to Supervised Fine-Tuning; Claude-quality on a narrow task at smaller-model cost routes to Distillation (Claude teacher to Llama or Titan student); an external model outside the Bedrock catalogue routes to Custom Model Import; each path includes an evaluation harness and a rollback gate. — Figure 1 — Most teams reach for fine-tuning two phases too early — walk the tree first.

Door one — Continued Pre-Training

Continued pre-training extends a foundation model's training on a large corpus of unlabelled domain text. Think of it as moving the model to a new country for a year. It picks up the dialect, the conventions, the rhythm of how the locals talk — without being trained for any specific job once it arrives.

It is right when you have a large domain corpus, no labelled task data, and the foundation model sounds like a tourist. It is wrong when you have specific task-completion problems, because immersion teaches the dialect but not the work. Runs land in the low-thousands of US dollars for Titan-class models and climb with size.

In production, it is rare. Most teams that think they need it actually need better RAG — the corpus belongs in a Knowledge Base, not baked into weights. It earns its place only in closed domains where retrieval cannot reach: deeply specialised scientific or medical workflows, niche engineering vocabularies where the language itself is the asset.

Door two — Supervised Fine-Tuning

Supervised fine-tuning trains a model on input/output pairs to produce specific responses. If continued pre-training is immersion, this is apprenticeship — the model watches a thousand examples of exactly the work it will be asked to do, then does it.

Use it when you have hundreds to thousands of high-quality input/output pairs, the task is narrow and well-defined, and prompt engineering has hit a ceiling. Do not use it for broad jobs like general reasoning or multi-step planning across many domains. Apprenticeship makes a specialist, not a generalist.

The training is hours to days. The expensive part is rarely the training. It is the dataset. A thousand clean pairs for a narrow task is harder than it sounds, and is where most SFT projects actually spend their money.

The production sweet spots are familiar: document classification, structured extraction (resume parsing, invoice line items, log-format normalisation), style transfer, narrow generation in a specific format. The input format on Bedrock is JSONL:

{"prompt": "Extract invoice fields from: ACME LTD #INV-2026-0042 ...", "completion": "{\"vendor\": \"ACME LTD\", \"invoice_id\": \"INV-2026-0042\", ...}"}
{"prompt": "Extract invoice fields from: Beta Co. Invoice 2026-0091 ...", "completion": "{\"vendor\": \"Beta Co.\", \"invoice_id\": \"2026-0091\", ...}"}

Train against Llama 4 Maverick or Titan Text Premier. Deploy the resulting custom model via on-demand or provisioned throughput. Invoke through the same Bedrock API as any foundation model.

Door three — Distillation

Distillation trains a smaller student model to mimic the behaviour of a larger teacher model on a specific task distribution. The teacher is Claude. The student is something like Llama 4 Scout or Titan Text Lite. The student emerges able to approach the teacher's quality on that one task while running at the student's price point.

The mechanics: run Claude over the task corpus to generate high-quality input/output pairs, then use those pairs as the training set for SFT on the smaller model. You are paying Claude rates once, to produce a teaching set, so that a cheaper model can carry the work afterwards. The savings show up at deployment — every production call goes to the student, never the teacher.

This is the most operationally impactful customisation path for cmdev work in 2026. It is the rare case where customisation pays for itself in months rather than years. Per-message classification across millions of daily messages. Real-time content tagging. Latency-critical extraction where Claude's latency budget is the bottleneck. These are the workloads where distillation earns its keep.

Door four — Custom Model Import

Custom Model Import takes a model you have trained or obtained elsewhere — a fine-tuned Llama, a custom Mistral variant, your own pre-trained model — and makes it callable through the standard Bedrock API. It is the diplomatic passport: the model itself is foreign, but it travels under Bedrock's credentials once it arrives.

Use it when you have a model outside the Bedrock catalogue and you want it consumed through the same API surface as the native ones. The arguments are compliance, billing consolidation, and operational simplicity. Putting every model call behind one API is worth a lot to a security architect.

The constraint is architectural compatibility. Llama, Mistral, Falcon, Mixtral variants and their derivatives are in. Novel transformer architectures are not, yet. Costs run on Custom Model Units rather than on-demand pricing, which suits steady-state workloads where the reservation is amortised by sustained usage.

In practice, this is what teams reach for when they have trained domain-specific Llama variants externally — on HuggingFace, on SageMaker training jobs, on-prem — and want them in the Bedrock invocation path for compliance consistency.

The Combined-Models Pattern

The architectural conclusion of all of the above is rarely "customise one model." It is to combine customised and frontier models, each carrying the load it is good at.

Combined-models topology — incoming query enters a small Haiku-class router that decides task type; hard reasoning, complex synthesis, and multi-step planning route to pinned Claude Sonnet or Opus; narrow recurring tasks (invoice extraction, content tagging, log classification) route to a custom-tuned Llama or Titan via Bedrock Custom Model; both invocations pass through Guardrails and emit model-invocation logs to the same S3 audit bucket; a cost-tag per model identifies per-tier spend. — Figure 2 — Pin Claude for the hard reasoning step; custom-tune a smaller model for the narrow recurring task.

Think of it as a hospital. Claude is the consultant who sees the hard cases — complex synthesis across retrieved context, novel multi-step planning, ambiguous tool selection, anything that needs depth. Pin a specific Claude version and use it without apology.

The custom-tuned smaller model is the specialist clinic. The invoice extractor that runs ten thousand times a day does not need a consultant. It needs to do exactly that one task, reliably, at a fraction of the per-call cost. A distilled or SFT'd Llama or Titan is the right tool for that work.

The router is triage. Often Claude Haiku — cheap, fast, good enough at classification — reads the incoming query and decides which clinic it belongs in. The routing decision itself costs roughly $0.001 per query. The savings on every query routed to the cheap model pay for that decision tens of thousands of times over per day.

Both paths share the same observability layer. Model-invocation logs land in the same S3 bucket. Guardrails wrap both. Cost tags identify both. The operational layer is unified even when the model layer is split. A working Strands pattern (per Part 3):

from strands import Agent, tool

@tool
def extract_invoice(text: str) -> dict:
    """Extract structured fields from an invoice."""
    custom_extractor = Agent(
        model="arn:aws:bedrock:us-east-1:123:custom-model/invoice-extractor-v3",
        instruction="Extract: vendor, invoice_id, date, line_items, total. Return JSON.",
    )
    return parse_json(custom_extractor(text).output)

reasoning_agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",  # Claude for reasoning
    tools=[extract_invoice, ...],
    instruction="You are a finance ops agent. Analyse invoices and answer questions.",
)

reasoning_agent("What's our total spend with ACME this quarter, with anomalies highlighted?")

The reasoning agent runs Claude. The extraction tool runs a custom-tuned Bedrock model. Both invocations are observable, cost-attributable, and Guardrails-wrapped.

The Harness That Tells The Truth

You cannot ship a customised model without a regression-testing harness. Customisation without evaluation is theology. Customisation with evaluation is engineering.

The harness has a held-out test set — 100 to 500 examples never used in training — and scores the model on quality, latency, and cost per example. Quality looks different per task type. Classification tasks need accuracy, precision and recall per class, a confusion matrix. Structured extraction needs field-level F1 and schema-compliance rate. Generation tasks need BLEU or ROUGE alongside LLM-as-judge scoring against a rubric. Claude Sonnet makes the right judge model — using a frontier model to grade outputs against criteria is honest because the judge has the capability to discriminate.

The baseline is the foundation model without customisation. The customised model has to beat that baseline on the quality metric and on the cost-or-latency dimension that motivated the customisation in the first place. If it does not beat both, ship the foundation model and call the customisation what it was: an experiment.

Drift detection sits on top. Production performance degrades as input distributions shift — new vendors in the invoice extractor, new log formats in the classifier, new languages in the content tagger. Run the harness against the production model periodically. The drift signal is the trigger to retrain.

What Bites In Production

Five things bite teams who ship customised models without thinking about the operating model.

Training data IP is the first. The corpus has to be data you have the right to train on. Customer-derived data needs explicit consent or contractual basis. Public-web data has uncertain legal status in some jurisdictions. The cleaner the provenance, the cleaner the deployment.

PII in training data is the second. Customised models can memorise specific examples and regurgitate them under certain prompts. PII in the training set is therefore PII at inference time, which is an NDPA, GDPR, and NIS2 problem at once. Redact or synthesise before training. NDPA Section 39 alone is enough to make this a board-level concern in the Nigerian context.

Evaluation cost is the third. A thousand-example harness run against three candidate models is not free. Budget the LLM-as-judge bill.

Deployment cost structure is the fourth. On-demand custom-model pricing assumes burst usage. Provisioned throughput is cheaper per-call for sustained workloads but commits to capacity. Calculate the break-even against actual traffic before you choose.

Versioning is the fifth. Treat custom-model deployment the way you treat database migrations: forward-only is a trap. Always keep two versions deployable so a regression can be rolled back without retraining.

When To Keep Walking Past The Customisation Door

The most underused move in 2026 production AI is to not customise the model at all. The signals are usually visible if you look.

The task quality is improving with each prompt iteration. Keep iterating. The RAG harness is improving with each retrieval tweak. Keep tuning. The variance is dominated by edge cases that look idiosyncratic rather than systematic — more training data will not fix idiosyncrasy. The volume does not justify the customisation cost; a task with a thousand daily invocations on Claude Sonnet costs about what a junior analyst's morning coffee bill is. And the model catalogue moves faster than you do — by the time a customised model ships, the next-generation foundation model may match it without the customisation overhead.

Customisation is a real lever. It is also one that many teams pull when the actual problem is upstream. The combined-models pattern works because each model does what it is good at, and the discipline starts with being honest about what the foundation model is already good at — and whether you have actually exhausted that envelope.

If you cannot say, with evidence, that prompts, retrieval, and tool design have been exhausted, what is the customisation actually for?

FAQs

Can I fine-tune Claude on Bedrock?

No. Anthropic does not expose fine-tuning of Claude on Bedrock. Claude is available for inference and is the production-default reasoning model, but customization paths apply only to Llama, Titan, Cohere Command, Mistral, and bring-your-own architectures via Custom Model Import.

When does distillation actually pay back?

When Claude Sonnet or Opus is the validated quality bar, the task is high-volume enough that per-call cost matters, and you can spend Claude rates once to generate training data that powers a cheaper student model in production. Per-message classification across millions of daily messages and latency-critical extraction are the production sweet spots.

How much labelled data do I need for supervised fine-tuning?

Hundreds to thousands of high-quality input/output pairs. The dataset is often the bigger cost than the training run itself — getting 1,000 clean pairs for a narrow task is non-trivial. SFT excels at narrow specialisation, not at making a model more capable in general.

Why not just bake the domain corpus into the model with Continued Pre-Training?

Most of the time, the corpus belongs in a Knowledge Base, not in the weights. CPT is the right call only for closed domains where retrieval-augmented patterns don't fit — specialised scientific, medical, or deeply technical engineering workflows. For everything else, better RAG beats CPT at a fraction of the operational cost.

What does the evaluation harness actually measure?

A held-out test set of 100–500 examples never used in training, scored on quality (accuracy, F1, BLEU/ROUGE, or LLM-as-judge against a rubric), latency, and cost per example. The customised model has to beat the foundation-model baseline on the quality metric and on the cost-or-latency dimension that motivated the customization. If it doesn't, ship the foundation model.

What's next

Part 5 picks up the orchestration layer one level higher: when a single Bedrock Agent isn't the right shape and the work needs to span multiple model invocations, multiple AWS services, and conditional branches — the AWS Step Functions story. The combined-models pattern from this piece composes naturally into Step Functions workflows; Part 5 unpacks how.

The full series:

Part 1 — Foundations: Building AI Agents on Amazon Bedrock
Part 2 — RAG with Bedrock Knowledge Bases
Part 3 — Open-source Agent Frameworks on Bedrock
Part 4 — Model Customization on Amazon Bedrock (this piece)
Part 5 — Multi-step AI Workflows with Step Functions and Bedrock
Part 6 — Security Guardrails and Observability for Bedrock
Part 7 — Cost Optimization on Bedrock (deepest multi-model routing)
Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage

The Amazon Bedrock series. Customised models meet production-grade observability in Part 6 and cost-tier routing in Part 7.