Hardened Containers for AI Workloads: The Tier Decision Most Teams Get Wrong

The platform lead lands on the question late on a Thursday evening, six weeks before the launch date. The agent runs untrusted code — code the user prompted into existence three seconds ago, code the model itself decided to write, code nobody on the team has read. The sandbox underneath the agent has to keep the rest of the platform safe from whatever it just ran. He has a whiteboard with four boxes on it. runc. gVisor. Firecracker. Kata. The vendor documentation he has read all week reduces to two answers — use ours, useful only if the choice has already been made, and it depends, true but unhelpful when the platform is going to start taking traffic in six weeks.

This piece is the workload-axis answer he could not find. Four production tiers, honestly compared. Three axes that determine the right tier per workload. Two questions that change the decision. And the hybrid pattern most regulated AI platforms actually ship, because no single tier wins across every workload, and the operational reality is that production platforms run two or three tiers, routed by workload class.

The companion piece agent code execution: microVM vs container goes deeper on the microVM tier specifically. This piece is the broader tier-selection framework.

Key takeaways

The four production tiers are standard containers (runc), hardened containers (Wolfi, Chainguard, distroless), userspace sandboxes (gVisor), and microVMs (Firecracker, Kata) — each has a defensible workload class and none wins across the board.
Three axes decide the tier: trust boundary of the code (internal-trusted vs user-supplied), workload latency and cost profile (50ms tool call vs 30-second code execution), and regulatory/audit requirements (FedRAMP High, EU AI Act, financial services often force VM-grade isolation).
The tier numbers from production: cold-start ranges from 5-20ms (runc) to 125-300ms (Firecracker) to 1-3s (Kata); microVMs cost 50-128MB memory minimum versus ~5MB for containers; gVisor adds 2-5x syscall latency overhead.
Most regulated AI platforms ship a hybrid — hardened containers for the inference and orchestration runtime, gVisor for tool-call execution where appropriate, microVMs for code execution sandboxes and cross-tenant isolation.
The hardened-image tier is the most under-deployed of the four — near-zero marginal cost over standard containers, meaningful security improvement, negligible latency impact — and most teams skip it because it doesn't feel like a Big Architectural Decision.

Four-tier container isolation matrix — standard containers (runc), hardened containers (Wolfi, Chainguard), userspace sandboxes (gVisor), microVMs (Firecracker, Kata) — with cold-start, memory, density, and best-fit workload per tier, plus the hybrid pattern most regulated AI platforms ship. — Figure 1 — The four tiers, the numbers from production, and the hybrid routing pattern

Four Rooms, Four Doors

Think of the four tiers as four rooms in the same building, each with a different kind of door.

The first room has a thin partition wall. Standard containers running on runc and containerd. Linux namespaces and cgroups separate the workloads, lowest overhead, and no security boundary worth defending against a determined adversary who has code execution inside. Appropriate for trusted code, not for untrusted code. The default Docker, Kubernetes, ECS, and most CI runner setups land here.

The second room has the same partition wall, but most of the dangerous items inside the room have been removed. Hardened containers — Wolfi, Chainguard, distroless, Bottlerocket. Standard container runtime, but the image inside has no shell, no package manager, no unnecessary binaries. Significantly fewer CVEs because there is significantly less surface to be vulnerable. The isolation boundary is the same — namespaces and cgroups — but the inside of the container is much harder to escape from because the attacker has fewer primitives to work with. Cheap latency-wise; meaningful security improvement; widely adopted in production AI platforms over the last twelve months as the default image tier.

The third room has a translator at the door who reads every message before the room hears it. gVisor — Google's application-kernel approach. It intercepts syscalls in userspace and re-implements them safely, so the host kernel is never directly exposed to the container's syscalls. Strong isolation against kernel exploits without the full overhead of a VM. Higher syscall latency than runc — typically two to five times — which matters for syscall-heavy workloads but is invisible to most AI inference paths. Used by Google internally, AWS Lambda, AWS App Runner, GKE Sandbox.

The fourth room is a separate building. MicroVMs — Firecracker, Kata Containers, Cloud Hypervisor. Lightweight virtual machines with their own kernels. Hardware-virtualisation isolation boundary, the strongest tier. Higher startup latency (typically 125-300ms for Firecracker cold-start, longer for full Kata) and higher memory overhead per instance (50-128MB minimum). Used by AWS Lambda, AWS Fargate, Anthropic's Sandbox skill, and most production AI agent platforms running untrusted user code.

A fifth tier exists in research — Confidential Containers using AMD SEV-SNP or Intel TDX for memory-encryption isolation — but it is not yet a mainstream production choice for AI workloads, and we will treat it as out-of-scope for this piece.

Three Questions the Workload Actually Asks You

The tier decision turns on three axes. Not on architectural preference. Not on what the cloud vendor recommends. Not on what is installed by default.

The first axis is the trust boundary of the code being run. Internal-trusted code — the company's own services, vetted by code review, deployed via the same CI/CD as the rest of the platform — needs hygiene, not isolation. Tier 1 with good image hygiene is fine. Tier 2 is better. The marginal isolation gain from going to gVisor or microVM is paid for in latency and complexity without earning meaningful security improvement, because the code was not going to attack the host. By contrast, user-supplied code — an agent executing whatever a user prompts it to execute, a generative model running arbitrary scripts it composed itself — is untrusted by construction. Tier 4 microVM is the default. Tier 3 gVisor is the floor.

The second axis is workload latency and cost profile. A workload that runs for thirty seconds and emits one result can amortise 200ms of microVM cold-start cost with no user-perceived impact. A workload that runs for 50ms — a per-request tool call inside a chat agent — cannot. The right isolation tier scales with the workload duration. Short, frequent, latency-sensitive workloads gravitate toward Tier 2 or Tier 3. Long, bursty, isolation-critical workloads can absorb Tier 4. The same architectural primitive — Tier 4 isolation — that is a fine fit for code execution in a research agent is a bad fit for per-query tool dispatch in a customer-facing chat assistant.

The third axis is regulatory and audit requirements. Some regulated workloads have explicit isolation requirements written into the policy. FedRAMP High. EU AI Act high-risk categorisation for systems handling sensitive personal data. Financial-services workloads handling card-holder data. These often require demonstrable VM-grade isolation between tenants, between workloads, between the model and the surrounding infrastructure. Where the audit requirement is binary, the tier choice is forced. Tier 4 microVM is the only tier that survives a strict tenant-isolation audit; the others can be justified for specific workloads but not as default isolation between tenants.

The three axes do not align neatly. A workload can be untrusted code (forcing high isolation), short-latency (forcing low isolation), and regulated (forcing high isolation again). Two of the three pull toward Tier 4; one pulls toward Tier 2. The honest answer is usually a hybrid.

The Two Questions That Move the Decision

Two questions sit on top of the workload-axis decision and can shift it materially.

The first: are you running the agent's code execution sandbox, or the agent's normal runtime? Code execution sandboxes — the Anthropic Sandbox skill model, OpenAI's Code Interpreter, Cursor's terminal sandbox — are the canonical Tier 4 use case. Untrusted code, bursty workload, isolation-critical. The agent's normal runtime — the LLM inference, the orchestration loop, the tool-call dispatcher — is a different beast. The LLM does not execute user code. It processes user input. The orchestration loop runs the platform's own code. Hardened containers (Tier 2) are typically the right answer for the runtime, and microVMs (Tier 4) for the code execution sandbox sitting underneath the runtime. Putting everything in microVMs because AI workloads need isolation is a category error that pays the latency cost for no security gain.

The second: what is the cardinality of tenants per host? Single-tenant deployments — one customer's workload on one host — tolerate weaker between-workload isolation because there is no other tenant on the host to be exposed. Multi-tenant deployments — the customer's workload sharing a host with another customer's workload — require strong between-workload isolation by default. The cloud vendor's hardware is the trust boundary, but the workload runtime is the surface. Multi-tenant SaaS AI platforms ship with microVM isolation between tenants almost universally now. Single-tenant deployments — VPC-isolated, dedicated capacity — can run hardened containers between workloads without the same exposure.

The Numbers That Justify the Choice

Concrete numbers from production deployments we run, with the usual caveat that workload mix shifts these significantly.

Cold-start latency runs from runc at 5-20ms typical, through hardened images at 15-50ms, gVisor at 30-100ms (depending heavily on what the workload does early), Firecracker at 125-300ms, and Kata Containers at one to three seconds.

Memory overhead per instance is around 5MB for runc and 5-10MB for hardened images (the image is smaller but the runtime overhead is the same). gVisor lands at 10-15MB. Firecracker requires 50-128MB minimum for the kernel plus VM overhead. Kata starts at 128MB and goes up.

Per-syscall latency overhead for syscall-heavy workloads tracks the abstraction depth. runc and hardened images sit at baseline. gVisor adds two to five times. Firecracker stays under 1.1x thanks to hardware acceleration. Kata behaves similarly to Firecracker.

Density per host runs hundreds to thousands for runc and hardened images, hundreds for gVisor, low hundreds for Firecracker (memory-bounded), and tens to low hundreds for Kata.

CVE attack surface — a rough indicator, not a number to over-index on — varies widely. runc with a standard image is large. A hardened image is 50-90% smaller depending on the base. gVisor is smaller than runc thanks to less host kernel exposure, though the gVisor codebase itself has had CVEs. Firecracker has the smallest exposed surface, but VM escape CVEs in the underlying hypervisor still exist and require host patching.

The economics shape the decision. If your workload is high-volume and short-duration — per-query tool dispatch, real-time chat — Tier 4 is hard to justify. If your workload is rare but high-isolation — untrusted code execution, multi-tenant data processing — Tier 4 is hard to avoid.

The Hybrid Pattern, Drawn From Production

A pattern recurs across the AI platforms we have shipped or reviewed in the last twelve months. It is not invented; it is what survives contact with real workloads.

Inference and orchestration runtime sits in hardened containers (Tier 2). The model server, the agent loop, the tool-call dispatcher, the retrieval pipeline. Internal-trusted code, performance-sensitive, audit-clean with proper image hygiene. Wolfi or Chainguard images on Kubernetes, often with Bottlerocket on the node side as an additional layer.

Tool-call execution sits in Tier 2 or Tier 3 depending on tool trust. Tools that hit known internal APIs run in the same hardened containers as the runtime. Tools that execute scripts, run shell commands, or compose system calls run in gVisor (Tier 3) for the marginal kernel-exploit isolation.

The code execution sandbox sits in a microVM (Tier 4). Firecracker for AWS-native, Kata for self-hosted. This is the layer that runs the agent's generated code, the user's Python in a Jupyter cell, the script the model decided to execute. Strong isolation, audit-defensible, multi-tenant safe.

Cross-tenant isolation sits in a microVM as well. Where regulatory or trust requirements demand strict between-tenant isolation, the tenant boundary lives at the microVM layer. Tier 2 inside the microVM, with the VM as the tenant separator.

The platform engineer who picks one tier for everything is either over-paying for isolation on workloads that do not need it, or under-paying on workloads that do. The hybrid pattern routes the workload class to the appropriate tier and lets the cost-and-isolation curves balance.

The Tier Most Platforms Forget to Adopt

The hardened-image tier deserves specific attention as the most under-deployed of the four. The marginal cost over standard containers is near zero — a different base image in the Dockerfile, a different node AMI on the cluster. The security improvement is meaningful. The latency impact is negligible. For most platforms still running standard containers on the runtime tier, the upgrade to hardened images is the highest-leverage isolation work available, and most do not ship it because it does not feel like a Big Architectural Decision in the way that picking microVMs does.

Hardware isolation is the strongest. Hardened images are the cheapest meaningful improvement. The right architectural posture uses both, in different places, for different workloads. The platforms that articulate this trade-off explicitly — in their threat models, their architecture documents, and their procurement conversations — are the ones operating at the standard the next decade of regulated AI deployment will require.

Isolation is not free, and pretending it is leads either to platforms that pay too much for it or platforms that pay too little. Which mistake is your current architecture making?

FAQs

Do I really need microVMs for AI workloads?

Only for the layers running untrusted code — agent code execution sandboxes, user-supplied Python in a Jupyter cell, scripts the model decided to execute. The LLM inference, the agent orchestration loop, and the tool-call dispatcher run the platform's own code and don't need VM-grade isolation. Putting everything in microVMs because "AI workloads need isolation" is a category error that pays latency cost for no security gain.

What's the difference between hardened containers and standard containers in security terms?

Same isolation boundary — Linux namespaces and cgroups — but a much smaller attack surface inside the container. No shell, no package manager, no unnecessary binaries, 50-90% fewer CVEs depending on the base image. An attacker who gets code execution inside has fewer primitives to work with, so container escape paths are materially harder to assemble. Cheap to adopt and meaningful as a defence-in-depth layer.

How much does gVisor cost in latency?

Cold-start runs 30-100ms depending on what the workload does early. Per-syscall latency is 2-5x runc — invisible for AI inference paths (few syscalls, mostly memory-bound and network-bound) but painful for syscall-heavy workloads like compilers or filesystem-intensive tools. Density per host stays in the hundreds, so it's not a meaningful cost on capacity.

When does Firecracker's 125-300ms cold-start actually matter?

When the workload duration is short enough that 200ms of startup is a visible fraction of total response time. A workload that runs for 30 seconds amortises microVM cold-start with no user-perceived impact. A 50ms per-request tool call inside a chat agent cannot — the cold-start dominates. Short, frequent, latency-sensitive workloads gravitate toward Tier 2 or Tier 3; long, bursty, isolation-critical workloads can absorb Tier 4.

What's the hybrid pattern that most production AI platforms ship?

Hardened containers (Tier 2) for the inference and orchestration runtime — model server, agent loop, retrieval pipeline. Tier 2 or gVisor (Tier 3) for tool-call execution depending on tool trust. MicroVM (Tier 4) for the code execution sandbox — Firecracker on AWS-native, Kata for self-hosted. MicroVM again as the cross-tenant boundary where regulatory or trust requirements demand strict between-tenant isolation. The hybrid lets the cost-and-isolation curves balance per workload class rather than forcing a single tier across everything.

Companion content

How to engage

If your AI platform is running runc-everywhere and the question of tier selection is on the horizon, we can help you map the workload classes to the right tiers and ship the hybrid pattern. Talk to us at creativeminds.dev/contact.