Engineering

Hardened Containers for AI Workloads: The Tier Decision Most Teams Get Wrong

cmdev11 min read
Hardened Containers for AI Workloads: The Tier Decision Most Teams Get Wrong
Share
~17 min

Every AI platform engineer eventually has to answer the same question. The agent runs untrusted code. The model executes whatever the user asks it to execute. The sandboxing layer has to keep the rest of the system safe from the workload it just ran. What runtime do you use? runc, gVisor, Firecracker, Kata, or one of the newer hardened images?

Vendor documentation gives two answers. Either "use ours" — useful only if you have already decided. Or "it depends" — true but unhelpful to a platform engineer with a deployment date.

This piece is the workload-axis answer. Four production tiers, honestly compared. Three axes that determine the right tier per workload. Two questions that change the decision. And the hybrid pattern most regulated AI platforms actually ship — because no single tier wins across every workload, and the operational reality is that production platforms run two or three tiers, routed by workload class.

The companion piece agent code execution: microVM vs container goes deeper on the microVM tier specifically. This piece is the broader tier-selection framework.

Key takeaways

  • The four production tiers are standard containers (runc), hardened containers (Wolfi, Chainguard, distroless), userspace sandboxes (gVisor), and microVMs (Firecracker, Kata) — each has a defensible workload class and none wins across the board.
  • Three axes decide the tier: trust boundary of the code (internal-trusted vs user-supplied), workload latency and cost profile (50ms tool call vs 30-second code execution), and regulatory/audit requirements (FedRAMP High, EU AI Act, financial services often force VM-grade isolation).
  • The tier numbers from production: cold-start ranges from 5-20ms (runc) to 125-300ms (Firecracker) to 1-3s (Kata); microVMs cost 50-128MB memory minimum versus ~5MB for containers; gVisor adds 2-5x syscall latency overhead.
  • Most regulated AI platforms ship a hybrid — hardened containers for the inference and orchestration runtime, gVisor for tool-call execution where appropriate, microVMs for code execution sandboxes and cross-tenant isolation.
  • The hardened-image tier is the most under-deployed of the four — near-zero marginal cost over standard containers, meaningful security improvement, negligible latency impact — and most teams skip it because it doesn't feel like a Big Architectural Decision.
Four-tier container isolation matrix — standard containers (runc), hardened containers (Wolfi, Chainguard), userspace sandboxes (gVisor), microVMs (Firecracker, Kata) — with cold-start, memory, density, and best-fit workload per tier, plus the hybrid pattern most regulated AI platforms ship.
Figure 1 — The four tiers, the numbers from production, and the hybrid routing pattern

The four tiers

Tier 1 — standard containers (runc, containerd). Linux namespaces and cgroups. Lowest overhead. No security boundary worth defending against a determined adversary running code inside the container. Appropriate for trusted code, not for untrusted code. The default Docker, Kubernetes, ECS, and most CI runner setups land here.

Tier 2 — hardened containers (Wolfi, Chainguard, distroless, Bottlerocket). Standard container runtime, but with minimal-attack-surface images. No shell, no package manager, no unnecessary binaries. Significantly fewer CVEs because there is significantly less surface to be vulnerable. Same isolation boundary as runc — namespaces and cgroups — but the inside of the container is much harder to escape from because the attacker has fewer primitives to work with. Cheap latency-wise; meaningful security improvement; widely adopted in production AI platforms over the last twelve months as the default image tier.

Tier 3 — userspace sandboxes (gVisor). Application-kernel approach. gVisor intercepts syscalls in userspace and re-implements them safely, so the host kernel is never directly exposed to the container's syscalls. Strong isolation against kernel exploits without the full overhead of a VM. Higher syscall latency than runc — typically 2-5x — which matters for syscall-heavy workloads but is invisible to most AI inference paths. Used by Google internally, AWS Lambda, AWS App Runner, GKE Sandbox.

Tier 4 — microVMs (Firecracker, Kata Containers, Cloud Hypervisor). Lightweight virtual machines with their own kernels. Hardware-virtualisation isolation boundary — the strongest tier. Higher startup latency (typically 125-300ms for Firecracker cold-start, longer for full Kata) and higher memory overhead per instance (50-128MB minimum). Used by AWS Lambda, AWS Fargate, Anthropic's Sandbox skill, and most production AI agent platforms running untrusted user code.

A fifth tier exists in research — Confidential Containers using AMD SEV-SNP or Intel TDX for memory-encryption isolation — but it is not yet a mainstream production choice for AI workloads, and we will treat it as out-of-scope for this piece.

The three workload axes

The tier decision turns on three axes. Not on architectural preference, not on what the cloud vendor recommends, not on what's installed by default.

Axis one — trust boundary of the code being run. Internal-trusted code (the company's own services, vetted by code review, deployed via the same CI/CD as the rest of the platform) needs hygiene, not isolation. Tier 1 with good image hygiene is fine. Tier 2 is better. The marginal isolation gain from going to gVisor or microVM is paid for in latency and complexity without earning meaningful security improvement, because the code wasn't going to attack the host. By contrast, user-supplied code (an agent executing whatever a user prompts it to execute, a generative model running arbitrary scripts it composed itself) is untrusted by construction. Tier 4 microVM is the default; Tier 3 gVisor is the floor.

Axis two — workload latency and cost profile. A workload that runs for 30 seconds and emits one result can amortise 200ms of microVM cold-start cost with no user-perceived impact. A workload that runs for 50ms (a per-request tool call inside a chat agent) cannot. The right isolation tier scales with the workload duration. Short, frequent, latency-sensitive workloads gravitate toward Tier 2 or Tier 3. Long, bursty, isolation-critical workloads can absorb Tier 4. The same architectural primitive — Tier 4 isolation — that's a fine fit for code execution in a research agent is a bad fit for per-query tool dispatch in a customer-facing chat assistant.

Axis three — regulatory and audit requirements. Some regulated workloads have explicit isolation requirements written into the policy. FedRAMP High, EU AI Act high-risk categorisation for systems handling sensitive personal data, financial-services workloads handling card-holder data — these often require demonstrable VM-grade isolation between tenants, between workloads, between the model and the surrounding infrastructure. Where the audit requirement is binary, the tier choice is forced. Tier 4 microVM is the only tier that survives a strict tenant-isolation audit; the others can be justified for specific workloads but not as default isolation between tenants.

The three axes do not align neatly. A workload can be untrusted code (forcing high isolation), short-latency (forcing low isolation), and regulated (forcing high isolation again). Two of the three pull toward Tier 4; one pulls toward Tier 2. The honest answer is usually a hybrid.

The two questions that change the answer

Two questions sit on top of the workload-axis decision and can shift it materially.

One — are you running the agent's code execution sandbox, or the agent's normal runtime? Code execution sandboxes (the Anthropic Sandbox skill model, OpenAI's Code Interpreter, Cursor's terminal sandbox) are the canonical Tier 4 use case. Untrusted code, bursty workload, isolation-critical. The agent's normal runtime — the LLM inference, the orchestration loop, the tool-call dispatcher — is a different beast. The LLM doesn't execute user code; it processes user input. The orchestration loop runs the platform's own code. Hardened containers (Tier 2) are typically the right answer for the runtime, and microVMs (Tier 4) for the code execution sandbox sitting underneath the runtime. Putting everything in microVMs because "AI workloads need isolation" is a category error and pays the latency cost for no security gain.

Two — what is the cardinality of tenants per host? Single-tenant deployments (one customer's workload on one host) tolerate weaker between-workload isolation because there's no other tenant on the host to be exposed. Multi-tenant deployments (the customer's workload sharing a host with another customer's workload) require strong between-workload isolation by default. The cloud vendor's hardware is the trust boundary, but the workload runtime is the surface. Multi-tenant SaaS AI platforms ship with microVM isolation between tenants almost universally now. Single-tenant deployments — VPC-isolated, dedicated capacity — can run hardened containers between workloads without the same exposure.

The numbers that justify the choice

Concrete numbers from production deployments we run, with the usual caveat that workload mix shifts these significantly:

Cold-start latency. runc: 5-20ms typical. Hardened image (Wolfi, distroless): 15-50ms. gVisor: 30-100ms (depends heavily on what the workload does early). Firecracker: 125-300ms. Kata Containers: 1-3s.

Memory overhead per instance. runc: ~5MB. Hardened image: ~5-10MB (the image is smaller but the runtime overhead is the same). gVisor: ~10-15MB. Firecracker: 50-128MB minimum (kernel + VM overhead). Kata: 128MB+.

Per-syscall latency overhead (for syscall-heavy workloads). runc: ~baseline. Hardened image: ~baseline. gVisor: 2-5x. Firecracker: <1.1x (hardware acceleration). Kata: similar to Firecracker.

Density per host. runc: hundreds to thousands. Hardened image: same. gVisor: hundreds. Firecracker: low hundreds (memory-bounded). Kata: tens to low hundreds.

CVE attack surface (rough indicator, not a number to over-index on). runc with a standard image: large. Hardened image: 50-90% smaller depending on the base. gVisor: smaller than runc (less host kernel exposure) but the gVisor codebase itself has had CVEs. Firecracker: smallest exposed surface, but VM escape CVEs in the underlying hypervisor still exist and require host patching.

The economics shape the decision. If your workload is high-volume and short-duration (per-query tool dispatch, real-time chat), Tier 4 is hard to justify. If your workload is rare but high-isolation (untrusted code execution, multi-tenant data processing), Tier 4 is hard to avoid.

The hybrid pattern most platforms ship

A pattern that recurs across the AI platforms we have shipped or reviewed in the last twelve months:

  • Inference and orchestration runtime — hardened containers (Tier 2). The model server, the agent loop, the tool-call dispatcher, the retrieval pipeline. Internal-trusted code, performance-sensitive, audit-clean with proper image hygiene. Wolfi or Chainguard images on Kubernetes (often via EKS with Bottlerocket on the node side as an additional layer).
  • Tool-call execution — Tier 2 or Tier 3 depending on tool trust. Tools that hit known internal APIs run in the same hardened containers as the runtime. Tools that execute scripts, run shell commands, or compose system calls run in gVisor (Tier 3) for the marginal kernel-exploit isolation.
  • Code execution sandbox — microVM (Tier 4). Firecracker for AWS-native, Kata for self-hosted. This is the layer that runs the agent's generated code, the user's Python in a Jupyter cell, the script the model decided to execute. Strong isolation, audit-defensible, multi-tenant safe.
  • Cross-tenant isolation — microVM as well. Where regulatory or trust requirements demand strict between-tenant isolation, the tenant boundary lives at the microVM layer. Tier 2 inside the microVM, with the VM as the tenant separator.

The platform engineer who picks "one tier for everything" is either over-paying for isolation on workloads that don't need it, or under-paying on workloads that do. The hybrid pattern routes the workload class to the appropriate tier and lets the cost-and-isolation curves balance.

What this teaches us about enterprise scaling

The principle underneath all of this is that isolation is not free, and pretending it is leads to platforms that either pay too much for it (microVMs everywhere, slow chat, expensive bills) or pay too little (runc everywhere, one CVE away from a tenant-bleed). The platforms that scale gracefully match isolation tier to workload class, audit the routing, and revisit it as workload mix changes.

The hardened-image tier deserves specific attention as the most under-deployed of the four. The marginal cost over standard containers is near zero — a different base image in the Dockerfile, a different node AMI on the cluster. The security improvement is meaningful. The latency impact is negligible. For most platforms still running standard containers on the runtime tier, the upgrade to hardened images is the highest-leverage isolation work available, and most don't ship it because it does not feel like a Big Architectural Decision in the way that picking microVMs does.

Hardware isolation is the strongest. Hardened images are the cheapest meaningful improvement. The right architectural posture uses both, in different places, for different workloads. The platforms that articulate this trade-off explicitly — in their threat models, their architecture documents, and their procurement conversations — are the ones operating at the standard the next decade of regulated AI deployment will require.

FAQs

Do I really need microVMs for AI workloads?

Only for the layers running untrusted code — agent code execution sandboxes, user-supplied Python in a Jupyter cell, scripts the model decided to execute. The LLM inference, the agent orchestration loop, and the tool-call dispatcher run the platform's own code and don't need VM-grade isolation. Putting everything in microVMs because "AI workloads need isolation" is a category error that pays latency cost for no security gain.

What's the difference between hardened containers and standard containers in security terms?

Same isolation boundary — Linux namespaces and cgroups — but a much smaller attack surface inside the container. No shell, no package manager, no unnecessary binaries, 50-90% fewer CVEs depending on the base image. An attacker who gets code execution inside has fewer primitives to work with, so container escape paths are materially harder to assemble. Cheap to adopt and meaningful as a defence-in-depth layer.

How much does gVisor cost in latency?

Cold-start runs 30-100ms depending on what the workload does early. Per-syscall latency is 2-5x runc — invisible for AI inference paths (few syscalls, mostly memory-bound and network-bound) but painful for syscall-heavy workloads like compilers or filesystem-intensive tools. Density per host stays in the hundreds, so it's not a meaningful cost on capacity.

When does Firecracker's 125-300ms cold-start actually matter?

When the workload duration is short enough that 200ms of startup is a visible fraction of total response time. A workload that runs for 30 seconds amortises microVM cold-start with no user-perceived impact. A 50ms per-request tool call inside a chat agent cannot — the cold-start dominates. Short, frequent, latency-sensitive workloads gravitate toward Tier 2 or Tier 3; long, bursty, isolation-critical workloads can absorb Tier 4.

What's the hybrid pattern that most production AI platforms ship?

Hardened containers (Tier 2) for the inference and orchestration runtime — model server, agent loop, retrieval pipeline. Tier 2 or gVisor (Tier 3) for tool-call execution depending on tool trust. MicroVM (Tier 4) for the code execution sandbox — Firecracker on AWS-native, Kata for self-hosted. MicroVM again as the cross-tenant boundary where regulatory or trust requirements demand strict between-tenant isolation. The hybrid lets the cost-and-isolation curves balance per workload class rather than forcing a single tier across everything.

Companion content

How to engage

If your AI platform is running runc-everywhere and the question of tier selection is on the horizon, we can help you map the workload classes to the right tiers and ship the hybrid pattern. Talk to us at creativeminds.dev/contact.

container-isolationfirecrackergvisorkataruncai-securitysandboxproduction-aiperspective

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation