When Your Agent Needs to Run Code: Three Real Paths, the runc Reckoning, and the Trade-Offs Nobody Surfaces

The post crossed every AI-engineering feed last Wednesday morning. Sharp claim, sharp prose, three thousand reactions before lunchtime. If your AI agent executes code inside a standard Docker container, you are one prompt away from a breached host. The argument: containers share the host kernel, agents run untrusted code by definition, and a kernel exploit reaches straight through container isolation. The recommended fix: ditch Docker for hardware-virtualised microVMs like LangChain's newly released LangSmith Sandboxes.

The post is rhetorically warm and architecturally correct in shape. It also lands at the exact moment the receipts arrived.

Three weeks before it went viral — in November 2025 — the runc maintainers disclosed three separate container-escape CVEs in a coordinated release. CVE-2025-31133. CVE-2025-52565. CVE-2025-52881. All three allow a malicious container to break out of runc-based isolation and gain write access to the host kernel surface. They impact Docker, containerd, Kubernetes, and effectively every managed Kubernetes service offered by every major cloud provider. The fixes shipped in runc 1.2.8, 1.3.3, and 1.4.0-rc.3.

If you needed a concrete proof point that the threat model is not theoretical, the proof point landed on its own.

This piece reads the architectural shift honestly. What the threat model actually is. What LangSmith Sandboxes, E2B, Modal, Daytona, and the Firecracker-DIY path actually offer. The trade-offs the marketing leaves out. And the four real paths a team should actually be choosing between, matched to engineering capacity and regulatory posture.

Key takeaways

Three runc CVEs disclosed in November 2025 — CVE-2025-31133, CVE-2025-52565, CVE-2025-52881 — proved the container-escape threat model is operationalized, not theoretical. They affect Docker, containerd, Kubernetes, and every managed K8s service.
Containers were designed to isolate vetted application code, not untrusted LLM-generated code. The defensive posture is "do not share a kernel with the untrusted code in the first place" — hardware-virtualised microVMs via Firecracker, KVM, or gVisor.
The market has consolidated around five real options: LangSmith Sandboxes (ecosystem-native), E2B (largest community, Firecracker), Modal (only path with GPU support, gVisor), Daytona (hardened Docker, fastest cold start but weakest isolation), and Firecracker direct (DIY).
Path 0 is the cheapest sandbox: do not execute code at all. Most production agents need retrieval, classification, and tool-calling — not a Python REPL. Audit your agents and remove the capability surface.
The decision logic is three questions: does the agent actually need to execute arbitrary code, does compliance require in-VPC operation, and are you already on LangChain? Each answer maps cleanly onto one of the four paths.

Head-to-head comparison of container and microVM isolation for agent code execution. The container column (runc, gVisor, Daytona-hardened Docker) lists shared host kernel isolation, 5-100ms cold start, 5-15MB memory, 30-50% cheaper per second, kernel CVE leads to host breach, best for constrained REPLs and inference runtimes, used by Daytona, Modal, GKE Sandbox, AWS App Runner. The microVM column (Firecracker, Kata, Cloud Hypervisor) lists hardware-virtualisation isolation, 125-300ms cold start, 50-128MB+ memory, baseline cost, guest kernel crash does not reach host, best for untrusted LLM-generated code, multi-tenant SaaS, and regulated workloads, used by AWS Lambda, Fargate, LangSmith Sandboxes, E2B, and Anthropic Sandbox skill. A decision rubric at the bottom asks three questions — is the code untrusted, is the blast radius multi-tenant, and is the workload regulated or in-VPC — with microVM as the honest default for arbitrary-code agents. — Figure 1 — The microVM premium buys a hardware boundary the container cannot · the rubric decides when the premium is worth paying

The Model Is Not Magic — It Is Syscalls

Containers were designed to isolate known, vetted application code — your services, written by your engineers, packaged for deployment. They were never designed to isolate untrusted code arriving over an API. The isolation surface is process namespaces, cgroups, capability dropping, and seccomp filters, all running on top of a single shared host kernel. Think of it as a curtain rather than a wall.

For known, vetted application code, that surface is adequate. For untrusted code generated by an LLM, executing inside the container with the same kernel, the surface is approximately what you would expect — hard enough to resist incidental probing, soft enough to fall to a determined kernel exploit.

The runc CVEs from November are the operationalisation of that softness. CVE-2025-31133 abuses a symlink replacement during container init to bind-mount an attacker-controlled target read-write into the container. CVE-2025-52565 exploits a race on the /dev/console bind mount before security protections are fully in place. CVE-2025-52881 lets an attacker circumvent LSM relabel protections and turn ordinary runc writes into arbitrary writes against the host procfs. None of these are theoretical. All three landed with proof-of-concept exploits in coordinated disclosure.

The viral post's framing — a simple Python script generated by an LLM could root your entire infrastructure — is rhetorically punchy and architecturally honest. A Python script written by an LLM is not magic. It is a chain of system calls. If those calls reach a kernel vulnerability that has been disclosed but not patched in your runtime, the script roots the box. The LLM does not have to know it is exploiting anything. The exploit was already there.

The right defensive posture for a system that runs untrusted code is not patch runc and hope. It is do not share a kernel with the untrusted code in the first place.

The Hypervisor Is the Wall the Container Was Not

A microVM gives the untrusted code its own Linux kernel, running on top of the host's hypervisor — typically KVM on Linux. The agent's code can crash that kernel. Can compromise that kernel. Can install rootkits inside that kernel. None of that reaches the host, because the host is running a different kernel behind the hypervisor's hardware-enforced boundary. CPU-level virtualisation extensions — Intel VT-x, AMD-V — make the boundary architecturally hard rather than software-enforced. If the container is a curtain, the microVM is a fire door.

Two pieces of technology you should know by name.

Firecracker is AWS's open-source microVM technology. It boots a stripped-down Linux kernel in roughly 125 milliseconds, on commodity hardware, with memory overhead in the single-digit-megabytes range. Firecracker powers AWS Lambda and Fargate, where it has executed tens of trillions of customer functions in production. It is the substrate most modern agent sandboxes are built on.

gVisor is Google's userspace-kernel reimplementation. It is not a microVM — instead, it intercepts system calls from the guest and re-implements them in a userspace Go process, so the guest's syscalls never reach the host kernel directly. gVisor is lighter than a full microVM (no separate kernel), but its isolation guarantee is weaker (the surface is a Go process, not hardware virtualisation). Modal uses it. AWS uses it on parts of Fargate.

Both are legitimately stronger than container isolation. Firecracker is the stronger of the two, with the trade-off being slightly higher cold-start cost and slightly higher memory overhead per sandbox.

Five Options, One Honest Pass Each

The agent-sandbox market has consolidated around five real options. Each one occupies a specific niche.

Platform	Isolation	Cold start	Notes
LangSmith Sandboxes	Hardware-virtualised microVM	Warm-pool backed	LangChain's GA product. Native integration with Deep Agents and Open SWE. Templates, auth proxy, persistent state over WebSockets.
E2B	Firecracker microVMs	~150ms	Largest community in the AI sandbox space. $35M funded. Battle-tested SDK. Purpose-built for untrusted code execution.
Modal	gVisor	Sub-second	The outlier — only platform where a sandbox can hold a GPU. Sits inside a broader compute platform covering inference, training, batch jobs.
Daytona	Docker containers (hardened)	Sub-90ms	Fastest cold start on the list. NOT microVM. $24M Series A. Trades isolation strength for boot latency. Honest about this.
Firecracker direct	Firecracker microVM	~125ms	The open-source DIY path. You operate the orchestration, the warm pool, the network plane, the auth surface. Powers AWS Lambda and Fargate.

A few things worth noting that the marketing tends to leave out.

Daytona is not a microVM. The viral post's framing would put Daytona on the wrong side of the line, despite Daytona being the fastest sandbox on the list and the chosen substrate for many production agent platforms. Daytona has chosen to optimise for boot latency over isolation strength, with a hardened Docker container as the runtime. For low-stakes workloads where the agent's code surface is constrained — a Python REPL with no shell access, no network — this is a defensible choice. For untrusted code with shell access, it is not.

Modal is the only sandbox you can put a GPU in. This matters for any agent doing ML inference, training, or GPU-accelerated data work. The trade-off is gVisor's weaker isolation. For agent workloads where the GPU is the point — a research assistant running CUDA kernels, an inference agent fine-tuning small models — Modal is the only credible option in the public market today.

Cold start hides a longer story. Firecracker boots in 125ms, but provisioning the runtime, restoring filesystem state, installing the agent's dependencies, and warming the network plane easily takes another two to ten seconds the first time. Warm pools — pre-provisioned sandboxes sitting idle and ready — are how every serious vendor closes this gap. LangSmith Sandboxes, E2B, and Modal all run warm pools. DIY Firecracker means you build the warm pool yourself.

Per-second cost matters at agent scale. Container compute costs roughly 30-50% less per second than microVM compute at equivalent CPU/memory specs. For high-volume agent traffic — thousands of code executions per hour — the cost difference compounds into a material line item. Worth modelling against the security posture decision, not assumed away.

The Cheapest Sandbox Is the One You Do Not Run

The viral post identified three implementation paths. We would add a fourth, which most enterprise teams should be on before any of the other three.

Path 0 — Do not execute code

The cheapest sandbox is the one you do not run.

Most production agents do not need code execution. They need retrieval, classification, summarisation, structured extraction, and tool-calling against well-defined APIs. A documentation assistant does not need a Python REPL. A customer-support agent does not need a shell. An evidence-pack generator does not need to install npm packages.

Before reaching for any sandbox technology, the question to answer is: what specifically must this agent do that requires executing arbitrary code? If the answer is nothing specific — we thought it might be useful, the right architectural move is to remove the code-execution path entirely. Code execution is a capability surface. Capability surfaces are attack surfaces. The agent that does not need the surface should not have it.

For most enterprise AI workloads we ship into regulated environments, Path 0 is the right path.

Path 1 — Ecosystem-Native (LangSmith Sandboxes)

If you are already building on LangChain or LangGraph, LangSmith Sandboxes removes the infrastructure burden. Hardware-virtualised microVMs, native integration with Deep Agents and Open SWE, sandbox templates with reusable image and resource configurations, warm pools, an auth proxy for injecting credentials without hardcoding secrets, persistent state across sessions, long-running session support over WebSockets. Generally available since 2026.

This is the path for teams whose value-add is the agent itself, not the sandbox infrastructure underneath it. You write the agent, LangChain runs the safety surface. The trade-off is platform lock-in — your sandbox lifecycle is now bound to LangSmith's product roadmap and pricing.

If you are not in the LangChain ecosystem, or you want to keep the sandbox choice independent of the agent framework, the API-first sandbox providers are the right path. E2B for general-purpose secure code execution with the strongest isolation among the API-first set. Modal for anything involving a GPU. Daytona for low-latency, low-stakes workloads where Docker isolation is acceptable.

The trade-off is the same as any managed platform: you give up some control over the runtime surface in exchange for not having to operate it. For most mid-market teams, this is the right trade-off.

Path 3 — DIY (Firecracker, Kata Containers)

If your security or compliance posture requires the sandbox to run inside your own VPC, on your own hardware, or on-prem — and FSI, defence-adjacent, and certain healthcare workloads do — you build it. Firecracker is the open-source substrate. Kata Containers provides a container-API-compatible runtime backed by microVMs. The orchestration, the warm pool, the auth proxy, the persistent-state layer, the audit trail, the cost telemetry — all of that, you build.

This is meaningful engineering work. Two to three engineer-quarters at minimum to get a production-grade Firecracker-based sandbox surface stood up with the surrounding operational infrastructure. The reason to take this path is not engineering preference. It is the regulator's posture, the data residency requirement, or a Cybersecurity-Assurance-Standard requirement that the sandbox must not egress to a third-party API. We covered the architectural pattern for this kind of isolation, applied to LLM endpoints rather than sandbox compute, in The Blueprint for Air-Gapped LLM Deployments on AWS Bedrock.

Three Questions, Four Paths

Three questions, in order. Answer them honestly.

Does this agent specifically need to execute arbitrary code, or just call well-defined tools? If the answer is tools, Path 0. If the answer is code, continue.

Does your security or compliance posture require the sandbox to run inside your own infrastructure? If yes, Path 3. If no, continue.

Are you already operating on LangChain or LangGraph, and want the sandbox bundled? If yes, Path 1. If no, Path 2.

This framework gets approximately the right answer for approximately every enterprise team. The teams it fails for — frontier-research labs running custom agent harnesses, sovereign-cloud deployments with bespoke compliance, on-device agent deployments at the edge — already know they are exceptions and will engineer the bespoke answer regardless.

The Conversation That Loses the Deal

Three things to leave the meeting with.

The architectural shift toward hardware-virtualised microVM sandboxes is real, the runc CVEs from November are the receipt, and any production deployment of agent code execution that is still running on a hardened Docker container should treat the migration as a Q3 priority, not a hypothetical future concern. The threat model is operationalised.

The cheapest defensive posture is to not execute the untrusted code in the first place. Most production agents do not need this capability, and the engineering pattern that the field has converged on — every agent gets a Python REPL just in case — is a security debt the market is starting to charge interest on. Audit your agents. Remove the code-execution capability from any agent that does not specifically need it.

The sandbox-as-a-service market is mature enough that building Firecracker orchestration from scratch is a deliberate choice driven by compliance, not by lack of options. LangSmith Sandboxes, E2B, Modal, and Daytona are real products with real differentiation. Pick the one that matches your ecosystem and your posture. Build the DIY path only if the regulator's letter on your desk explicitly requires it.

The teams shipping production AI to regulated buyers in 2026 are going to be the ones who can answer how do you isolate untrusted code generated by your agents in one paragraph, with specific names of products, specific isolation technologies, and a specific decision logic. Anyone who shows up with we run it in Docker, we patch quickly is going to lose the deal — and the worst version of that meeting is the one where the buyer's CISO has read the runc CVEs and you have not.

FAQs

Are the November 2025 runc CVEs really enough to change the architecture?

Yes. All three landed with proof-of-concept exploits in coordinated disclosure, and they affect every managed Kubernetes service offered by every major cloud provider. The viral framing — that a Python script generated by an LLM could root the host — is architecturally honest. The script is just a chain of system calls; if those reach a disclosed-but-unpatched kernel vulnerability in your runtime, the script roots the box.

What's the difference between Firecracker and gVisor?

Firecracker is a hardware-virtualised microVM — the untrusted code gets its own Linux kernel running on top of KVM, with hardware-enforced isolation via Intel VT-x or AMD-V. gVisor is a userspace kernel reimplementation that intercepts syscalls in a Go process, so the guest's calls never reach the host kernel directly. Firecracker is the stronger guarantee; gVisor is lighter but with a weaker isolation surface.

Why is Daytona on the list if it's not a microVM?

Because it's the fastest sandbox on the list (sub-90ms cold start) and the chosen substrate for many production agent platforms that have deliberately optimized boot latency over isolation strength. For constrained workloads — a Python REPL with no shell, no network — that trade-off is defensible. For untrusted code with shell access, it is not. Daytona is honest about the trade.

When does the DIY Firecracker path actually make sense?

When the regulator's letter on the desk requires the sandbox to run inside your own VPC, on your own hardware, or on-prem — typically FSI, defence-adjacent, and certain healthcare workloads. Expect two to three engineer-quarters minimum to build the orchestration, warm pool, network plane, auth surface, and audit trail. The reason is compliance, not engineering preference.

What should most enterprise teams do first?

Path 0 — audit the agents, find the ones that have a code-execution capability they don't specifically need, and remove it. Every Python REPL added "just in case" is security debt the market is now charging interest on. For agents that genuinely need code execution, pick the path that matches your ecosystem and regulatory posture.

Companion content

The Blueprint for Air-Gapped LLM Deployments on AWS Bedrock — the architectural pattern that DIY-Firecracker sandboxes need to inherit from, applied to model endpoints
Designing Strict RBAC for Enterprise Knowledge Bases — the audit pattern for any agent action, including code execution
Mitigating Non-Deterministic AI Failures in Production Systems — drift monitoring extended to the sandbox surface
The Self-Improving Agent: SIA's Three Levers and the Production Pattern — what the safety surface looks like when the agent edits its own code AND executes it
Why 95% of Enterprise AI Pilots Fail at the Deployment Phase — the strategic frame this fits inside

Sources

LangSmith Sandboxes GA announcement: langchain.com/blog
runc CVE disclosures (November 2025): Sysdig technical overview
CNCF runc breakout overview: cncf.io/blog
AI sandbox landscape comparison: Northflank, Spheron
Firecracker microVM architecture: AWS open-source repository

How to engage

We design and ship agent execution architectures for regulated enterprises — sandbox selection against your specific posture, DIY Firecracker orchestration inside customer VPCs where compliance requires it, the migration path from hardened-Docker to microVM sandboxes for teams running code execution today. If your agents execute untrusted code in production and you have not yet had the runc-CVE conversation with your CISO, that conversation is overdue. Talk to us at creativeminds.dev/contact.