A viral LinkedIn post crossed every AI-engineering feed last week with a sharp claim. If your AI agent executes code inside a standard Docker container, you are one prompt away from a breached host. The argument: containers share the host kernel, agents run untrusted code by definition, and a kernel exploit reaches straight through container isolation. The recommended fix: ditch Docker for hardware-virtualised microVMs, like LangChain's newly released LangSmith Sandboxes.
The post is rhetorically warm and architecturally correct in shape. It also lands at the exact moment the receipts arrived.
In November 2025 — three weeks before the post went viral — the runc maintainers disclosed three separate container-escape CVEs in coordinated release. CVE-2025-31133. CVE-2025-52565. CVE-2025-52881. All three allow a malicious container to break out of runc-based isolation and gain write access to the host kernel surface. The CVEs impact Docker, containerd, Kubernetes, and effectively every managed Kubernetes service offered by every major cloud provider. The fixes shipped in runc 1.2.8, 1.3.3, and 1.4.0-rc.3.
If you needed a concrete proof point that the threat model is not theoretical, the proof point landed on its own.
This piece reads the architectural shift honestly. What the threat model actually is. What LangSmith Sandboxes, E2B, Modal, Daytona, and the Firecracker-DIY path actually offer. The trade-offs the marketing leaves out. And the four real paths a team should actually be choosing between, matched to engineering capacity and regulatory posture.
The threat model, concretely
Containers were designed to isolate known, vetted application code — your services, written by your engineers, packaged for deployment. They were never designed to isolate untrusted code arriving over an API. The isolation surface is process namespaces, cgroups, capability dropping, and seccomp filters — all of which run on top of a single shared host kernel.
For known, vetted application code, that surface is adequate. For untrusted code generated by an LLM, executing inside the container with the same kernel, the surface is approximately what you would expect: hard enough to resist incidental probing, soft enough to fall to a determined kernel exploit.
The runc CVEs from November are the operationalisation of that softness. CVE-2025-31133 abuses a symlink replacement during container init to bind-mount an attacker-controlled target read-write into the container. CVE-2025-52565 exploits a race on the /dev/console bind mount before security protections are fully in place. CVE-2025-52881 lets an attacker circumvent LSM relabel protections and turn ordinary runc writes into arbitrary writes against the host procfs. None of these are theoretical. All three landed with proof-of-concept exploits in coordinated disclosure.
The viral post's framing — "a simple Python script generated by an LLM could root your entire infrastructure" — is rhetorically punchy and architecturally honest. A Python script written by an LLM is not magic. It is a chain of system calls. If those calls reach a kernel vulnerability that has been disclosed but not patched in your runtime, the script roots the box. The LLM does not have to know it is exploiting anything. The exploit was already there.
The right defensive posture for a system that runs untrusted code is not "patch runc and hope." It is "do not share a kernel with the untrusted code in the first place."
What hardware-virtualised isolation actually means
A microVM gives the untrusted code its own Linux kernel, running on top of the host's hypervisor (typically KVM on Linux). The agent's code can crash that kernel. Can compromise that kernel. Can install rootkits inside that kernel. None of that reaches the host, because the host is running a different kernel behind the hypervisor's hardware-enforced boundary. CPU-level virtualisation extensions — Intel VT-x, AMD-V — make the boundary architecturally hard rather than software-enforced.
Two pieces of technology you should know by name.
Firecracker is AWS's open-source microVM technology. It boots a stripped-down Linux kernel in roughly 125 milliseconds, on commodity hardware, with memory overhead in the single-digit-megabytes range. Firecracker powers AWS Lambda and Fargate, where it has executed tens of trillions of customer functions in production. It is the substrate most modern agent sandboxes are built on.
gVisor is Google's userspace-kernel reimplementation. It is not a microVM — instead, it intercepts system calls from the guest and re-implements them in a userspace Go process, so the guest's syscalls never reach the host kernel directly. gVisor is lighter than a full microVM (no separate kernel), but its isolation guarantee is weaker (the surface is a Go process, not hardware virtualisation). Modal uses it. AWS uses it on parts of Fargate.
Both are legitimately stronger than container isolation. Firecracker is the stronger of the two, with the trade-off being slightly higher cold-start cost and slightly higher memory overhead per sandbox.
The landscape as it actually exists
The agent-sandbox market has consolidated around five real options. Each one occupies a specific niche.
| Platform | Isolation | Cold start | Notes |
|---|---|---|---|
| LangSmith Sandboxes | Hardware-virtualised microVM | Warm-pool backed | LangChain's GA product. Native integration with Deep Agents and Open SWE. Templates, auth proxy, persistent state over WebSockets. |
| E2B | Firecracker microVMs | ~150ms | Largest community in the AI sandbox space. $35M funded. Battle-tested SDK. Purpose-built for untrusted code execution. |
| Modal | gVisor | Sub-second | The outlier — only platform where a sandbox can hold a GPU. Sits inside a broader compute platform covering inference, training, batch jobs. |
| Daytona | Docker containers (hardened) | Sub-90ms | Fastest cold start on the list. NOT microVM. $24M Series A. Trades isolation strength for boot latency. Honest about this. |
| Firecracker direct | Firecracker microVM | ~125ms | The open-source DIY path. You operate the orchestration, the warm pool, the network plane, the auth surface. Powers AWS Lambda and Fargate. |
A few things worth noting that the marketing tends to leave out.
Daytona is not a microVM. The viral post's framing would put Daytona on the wrong side of the line, despite Daytona being the fastest sandbox on the list and the chosen substrate for many production agent platforms. Daytona has chosen to optimise for boot latency over isolation strength, with a hardened Docker container as the runtime. For low-stakes workloads where the agent's code surface is constrained — a Python REPL with no shell access, no network — this is a defensible choice. For untrusted code with shell access, it is not.
Modal is the only sandbox you can put a GPU in. This matters for any agent doing ML inference, training, or GPU-accelerated data work. The trade-off is gVisor's weaker isolation. For agent workloads where the GPU is the point — a research assistant running CUDA kernels, an inference agent fine-tuning small models — Modal is the only credible option in the public market today.
Cold start hides a longer story. Firecracker boots in 125ms, but provisioning the runtime, restoring filesystem state, installing the agent's dependencies, and warming the network plane easily takes another two to ten seconds the first time. Warm pools — pre-provisioned sandboxes sitting idle and ready — are how every serious vendor closes this gap. LangSmith Sandboxes, E2B, and Modal all run warm pools. DIY Firecracker means you build the warm pool yourself.
Per-second cost matters at agent scale. Container compute costs roughly 30-50% less per second than microVM compute at equivalent CPU/memory specs. For high-volume agent traffic — thousands of code executions per hour — the cost difference compounds into a material line item. Worth modelling against the security posture decision, not assumed away.
The four real paths
The viral post identified three implementation paths. We would add a fourth, which most enterprise teams should be on before any of the other three.
Path 0 — Do not execute code
The cheapest sandbox is the one you do not run.
Most production agents do not need code execution. They need retrieval, classification, summarisation, structured extraction, and tool-calling against well-defined APIs. A documentation assistant does not need a Python REPL. A customer-support agent does not need a shell. An evidence-pack generator does not need to install npm packages.
Before reaching for any sandbox technology, the question to answer is: what specifically must this agent do that requires executing arbitrary code? If the answer is "nothing specific — we thought it might be useful," the right architectural move is to remove the code-execution path entirely. Code execution is a capability surface; capability surfaces are attack surfaces; the agent that does not need the surface should not have it.
For most enterprise AI workloads we ship into regulated environments, Path 0 is the right path.
Path 1 — Ecosystem-Native (LangSmith Sandboxes)
If you are already building on LangChain or LangGraph, LangSmith Sandboxes removes the infrastructure burden. Hardware-virtualised microVMs, native integration with Deep Agents and Open SWE, sandbox templates with reusable image and resource configurations, warm pools, an auth proxy for injecting credentials without hardcoding secrets, persistent state across sessions, long-running session support over WebSockets. Generally available since 2026.
This is the path for teams whose value-add is the agent itself, not the sandbox infrastructure underneath it. You write the agent, LangChain runs the safety surface. The trade-off is platform lock-in — your sandbox lifecycle is now bound to LangSmith's product roadmap and pricing.
Path 2 — API-First (E2B, Modal, Daytona)
If you are not in the LangChain ecosystem, or you want to keep the sandbox choice independent of the agent framework, the API-first sandbox providers are the right path. E2B for general-purpose secure code execution with the strongest isolation among the API-first set. Modal for anything involving a GPU. Daytona for low-latency, low-stakes workloads where Docker isolation is acceptable.
The trade-off is the same as any managed platform: you give up some control over the runtime surface in exchange for not having to operate it. For most mid-market teams, this is the right trade-off.
Path 3 — DIY (Firecracker, Kata Containers)
If your security or compliance posture requires the sandbox to run inside your own VPC, on your own hardware, or on-prem — and FSI, defence-adjacent, and certain healthcare workloads do — you build it. Firecracker is the open-source substrate. Kata Containers provides a container-API-compatible runtime backed by microVMs. The orchestration, the warm pool, the auth proxy, the persistent-state layer, the audit trail, the cost telemetry — all of that, you build.
This is meaningful engineering work. Two to three engineer-quarters at minimum to get a production-grade Firecracker-based sandbox surface stood up with the surrounding operational infrastructure. The reason to take this path is not engineering preference. It is the regulator's posture, the data residency requirement, or a Cybersecurity-Assurance-Standard requirement that the sandbox must not egress to a third-party API. We covered the architectural pattern for this kind of isolation, applied to LLM endpoints rather than sandbox compute, in The Blueprint for Air-Gapped LLM Deployments on AWS Bedrock.
A short decision framework
Three questions, in order. Answer them honestly.
Does this agent specifically need to execute arbitrary code, or just call well-defined tools? If the answer is tools, Path 0. If the answer is code, continue.
Does your security or compliance posture require the sandbox to run inside your own infrastructure? If yes, Path 3. If no, continue.
Are you already operating on LangChain or LangGraph, and want the sandbox bundled? If yes, Path 1. If no, Path 2.
This framework gets approximately the right answer for approximately every enterprise team. The teams it fails for — frontier-research labs running custom agent harnesses, sovereign-cloud deployments with bespoke compliance, on-device agent deployments at the edge — already know they are exceptions and will engineer the bespoke answer regardless.
What this teaches us about enterprise scaling
Three things.
One. The architectural shift toward hardware-virtualised microVM sandboxes is real, the runc CVEs from November are the receipt, and any production deployment of agent code execution that is still running on a hardened Docker container should treat the migration as a Q3 priority, not a hypothetical future concern. The threat model is operationalised.
Two. The cheapest defensive posture is to not execute the untrusted code in the first place. Most production agents do not need this capability, and the engineering pattern that the field has converged on — every agent gets a Python REPL just in case — is a security debt the market is starting to charge interest on. Audit your agents. Remove the code-execution capability from any agent that does not specifically need it.
Three. The sandbox-as-a-service market is mature enough that building Firecracker orchestration from scratch is a deliberate choice driven by compliance, not by lack of options. LangSmith Sandboxes, E2B, Modal, and Daytona are real products with real differentiation. Pick the one that matches your ecosystem and your posture. Build the DIY path only if the regulator's letter on your desk explicitly requires it.
The teams shipping production AI to regulated buyers in 2026 are going to be the ones who can answer "how do you isolate untrusted code generated by your agents" in one paragraph, with specific names of products, specific isolation technologies, and a specific decision logic. Anyone who shows up with "we run it in Docker, we patch quickly" is going to lose the deal.
Companion content
- The Blueprint for Air-Gapped LLM Deployments on AWS Bedrock — the architectural pattern that DIY-Firecracker sandboxes need to inherit from, applied to model endpoints
- Designing Strict RBAC for Enterprise Knowledge Bases — the audit pattern for any agent action, including code execution
- Mitigating Non-Deterministic AI Failures in Production Systems — drift monitoring extended to the sandbox surface
- The Self-Improving Agent: SIA's Three Levers and the Production Pattern — what the safety surface looks like when the agent edits its own code AND executes it
- Why 95% of Enterprise AI Pilots Fail at the Deployment Phase — the strategic frame this fits inside
Sources
- LangSmith Sandboxes GA announcement: langchain.com/blog
- runc CVE disclosures (November 2025): Sysdig technical overview
- CNCF runc breakout overview: cncf.io/blog
- AI sandbox landscape comparison: Northflank, Spheron
- Firecracker microVM architecture: AWS open-source repository
How to engage
We design and ship agent execution architectures for regulated enterprises — sandbox selection against your specific posture, DIY Firecracker orchestration inside customer VPCs where compliance requires it, the migration path from hardened-Docker to microVM sandboxes for teams running code execution today. If your agents execute untrusted code in production and you have not yet had the runc-CVE conversation with your CISO, that conversation is overdue. Talk to us at creativeminds.dev/contact.
