AI Integration · AI Agents
Sandboxed Code Execution for AI Agents: E2B, Modal, and Firecracker in Practice
When your AI agent needs to run the code it writes, you can't let it touch your production servers. Here's how the main isolation options work and when to use each.
Anurag Verma
8 min read
Sponsored
The moment an AI agent can write and execute code, you have a security problem. It does not matter how well you’ve prompted the model to be careful. Code execution is code execution. An agent that writes import os; os.system("rm -rf /") and runs it on your server is a serious incident, not a bug to be handled gracefully.
The answer is not to avoid code-executing agents. These are some of the most capable tools in the current wave of AI development — data analysis agents, coding assistants, test-running agents, automated debugging pipelines. The answer is isolation: run the agent’s code somewhere that cannot affect your infrastructure.
This piece covers the three main approaches: cloud sandbox APIs (E2B), serverless GPU/CPU containers (Modal), and microVM-based isolation (Firecracker). They solve the same underlying problem with different tradeoffs in latency, cost, control, and complexity.
What You’re Trying to Isolate
Before picking a tool, be clear about your threat model. Code-executing agents need isolation from:
- Your host system: file system access, process access, environment variables with secrets
- Other tenants: in multi-user systems, one user’s agent code should not see another’s data or resources
- Your network: agent code should not be able to make arbitrary outbound requests (exfiltrate data, hit internal services) or inbound requests should not reach internal infrastructure
- Resource limits: a rogue agent should not be able to run a CPU-exhausting loop or allocate 50GB of memory
Different tools give you different amounts of control over each.
E2B: The Fastest Path to Isolated Execution
E2B (formerly Code Interpreter SDK) runs cloud sandboxes that launch in under 500ms. Each sandbox is an isolated Linux environment — you get a shell, a Python environment, a file system. When the sandbox is done, it disappears.
from e2b_code_interpreter import Sandbox
sandbox = Sandbox()
# Execute arbitrary Python code in the sandbox
execution = sandbox.run_code("""
import pandas as pd
import json
data = [{"name": "Alice", "score": 92}, {"name": "Bob", "score": 87}]
df = pd.DataFrame(data)
print(df.to_json(orient='records'))
""")
print(execution.text) # stdout output
print(execution.results) # rich outputs (DataFrames, plots, etc.)
sandbox.kill()
The sandbox runs in E2B’s cloud. You don’t manage servers. The Python SDK handles sandbox lifecycle and returns structured outputs including text, errors, and rich types (matplotlib figures come back as base64 PNG, DataFrames as structured data).
For AI coding agents, E2B exposes a streaming execution API so you can show output in real time rather than waiting for the full execution:
with Sandbox() as sandbox:
for chunk in sandbox.run_code_streaming("""
import time
for i in range(5):
print(f"Step {i+1} complete")
time.sleep(0.5)
"""):
if chunk.type == "stdout":
print(chunk.data, end="", flush=True)
What E2B gives you: fast startup, managed infrastructure, Python/JS/TypeScript SDKs, network isolation by default, per-sandbox file systems.
What E2B doesn’t give you: GPU access (CPU-only unless you’re on specific plans), persistent storage between sandboxes (you manage that externally), or the ability to run entirely on your own infrastructure.
Pricing: $0.000224 per second of sandbox CPU time (as of early 2026). A 30-second analysis task costs about $0.007. For agents that run many short tasks, it’s cheap. For long-running computations, it adds up.
Modal: Serverless with GPU Access and More Control
Modal is a serverless compute platform oriented toward Python. You define your environment as code, and Modal runs your functions in isolated containers. It is not specifically an AI agent sandbox, but it is frequently used for that purpose because it gives you GPU access and more control over the execution environment.
import modal
app = modal.App("agent-executor")
# Define the isolated environment
image = modal.Image.debian_slim().pip_install(
"pandas", "numpy", "matplotlib", "scikit-learn"
)
@app.function(image=image, timeout=120, memory=2048)
def run_agent_code(code: str) -> dict:
"""Run arbitrary agent code in an isolated Modal container."""
import sys
import io
import traceback
stdout_capture = io.StringIO()
stderr_capture = io.StringIO()
result = {"stdout": "", "stderr": "", "error": None, "return_value": None}
try:
old_stdout, old_stderr = sys.stdout, sys.stderr
sys.stdout, sys.stderr = stdout_capture, stderr_capture
namespace = {}
exec(compile(code, "<agent>", "exec"), namespace)
result["return_value"] = namespace.get("__result__")
sys.stdout, sys.stderr = old_stdout, old_stderr
except Exception:
sys.stdout, sys.stderr = old_stdout, old_stderr
result["error"] = traceback.format_exc()
result["stdout"] = stdout_capture.getvalue()
result["stderr"] = stderr_capture.getvalue()
return result
# Calling from your application
with modal.enable_local_execution():
output = run_agent_code.remote("""
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
__result__ = df.describe().to_dict()
""")
print(output)
The @app.function decorator defines the container environment. Each call to run_agent_code.remote() spins up a fresh container for that function. The container is isolated: no access to your local file system, no shared state with other calls, network access is configurable.
For GPU workloads (fine-tuning, inference at the agent level, image processing):
@app.function(
image=image,
gpu="A10G",
timeout=300,
memory=8192,
)
def run_gpu_agent_code(code: str) -> dict:
# Same pattern, GPU available inside the function
...
What Modal gives you: GPU access, precise environment control, network policies, persistent volumes (if you need to share state between calls), and the ability to run the same code locally for testing.
What Modal doesn’t give you: the ~500ms startup time of E2B (Modal containers take a few seconds to cold-start, though warm containers are fast). Modal also requires more code to set up compared to E2B’s SDK.
Pricing: $0.00006/second for CPU functions, $0.000583/second for A10G GPU. Pay only for active execution time.
Firecracker: microVMs When You Control the Infrastructure
Firecracker is the microVM technology developed by AWS and used in Lambda and Fargate. It creates VMs that boot in under 125ms and have the isolation properties of full VMs (separate kernel, separate network stack, separate file system) with the performance profile of containers.
Running Firecracker directly is an infrastructure project, not a library call. You manage the host, the VM images, the network configuration. The payoff: strong isolation guarantees, full control over what the environment looks like, and no per-execution vendor charges.
Here is the minimum viable Firecracker setup for a code execution service:
# Install Firecracker (example for Amazon Linux 2023)
curl -Lo firecracker \
"https://github.com/firecracker-microvm/firecracker/releases/latest/download/firecracker-$(uname -m)"
chmod +x firecracker
# You also need a kernel image and root filesystem
# These are typically pre-built for your use case
The typical architecture: a pool of pre-warmed Firecracker microVMs, each with a minimal Linux kernel and a Python environment. When an agent needs to execute code, your orchestration layer grabs a VM from the pool, injects the code, runs it, collects output, and discards the VM (or resets it for reuse).
Most teams using Firecracker at the application layer are building on top of it through Kata Containers or a custom orchestration service. Building this from scratch is a significant investment.
What Firecracker gives you: the strongest isolation short of dedicated hardware, full control, no per-execution vendor cost (you pay for the host machines), ability to run on-premises.
What Firecracker requires: infrastructure expertise, ongoing maintenance, a team that can debug VM networking and kernel issues.
Comparing the Three
| E2B | Modal | Firecracker (DIY) | |
|---|---|---|---|
| Startup time | ~300-500ms | 2-5s (cold), ~100ms warm | ~125ms (pre-warmed) |
| GPU support | No (basic plans) | Yes | Yes (pass-through) |
| Network isolation | Default | Configurable | Full control |
| Infrastructure burden | None | Low | High |
| Cost model | Per-second execution | Per-second execution | Infrastructure |
| On-premises | No | No | Yes |
| Max execution time | 24h (depends on plan) | Configurable | Unlimited |
The Security Controls That Matter Most
Regardless of which isolation layer you use, three controls deserve explicit attention:
Network egress. By default, sandboxed code can make outbound HTTP requests. An agent that processes sensitive data can exfiltrate it over the network if you’re not careful. E2B allows network disable by default; Modal network policies are configurable. Test that your agent’s code cannot reach your internal services or arbitrary external hosts.
Resource caps. Set memory limits, CPU limits, and execution timeouts. An agent running a poorly written loop will consume whatever you give it. Most sandboxing tools expose these as configuration:
# E2B — set at sandbox creation
sandbox = Sandbox(timeout=60) # Kill after 60 seconds
# Modal — set on the function decorator
@app.function(timeout=120, memory=2048)
Output validation. Even isolated execution produces outputs. An agent that writes agent-controlled strings to your database, or returns agent-controlled data that gets executed elsewhere, can still cause harm. Validate and sanitize what comes out of the sandbox as carefully as what goes in.
Practical Recommendation
For most teams building AI coding or data analysis agents:
-
Start with E2B. The SDK is simple, startup is fast, and the managed infrastructure means you ship in days rather than weeks. The per-execution pricing is cheap for typical agent tasks.
-
Move to Modal when you need GPU access for model inference inside the agent, or when you want tighter control over the Python environment and can invest a day in the setup.
-
Consider Firecracker-based infrastructure only if you are in a regulated environment (healthcare, finance) that prohibits cloud vendor code execution, or if your agent workloads are large enough that the per-execution cost of managed services is material compared to running your own fleet.
The hard part of code-executing agents is not the isolation layer — any of the three options above provides adequate isolation. The hard part is the orchestration: queueing agent tasks, managing sandbox pools, streaming outputs to the user, handling timeouts gracefully, and wiring agent-generated results back into the agent’s context. The isolation layer is a dependency, not the core product.
Sponsored
More from this category
More from AI Integration
AI Video Generation in 2026: What Agencies Need to Know Before Pitching It to Clients
Browser-Use Agents: Automating the Web When APIs Don't Exist
Fine-Tuning vs RAG in 2026: A Decision Guide for Teams Building with LLMs
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored