Sandboxed Code Execution: Letting AI Run Code Safely

Your AI data analyst agent writes Python code to analyze a CSV file. It generates: import os; os.system('rm -rf /'). Not because it is malicious - because the model hallucinated a shell command while trying to clean up temporary files. Without sandboxing, this executes on your server. With sandboxing, it executes in an isolated container that has no access to your file system, network, or other resources. The agent gets an error, reasons about it, and generates correct code.

Code execution gives AI agents superhuman capabilities: data analysis, mathematical computation, API integration, file transformation. But it also gives them superhuman destructive potential. The sandbox is the boundary that makes code execution safe enough to deploy in production - containing the blast radius of any mistake, hallucination, or adversarial input.

What sandboxed execution actually is

Sandboxing creates an isolated execution environment where code runs with restricted permissions. The code can compute, read/write within its boundary, and produce output - but cannot affect anything outside the sandbox.

Key properties of a good sandbox:

Isolation: Code cannot access the host system (files, network, processes)
Resource limits: CPU time, memory, and disk are capped (prevents mining, DoS)
Deterministic cleanup: The sandbox is destroyed after execution, leaving no artifacts
Output capture: stdout, stderr, and generated files are captured and returned

graph TD
  subgraph agent["AI Agent"]
      GEN["Generate Code"]
  end
  subgraph sandbox["Sandbox (Isolated)"]
      EXEC["Execute Code"]
      FS["Limited Filesystem
(/tmp only)"]
      NET["No Network Access
(or allowlisted only)"]
      RES["Resource Limits
(30s CPU, 512MB RAM)"]
  end
  subgraph host["Host System (Protected)"]
      HFS["Host Filesystem"]
      HN["Host Network"]
      HP["Other Processes"]
  end

  GEN --> EXEC
  EXEC --> FS
  EXEC -.->|"BLOCKED"| HFS
  EXEC -.->|"BLOCKED"| HN
  EXEC -.->|"BLOCKED"| HP
  EXEC --> OUTPUT["Captured Output"]

  style sandbox fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style host fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style GEN fill:#EEEDFE,stroke:#534AB7,color:#3C3489

Sandboxing techniques

Container-based sandboxing

Run code in a Docker container with restricted capabilities:

import docker

client = docker.from_env()

def execute_in_sandbox(code: str, timeout: int = 30):
    container = client.containers.run(
        image="python:3.11-slim",
        command=["python", "-c", code],
        detach=True,
        mem_limit="512m",
        cpu_period=100000,
        cpu_quota=50000,  # 50% of one CPU
        network_disabled=True,
        read_only=True,
        tmpfs={"/tmp": "size=100m"},
        security_opt=["no-new-privileges"],
    )
    
    try:
        result = container.wait(timeout=timeout)
        logs = container.logs(stdout=True, stderr=True)
        return {"exit_code": result["StatusCode"], "output": logs.decode()}
    except Exception as e:
        container.kill()
        return {"exit_code": -1, "output": f"Execution timeout: {timeout}s exceeded"}
    finally:
        container.remove(force=True)

Pros: Strong isolation, familiar tooling, supports any language. Cons: Cold start latency (1-5 seconds per container), resource overhead.

Microvm-based sandboxing (Firecracker)

AWS’s Firecracker provides VM-level isolation with container-like speed:

Full VM isolation (separate kernel) in ~125ms boot time
Used by AWS Lambda and Fly.io
Stronger isolation than containers (kernel-level separation)
More complex to set up than Docker

gVisor/Kata Containers

gVisor interposes a user-space kernel between the sandboxed code and the host kernel:

Intercepts system calls and implements them in user space
Stronger than container namespaces (no direct kernel access)
Lower overhead than full VMs
Used by Google Cloud Run and GKE Sandbox

WASM (WebAssembly) sandboxing

Compile code to WASM and run in a WASM runtime:

# Using Wasmtime or similar
import wasmtime

def execute_wasm_sandbox(wasm_bytes, input_data):
    engine = wasmtime.Engine()
    module = wasmtime.Module(engine, wasm_bytes)
    linker = wasmtime.Linker(engine)
    # Only link approved host functions
    linker.define_func("env", "print", print_handler)
    instance = linker.instantiate(module)
    return instance.exports["main"](input_data)

Pros: Near-native speed, tiny overhead, fine-grained permission control. Cons: Limited language support (Python requires special compilation), ecosystem maturity.

Process-level sandboxing (seccomp, AppArmor)

Restrict system calls available to a process:

import prctl

def sandbox_process():
    # Restrict to read, write, exit, and mmap only
    prctl.set_seccomp(prctl.SECCOMP_MODE_STRICT)
    # Now this process cannot open files, create sockets, fork, etc.

Pros: Zero overhead, no container/VM startup. Cons: Complex to configure correctly, easy to miss syscalls.

graph LR
  subgraph spectrum["Isolation Spectrum"]
      SEC["seccomp/AppArmor
Lightest, least isolated
~0ms overhead"]
      GVISOR["gVisor
User-space kernel
~10ms overhead"]
      DOCKER["Docker + restricted
Container isolation
~1-5s cold start"]
      FIRECRACKER["Firecracker
MicroVM isolation
~125ms boot"]
      VM["Full VM
Strongest isolation
~seconds boot"]
  end

  style SEC fill:#FAEEDA,stroke:#854F0B,color:#633806
  style GVISOR fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style DOCKER fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style FIRECRACKER fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style VM fill:#F1EFE8,stroke:#888780,color:#444441

Production patterns for AI code execution

Pattern 1: Ephemeral containers with pre-warming

Pre-start a pool of sandbox containers to eliminate cold start latency:

class SandboxPool:
    def __init__(self, pool_size=10):
        self.available = asyncio.Queue()
        for _ in range(pool_size):
            container = create_sandbox_container()
            self.available.put_nowait(container)
    
    async def execute(self, code, timeout=30):
        container = await self.available.get()
        try:
            result = await run_in_container(container, code, timeout)
            return result
        finally:
            # Reset and return to pool (or destroy and create fresh)
            await reset_container(container)
            await self.available.put(container)

Pattern 2: Allowlisted capabilities

Instead of blocking everything, explicitly allow specific operations:

ALLOWED_MODULES = ["pandas", "numpy", "matplotlib", "json", "csv", "math", "datetime"]
BLOCKED_MODULES = ["os", "sys", "subprocess", "shutil", "socket", "requests"]

def validate_code(code: str) -> bool:
    tree = ast.parse(code)
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                if alias.name.split('.')[0] in BLOCKED_MODULES:
                    return False
        if isinstance(node, ast.ImportFrom):
            if node.module.split('.')[0] in BLOCKED_MODULES:
                return False
    return True

Pattern 3: File I/O with volume mounting

Let the sandbox read input files and write output files via mounted volumes:

def execute_with_data(code: str, input_files: dict, timeout: int = 60):
    # Create temp directory with input files
    with tempfile.TemporaryDirectory() as tmpdir:
        for filename, content in input_files.items():
            Path(tmpdir, filename).write_bytes(content)
        
        # Mount as read-only input, separate writable output
        container = run_sandbox(
            code=code,
            mounts=[
                {"source": tmpdir, "target": "/data/input", "read_only": True},
                {"target": "/data/output", "tmpfs": True}
            ],
            timeout=timeout
        )
        
        # Collect output files
        output_files = collect_outputs(container, "/data/output")
        return {"stdout": container.logs, "files": output_files}

Pattern 4: Iterative execution with state

For multi-step data analysis, maintain state across executions:

class StatefulSandbox:
    def __init__(self):
        self.container = create_persistent_sandbox()
        self.execution_count = 0
        self.max_executions = 20
    
    async def execute(self, code: str):
        if self.execution_count >= self.max_executions:
            raise MaxExecutionsExceeded()
        
        result = await self.container.exec(code)
        self.execution_count += 1
        return result
    
    async def cleanup(self):
        await self.container.destroy()

Where sandboxing gets hard

Dependency installation

AI-generated code often requires packages (pandas, numpy, scikit-learn). Pre-installing common packages in the base image helps, but the model might request obscure packages. Solutions: pre-built images with common packages, pip install with timeout and size limits, or deny unknown packages.

Network access for APIs

Some legitimate tasks require network access (calling APIs, downloading data). Blanket network blocking prevents this. Solutions: allowlisted domains only, proxy all outbound traffic through a monitored gateway, or provide a “fetch” tool that the agent uses instead of raw network access.

Long-running computations

Data analysis on large datasets can legitimately take minutes. Short timeouts kill valid work; long timeouts allow resource abuse. Solutions: progressive timeouts (30s default, extendable with explicit request), resource monitoring (kill if CPU/memory exceeds limits even within timeout), and user-visible progress.

Secret management

Code might need API keys to call external services. Injecting secrets into the sandbox is risky - the AI-generated code could exfiltrate them. Solutions: never expose raw secrets; instead, provide pre-authenticated client objects or proxy services that add authentication transparently.

Real-world code execution systems

ChatGPT Code Interpreter - sandboxed Python environment with file upload/download, pre-installed data science packages, network-disabled
Claude Artifacts - client-side code execution (HTML/JS runs in user’s browser sandbox)
GitHub Copilot Workspace - executes generated code in cloud sandboxes for testing
E2B - cloud sandbox platform specifically built for AI code execution (Firecracker-based)
Modal - serverless compute with container isolation, popular for AI agent workloads
Replit - code execution with container isolation and resource limits

How to apply in practice

Default to network-disabled containers. Most AI code execution tasks (data analysis, computation, file transformation) do not need network access. Start with it off and add allowlisted access only when needed.

Pre-install common packages. Build a base image with pandas, numpy, matplotlib, scikit-learn, and other common data science packages. This eliminates 90% of “package not found” errors and avoids the security risk of runtime pip installs.

Set conservative resource limits. 512MB RAM, 30-second CPU time, 100MB disk. These are sufficient for most analysis tasks and prevent resource abuse. Increase per-task if needed with explicit justification.

Always capture and return stderr. When code fails, the error message is what the agent needs to debug and retry. Suppressing errors makes the agent blind.

Clean up aggressively. Destroy sandbox environments after each execution (or short session). Persistent sandboxes accumulate state that can leak across tasks or sessions.

FAQ

Q: Container-based vs WASM sandboxing - which should I use?

Containers for most production use cases: mature tooling, any language support, familiar debugging, and sufficient isolation for AI code execution. WASM when you need ultra-low latency (sub-millisecond startup), running in the browser, or embedding execution in a larger application without containerization overhead. WASM’s language support is still limited for interpreted languages (Python in WASM works via Pyodide but with limitations).

Q: How do I handle the case where AI-generated code needs to read user-uploaded files?

Mount user files as read-only volumes in the sandbox. Never let the sandbox write to user storage directly - instead, capture output files from the sandbox and explicitly save them to user storage after execution. This prevents the sandbox from corrupting or deleting user data even if the generated code is buggy or malicious.

Q: Is static code analysis (linting) a sufficient alternative to sandboxing?

No. Static analysis can catch obvious dangerous patterns (os.system, eval, exec) but cannot catch all malicious or dangerous code. The space of harmful programs is too large to enumerate. A program that only uses “safe” operations can still consume infinite resources or produce wrong results. Sandboxing provides defense-in-depth: static analysis as a fast first filter, sandboxing as the actual security boundary.

Interview questions

Q: Design the code execution infrastructure for an AI data analyst that processes user-uploaded CSVs (up to 1GB). Users ask questions in natural language, the AI writes Python code to analyze the data.

Architecture: (1) Upload pipeline: user uploads CSV → stored in object storage (S3) → metadata extracted (columns, row count, sample). (2) Sandbox environment: Docker containers with Python 3.11, pandas, numpy, matplotlib pre-installed. Memory limit: 4GB (for 1GB files + processing overhead). CPU: 2 cores. Timeout: 120 seconds. Network: disabled. (3) Execution flow: agent generates code → static analysis (block os, subprocess, network imports) → mount CSV as read-only → execute → capture stdout + generated plots/files → return to agent. (4) Iteration: agent sees results, generates follow-up code, same sandbox (stateful session) for 10 executions max. (5) Output handling: generated plots saved as PNG, data results as JSON/CSV, returned to user. (6) Scale: container pool of 50 pre-warmed instances, auto-scale based on queue depth.

Q: Your AI agent’s sandboxed code execution works but is too slow - 3-5 seconds per execution due to container cold starts. Users experience this as lag between each analysis step. How do you optimize?

Multi-layer optimization: (1) Container pooling - pre-warm 20-50 containers, assign from pool instead of creating fresh (eliminates cold start entirely). (2) Persistent sessions - reuse the same container across multiple code executions within one conversation (state persists, no startup per step). Reset/destroy after conversation ends. (3) Warm image optimization - minimize base image size, pre-install all packages, use multi-stage builds to reduce layer count. (4) Consider alternatives for simple code: use WASM (Pyodide) for small computations that do not need heavy packages (sub-100ms). Route complex analysis to containers, simple math/logic to WASM. (5) Predictive pre-warming: when a user starts a data analysis conversation, pre-allocate a container before the first code generation completes. Target: <200ms for pool-based execution, <50ms for WASM-routed simple calculations.