Build a Distributed Code Execution Runtime


security distributed-systems cloud-infrastructure

System Design Deep Dive

Distributed Code Execution Runtime

How to run untrusted code safely at scale - isolation, throughput, and sub-second latency in tension

⏱ 14 min read📐 Advanced🏗️ Infrastructure

Running arbitrary user-submitted code is one of the more deceptively dangerous problems in backend infrastructure. On its surface, you are compiling and executing text. In practice, you are accepting an adversarial program, running it on your hardware, and promising to return output within a few seconds - all while thousands of other submissions do the same thing in parallel.

Think of it like operating a commercial kitchen that rents cooking stations to strangers. Each chef gets their own burners, knives, and ingredients. They can cook anything they like - but they cannot open the shared walk-in fridge, pick up a knife from the next station, or leave their burner running unattended. Your job is to enforce those boundaries perfectly, at every station, simultaneously, without slowing anyone down.

The naive approach - spinning up a process per submission with subprocess.run() - collapses immediately under adversarial input. A fork bomb exhausts the process table for the entire host. An infinite loop occupies a CPU core indefinitely. A malicious program reads /etc/passwd, writes arbitrary files to disk, opens outbound network connections, or exploits a kernel vulnerability to escape entirely. Worse, at a few thousand concurrent submissions you have thousands of unmanaged processes competing for resources with no accounting between them.

We need to solve for three tensions simultaneously: security (untrusted code cannot escape its sandbox or affect neighboring executions), throughput (thousands of concurrent submissions without resource contention), and latency (users waiting for output expect results within 2-3 seconds, not 30). These forces pull against each other - stronger isolation means heavier container overhead, which directly conflicts with low latency and high throughput. The architecture must manage this tradeoff explicitly, not hope it resolves itself.

Requirements and Constraints

Functional Requirements

  • Accept code submissions in at least 5 languages: Python 3, Go, Java, C++, JavaScript (Node.js)
  • Compile (where needed) and execute submitted code in an isolated sandbox
  • Return stdout, stderr, exit code, and execution time to the submitter
  • Enforce a configurable timeout per submission (default 10 seconds)
  • Stream output in near-real-time so users see partial results before completion
  • Support at least 5,000 concurrent active submissions

Non-Functional Requirements

  • End-to-end latency (submit to first output byte): under 2 seconds for interpreted languages, under 4 seconds for compiled languages (cold start included)
  • Throughput: 10,000 submissions per minute sustained
  • Sandbox startup overhead: under 500ms per container
  • Zero cross-submission contamination - one submission must never read or write another’s state
  • Availability: 99.9% (8.7 hours downtime per year)
  • Result durability: execution outputs retained for 7 days

Constraints

  • Submitted code has no outbound network access
  • Maximum execution time: 30 seconds hard cap (configurable down to 1 second)
  • Maximum memory per submission: 512 MB
  • Maximum output size: 10 MB (stdout + stderr combined)
  • We assume the queue and result store are separate infrastructure from the worker pool
  • Language runtime images are pre-warmed - we do not build images at submission time

High-Level Architecture

The system has five major components working in a pipeline. The Submission API accepts code over HTTP, validates input, assigns a job ID, and enqueues the job. The Execution Queue durably buffers jobs between producers (API nodes) and consumers (workers), decoupling submission rate from execution capacity. The Worker Pool is a horizontally scalable fleet of nodes; each worker dequeues one job, spawns an isolated container, and supervises its execution. The Sandbox Runtime is the container itself - namespaced from the host, cgroup-limited, and filtered by a seccomp profile that blocks dangerous syscalls. Finally, the Result Store receives stdout/stderr from workers and makes results available to clients via polling or WebSocket push.

High-level architecture: client submits code through API gateway to Redis queue, workers dequeue and spawn sandboxed containers, results written to Redis/S3

Data flows in one direction. A client POSTs code to the API. The API writes the job payload to Redis and returns a job_id to the client immediately - no waiting. A worker picks up the job via BRPOPLPUSH (which atomically moves the job ID to a processing list, preventing loss on worker crash). The worker spawns a container, injects the source code, and starts execution. As the container writes to stdout/stderr, the worker relays chunks to a Redis pub/sub channel keyed by job_id. The client subscribes to that channel via WebSocket and receives output in real time. When execution completes (or times out), the worker writes the final result record to Redis with a 7-day TTL.

Key Insight

The queue is what makes this system safe under load - it is the pressure valve between unpredictable submission bursts and a finite execution capacity, and it is what lets you scale workers independently of the API layer without any coordination.

The Submission API

The Submission API’s job is to be a thin, fast gate: validate, assign an ID, enqueue, and return. It must never block on execution.

Input validation matters more than it appears. Before any job touches the queue, the API must check: language is in the supported set, source code size is under the limit (we cap at 256 KB), the user has not exceeded their rate limit (token bucket per user_id), and the submission payload is well-formed. Failing these checks early is cheap. Failing them inside the worker means you’ve wasted a container cold start.

# FastAPI submission endpoint - validates, enqueues, returns job_id immediately
import uuid, time, json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, constr
import redis.asyncio as aioredis

SUPPORTED_LANGUAGES = {"python3", "go", "java", "cpp", "javascript"}
MAX_CODE_BYTES = 256 * 1024

class SubmitRequest(BaseModel):
    language: str
    source_code: constr(max_length=262144)
    timeout_ms: int = 10000
    stdin: str = ""

async def submit(req: SubmitRequest, redis: aioredis.Redis, user_id: str):
    if req.language not in SUPPORTED_LANGUAGES:
        raise HTTPException(400, f"unsupported language: {req.language}")
    if len(req.source_code.encode()) > MAX_CODE_BYTES:
        raise HTTPException(413, "source code exceeds 256KB limit")

    job_id = str(uuid.uuid4())
    job = {
        "id": job_id,
        "user_id": user_id,
        "language": req.language,
        "source_code": req.source_code,
        "stdin": req.stdin,
        "timeout_ms": min(req.timeout_ms, 30000),
        "created_at": time.time(),
        "status": "queued"
    }

    # Store job payload with TTL, then enqueue the ID
    pipe = redis.pipeline()
    pipe.setex(f"job:{job_id}", 86400, json.dumps(job))
    pipe.lpush("execution_queue", job_id)
    await pipe.execute()

    return {"job_id": job_id, "status": "queued"}
Watch Out

Do not store the full source code in the queue. Store the job ID in the queue and the payload separately in Redis with a TTL. A 256 KB payload in a 10,000-deep queue is 2.5 GB of queue memory - the queue becomes the bottleneck and memory pressure causes job loss.

The Execution Queue

The queue’s job is to buffer jobs between API nodes and workers, guarantee no job is lost on worker crash, and provide backpressure when workers fall behind.

Redis is the natural choice here - BRPOPLPUSH atomically moves a job ID from the execution_queue list to a worker-specific processing:{worker_id} list in a single operation. If the worker crashes before completing the job, the job ID remains in the processing:{worker_id} list. A reaper process (or the worker itself on restart) scans these in-flight lists for jobs older than the max execution timeout and re-queues them. This is the at-least-once delivery guarantee.

# Worker queue consumer using BRPOPLPUSH for reliable delivery
import redis, json, time

EXECUTION_QUEUE = "execution_queue"
REQUEUE_AFTER_SECONDS = 35  # slightly above max timeout

def worker_loop(redis_client: redis.Redis, worker_id: str):
    processing_list = f"processing:{worker_id}"

    while True:
        # Atomically dequeue and move to our in-flight list
        raw = redis_client.brpoplpush(EXECUTION_QUEUE, processing_list, timeout=30)
        if raw is None:
            requeue_stale_jobs(redis_client, processing_list)
            continue

        job_id = raw.decode()
        job_json = redis_client.get(f"job:{job_id}")
        if job_json is None:
            # Job expired before we could process it - discard
            redis_client.lrem(processing_list, 1, job_id)
            continue

        job = json.loads(job_json)
        execute_job(redis_client, job, processing_list)

def requeue_stale_jobs(redis_client: redis.Redis, processing_list: str):
    # Jobs in-flight for longer than max timeout are assumed lost
    for job_id in redis_client.lrange(processing_list, 0, -1):
        job_json = redis_client.get(f"job:{job_id.decode()}")
        if job_json is None:
            redis_client.lrem(processing_list, 1, job_id)
            continue
        job = json.loads(job_json)
        if time.time() - job["created_at"] > REQUEUE_AFTER_SECONDS:
            redis_client.lrem(processing_list, 1, job_id)
            redis_client.lpush(EXECUTION_QUEUE, job_id)
Real World

Judge0 (the open-source code execution API used by Codeforces and many online judges) uses exactly this Redis BRPOPLPUSH pattern with a configurable number of worker threads. LeetCode’s execution backend uses a similar decoupled queue design, routing to language-specific worker pools to avoid Python jobs queuing behind slow Java compilations.

The Worker and Sandbox

The worker’s job is to take a job from the queue, spawn an isolated execution environment, enforce resource limits, and collect output - all within the job’s timeout window.

The analogy here is a prison workshop: the prisoner (user code) gets a tool (the language runtime) and a workbench (a container with a filesystem), but the room has no windows (no network), the tools are chained down (read-only root filesystem), and a guard watches with a timer (the worker process monitoring via cgroup events).

Container isolation is the core mechanism. We use Linux namespaces (PID, network, mount, UTS, IPC) to give each execution its own view of the system. The container sees its own process tree, its own network stack (which has no interfaces), and a fresh overlay filesystem. The host sees nothing of the container’s internals.

# Docker flags for a single code execution container
docker run \
  --rm \
  --network none \
  --memory 512m \
  --memory-swap 512m \
  --cpus 1.0 \
  --pids-limit 64 \
  --read-only \
  --tmpfs /tmp:size=32m,noexec \
  --tmpfs /code:size=10m \
  --security-opt no-new-privileges \
  --security-opt seccomp=/etc/runner/seccomp-strict.json \
  --cap-drop ALL \
  --user 65534:65534 \
  code-runner:python3.12 \
  timeout 10s python3 /code/solution.py

Resource limits are set at two layers. The Docker flags above set cgroup limits (memory, CPU, PIDs). The seccomp profile blocks dangerous syscalls - things like ptrace, mount, socket (except loopback), fork beyond the PID limit, and setuid. A submission that tries to call socket() with AF_INET gets a EPERM immediately rather than a kernel exploit opportunity.

Sandbox internals: worker process supervises a container with overlay filesystem, cgroup limits on CPU/memory/PIDs, seccomp syscall filter, and isolated network namespace

Timeout enforcement cannot rely solely on the container’s timeout command. The worker process also holds a Go context with the deadline. When the deadline fires, the worker sends SIGKILL to the entire process group, not just the container’s PID 1 - this ensures all child processes are killed even if the main process spawned a detached subprocess.

// Worker: run container with dual-layer timeout enforcement
package worker

import (
    "context"
    "os/exec"
    "syscall"
    "time"
    "bytes"
    "errors"
)

var ErrTimeout = errors.New("execution timeout")
var ErrOOM    = errors.New("memory limit exceeded")

func runContainer(job Job) (stdout, stderr []byte, exitCode int, err error) {
    ctx, cancel := context.WithTimeout(
        context.Background(),
        time.Duration(job.TimeoutMs)*time.Millisecond,
    )
    defer cancel()

    cmd := exec.CommandContext(ctx, "docker", buildDockerArgs(job)...)
    // New process group so SIGKILL kills all children
    cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}

    var outBuf, errBuf bytes.Buffer
    cmd.Stdout = &outBuf
    cmd.Stderr = &errBuf

    if startErr := cmd.Start(); startErr != nil {
        return nil, nil, -1, startErr
    }

    done := make(chan error, 1)
    go func() { done <- cmd.Wait() }()

    select {
    case <-ctx.Done():
        // Kill entire process group - not just PID 1
        syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
        <-done
        return outBuf.Bytes(), errBuf.Bytes(), -1, ErrTimeout
    case runErr := <-done:
        code := 0
        if cmd.ProcessState != nil {
            code = cmd.ProcessState.ExitCode()
        }
        if code == 137 {
            return outBuf.Bytes(), errBuf.Bytes(), 137, ErrOOM
        }
        return outBuf.Bytes(), errBuf.Bytes(), code, runErr
    }
}
Watch Out

Not limiting the PID count is the most common misconfiguration. Without --pids-limit, a fork bomb like :(){ :|:& };: in bash spawns processes until the host’s PID namespace is exhausted - killing every container on the node and triggering a full instance restart.

Output Streaming and Result Delivery

The output streaming component’s job is to relay stdout/stderr from inside the container back to the waiting client in real time - not after the job completes.

The pattern: as the worker’s outBuf receives bytes from the container, it publishes chunks to a Redis pub/sub channel keyed output:{job_id}. The client maintains a WebSocket connection to the API, which subscribes to this channel and forwards chunks as they arrive. When the worker publishes a final {"type":"done"} event, the client WebSocket closes cleanly.

# Worker side: stream output chunks to Redis pub/sub as they arrive
import subprocess, json, time, threading

def stream_and_publish(proc: subprocess.Popen, job_id: str, r: redis.Redis):
    channel = f"output:{job_id}"
    max_bytes = 10 * 1024 * 1024  # 10 MB output cap
    total = 0

    def read_pipe(pipe, stream_type: str):
        nonlocal total
        for line in iter(pipe.readline, b""):
            total += len(line)
            if total > max_bytes:
                r.publish(channel, json.dumps({"type": "error", "msg": "output limit exceeded"}))
                proc.kill()
                return
            r.publish(channel, json.dumps({
                "type": stream_type,
                "data": line.decode("utf-8", errors="replace"),
                "ts": time.time()
            }))
        pipe.close()

    t_out = threading.Thread(target=read_pipe, args=(proc.stdout, "stdout"))
    t_err = threading.Thread(target=read_pipe, args=(proc.stderr, "stderr"))
    t_out.start(); t_err.start()
    t_out.join(); t_err.join()
    proc.wait()

    r.publish(channel, json.dumps({"type": "done", "exit_code": proc.returncode}))
    # Store final result with TTL
    r.setex(f"result:{job_id}", 604800, json.dumps({
        "exit_code": proc.returncode,
        "status": "completed" if proc.returncode == 0 else "error"
    }))
Data flow from code submission through queue, worker, sandbox execution, and real-time output streaming back to client via WebSocket
Key Insight

Stream output over Redis pub/sub rather than buffering everything and returning it at job completion - users judge a system’s “speed” by time-to-first-output, not total execution time. A job that prints its first line in 200ms feels instant even if total runtime is 4 seconds.

Data Model

The persistence layer has two concerns: job metadata (durable, queryable) and execution output (temporary, fast). We keep them separate.

-- Core job tracking - durable, used for history and retries
CREATE TABLE submissions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL,
  language VARCHAR(32) NOT NULL
    CHECK (language IN ('python3', 'go', 'java', 'cpp', 'javascript')),
  source_code TEXT NOT NULL,
  stdin TEXT NOT NULL DEFAULT '',
  status VARCHAR(20) NOT NULL DEFAULT 'queued'
    CHECK (status IN ('queued', 'running', 'completed', 'timeout', 'error', 'oom')),
  worker_id VARCHAR(64),
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  started_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  timeout_ms INTEGER NOT NULL DEFAULT 10000,
  memory_limit_bytes BIGINT NOT NULL DEFAULT 536870912
);

-- Execution outputs - queried by job_id, pruned after 7 days
CREATE TABLE execution_results (
  submission_id UUID PRIMARY KEY
    REFERENCES submissions(id) ON DELETE CASCADE,
  stdout TEXT,
  stderr TEXT,
  exit_code SMALLINT,
  execution_time_ms INTEGER,
  memory_used_bytes BIGINT,
  container_id VARCHAR(128)
);

-- Indexes for the hot query paths
CREATE INDEX idx_submissions_status_created
  ON submissions(status, created_at)
  WHERE status IN ('queued', 'running');

CREATE INDEX idx_submissions_user_history
  ON submissions(user_id, created_at DESC);

CREATE INDEX idx_submissions_worker_inflight
  ON submissions(worker_id)
  WHERE status = 'running';

The submissions table shards on user_id - most queries are “show me this user’s recent runs”. The execution_results table shares the same partition key (via the foreign key join). Redis holds the hot path: the queue (execution_queue list), in-flight job payloads (job:{id} keys, 24h TTL), and output streams (output:{id} pub/sub + result:{id} key, 7-day TTL). Postgres is only written after the job completes - it is never on the critical path.

Real World

HackerRank separates “submission metadata” from “execution artifacts” in exactly this way - a relational store for metadata with full history and a fast KV store for the hot execution path. This lets them serve submission history pages from Postgres while keeping the live execution path on Redis with no join latency.

Key Algorithms and Protocols

Timeout Enforcement

Timeout enforcement has three layers, each catching failures the previous layer misses. Layer 1: the container’s own timeout command kills the process after N seconds. Layer 2: the Go worker’s context deadline fires independently and sends SIGKILL to the container’s process group. Layer 3: the reaper process scans in-flight job lists for jobs whose created_at exceeds the max execution window and re-queues them.

The non-obvious edge case is the container that forks a child process before exec-ing the user code. If you kill only the container’s PID 1, the child processes become orphans in the container’s PID namespace and keep running. The fix is Setpgid: true in the worker - this ensures all processes in the container share a process group, so SIGKILL to -pgid kills the entire tree.

# Reaper: scan all worker processing lists for stale jobs and re-queue them
def reap_stale_jobs(r: redis.Redis, max_age_seconds: int = 35):
    # Find all in-flight lists across all workers
    for key in r.scan_iter("processing:*"):
        for job_id in r.lrange(key, 0, -1):
            job_json = r.get(f"job:{job_id.decode()}")
            if not job_json:
                r.lrem(key, 1, job_id)
                continue
            job = json.loads(job_json)
            age = time.time() - job["created_at"]
            if age > max_age_seconds:
                # Atomically move back to main queue
                pipe = r.pipeline()
                pipe.lrem(key, 1, job_id)
                pipe.lpush("execution_queue", job_id)
                pipe.execute()
Key Insight

The reaper is what turns at-least-once delivery from a promise into a guarantee - without it, a worker crash between BRPOPLPUSH and job completion silently drops the job forever, and the user waits indefinitely for a result that never comes.

Container Pre-warming

Cold-start latency for a Python container is 300-500ms for docker run. At 10,000 submissions per minute, adding 400ms per job means your workers need 67 vCPU-seconds of just startup overhead every minute - before any actual execution happens. The fix is pre-warmed container pools.

Each worker maintains a small pool of idle, running containers per language (typically 2-4). When a job arrives, it injects code into a waiting container via a mounted tmpfs rather than spawning a new one. After the job completes, the container is wiped (tmpfs cleared, process reset) and returned to the pool. New containers are started in the background to refill the pool. This brings effective cold-start overhead down to under 50ms.

// Pre-warmed pool: maintain N ready containers per language
type ContainerPool struct {
    language  string
    ready     chan string // channel of ready container IDs
    poolSize  int
}

func (p *ContainerPool) Get(ctx context.Context) (string, error) {
    select {
    case id := <-p.ready:
        return id, nil
    case <-ctx.Done():
        return "", ctx.Err()
    }
}

func (p *ContainerPool) Refill() {
    for len(p.ready) < p.poolSize {
        id, err := startIdleContainer(p.language)
        if err != nil {
            time.Sleep(500 * time.Millisecond)
            continue
        }
        p.ready <- id
    }
}

Scaling and Performance

Workers are the unit of scale. Each worker node runs 4-8 concurrent job goroutines, each supervising one container. Worker count scales based on queue depth, not CPU utilization. A worker spending 8 seconds waiting for a Python script to finish is at low CPU but fully occupied - CPU metrics are the wrong signal for autoscaling here.

Autoscaling: queue depth metric drives worker fleet scaling, with language-specific worker pools to prevent language skew
Capacity estimation:
  Given:
    - 10,000 submissions/minute sustained
    - Average execution time: 3 seconds
    - 4 goroutines per worker node

  Concurrent active jobs: 10,000 / 60 * 3 = 500
  Workers needed: 500 / 4 = 125 worker nodes
  Queue depth target: keep < 60s of backlog = 10,000 jobs

  Memory per worker node:
    - 4 containers * 512MB limit = 2GB container memory
    - Worker process overhead: ~200MB
    - Total: ~2.2GB per worker node
    - Recommended instance: 4GB RAM, 4 vCPU (r-series)

  Bandwidth (output streaming):
    - 10,000 jobs/min * 50KB avg output = 500MB/min = 8.3 MB/s
    - Redis pub/sub throughput is well within this range

Hot spots emerge in two places. First, the Redis queue - at high submission rate, a single Redis instance can become a bottleneck. Partition the queue by language: execution_queue:python3, execution_queue:go, etc. Workers specialize per language and pull from their queue only. This also prevents a flood of slow Java compilations from starving fast Python jobs. Second, pre-warmed container pools deplete during traffic spikes. Monitor pool depth as a metric; alert when any pool drops below 1 available container.

Real World

Codeforces partitions its execution workers by language and by problem difficulty tier - easy problems get a faster, smaller worker pool while harder problems (which typically have longer time limits) use a separate queue. This prevents a single slow C++ submission from blocking 50 Python submissions waiting behind it.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Worker crash mid-executionprocessing:{id} list persists past max timeoutJob stuck, user waits indefinitelyReaper re-queues job after timeout window; new worker picks up
Container OOM killExit code 137 from DockerSingle job fails with OOM errorMark submission as oom, return error to user; no retry
Fork bomb / PID exhaustionPID limit cgroup eventContainer killed, job failsReturn error with “process limit exceeded”; host unaffected
Redis primary failureSentinel election, ~30s gapQueue writes fail, new jobs rejectedReturn 503 during failover; in-flight jobs complete normally
Disk full on workerWrite error on tmpfs mountContainer stdout truncatedAlert on disk usage; fail new containers cleanly; drain worker
Network split between API and RedisREDIS_TIMEOUT errors in APIJob submissions failReturn 503 with retry hint; reconnect with exponential backoff
Watch Out

The most common operational mistake is not setting a memory limit on the Redis instance holding job payloads. Without maxmemory configured, a submission flood fills available RAM and Redis crashes - taking down the queue entirely and losing all in-flight job IDs that have not yet been moved to processing:* lists.

Comparison of Approaches

Sandbox TechnologyIsolation StrengthCold-Start LatencyComplexityFailure ModeBest Fit
Docker + runcStrong (kernel namespaces)300-500msLowKernel exploit escapes hostGeneral purpose, most teams
gVisor (runsc)Very strong (userspace kernel)500-800msMediumgVisor kernel bugsHigh-security untrusted code
Firecracker microVMMaximum (hardware VM boundary)100-150msHighHypervisor vulnerabilitiesMulti-tenant cloud providers
nsjail + seccompStrong (tight syscall filter)50-100msVery HighSeccomp profile gapsCompetitive programming judges

For most teams building a code execution platform, Docker + runc with a tight seccomp profile is the right starting point - it is well-understood, has a large ecosystem, and the isolation is strong enough for 99% of threat models. Move to gVisor if you are running code from genuinely hostile actors (security competitions, exploit development sandboxes) where kernel exploits are a real threat model. Use Firecracker if you are building a multi-tenant cloud product and need VM-level isolation guarantees for compliance.

The counterintuitive lesson: nsjail with a hand-tuned seccomp profile gives the lowest latency (50ms cold start vs 500ms for gVisor), but its complexity is extreme - a missing syscall in the profile silently breaks user code in ways that are hard to debug. For competitive programming judges, this tradeoff is worth it. For general SaaS, it is not.

Key Takeaways

  • Container isolation is not just about security - it is the mechanism that makes resource accounting possible at scale, because each job’s CPU, memory, and PID usage is tracked independently by cgroups.
  • BRPOPLPUSH (not BRPOP) is the correct pattern for reliable queue consumption - it atomically moves the job to an in-flight list, making crash recovery deterministic.
  • Autoscale on queue depth, not CPU utilization - a worker blocked on a running container is at low CPU but fully occupied, and CPU-based autoscaling will under-provision.
  • Pre-warm container pools to eliminate cold-start latency from the critical path - users judge responsiveness by time-to-first-output, not total execution time.
  • Partition queues by language to prevent slow compilation jobs from starving fast interpreted-language submissions.
  • Two-layer timeout enforcement (container-level + worker-level context deadline) is necessary because a container can enter an unresponsive state where its PID 1 is alive but not responding to signals.
  • seccomp profiles are the single most important security control after PID limits - they close the attack surface from the thousands of Linux syscalls down to the 50-100 your language runtimes actually need.
  • Output size caps must be enforced inside the streaming loop, not after the fact - a job writing 1 GB to stdout will fill your Redis pub/sub buffer and cascade into other jobs’ output being dropped.

The counter-intuitive lesson from building these systems: security and performance point in the same direction. Tight namespace isolation, cgroup limits, and seccomp profiles do not just block attacks - they also prevent any single job from impacting others, which is what lets you run thousands of jobs on the same fleet without performance degradation.

Frequently Asked Questions

Q: Why not use WebAssembly (Wasm) as the sandbox instead of containers?

A: Wasm is compelling for JavaScript and C/C++ workloads - Wasmer and Wasmtime give sub-10ms startup with strong isolation. The problem is language support: Python, Java, and Go have incomplete or slow Wasm runtimes today. For a platform that must support 5+ languages equally, containers are the only option with consistent semantics. Use Wasm for JavaScript-only execution environments where startup latency is the primary concern.

Q: Why Redis for the queue instead of Kafka?

A: Kafka excels at ordered, high-throughput event streaming with long retention. For a code execution queue, you want low-latency blocking consumers (BRPOP), simple at-least-once retry semantics, and a small total job count (jobs complete in seconds, not hours). Redis fits these requirements better. Use Kafka if you need to replay the submission history for analytics or audit, or if your queue depth regularly exceeds 1 million jobs.

Q: How do you handle stateful language runtimes like Java with JVM warm-up?

A: Java’s JVM takes 1-2 seconds of warm-up before JIT compilation kicks in - the first execution is always slow. The fix is to keep JVM processes alive between jobs (the pre-warming pool approach), injecting new source code via a class loader rather than restarting the JVM. This reduces Java cold-start from 2 seconds to under 100ms. The complexity is isolating JVM state between jobs (class statics, thread locals) - you need a fresh ClassLoader per job to prevent contamination.

Q: What happens if the queue depth grows faster than we can add workers?

A: You need to shed load. Add a max_queue_depth check in the submission API - if the queue exceeds a threshold (say, 30,000 jobs), return 429 Too Many Requests with a Retry-After header. This is better than accepting infinite submissions because unbounded queue growth means jobs enqueued now will not execute for hours, and users see phantom “queued” states with no feedback.

Q: Why not run user code directly in Lambda or Cloud Run instead of managing workers?

A: Managed serverless functions have 100-500ms cold starts (container reuse aside), and you do not control the container image, seccomp profile, or network isolation - you are trusting the cloud provider’s sandboxing. For a general-purpose code execution platform, you need to set your own seccomp profile, control the exact language runtime version, and enforce custom cgroup limits. Run your own workers. The operational complexity is real, but so is the security requirement.

Q: How do you prevent a user from submitting code that reads another user’s source code from the result store?

A: The container has no network access (--network none) and a read-only root filesystem with no host volume mounts. The code can only read its own stdin and write to its own stdout/stderr - it has no path to reach Redis, the result store, or any other job’s data. The container’s isolation namespace ensures it cannot even enumerate other processes on the host.

Interview Questions

Q: Walk me through how you guarantee a job is never silently lost when a worker crashes.

Expected depth: Describe the BRPOPLPUSH pattern - job ID atomically moved to processing:{worker_id} list. Explain that on worker crash, the job ID remains there and is not in the main queue. Describe the reaper process that scans in-flight lists for jobs older than the max timeout and moves them back. Discuss idempotency: if a worker re-executes a job that actually completed before the crash, the result write is a no-op (same job ID, same result store key).

Q: How would you design the system to support interactive execution - a REPL session where the user sends input multiple times?

Expected depth: Explain that this requires a persistent container (not one-shot) and a bidirectional channel for stdin. Describe WebSocket-based stdin forwarding: the client sends input lines, the API forwards them to the worker via Redis pub/sub, the worker writes them to the container’s stdin pipe. Discuss session TTL (idle session timeout after 5 minutes), container pool management for interactive containers, and the risk of users holding containers open indefinitely.

Q: The system needs to support a “compare output” mode for competitive programming - run user code and expected code, compare outputs. How do you design this?

Expected depth: This is a “judge” problem. Run both submissions as separate jobs. The comparison job should not depend on wall-clock ordering - use a parent job ID that gates on both child jobs completing. Discuss output normalization (strip trailing whitespace, normalize line endings) before comparison. For floating-point answers, discuss epsilon comparison. The key design choice: run comparison server-side (deterministic) vs client-side (untrusted).

Q: How do you handle the case where a user’s code opens thousands of file descriptors to exhaust the system?

Expected depth: The --ulimit nofile=64:64 Docker flag limits open file descriptors per container. Without this, a submission doing for i in range(100000): open("/tmp/f"+str(i), "w") exhausts the kernel’s file descriptor table. Combine with the tmpfs size limit (--tmpfs /tmp:size=32m) to cap total disk usage. Discuss why the kernel limit (not just the cgroup limit) is the binding constraint here.

Q: Design the monitoring and alerting setup for this system. What are the five most important metrics to watch?

Expected depth: (1) Queue depth per language - alert if depth exceeds 5,000; (2) Worker pool utilization - alert if any worker is at 100% goroutine capacity for more than 60 seconds; (3) Container pre-warm pool depth - alert if any language pool hits 0 ready containers; (4) p99 end-to-end latency (submit to first output byte) - alert if above 3 seconds; (5) seccomp violation rate - any increase indicates someone probing for syscall gaps. Discuss why CPU utilization per worker is a misleading signal for this system.

Continue Learning

Want to see how these patterns hold up when traffic spikes 50x at 3 AM? That's exactly what this Premium deep-dive covers.