The simplest possible online code editor: a single EditorService backed by PostgreSQL for file storage and a shared pre-warmed process pool for code execution. No real-time collaboration, no container isolation, no WebSocket streaming. Demonstrates why process-level isolation and lack of collaboration become untenable as user counts grow.
Online code editors are increasingly common system design interview questions because they combine real-time collaboration, sandboxed code execution, durable file persistence, and multi-language runtime management into a single problem. Companies like Replit, CodeSandbox, GitHub Codespaces, and StackBlitz ask variants of this question because it directly maps to their core product engineering challenges.
The naive approach uses the simplest possible architecture: a single EditorService backed by PostgreSQL for all data (projects, files, execution output) and a shared pre-warmed process pool for code execution. Files are saved via explicit REST PUT requests — there is no auto-save, no real-time sync, no WebSocket connection. When a user clicks Save, the full file content is sent to EditorService which UPSERTs the file record in PostgreSQL. There is no versioning — the previous content is overwritten and lost.
Code execution uses a shared pool of pre-warmed OS processes. Each process has a pre-loaded language runtime (Python interpreter, Node.js V8 engine, Java JVM, Go runtime). When a user clicks Run, EditorService assigns an available process, writes the code to a temp file in the process sandbox, and executes it. Stdout/stderr is captured and stored in PostgreSQL as an execution record. The client polls GET /api/v1/executions/{id}/output every 2 seconds until the execution completes or times out at 30 seconds.
The shared process pool has two critical problems. First, isolation is process-level only: ulimits cap CPU time (30s) and memory (256MB), but processes share the host OS kernel, filesystem, and network stack. A malicious user could read other users' temp files, scan the internal network, or exhaust the PID table with a fork bomb. Second, resource contention is shared: one user running a CPU-intensive computation (matrix multiplication, cryptocurrency mining) degrades execution performance for all other users sharing the same host.
There is no collaboration capability. Each file has a single editor — if two users open the same file, the last save wins with no conflict detection, no merge, and no notification. There is no cursor sharing, no presence awareness, and no real-time sync. This is acceptable for single-user coding tools (personal projects, homework assignments) but unusable for pair programming or collaborative development.
The polling model for execution output creates wasted bandwidth. At 1K concurrent executions, output polling generates 500 QPS of GET requests, most returning unchanged data. WebSocket streaming (used in V1 and V2) pushes output immediately when available, eliminating 95% of these redundant requests.
This template exists to make the process pool exhaustion and isolation risks visible and measurable. Run the simulation with 200 concurrent execution requests and watch the shared pool exhaust while a single fork bomb crashes the host. Compare with the Container+OT variant (V1) where each session has strict CPU/memory limits via Firecracker microVMs, and the Container+CRDT variant (V2) where CRDT-based collaboration enables real-time multi-user editing.
Interviewers expect candidates to identify the shared process pool as a security risk, propose container-based isolation as the solution, discuss the transition from polling to WebSocket for real-time output, and reason about adding collaborative editing via OT or CRDT.
The naive code editor system is a five-component linear architecture: EditorClient, EditorLB (load balancer), EditorService, CodeDB (PostgreSQL), and ExecWorker (shared process pool). There is no WebSocket gateway, no collaboration service, no container orchestrator, no object storage, and no caching layer.
All traffic arrives at EditorLB (AWS ALB), which distributes requests across EditorService pods using round-robin. The load balancer adds approximately 1.5ms of routing latency and can handle up to 10K RPS — well above the system's actual limits, which are constrained by the process pool. The load balancer is never the bottleneck.
EditorService handles four types of requests. File operations (50% of traffic): users save files via PUT /api/v1/projects/{id}/files/{path} and load files via GET /api/v1/projects/{id}/files. File content is stored as TEXT columns in PostgreSQL — no S3, no versioning, no CDN. Each save overwrites the previous content via UPSERT. Code execution (20% of traffic): users click Run, EditorService reads the file from CodeDB, dispatches to an available process in ExecWorker's pool, and returns a 202 with an execution_id. The client polls for output. Project CRUD (20%): creating and listing projects — standard database operations. Output polling (10%): GET /api/v1/executions/{id}/output — the client polls every 2 seconds for execution results.
CodeDB (PostgreSQL) stores three tables: projects (metadata), files (content as TEXT columns), and executions (output as TEXT columns). A single primary instance with no read replicas handles all reads and writes. At peak: 2,500 file operations/sec + 1,000 execution dispatches/sec + 500 output polls/sec = 4,000 ops/sec. The database connection pool (200 max) can sustain this load but has no headroom for spikes.
ExecWorker is a shared pool of 100 pre-warmed OS processes across 10 worker pods (10 processes each). Each process has a pre-loaded language runtime. Process-level isolation uses ulimits: CPU time limit (30 seconds), memory limit (256MB), file descriptor limit (64), PID limit (32). There is no container boundary — processes share the host kernel, filesystem namespace, and network. A fork bomb bypassing PID limits can crash the host and take down all 10 processes on that pod.
The execution flow is synchronous from the user's perspective: click Run, wait for the poll to return output. There is no streaming, no real-time output, and no interactive stdin support. Long-running programs (web servers, infinite loops) are killed after 30 seconds with no graceful shutdown mechanism.
This sequence diagram traces the two primary flows: file saving (explicit Save action) and code execution (Run click + output polling). The critical insight is the shared process pool — all users' code runs in the same pool of 100 pre-warmed processes with no container boundary. A fork bomb or CPU-intensive computation in one process degrades all others.
The second insight is the polling model for output retrieval. The client polls every 2 seconds, meaning output appears with 0-2 second latency. At 1K concurrent executions, this generates 500 QPS of mostly redundant requests.
Step-by-Step Walkthrough
Pseudocode
// FILE SAVE — explicit Ctrl+S, no auto-save
async function saveFile(projectId, path, content):
await db.execute(
"INSERT INTO files (file_id, project_id, path, content, updated_at) " +
"VALUES (gen_random_uuid(), $1, $2, $3, NOW()) " +
"ON CONFLICT (project_id, path) DO UPDATE SET content=$3, updated_at=NOW()",
[projectId, path, content]
) // ~40ms — UPSERT with TOAST compression for content > 2KB
return 200
// CODE EXECUTION — shared process pool
async function runCode(projectId, filePath, language):
content = await db.execute(
"SELECT content FROM files WHERE project_id=$1 AND path=$2",
[projectId, filePath]
)
process = processPool.acquire() // O(1) from free list, or queue if exhausted
if (!process) throw 503 // Pool exhausted
execId = uuid()
await db.execute(
"INSERT INTO executions (execution_id, project_id, file_path, language, status) " +
"VALUES ($1, $2, $3, $4, 'running')",
[execId, projectId, filePath, language]
)
// Execute in background — process captures stdout/stderr
process.exec(language, content, { timeout: 30000, maxMemory: 256_000_000 })
process.onOutput(chunk => db.execute(
"UPDATE executions SET output = output || $1 WHERE execution_id = $2",
[chunk, execId]
))
return { execId, status: 'running' }The schema reflects the naive approach's single-database design. All data — project metadata, file content, and execution output — lives in one PostgreSQL instance. The files table is the largest due to inline TEXT content storage. The executions table grows during active code runs as output is appended.
The critical column is files.content — a TEXT column storing full file content. PostgreSQL uses TOAST (The Oversized-Attribute Storage Technique) to compress and store large values out-of-line, but reading/writing large TEXT values still incurs significant I/O compared to S3.
Choice
100 pre-warmed OS processes with ulimit-based isolation instead of containers
Rationale
Pre-warmed processes eliminate cold start entirely — execution begins in under 50ms because the language runtime is already loaded. Firecracker microVM boot + runtime load takes 1-2 seconds even with pre-warming. The trade-off is weak isolation: processes share the OS kernel, filesystem, and network. A fork bomb, filesystem traversal, or network scan from one process can affect all others on the same host. This is acceptable for trusted internal tools but unacceptable for public platforms running untrusted code.
Choice
Store file content as TEXT columns in the files table instead of S3
Rationale
Storing files in PostgreSQL eliminates the need for S3 SDK integration, presigned URLs, and eventual consistency handling. One table, one query to read or write. The trade-off is that large files (>1MB) bloat the database — PostgreSQL is not optimized for blob storage. TOAST compression helps but adds CPU overhead. At this naive scale (under 10K projects, ~200K files), total storage fits in a single PostgreSQL instance (~1GB of content). Beyond 50K projects, file content should migrate to S3.
Choice
Last-write-wins overwrites with no conflict detection or merge
Rationale
Real-time collaboration requires WebSocket connections, a conflict resolution algorithm (OT or CRDT), shared document state management in a fast cache, and broadcast fan-out to all participants. This adds 3-4 components and significant implementation complexity. The naive approach avoids all of this by treating each file as single-writer. If two users open the same file and both save, the second save silently overwrites the first with no merge, no warning, and no history. This is the simplest correct behavior for single-user tools.
Choice
Client polls GET /executions/{id}/output every 2 seconds instead of WebSocket streaming
Rationale
Polling uses standard HTTP request/response — no persistent connections, no reconnection logic, no server-side pub/sub. Any HTTP client works. The trade-off is latency (up to 2 seconds for new output to appear) and wasted bandwidth (most polls return unchanged data). At 1K concurrent executions, polling generates 500 QPS of mostly redundant requests. WebSocket streaming in V1/V2 pushes output immediately with zero wasted requests.
Choice
UPSERT overwrites previous content with no version history
Rationale
Versioning requires either multiple rows per file (version column with incrementing counter) or S3 object versioning. Both add storage cost and query complexity. The naive approach keeps exactly one row per file — the current version. If a user makes a mistake and saves, the previous content is gone. Production editors maintain version history for undo/diff, but this adds significant storage and UX complexity.
Target RPS
~5K sustained (ceiling at process pool)
Latency (p99)
~50ms file save, ~50ms execution start, 0-2s output delivery
Storage
~1 GB at 10K projects (file content in PostgreSQL)
Availability
~99% (single instance, no redundancy)
| Operation | Time | Space | Notes |
|---|---|---|---|
| File save (UPSERT) | O(1) — single row UPSERT on unique index | O(S) — S is the file content size | PostgreSQL UPSERT on (project_id, path) index. TOAST compression for content > 2KB adds ~5ms CPU overhead. Total latency: ~40ms including WAL flush. |
| Process allocation from pool | O(1) — pick first available process from free list | O(1) — single process assignment | The process pool maintains a free list. Allocation is O(1) when processes are available. When the pool is exhausted, requests queue with O(N) wait time where N is the queue depth. |
| Output polling (SELECT by execution_id) | O(1) — indexed PK lookup | O(S) — S is the output size (can be large) | Fast per-query (~10ms) but generated at 500 QPS for 1K concurrent executions. 80% of polls return unchanged data — pure waste that WebSocket streaming eliminates. |
| File listing (SELECT by project_id) | O(F) — F is the number of files in the project | O(F) — returns all file metadata | Indexed on (project_id, path). Typical projects have 5-50 files. Not a performance concern. |
Project metadata table. Write-once on creation, read on every project open. Small table (~10K rows for a moderate deployment). Not a performance concern.
Indexes: PK on project_id, idx_projects_owner ON (owner_id, created_at) for dashboard listing
Small, low-write table. Fully cached in PostgreSQL buffer pool after initial queries. Not a performance concern at any reasonable scale.
File content stored as TEXT columns. Each save overwrites the previous content via UPSERT on (project_id, path). No versioning — previous content is lost. The largest table by data volume due to inline content storage.
Indexes: PK on file_id, UNIQUE idx_files_project_path ON (project_id, path) for UPSERT
The content column is the bottleneck — large TEXT values trigger TOAST compression, adding CPU overhead on read/write. At 10K projects with 20 files each averaging 5KB, the table holds ~200K rows and ~1GB of content. Beyond 50K projects, file content should migrate to S3.
Execution records storing stdout/stderr output from code runs. Created on Run click, updated as output is captured. Polled by clients every 2 seconds. Retained for 24 hours then purged by a scheduled job.
Indexes: PK on execution_id, idx_executions_project ON (project_id, created_at) for history listing, idx_executions_status ON (status) WHERE status = 'running' for active execution tracking
Write-heavy during active executions (output appended every 100ms). The output column grows during execution — a program printing 1MB of output creates a 1MB TEXT value that is read on every poll. At 1K concurrent executions with 500 QPS polling, this table generates significant read I/O.
The naive approach has no concept of sessions — there is no persistent connection, no container association, and no session state. This table exists in V1/V2 to track container assignments and collaboration state.
Included for completeness in the variant comparison. The V1 variant adds session tracking for container-per-session association. The V2 variant extends it with CRDT collaboration state and Firecracker snapshot references.
Fork bomb in the shared process pool (user runs `:(){ :|:& };:`)
Impact
The fork bomb creates processes exponentially until the PID limit (32) is hit. Those 32 processes consume CPU scheduling time, degrading execution performance for all other processes on the same worker pod. If the PID limit is bypassed via race condition, the host's PID table is exhausted, causing all 10 processes on that pod to fail with EAGAIN.
Mitigation
Container isolation (V1/V2) limits fork bombs to the microVM's kernel PID namespace — the host is unaffected. In the naive approach, the only mitigation is aggressive PID limits (RLIMIT_NPROC=16) and automated pod restart on PID exhaustion detection.
Two users save the same file simultaneously (last-write-wins race)
Impact
Both PUT requests arrive at EditorService within milliseconds. PostgreSQL serializes the UPSERTs — one commits first, the second overwrites it. The first user's changes are silently lost with no notification, no merge, and no conflict detection. Neither user is aware of the data loss until they manually compare their local editor state with the saved version.
Mitigation
The V1/V2 variants solve this with OT/CRDT collaborative editing — concurrent edits are automatically merged without data loss. In the naive approach, the only mitigation is advisory file locking (check if another user has the file open), which adds complexity without guaranteeing correctness.
Database failure (single PostgreSQL instance goes down)
Impact
Total system outage — no file saves, no file loads, no execution records, no project CRUD. All in-progress executions lose their output (output is stored in the database). Users cannot save their work — unsaved code in the browser is the only copy.
Mitigation
Add RDS Multi-AZ for automated failover (30-60 seconds recovery). The V2 variant separates file storage (S3, 99.999999999% durability) from metadata (PostgreSQL), so file content survives database failures.
Process pool exhaustion during peak usage (100 concurrent Run clicks)
Impact
All 100 processes are occupied. Subsequent Run requests queue with a 10-second wait timeout. If executions average 5 seconds, the queue drains at ~20 processes/sec — 100 queued requests take 5 seconds. At 200 queued requests, the wait exceeds the 10-second timeout and requests fail with 503 Service Unavailable.
Mitigation
Increase pool size (more worker pods) or decrease execution timeout. The V1/V2 variants use container-per-session with a pre-warmed pool of 500+ VMs, providing 5x the execution capacity with independent resource isolation.
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| ExecWorker (Shared Process Pool) | Pool exhaustion from long-running or stuck processes | All execution requests fail with 503. Users can still save and load files but cannot run code. If stuck processes are not killed, the pool never recovers without pod restart. | Enforce 30-second hard kill timeout via SIGKILL. Monitor pool utilization and alert at 70% capacity. Auto-scale worker pods based on queue depth. |
| CodeDB (PostgreSQL) | Connection pool exhaustion from concurrent file saves + execution output writes | All requests fail — no saves, no loads, no execution records. 503 errors across the board. Total system outage. | Connection pooling via PgBouncer (transaction mode). Increase max_connections from 200 to 500. Long-term: separate file content storage (S3) from metadata (PostgreSQL). |
| EditorService | Thread starvation from slow database queries blocking all threads | If CodeDB is slow (>500ms queries), all 200 threads (4 pods x 50 threads) block on database responses. New requests queue and eventually timeout. Cascading failure as clients retry. | Database query timeouts (500ms max). Separate thread pools for file operations vs execution dispatch (bulkhead pattern). Circuit breaker on database calls. |
Vertical scaling for PostgreSQL (upgrade instance size). Horizontal scaling for EditorService (add pods) and ExecWorker (add pods with more processes). Auto-scaling trigger: process pool utilization > 70% for 3 consecutive minutes. The ceiling is approximately 2K-3K concurrent users regardless of pod count, because the shared process pool's isolation model breaks down under adversarial usage. Beyond this ceiling, container isolation is required.
Key metrics to monitor: (1) Process pool utilization — percentage of processes currently executing user code. Alert at 70%, critical at 90%. This is the primary capacity indicator. (2) Execution queue depth — number of Run requests waiting for an available process. Alert if queue depth exceeds 50 (>2.5 second wait). (3) File save latency (p50, p99) — should be under 50ms. Alert at p99 > 200ms indicating database contention. (4) Output poll empty rate — percentage of GET /executions/{id}/output requests returning unchanged data. Expected ~80% waste; useful for justifying migration to WebSocket. (5) PostgreSQL active connections — alert at 70% of max_connections (140/200), critical at 85% (170/200). (6) Process kill rate — frequency of 30-second timeout kills. High kill rate indicates users running long-lived programs (web servers, infinite loops) that exhaust the pool.
At 1K concurrent users: EditorService 4 pods on ECS Fargate (~$250/month), ExecWorker 10 pods (~$625/month), PostgreSQL db.r7g.large (~$200/month), ALB (~$30/month). Total: ~$1,105/month. This is the cheapest variant but breaks down beyond 2K-3K concurrent users due to process pool exhaustion and isolation risks. The V1 Container+OT variant costs approximately $3,500/month but provides kernel-level isolation and real-time collaboration. The V2 Container+CRDT variant costs approximately $15K-20K/month at 100K concurrent sessions but provides the full production feature set.
The naive approach has critical security gaps. Process-level isolation via ulimits cannot prevent: (1) filesystem traversal — user code can read other users' temp files and the host filesystem, (2) network access — user code can scan internal services, make outbound HTTP requests, and exfiltrate data, (3) fork bombs — while PID limits cap process count, the CPU scheduling impact degrades all users on the same host. For internal tools with trusted users, these risks are acceptable with monitoring. For public platforms, container isolation (V1/V2) is mandatory. Additional measures: disable network access in exec processes, mount temp directories with noexec, and run exec processes as a dedicated low-privilege user.
Rolling deployment for EditorService — replace one pod at a time while ALB routes traffic to remaining pods. ExecWorker pods are drained before replacement: wait for all active processes to complete (up to 30 seconds), then terminate. Database migrations run during low-traffic windows. No blue-green needed at this scale — rolling updates provide sufficient safety with minimal complexity.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Shared Process Pool) | T1 | ~50ms exec start, 0-2s output delivery | ~5K RPS total | $1,100/month | Low | 99% (single DB) |
| V1: Container + OT (Firecracker + Operational Transform) | T2 | <2s cold start, <100ms collab | ~50K RPS peak | $3,500/month | Medium | 99.9% (multi-AZ) |
| V2: Container + CRDT (Firecracker + Yjs/Automerge) | T3 | <2s cold start, <100ms collab | ~50K RPS + 200K WS msg/sec | $15K-20K/month | Very High | 99.95% (multi-AZ, CRDT resilience) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Online code editors combine four hard distributed systems challenges: (1) sandboxed code execution — running untrusted code safely requires container isolation, resource limits, and network restrictions, (2) real-time collaboration — multi-user editing requires conflict resolution algorithms (OT or CRDT) with sub-100ms latency, (3) durable file persistence — code must never be lost, requiring S3-tier durability with versioning, and (4) multi-language runtime management — supporting Python, Node.js, Java, and Go requires per-language container images with runtime pre-warming. Replit, CodeSandbox, and GitHub Codespaces ask this question because it is their core product. Google, Meta, and Amazon ask it because it tests distributed systems fundamentals (isolation, real-time sync, durable storage) in a concrete, relatable context.
The shared process pool has two failure modes. First, resource exhaustion: at 100 processes with 30-second timeout, the sustained execution capacity is approximately 3.3 executions/second. A burst of 200 concurrent Run clicks exhausts the pool in under a second, queuing all subsequent requests. Second, isolation failure: process-level ulimits cap CPU and memory but cannot prevent filesystem traversal (one user reads another's temp files), network scanning (probing internal services), or PID exhaustion (fork bombs that crash the host). A single malicious user can degrade or crash execution for all concurrent users on the same host.
Migrate when any of these conditions are met: (1) you run untrusted code from public users (security requirement, not scale), (2) concurrent executions regularly exceed 50% of pool capacity (resource contention becomes measurable), or (3) you need multi-user collaboration (requires WebSocket + OT/CRDT, which justifies the infrastructure upgrade to containers). In practice, any public-facing code editor should use container isolation from day one — the security risk of shared processes with untrusted code is too high regardless of scale.
Poorly. The process-level PID limit (32) prevents the fork bomb from creating more than 32 child processes. However, those 32 processes consume CPU scheduling time from the host, degrading performance for all other processes in the pool. If the PID limit is bypassed (race condition during rapid fork), the fork bomb can exhaust the host's PID table, causing all processes on that worker pod to fail with EAGAIN errors. The Container variant (V1/V2) isolates fork bombs within a Firecracker microVM — the microVM's kernel enforces its own PID limit independently of the host.
Adding WebSocket for output streaming is a partial upgrade that addresses the output delivery latency problem but not the isolation or collaboration problems. WebSocket requires a persistent connection server, connection lifecycle management (heartbeats, reconnection), and a mechanism to route output from the correct process to the correct client. This complexity is comparable to adding a full collaboration service. In practice, teams that add WebSocket for output streaming simultaneously add it for collaborative editing (OT/CRDT), since the infrastructure cost is shared.
Sign in to join the discussion.
Ready to design your own Online Code Editor?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator