Vetora logo
Hard11 componentsInterview: Medium

Online Code Editor — Collaborative Editing with OT

Design a browser-based code editor with container-per-session sandboxed execution, Operational Transform for real-time multi-user collaboration, and durable file storage with version history.

Real-TimeCollaborationContainersOT
Problem Statement

The online code editor is a compelling system design interview problem because it combines real-time collaboration, sandboxed code execution, and durable file persistence into a single system with fundamentally different scaling characteristics for each concern. Building a platform like Replit or CodeSandbox requires solving problems that span from low-latency collaborative editing (sub-100ms round-trip for keystrokes) to secure untrusted code execution (complete process isolation) to reliable file storage (zero data loss even during container crashes).

At production scale, the system must support 100,000 concurrent editing sessions, each potentially involving multiple collaborators editing the same file simultaneously. The collaborative editing challenge is particularly interesting: when two users type in the same document at the same time, their edits must be merged deterministically so that both clients converge to the same document state. This is the classic distributed consistency problem, and the two leading solutions — Operational Transform (OT) and Conflict-free Replicated Data Types (CRDTs) — each have distinct trade-offs in complexity, memory overhead, and server requirements.

Code execution adds a security dimension that most system design problems lack. Users submit arbitrary code in multiple languages (Python, JavaScript, Java, Go), and the system must execute it without allowing escape from the sandbox. A fork bomb, a network attack, or an attempt to read the host filesystem must all be contained. Container-per-session isolation using Firecracker microVMs or Docker with gVisor provides kernel-level security boundaries, but at the cost of cold start latency (1-2 seconds for container spin-up) and significant aggregate memory consumption at 100K concurrent sessions.

This template models the complete architecture: an API gateway, a load balancer routing REST and WebSocket traffic, an EditorService for project CRUD and file operations, a CollabService implementing Operational Transform, Redis caches for session and document state, PostgreSQL for project metadata, S3 for durable file storage, and a Kafka-backed execution pipeline feeding ContainerWorker pods. The simulation illustrates how OT operation latency varies with concurrent editor count, how container cold start affects the execution experience, and how the separation of EditorService and CollabService enables independent scaling of REST and WebSocket workloads.

Architecture Overview

The code editor architecture separates three concerns into distinct service boundaries: project management (EditorService), real-time collaboration (CollabService), and code execution (ContainerWorker). This separation is driven by fundamentally different scaling dimensions — EditorService scales with REST request volume, CollabService scales with concurrent WebSocket connections, and ContainerWorker scales with active execution sessions.

When a user opens a project, the EditorClient connects through the API Gateway and MainLB. The EditorService loads project metadata from ProjectDB (PostgreSQL) and file contents from FileStorage (S3), caching active session state in SessionCache (Redis). If the project has other active collaborators, the client also establishes a WebSocket connection to CollabService, which sends the current document state from DocCache (Redis) along with any recent operations the client needs to replay for catch-up.

Real-time collaboration is handled entirely by CollabService using Operational Transform. Each keystroke generates an OT operation (insert, delete, or retain) that is sent to CollabService via WebSocket. The service transforms the operation against any concurrent edits from other users — this is the core OT algorithm that ensures all clients converge to the same document state regardless of operation ordering. The transformed operation is applied to the canonical document in DocCache and broadcast to all session participants. The target latency for this full round-trip is under 100ms, which requires DocCache reads and writes to complete in under 10ms.

Code execution follows an asynchronous pipeline. When the user clicks "Run," EditorService publishes an execution request to ExecStream (Kafka). ContainerWorker consumes the request, spins up an isolated Firecracker microVM or Docker container with strict resource limits (1 CPU core, 512MB RAM, restricted network), mounts the project files from S3, and executes the code. Stdout and stderr stream back to the client in real-time through the CollabService WebSocket connection. Containers are torn down after execution completes or after a five-minute idle timeout.

File persistence uses a dual-write strategy: auto-save every five seconds flushes the canonical document state from DocCache to S3, and the file version is recorded in ProjectDB. S3 versioning provides built-in file history without custom implementation, enabling undo and version browsing. This ensures that even if a container crashes or a session drops, the user's work is preserved up to the last auto-save checkpoint.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Collaboration Algorithm

Choice

Operational Transform (OT) with a central server (CollabService)

Rationale

OT requires a central server for operation ordering but is simpler to implement correctly for text editing than CRDTs. CRDTs are decentralized and work without a server, but they carry higher per-character memory overhead and require complex garbage collection to reclaim tombstoned deletions. At 100K concurrent sessions, the server cost of OT is acceptable, and the simpler implementation reduces the surface area for consistency bugs. Google Docs uses OT for similar reasons.

Code Execution Isolation

Choice

Container-per-session using Firecracker microVMs with strict resource limits

Rationale

Untrusted user code can attempt fork bombs, network attacks, filesystem access, and other malicious operations. Firecracker microVMs provide kernel-level isolation with a minimal attack surface — each session runs in its own lightweight VM with dedicated CPU, memory, and network namespace. The trade-off is cold start latency (1-2 seconds) and aggregate memory consumption (128MB per VM means 12.5TB for 100K sessions). Pre-warmed container pools mitigate cold start for popular languages.

File Storage Backend

Choice

Amazon S3 with built-in versioning for file history

Rationale

Code files range from tiny (100 bytes for a config file) to moderately large (10MB+ for bundled dependencies). S3 handles arbitrary file sizes with 99.999999999% durability and provides built-in versioning that automatically retains previous file versions. Storing file content in PostgreSQL would bloat the database, complicate backups, and degrade query performance. S3 also decouples storage scaling from database scaling.

Service Separation (EditorService vs. CollabService)

Choice

Separate services for REST operations and real-time WebSocket connections

Rationale

REST operations (project CRUD, file save/load, execution triggers) and WebSocket connections (OT operations, live collaboration) have fundamentally different scaling characteristics. REST traffic is stateless and bursty; WebSocket connections are long-lived and stateful. Separating them allows independent scaling — adding more CollabService pods when concurrent collaboration sessions spike without over-provisioning EditorService, and vice versa during periods of heavy project creation.

Scale & Performance

Target RPS

50,000 peak RPS (REST + OT operations combined)

Latency (p99)

<100ms (OT round-trip); <2s (execution cold start)

Storage

~20 TB/year (code files in S3 + project metadata)

Availability

99.95%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
What is Operational Transform and how does it enable real-time collaboration?

Operational Transform (OT) is an algorithm for resolving concurrent edits in a shared document. When two users type simultaneously, their operations may conflict — for example, user A inserts a character at position 5 while user B deletes the character at position 3. OT transforms these operations against each other so that both clients arrive at the same final document state regardless of the order in which they receive the operations. The central server (CollabService) acts as the ordering authority, applying each operation to the canonical document and broadcasting the transformed result to all clients.

How does the system isolate untrusted user code during execution?

Each coding session runs in an isolated Firecracker microVM or Docker container with gVisor. The container has strict resource limits: one CPU core, 512MB RAM, no host network access, and a read-only filesystem except for the project directory. Firecracker provides a minimal VM monitor with a reduced attack surface compared to full virtualization. If a user's code attempts a fork bomb, the process limit is capped; if it tries to access the network, the firewall blocks all traffic. The container is destroyed after execution completes or after an idle timeout.

What is the difference between OT and CRDTs for collaborative editing?

OT and CRDTs are both algorithms for achieving consistency in collaborative editing, but they differ in architecture and trade-offs. OT requires a central server that orders operations and transforms them against each other — it is simpler to implement but creates a single point of coordination. CRDTs (Conflict-free Replicated Data Types) encode conflict resolution into the data structure itself, enabling peer-to-peer collaboration without a central server. However, CRDTs have higher per-character memory overhead due to unique character identifiers and require garbage collection for deleted characters. Most production systems (Google Docs, Overleaf) use OT for its simplicity.

How does the system handle container cold start latency?

The first code execution in a new session requires spinning up a container, which takes 1-2 seconds with Firecracker. To mitigate this, the system maintains pre-warmed container pools for popular languages (Python, JavaScript, Java, Go). When a user opens a project, the system pre-allocates a warm container in the background so it is ready by the time they click Run. For subsequent executions within the same session, the existing container is reused, providing near-instant startup. Containers are torn down after a five-minute idle timeout to reclaim resources.

How is file persistence handled to prevent data loss?

The system uses a multi-layered persistence strategy. The canonical document state lives in DocCache (Redis) for fast OT operations. Every five seconds, an auto-save flushes the current state from DocCache to S3 (FileStorage) and records the file version in ProjectDB (PostgreSQL). S3 versioning automatically retains all previous versions, enabling version history and undo. Even if a container crashes or a WebSocket connection drops, the user's work is preserved up to the last auto-save. On reconnection, the client syncs from the canonical state in DocCache or, if DocCache was evicted, from the latest S3 version.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Online Code Editor?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator