1What is the key difference between ETL and ELT?
ETL (Extract, Transform, Load) pipelines move data from source systems to analytical destinations by extracting raw data, transforming it into a usable schema, and loading it into a target store. Modern variants include ELT (load raw, then transform in-warehouse) and reverse ETL (push warehouse data back to operational tools).
ETL pipelines are the circulatory system of data-driven organizations: they move data from where it is generated (transactional databases, SaaS APIs, event streams, log files) to where it is analyzed (data warehouses, data lakes, ML feature stores). Without reliable ETL, analytics dashboards show stale data, ML models train on incomplete features, and business decisions are based on gut feeling rather than evidence.
Traditional **ETL** extracts data from sources, transforms it in a separate processing engine (Spark, custom Python scripts), and loads the transformed result into the target. This made sense when compute was expensive -- you minimized what you loaded into the warehouse. Modern **ELT** reverses the T and L: raw data is loaded directly into a cloud warehouse (BigQuery, Snowflake, Redshift) or lakehouse, and transformations happen in-place using SQL. Cloud warehouses have near-infinite compute elasticity, making it cheaper to transform after loading.
**dbt** (data build tool) has become the industry standard for the transformation layer in ELT. dbt lets analytics engineers write SQL SELECT statements that define transformations, then compiles them into CREATE TABLE AS SELECT (CTAS) statements. dbt adds software engineering practices to SQL transformations: version control, automated testing (schema tests, data tests), documentation, and lineage graphs. This shifted the 'T' from data engineers writing Spark code to analytics engineers writing tested SQL.
Orchestration ties everything together. **Apache Airflow** is the dominant pipeline orchestrator, modeling ETL workflows as Directed Acyclic Graphs (DAGs) of tasks. Each task is an extract, transform, or load operation. Airflow handles scheduling, dependency management, retries, alerting, and backfills. Dagster and Prefect are modern alternatives that emphasize software-defined assets and better developer ergonomics.
Daily Sales Analytics Pipeline
A SaaS company runs a daily ELT pipeline. At midnight, Airbyte extracts new orders from the production PostgreSQL database (incremental, using the updated_at column), new events from Stripe's API (paginated, using cursor-based pagination), and marketing attribution data from Google Ads API. Raw data lands in BigQuery's 'raw' dataset as append-only tables. At 1 AM, a dbt job runs 30 SQL models: staging models clean and rename columns, intermediate models join orders with payments and attribution, and mart models produce the final 'monthly_recurring_revenue' and 'customer_lifetime_value' tables. dbt tests validate that revenue is never negative and that every order has a matching payment. Airflow orchestrates the entire flow and alerts the data team via Slack if any step fails.
Airbnb
Airbnb built Minerva, a metrics platform powered by ETL pipelines. Over 2,000 dbt models transform raw event data into certified business metrics (bookings, revenue, guest satisfaction). Airflow orchestrates 50,000+ ETL tasks daily. Their key innovation was 'metric certification' -- every metric must pass data quality tests and have documented ownership before appearing in dashboards, eliminating conflicting metric definitions across teams.
Spotify
Spotify's ETL platform processes over 3 PB of data daily for music recommendations, podcast analytics, and artist royalty calculations. They use Luigi (their own orchestrator) and Scio (Scala on Dataflow) for batch transformations. The royalty pipeline is mission-critical: it must accurately count every stream to calculate per-play payments to rights holders. Any ETL error directly impacts artist payments, requiring extensive reconciliation checks.
GitLab
GitLab practices 'analytics as code' with their fully open-source ELT pipeline. They extract data from 30+ SaaS sources (Salesforce, Zuora, Zendesk) using Meltano (their open-source EL tool), load into Snowflake, transform with dbt (1,500+ models), and orchestrate with Airflow. Their entire analytics codebase -- including dbt models, tests, and Airflow DAGs -- is public on GitLab.com, serving as a reference implementation for modern ELT.
| Aspect | Description |
|---|---|
| ETL vs ELT | ETL transforms before loading, reducing storage costs and ensuring only clean data enters the warehouse. But it requires a separate compute engine (Spark, Beam) and makes iterating on transformations slower (change code, redeploy, rerun). ELT loads raw data and transforms in-warehouse with SQL, enabling faster iteration and schema-on-read flexibility, but increases storage costs and exposes raw (potentially sensitive) data in the warehouse. |
| Full Refresh vs Incremental | Full refresh (reprocess everything) is simple and guarantees correctness but is expensive for large datasets. Incremental extraction (only new/changed rows since last run) is efficient but complex: you need a reliable change detection mechanism (timestamps, CDC, version columns), and you must handle late-arriving updates and deletes. Start with full refresh, switch to incremental when volume demands it. |
| Orchestrator Complexity vs Pipeline Reliability | Airflow provides fine-grained control (retries, SLAs, sensors, branching, dynamic DAGs) but has a steep learning curve and operational overhead (scheduler, workers, metadata DB). Simpler tools like cron + scripts are easier to start with but lack observability, retry logic, and dependency management. The complexity of Airflow pays off at scale (100+ DAGs, 10+ team members). |
dbt Transformation at JetBlue
Scenario
JetBlue's analytics team maintained over 200 stored procedures in their Teradata data warehouse for transforming raw flight, passenger, and revenue data into business metrics. Stored procedures were difficult to test, version-control, or document. When a procedure produced incorrect revenue numbers, root cause analysis took days because there was no lineage tracking. Different teams sometimes reported different revenue numbers because they used different procedures.
Solution
JetBlue migrated to Snowflake with dbt for transformations. They replaced stored procedures with 400+ dbt SQL models organized into staging, intermediate, and mart layers. Each model included schema tests (not-null, accepted-values, referential integrity) and data tests (revenue reconciliation against source systems). dbt's lineage graph provided full visibility into how raw data flowed to final metrics. Airflow orchestrated the daily dbt run.
Outcome
Metric discrepancies between teams were eliminated because everyone consumed the same certified dbt mart models. Data quality issues were caught automatically by dbt tests before bad data reached dashboards -- average time to detect data quality issues dropped from 3 days to 15 minutes. New metric development time dropped from weeks (writing and testing stored procedures) to days (writing and testing dbt SQL models).
See ETL Pipelines in action
Explore system design templates that use etl pipelines and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the key difference between ETL and ELT?
2Why is idempotency important for ETL pipeline tasks?