Vetora logo
๐Ÿ”„Data Engineering

ETL Pipelines

ETL (Extract, Transform, Load) pipelines move data from source systems to analytical destinations by extracting raw data, transforming it into a usable schema, and loading it into a target store. Modern variants include ELT (load raw, then transform in-warehouse) and reverse ETL (push warehouse data back to operational tools).

Overview

ETL pipelines are the circulatory system of data-driven organizations: they move data from where it is generated (transactional databases, SaaS APIs, event streams, log files) to where it is analyzed (data warehouses, data lakes, ML feature stores). Without reliable ETL, analytics dashboards show stale data, ML models train on incomplete features, and business decisions are based on gut feeling rather than evidence.

Traditional **ETL** extracts data from sources, transforms it in a separate processing engine (Spark, custom Python scripts), and loads the transformed result into the target. This made sense when compute was expensive -- you minimized what you loaded into the warehouse. Modern **ELT** reverses the T and L: raw data is loaded directly into a cloud warehouse (BigQuery, Snowflake, Redshift) or lakehouse, and transformations happen in-place using SQL. Cloud warehouses have near-infinite compute elasticity, making it cheaper to transform after loading.

**dbt** (data build tool) has become the industry standard for the transformation layer in ELT. dbt lets analytics engineers write SQL SELECT statements that define transformations, then compiles them into CREATE TABLE AS SELECT (CTAS) statements. dbt adds software engineering practices to SQL transformations: version control, automated testing (schema tests, data tests), documentation, and lineage graphs. This shifted the 'T' from data engineers writing Spark code to analytics engineers writing tested SQL.

Orchestration ties everything together. **Apache Airflow** is the dominant pipeline orchestrator, modeling ETL workflows as Directed Acyclic Graphs (DAGs) of tasks. Each task is an extract, transform, or load operation. Airflow handles scheduling, dependency management, retries, alerting, and backfills. Dagster and Prefect are modern alternatives that emphasize software-defined assets and better developer ergonomics.

Key Points
  • 1ETL (Extract, Transform, Load) transforms data before loading; ELT (Extract, Load, Transform) loads raw data first and transforms in the warehouse. ELT is dominant in modern cloud architectures because cloud warehouses provide elastic, cheap compute.
  • 2The Extract phase must handle schema evolution, API pagination, rate limiting, incremental extraction (only new/changed rows), and source system load management. CDC (Change Data Capture) is the most efficient extraction method for databases.
  • 3dbt is the standard for SQL-based transformations: it compiles SELECT statements into materialized tables/views, adds testing (not-null, unique, referential integrity), generates documentation, and tracks data lineage. dbt models are versioned in Git alongside application code.
  • 4Airflow DAGs define task dependencies, retry policies, SLA deadlines, and alerting. Best practice: each task should be atomic and idempotent. Use the 'sensor' pattern to wait for upstream data arrival rather than scheduling by wall-clock time.
  • 5Data quality is the hardest part of ETL. Implement checks at every stage: row count validation after extraction, schema conformance after transformation, reconciliation totals after loading. Tools like Great Expectations and dbt tests automate these checks.
  • 6Reverse ETL pushes transformed data from the warehouse back to operational tools (Salesforce, Marketo, Zendesk). Tools like Census and Hightouch sync warehouse segments to CRMs, enabling the warehouse to be the single source of truth for customer data.
Simple Example

Daily Sales Analytics Pipeline

A SaaS company runs a daily ELT pipeline. At midnight, Airbyte extracts new orders from the production PostgreSQL database (incremental, using the updated_at column), new events from Stripe's API (paginated, using cursor-based pagination), and marketing attribution data from Google Ads API. Raw data lands in BigQuery's 'raw' dataset as append-only tables. At 1 AM, a dbt job runs 30 SQL models: staging models clean and rename columns, intermediate models join orders with payments and attribution, and mart models produce the final 'monthly_recurring_revenue' and 'customer_lifetime_value' tables. dbt tests validate that revenue is never negative and that every order has a matching payment. Airflow orchestrates the entire flow and alerts the data team via Slack if any step fails.

Real-World Examples

Airbnb

Airbnb built Minerva, a metrics platform powered by ETL pipelines. Over 2,000 dbt models transform raw event data into certified business metrics (bookings, revenue, guest satisfaction). Airflow orchestrates 50,000+ ETL tasks daily. Their key innovation was 'metric certification' -- every metric must pass data quality tests and have documented ownership before appearing in dashboards, eliminating conflicting metric definitions across teams.

Spotify

Spotify's ETL platform processes over 3 PB of data daily for music recommendations, podcast analytics, and artist royalty calculations. They use Luigi (their own orchestrator) and Scio (Scala on Dataflow) for batch transformations. The royalty pipeline is mission-critical: it must accurately count every stream to calculate per-play payments to rights holders. Any ETL error directly impacts artist payments, requiring extensive reconciliation checks.

GitLab

GitLab practices 'analytics as code' with their fully open-source ELT pipeline. They extract data from 30+ SaaS sources (Salesforce, Zuora, Zendesk) using Meltano (their open-source EL tool), load into Snowflake, transform with dbt (1,500+ models), and orchestrate with Airflow. Their entire analytics codebase -- including dbt models, tests, and Airflow DAGs -- is public on GitLab.com, serving as a reference implementation for modern ELT.

Trade-Offs
AspectDescription
ETL vs ELTETL transforms before loading, reducing storage costs and ensuring only clean data enters the warehouse. But it requires a separate compute engine (Spark, Beam) and makes iterating on transformations slower (change code, redeploy, rerun). ELT loads raw data and transforms in-warehouse with SQL, enabling faster iteration and schema-on-read flexibility, but increases storage costs and exposes raw (potentially sensitive) data in the warehouse.
Full Refresh vs IncrementalFull refresh (reprocess everything) is simple and guarantees correctness but is expensive for large datasets. Incremental extraction (only new/changed rows since last run) is efficient but complex: you need a reliable change detection mechanism (timestamps, CDC, version columns), and you must handle late-arriving updates and deletes. Start with full refresh, switch to incremental when volume demands it.
Orchestrator Complexity vs Pipeline ReliabilityAirflow provides fine-grained control (retries, SLAs, sensors, branching, dynamic DAGs) but has a steep learning curve and operational overhead (scheduler, workers, metadata DB). Simpler tools like cron + scripts are easier to start with but lack observability, retry logic, and dependency management. The complexity of Airflow pays off at scale (100+ DAGs, 10+ team members).
Case Study

dbt Transformation at JetBlue

Scenario

JetBlue's analytics team maintained over 200 stored procedures in their Teradata data warehouse for transforming raw flight, passenger, and revenue data into business metrics. Stored procedures were difficult to test, version-control, or document. When a procedure produced incorrect revenue numbers, root cause analysis took days because there was no lineage tracking. Different teams sometimes reported different revenue numbers because they used different procedures.

Solution

JetBlue migrated to Snowflake with dbt for transformations. They replaced stored procedures with 400+ dbt SQL models organized into staging, intermediate, and mart layers. Each model included schema tests (not-null, accepted-values, referential integrity) and data tests (revenue reconciliation against source systems). dbt's lineage graph provided full visibility into how raw data flowed to final metrics. Airflow orchestrated the daily dbt run.

Outcome

Metric discrepancies between teams were eliminated because everyone consumed the same certified dbt mart models. Data quality issues were caught automatically by dbt tests before bad data reached dashboards -- average time to detect data quality issues dropped from 3 days to 15 minutes. New metric development time dropped from weeks (writing and testing stored procedures) to days (writing and testing dbt SQL models).

Common Mistakes
  • โš Building monolithic ETL jobs that extract, transform, and load in a single script. When the job fails, you cannot tell which phase failed or retry from the failure point. Break pipelines into discrete, idempotent tasks: extract (write raw to staging), transform (staging to clean), load (clean to final) -- each independently retriable.
  • โš Ignoring schema evolution from source systems. When a source adds, removes, or renames a column, brittle ETL pipelines break silently (dropping data) or loudly (crashing). Use schema registries, contract tests, or flexible formats (JSON in a VARCHAR column) to handle schema changes gracefully.
  • โš Treating data quality as an afterthought. Adding tests only after bad data causes an incident is reactive. Build quality checks into every pipeline from day one: row counts, null rates, value distributions, cross-source reconciliation. dbt tests and Great Expectations make this low-effort.
  • โš Over-engineering with streaming when batch ETL meets the SLA. If your dashboard refreshes hourly and your users are fine with hourly data, a batch ELT pipeline running every hour is far simpler and cheaper than a streaming pipeline. Match the pipeline complexity to the actual freshness requirement.
Related Concepts

See ETL Pipelines in action

Explore system design templates that use etl pipelines and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Build an ETL pipeline transforming raw logs to analytics

Metrics to watch
pipeline_latency_mstransform_throughput_rpsdata_quality_error_pctstage_duration_ms
Run Simulation
Test Your Understanding

1What is the key difference between ETL and ELT?

2Why is idempotency important for ETL pipeline tasks?

Deeper Reading