Anduril Labs | Data Infrastructure & Analytics Solutions

ETL might be the most important acronym in data that nobody outside of data teams understands. It's the plumbing that makes everything else possible.

ETL: Extract, Transform, Load

ETL is the process of moving data from source systems to a destination (usually a data warehouse) while transforming it along the way.

Extract: Pull data from source systems - your CRM, ERP, databases, APIs, files.

Transform: Clean, validate, and reshape the data. Standardize formats, apply business logic, join data from different sources.

Load: Put the transformed data into the destination system.

According to Rivery's analysis, ETL was designed to protect the data warehouse. When on-premise warehouses had expensive, limited compute, you transformed data before loading to minimize the processing burden.

ELT: The Modern Approach

ELT flips the process: Extract, Load, Transform. Load raw data first, transform it inside the destination.

Why the change? Pure Storage explains that cloud data warehouses changed the economics. When compute is cheap and scalable, it makes sense to load everything and transform where the data lives.

ELT advantages: - Faster initial data loading - Raw data preserved for future use cases - Transformations can be modified without re-extracting - Better suited for cloud data warehouses - More flexibility for data scientists

ELT is now the default for modern data stacks. Tools like dbt (data build tool) have standardized the transformation-in-warehouse pattern.

The Components of a Data Pipeline

Whether ETL or ELT, data pipelines have common components:

Extraction

Pulling data from sources. This could be: - Database replication - copying data from operational databases - API calls - fetching data from SaaS applications - File ingestion - loading CSV, JSON, or other file formats - Change data capture (CDC) - capturing only what changed since the last extraction

Transformation

Making data usable. Common transformations include: - Cleaning - handling nulls, fixing data types, removing duplicates - Standardization - consistent date formats, currency conversions, naming conventions - Business logic - calculating metrics, applying rules, deriving new fields - Aggregation - summarizing data for reporting - Joining - combining data from multiple sources

Loading

Getting data to its destination. Considerations include: - Full vs incremental loads - reload everything or just changes? - Batch vs streaming - scheduled loads or real-time? - Upserts - handling updates to existing records

Orchestration

Coordinating the whole process. Tools like Apache Airflow manage: - Scheduling (run this pipeline every hour) - Dependencies (don't run transformation until extraction completes) - Monitoring (alert if something fails) - Retries (handle transient failures gracefully)

Modern Tooling

The modern data stack has standardized around specific tools:

Ingestion: Fivetran, Airbyte, Stitch - pre-built connectors to extract from common sources

Transformation: dbt - SQL-based transformations that run in the warehouse

Orchestration: Airflow, Dagster, Prefect - workflow management

Warehouses: Snowflake, BigQuery, Redshift, Databricks - where the data lives

Why This Matters

Without ETL/ELT, your data warehouse is empty. Your dashboards have nothing to show. Your analysts can't answer questions.

The quality of your data pipeline directly impacts: - Data freshness - how current is your data? - Data quality - can you trust the numbers? - Reliability - do pipelines fail frequently? - Maintainability - can you change things without breaking everything?

Common Pitfalls

Building everything custom. Modern ingestion tools handle 90% of common sources. Don't build connectors from scratch.

No testing. Data pipelines need tests. Transformations should be validated. Assertions should catch data quality issues.

Poor documentation. When the person who built the pipeline leaves, can anyone else understand it?

Ignoring failures. Pipelines fail. What matters is how quickly you know and how gracefully you recover.

Data pipelines are built and maintained by data engineers. Learn about what data engineers actually do.

What Is ETL (and ELT)? The Data Plumbing Explained