What Is a Data Pipeline?
A data pipeline moves data from one place to another, transforming it along the way. Your CRM has customer data. Your website has traffic data. Your payment processor has transaction data. A data pipeline brings it all together so you can actually use it.
The Problem Pipelines Solve
Data lives everywhere. Sales uses Salesforce. Marketing uses HubSpot. Finance uses QuickBooks. Operations uses a custom app. Each system is a silo - useful on its own, but isolated.
Want to know which marketing campaigns generate the most valuable customers? You need marketing data (campaign source) connected to sales data (deal value) connected to product data (usage patterns). That connection is what a pipeline builds.
How Pipelines Work
A typical data pipeline has three stages:
Extract - Pull data from source systems. This might be API calls to Salesforce, database queries to your product, or file downloads from a vendor. The goal is getting raw data out.
Transform - Clean and reshape the data. Standardize date formats. Remove duplicates. Calculate derived metrics. Join data from different sources. This is where raw data becomes useful data.
Load - Put the transformed data somewhere it can be used. Usually this is a data warehouse (Snowflake, BigQuery, Redshift) where analysts and dashboards can access it.
This is the classic "ETL" pattern - Extract, Transform, Load. Modern approaches sometimes do "ELT" - load raw data first, transform it in the warehouse. The principle is the same.
Types of Pipelines
Batch pipelines run on a schedule - hourly, daily, weekly. Good for data that doesn't need to be real-time. Most business reporting works fine with data that's a few hours old.
Streaming pipelines process data continuously as it arrives. Good for real-time dashboards, fraud detection, or operational alerts. More complex and expensive to build and maintain.
Reverse pipelines push data back out from your warehouse to operational systems. Sync a customer health score from your data warehouse back into Salesforce so sales reps can see it.
What Can Go Wrong
Source systems change. An API updates, a field gets renamed, a new required parameter appears. Suddenly your pipeline breaks at 2 AM on a Saturday.
Data quality issues. Nulls where there shouldn't be nulls. Duplicates that shouldn't exist. Invalid values that break downstream logic. Bad data in, bad data out.
Scaling problems. A pipeline that works for 10,000 records might choke on 10 million. Performance issues tend to appear suddenly when you cross a threshold.
Schema drift. The shape of your data changes over time. New fields get added, old fields get removed. Pipelines need to handle this gracefully.
Dependency failures. Your pipeline depends on an upstream system being available. If that system has downtime, your pipeline can't run.
Build vs Buy
You have two options for building pipelines:
Managed tools (Fivetran, Airbyte, Stitch) handle the extraction for common sources. They maintain connectors to hundreds of APIs so you don't have to. You pay for convenience and reliability.
Custom code (Python, Spark, dbt) gives you full control. You write exactly what you need. You also maintain it forever.
Most organizations use both. Managed tools for standard connectors (Salesforce, Google Analytics, Stripe). Custom code for unique sources or complex transformations.
Signs You Need Better Pipelines
- Reports are based on manual exports and spreadsheets
- "Where did this number come from?" is a common question
- Different teams have different numbers for the same metric
- Data is always stale by the time you see it
- An engineer spends significant time fixing data issues
Getting Started
Start with your most painful data problem. Is it connecting marketing to sales? Understanding customer behavior? Financial reporting? Pick one high-value use case, build a pipeline that solves it, and expand from there.
Don't try to connect everything on day one. Every source you add is complexity you have to maintain.
Once your data is flowing, you need somewhere to put it. Learn about data warehouses and ETL vs ELT.