What Is a Data Stack?
A data stack is the collection of tools and technologies you use to collect, store, transform, and analyze data. It's the infrastructure that makes analytics possible.
The Layers
Think of a data stack as layers, each handling a specific job:
1. Data Sources - Where data originates. Your CRM, website, payment processor, marketing tools, product database. Not part of the stack itself, but the inputs.
2. Ingestion - Moving data from sources into your central storage. ETL/ELT tools, API connectors, streaming platforms.
3. Storage - Where data lives once collected. Data warehouses, data lakes, databases.
4. Transformation - Cleaning and reshaping data for analysis. dbt, SQL, Python scripts.
5. Analysis/BI - Where insights happen. BI tools, SQL clients, notebooks.
6. Orchestration - Coordinating everything. Scheduling jobs, handling dependencies, monitoring.
Common Tools by Layer
Ingestion: - Fivetran, Airbyte, Stitch (managed connectors) - Custom scripts for proprietary systems - Kafka, Kinesis (streaming)
Storage: - Snowflake, BigQuery, Redshift (cloud warehouses) - Databricks (lakehouse) - S3/GCS (raw storage)
Transformation: - dbt (SQL-based, most popular) - Dataform (Google) - Custom Python/Spark
Analysis: - Looker, Tableau, Power BI (enterprise BI) - Metabase, Mode (mid-market) - Jupyter notebooks (data science)
Orchestration: - Airflow (open source standard) - Dagster, Prefect (newer alternatives) - Cloud-native options (AWS Step Functions, etc.)
Stack Evolution
Early stage (startup): - Data sources → Google Sheets - Maybe a basic BI tool connected to production database - Manual exports and imports - Works until it doesn't
Growth stage: - Cloud warehouse (Snowflake/BigQuery) - Managed ingestion (Fivetran) - dbt for transformations - BI tool for dashboards - This handles most analytics needs
Scale stage: - Multiple data sources, high volume - Sophisticated orchestration - Data quality monitoring - Possibly real-time streaming - Data governance and catalogs
Don't over-engineer early. Start simple, add complexity as needed.
The Integration Challenge
The hard part isn't picking tools. It's making them work together:
Data contracts - What format does data arrive in? What happens when it changes?
Lineage - Where did this number come from? Can you trace it back?
Consistency - Same business logic applied everywhere?
Monitoring - How do you know when something breaks?
A beautifully architected stack that's poorly integrated is worse than a simple stack that actually works.
Build vs Buy
For most companies:
Buy: Warehouse, ingestion, BI tools. These are commoditized. No advantage to building.
Build: Custom transformations (your business logic), integrations to proprietary systems.
Evaluate: Orchestration, data quality, catalogs - depends on complexity.
Common Mistakes
Over-architecting early. You don't need Kafka when you have 10,000 daily events. Start simple.
Too many tools. Every tool is maintenance burden. Consolidate where possible.
Ignoring data quality. The fanciest stack means nothing if the data is wrong.
No documentation. Stack complexity grows. Without documentation, knowledge lives in heads. That's fragile.
Vendor lock-in. Consider exit costs when choosing tools. How hard would it be to switch?
Evaluating Your Stack
Questions to ask: - Can we answer the business questions we need to? - How long does it take to add a new data source? - Do we trust the numbers? - Can we maintain this with our team? - What's our monthly spend, and does it scale reasonably?
Getting Started
If you're building from scratch:
1. Pick a cloud warehouse - Snowflake or BigQuery are safe choices 2. Set up basic ingestion - Fivetran or Airbyte for common sources 3. Start with dbt - For transformations, it's the standard 4. Choose a BI tool - Metabase (free) or Looker/Tableau (paid) 5. Add orchestration when needed - Airflow when complexity demands it
This gives you a solid foundation that scales.
Your stack supports your analytics. Learn about data warehouses and data pipelines.
---
Sources: - dbt: What Is the Modern Data Stack? - Snowflake: Modern Data Stack Guide - a16z: Emerging Architectures for Modern Data Infrastructure