categories.data-quality-observability Intermediate
What is Data Lineage and how do you track it?
Data Lineage
Data lineage describes the complete flow of data from its source to destination, including every transformation step along the way.
Why It Matters
- Impact analysis: Quickly identify which downstream reports or models are affected when an upstream table changes
- Root cause analysis: Trace back exactly which step introduced a data anomaly
- Compliance and audit: Regulations like GDPR require tracking the flow of personal data
- Trust building: Helps data consumers understand where data comes from, improving confidence
Lineage Granularity
| Level | Description | Tools |
|---|---|---|
| Column-level | Tracks the origin of each field | dbt, OpenLineage |
| Table-level | Tracks dependencies between tables | Apache Atlas, Amundsen |
| Job-level | Tracks inputs/outputs of pipeline jobs | Airflow, Marquez |
Implementation Approaches
Manual documentation: Describe sources and dependencies in dbt model YAML files
Automatic capture: Use the OpenLineage standard to have Spark and Airflow automatically report lineage to Marquez or DataHub
Example (dbt): dbt automatically builds a lineage graph from ref() and source() calls, visualized in the dbt docs interface.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
