categories.data-quality-observability Intermediate

What is Data Lineage and how do you track it?

AI Practice

Data Lineage

Data lineage describes the complete flow of data from its source to destination, including every transformation step along the way.

Why It Matters

  • Impact analysis: Quickly identify which downstream reports or models are affected when an upstream table changes
  • Root cause analysis: Trace back exactly which step introduced a data anomaly
  • Compliance and audit: Regulations like GDPR require tracking the flow of personal data
  • Trust building: Helps data consumers understand where data comes from, improving confidence

Lineage Granularity

Level Description Tools
Column-level Tracks the origin of each field dbt, OpenLineage
Table-level Tracks dependencies between tables Apache Atlas, Amundsen
Job-level Tracks inputs/outputs of pipeline jobs Airflow, Marquez

Implementation Approaches

Manual documentation: Describe sources and dependencies in dbt model YAML files

Automatic capture: Use the OpenLineage standard to have Spark and Airflow automatically report lineage to Marquez or DataHub

Example (dbt): dbt automatically builds a lineage graph from ref() and source() calls, visualized in the dbt docs interface.

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub