categories.pipeline-orchestration Intermediate
Data Quality Monitoring
Explain data quality monitoring methods in data pipelines.
Why Data Quality Matters
"Garbage in, garbage out." Downstream analytics, reports, and ML models depend on high-quality input. Data quality issues are often hard to detect and can silently corrupt decisions.
Common Data Quality Dimensions
- Completeness: Are critical columns NULL?
- Uniqueness: Are primary keys duplicated?
- Timeliness: Does data arrive on schedule?
- Consistency: Is data consistent across systems?
- Validity: Are values within allowed ranges (e.g., date formats, enum values)?
Tools
dbt Tests: Declare tests in dbt models (not_null, unique, accepted_values, relationships); auto-validates after every dbt run.
Great Expectations: Advanced Python framework for complex expectations (value distributions, inter-column relationships); generates data documentation.
Monitoring Alerts: Set threshold alerts (e.g., null rate > 5%, row count drops unexpectedly) and proactively notify on anomalies.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
