How do you implement anomaly detection in a data pipeline?
Types of Data Anomalies
Volume anomalies: Sudden increase or decrease in row counts (e.g., daily orders dropping from 10,000 to 100)
Distribution anomalies: Metric distributions deviating from historical patterns (e.g., average order value doubling overnight)
Freshness anomalies: Data updates delayed beyond expected SLA
Schema anomalies: Column type changes, new columns appearing, or columns disappearing
Detection Methods
Rule-based Set static thresholds — simple and transparent:
- Alert when row count < 1,000
- Alert when NULL rate > 5%
Statistical Based on historical data characteristics:
- Z-score: detect deviation from the mean
- IQR (Interquartile Range): detect outliers
- Moving averages: detect trend anomalies
ML-based Machine learning models that automatically learn normal patterns — used by tools like Monte Carlo Data and Anomalo.
Tool Comparison
| Tool | Characteristics |
|---|---|
| dbt tests | Lightweight, good for SQL rule checks |
| Great Expectations | Rich expectation library, CI/CD support |
| Monte Carlo | SaaS, ML-driven automatic anomaly detection |
| Soda Core | Open-source, declarative SodaCL syntax |
Best Practice
Place quality gates at key checkpoints throughout the pipeline to prevent anomalous data from flowing downstream — avoiding the "garbage in, garbage out" problem.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
