categories.data-quality-observability Basic
What is a Data Catalog and what problems does it solve?
Data Catalog
A data catalog is a centralized system for managing metadata about enterprise data assets, enabling users to discover, understand, and trust their data.
Core Problems It Solves
Data silos: Data is scattered across departments, making it hard to know what data exists where.
Poor data understanding: Ambiguous column names like "status" or "type" — no one knows what they mean.
Redundant work: Different teams build the same datasets or metrics independently, wasting effort and creating inconsistent definitions.
Compliance risk: No way to track which tables contain personal or sensitive data (PII).
Key Features
| Feature | Description |
|---|---|
| Data discovery | Search and find needed datasets or columns |
| Business glossary | Unified definitions for business terms, eliminating ambiguity |
| Data lineage | Visualize data flow and dependencies |
| Data quality scores | Display quality assessment results for each dataset |
| Sensitive data tagging | Mark locations of PII, financial, and other sensitive data |
Common Tools
- Open source: Apache Atlas, Amundsen, DataHub
- Cloud-native: AWS Glue Data Catalog, Google Dataplex
- Commercial: Alation, Collibra
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
