Data Formats: Parquet vs ORC vs CSV
Compare file formats commonly used in big data processing.
CSV / JSON (Row-based Text Formats)
Human-readable, schema-free, supported by all tools.
Cons: No compression, no statistics, reads entire rows even for one column, poor performance.
Use cases: Data exchange, manual inspection, small datasets.
Parquet (Columnar Binary Format)
Column-oriented storage — queries only scan needed columns, dramatically reducing I/O.
- Built-in schema with Schema Evolution support.
- Per-column compression (Snappy, ZSTD, Gzip) and statistics (min/max for Predicate Pushdown).
- Standard format for Spark, Hive, BigQuery, and Redshift.
Use cases: Large-scale analytical queries, data lakes.
ORC (Optimized Row Columnar)
Similar columnar format to Parquet; more common in the Hive ecosystem. Slightly better compression but narrower compatibility.
Recommendation
Default to Parquet for new projects (widest ecosystem support). Consider ORC for Hive-centric systems. Avoid CSV in large analytical pipelines.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
