categories.batch-processing Basic

Data Formats: Parquet vs ORC vs CSV

AI Practice

Compare file formats commonly used in big data processing.

CSV / JSON (Row-based Text Formats)

Human-readable, schema-free, supported by all tools.

Cons: No compression, no statistics, reads entire rows even for one column, poor performance.

Use cases: Data exchange, manual inspection, small datasets.

Parquet (Columnar Binary Format)

Column-oriented storage — queries only scan needed columns, dramatically reducing I/O.

  • Built-in schema with Schema Evolution support.
  • Per-column compression (Snappy, ZSTD, Gzip) and statistics (min/max for Predicate Pushdown).
  • Standard format for Spark, Hive, BigQuery, and Redshift.

Use cases: Large-scale analytical queries, data lakes.

ORC (Optimized Row Columnar)

Similar columnar format to Parquet; more common in the Hive ecosystem. Slightly better compression but narrower compatibility.

Recommendation

Default to Parquet for new projects (widest ecosystem support). Consider ORC for Hive-centric systems. Avoid CSV in large analytical pipelines.

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub