categories.batch-processing Advanced

Spark Performance Tuning

AI Practice

Explain common Spark performance issues and tuning techniques.

1. Reduce Shuffle

Shuffle is the biggest bottleneck. Tuning approaches:

  • Broadcast Join: When one side is small (under 10MB), broadcast it to all executors to avoid shuffle.
  • Pre-aggregate before join.
  • Tune spark.sql.shuffle.partitions (default 200; increase for large datasets).

2. Handle Data Skew

Use explain() to confirm skew, then:

  • Salting: Add a random suffix to hot keys to scatter them; merge after processing.
  • AQE (Adaptive Query Execution): Spark 3.x auto skew handling. Enable spark.sql.adaptive.enabled=true.

3. Memory Management

  • Tune spark.executor.memory and spark.driver.memory.
  • Increase spark.memory.fraction (execution memory ratio).
  • Avoid collect() on large DataFrames to the driver.

4. Serialization

Use Kryo serialization instead of the default Java serialization — up to 10x faster.

5. Persistence Strategy

Use cache() or persist() on DataFrames reused multiple times to avoid recomputation.

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub