categories.batch-processing Advanced
Spark Performance Tuning
Explain common Spark performance issues and tuning techniques.
1. Reduce Shuffle
Shuffle is the biggest bottleneck. Tuning approaches:
- Broadcast Join: When one side is small (under 10MB), broadcast it to all executors to avoid shuffle.
- Pre-aggregate before join.
- Tune
spark.sql.shuffle.partitions(default 200; increase for large datasets).
2. Handle Data Skew
Use explain() to confirm skew, then:
- Salting: Add a random suffix to hot keys to scatter them; merge after processing.
- AQE (Adaptive Query Execution): Spark 3.x auto skew handling. Enable
spark.sql.adaptive.enabled=true.
3. Memory Management
- Tune
spark.executor.memoryandspark.driver.memory. - Increase
spark.memory.fraction(execution memory ratio). - Avoid
collect()on large DataFrames to the driver.
4. Serialization
Use Kryo serialization instead of the default Java serialization — up to 10x faster.
5. Persistence Strategy
Use cache() or persist() on DataFrames reused multiple times to avoid recomputation.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
