Apache Spark Core Architecture
Explain Apache Spark core architecture and execution model.
Components
Driver: The main process of a Spark application. Parses the DAG, schedules tasks, and communicates with the Cluster Manager.
Executor: Process running on worker nodes that performs actual computation and stores data (memory/disk). Each application has its own executors.
Cluster Manager: Manages cluster resources (YARN, Kubernetes, Spark Standalone).
Core Abstractions
RDD (Resilient Distributed Dataset): Lowest-level distributed data abstraction — immutable, fault-tolerant, recomputable. Rarely used directly today.
DataFrame/Dataset: High-level API on top of RDD with schema support and SQL queries. Automatically optimized by the Catalyst Optimizer.
Lazy Evaluation
Transformations (map, filter, join) are not executed immediately — they build a DAG. Actions (collect, count, write) trigger actual execution, allowing Catalyst to optimize the whole plan.
Shuffle
Operations like GroupBy and Join require data redistribution (Shuffle) across nodes over the network — the primary performance bottleneck. Minimizing shuffle is the core of Spark tuning.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
