categories.batch-processing Intermediate

Apache Spark Core Architecture

AI Practice

Explain Apache Spark core architecture and execution model.

Components

Driver: The main process of a Spark application. Parses the DAG, schedules tasks, and communicates with the Cluster Manager.

Executor: Process running on worker nodes that performs actual computation and stores data (memory/disk). Each application has its own executors.

Cluster Manager: Manages cluster resources (YARN, Kubernetes, Spark Standalone).

Core Abstractions

RDD (Resilient Distributed Dataset): Lowest-level distributed data abstraction — immutable, fault-tolerant, recomputable. Rarely used directly today.

DataFrame/Dataset: High-level API on top of RDD with schema support and SQL queries. Automatically optimized by the Catalyst Optimizer.

Lazy Evaluation

Transformations (map, filter, join) are not executed immediately — they build a DAG. Actions (collect, count, write) trigger actual execution, allowing Catalyst to optimize the whole plan.

Shuffle

Operations like GroupBy and Join require data redistribution (Shuffle) across nodes over the network — the primary performance bottleneck. Minimizing shuffle is the core of Spark tuning.

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub