MapReduce Computing Paradigm
Explain how MapReduce works.
What Is MapReduce
MapReduce is a distributed computing paradigm proposed by Google and the core computation model of Hadoop. It decomposes large-scale data processing into two phases: Map and Reduce.
Map Phase
Applies a mapping function independently to each input record, producing key-value pairs. Map tasks run in parallel across multiple nodes.
Example: For word count, each word emits (word, 1).
Shuffle and Sort
The framework automatically groups and sorts Map outputs by key, routing them to the appropriate Reducer. This is the most expensive step (network transfer).
Reduce Phase
Aggregates all values for the same key.
Example: Sum all (word, 1) to produce (word, total_count).
Limitations of MapReduce
Every MapReduce job reads and writes disk; multi-step jobs are slow. Spark memory-based computation solved this, gradually replacing MapReduce.
Historical Significance
MapReduce established the foundation for big data batch processing, inspiring later frameworks like Spark and Flink. Understanding MapReduce helps in understanding modern distributed computing design.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
