In this Hadoop tutorial, we are going to offer you an end to end MapReduce job execution flow. Here we will describe every component which is part of MapReduce working in detail. This tutorial will assist you to answer how Hadoop MapReduce work, how data flows in MapReduce, how Mapreduce job is executed in Hadoop?
What is MapReduce?
Hadoop MapReduce is the data processing layer. It processes the huge amount of structured and unstructured data stored in HDFS. MapReduce handles data in parallel by splitting the job into the set of independent tasks. So, parallel processing increases speed and reliability.
Hadoop MapReduce data processing occurs in 2 phases- Map and Reduce phase.
• Map phase: It is the initial phase of data processing. In this phase, we state all the complex logic/business rules/costly code.
• Reduce phase: The second phase of processing is the Reduce Phase. In this phase, we state light-weight processing like aggregation/summation.
Steps of MapReduce Job Execution flow
MapReduce processes the data in various phases with the help of different components. Let us discuss the steps of job execution in Hadoop.
The data for MapReduce job is stored in Input Files. Input files reside in HDFS. The input file format is random. Line-based log files and binary format can also be used.
After that InputFormat defines how to divide and read these input files. It selects the files or other objects for input. InputFormat creates InputSplit.
It represents the data that will be processed by an individual Mapper. For each split, one map task is created. Therefore the number of map tasks is equal to the number of InputSplits. Framework divide split into records, which mapper process.
It communicates with the inputSplit. And then transforms the data into key-value pairs suitable for reading by the Mapper. RecordReader by default uses TextInputFormat to transform data into a key-value pair. It interrelates to the InputSplit until the completion of file reading. It allocates a byte offset to each line present in the file. Then, these key-value pairs are further sent to the mapper for added processing.
It processes input records produced by the RecordReader and generates intermediate key-value pairs. The intermediate output is entirely different from the input pair. The output of the mapper is a full group of key-value pairs. Hadoop framework does not store the output of the mapper on HDFS. Mapper doesn’t store, as data is temporary and writing on HDFS will create unnecessary multiple copies. Then Mapper passes the output to the combiner for extra processing.
Combiner is Mini-reducer that performs local aggregation on the mapper’s output. It minimizes the data transfer between mapper and reducer. So, when the combiner functionality completes, the framework passes the output to the partitioner for further processing.
Partitioner comes into existence if we are working with more than one reducer. It grabs the output of the combiner and performs partitioning.
Partitioning of output occurs based on the key in MapReduce. By hash function, the key (or a subset of the key) derives the partition.
Based on key-value in MapReduce, partitioning of each combiner output occur. And then the record having a similar key-value goes into the same partition. After that, each partition is sent to a reducer.
Partitioning in MapReduce execution permits even distribution of the map output over the reducer.
Shuffling and Sorting
After partitioning, the output is shuffled to the reduced node. The shuffling is the physical transfer of the data which is done over the network. As all the mappers complete and shuffle the output on the reducer nodes. Then the framework joins this intermediate output and sort. This is then offered as input to reduce the phase.
The reducer then takes a set of intermediate key-value pairs produced by the mappers as the input. After that runs a reducer function on each of them to generate the output. The output of the reducer is the decisive output. Then the framework stores the output on HDFS.
It writes these output key-value pairs from the Reducer phase to the output files.
OutputFormat defines the way how RecordReader writes these output key-value pairs in output files. So, its instances offered by the Hadoop write files in HDFS. Thus OutputFormat instances write the decisive output of reducer on HDFS.
Conclusion – Hadoop MapReduce Job Execution Flow Chart
We have discussed step by step MapReduce job execution flow. I hope this tutorial helps you a lot to understand MapReduce working.