Mapper Class in Hadoop MapReduce: In this article, we will answer what is Mapper in Hadoop MapReduce, how Hadoop mapper works, what are the process of mapper in MapReduce?, how Hadoop creates Key-value pair in MapReduce?
Introduction to Hadoop Mapper
Hadoop Mapper processes the input record created by the RecordReader and generates intermediate key-value pairs. The intermediate output is entirely different from the input pair.
The output of the mapper is the entire collection of key-value pairs. Before writing the output for every mapper task, partitioning of output occurs based on the key. Therefore partitioning itemizes that all the values for each key are grouped.
Hadoop MapReduce creates one map task for each InputSplit.
Hadoop MapReduce only understands key-value pairs of data. Hence, before sending data to the mapper, the Hadoop framework should covert data into the key-value pair.
How is the key-value pair generated in Hadoop MapReduce?
As we have understood what is mapper in Hadoop, now we will discuss how Hadoop generates key-value pairs?
• InputSplit – It is the logical representation of data created by the InputFormat. In the MapReduce program, it describes a unit of work that consists of a single map task.
• RecordReader– It interrelates with the inputSplit. And then transforms the data into key-value pairs suitable for reading by the Mapper. RecordReader by default uses TextInputFormat to transform data into the key-value pair.
Mapper Process in Hadoop MapReduce
InputSplit transforms the physical representation of the blocks into logical for the Mapper. For assumption, to read the 100MB file, it will need 2 InputSplit. For every block, the framework generates one InputSplit. Each InputSplit generates one mapper.
MapReduce InputSplit will not always depend on the number of data blocks. We can alter the number of a split by setting mapred.max.split.size property during job execution.
MapReduce RecordReader is accountable for reading/converting data into key-value pairs until the end of the file. RecordReader allocates Byte offset to each line present in the file. Then Mapper receives this key pair. Mapper generates the intermediate output (key-value pairs which are understandable to reduce).
How many Map tasks in Hadoop?
The number of map tasks is based on the total number of blocks of the input files. In the MapReduce map, the correct level of parallelism seems to be around 10-100 maps/node. However, there is 300 map for CPU-light map tasks.
For example, we have a block size of 128 MB. And we expect 10TB of input data. Thus it produces 82,000 maps. Hence the number of maps depends on InputFormat.
Mapper = (total data size)/ (input split size)
Example – data size is 1 TB. The input split size is 100 MB.
Mapper = (1000*1000)/100 = 10,000
Therefore, Mapper Class in Hadoop MapReduce takes a set of data and transforms it into another set of data. Thus, it splits individual elements into tuples (key/value pairs).