Key Value Pair in MapReduce

In this tutorial, we are going to learn complete introduction to MapReduce Key-Value Pair. Initially, we will discuss what is a key-value pair in Hadoop?, How key-value pair is created in MapReduce?. Finally, we will explain MapReduce’s key-value pair generation with examples.

Key Value Pairs in MapReduce

What is Key Value Pair in Hadoop MapReduce?

The key-value pair in MapReduce is the record entity that Hadoop MapReduce accepts for execution.

We use Hadoop mainly for data analysis. It deals with structured, unstructured, and semi-structured data. With Hadoop, if the schema is static we can precisely work on the column in the place of key value. However, if the schema is not static we will work on a key value.

Keys value is not the essential properties of the data.  But they are chosen by users evaluating the data.

MapReduce is the core component of Hadoop, which offers data processing. It performs processing by dividing the job into two phases: The map phase and Reduce phase. Each phase has key-value as input and output.

MapReduce Key value pair generation in Hadoop

In MapReduce job execution, before sending data to the mapper, first change it into key-value pairs. Because of mapper only key-value pairs of data.

Key-value pair in MapReduce is created as follows:

InputSplit – It is the logical representation of data that InputFormat generates. The MapReduce program describes a unit of work that consists of a single map task.

RecordReader – It interacts with the InputSplit. After that, it transforms the data into key-value pairs suitable for reading by the Mapper. RecordReader by default uses TextInputFormat to transform data into key-value pairs.

In MapReduce job execution, the map function handles a certain key-value pair. Then produces a certain number of key-value pairs. The Reduce function handles the values grouped by the same key. Then produces another set of key-value pairs as the output.  The Map output types should match the input types of the Reduce as given below:

Map: (K1, V1) -> list (K2, V2)

Reduce: {(K2, list (V2}) -> list (K3, V3)

On what basis is a key-value pair generated in MapReduce?

MapReduce Key-value pair generation completely depends on the data set. It also depends on the required output. The framework specifies the key-value pairs in 4 places: Map input/output, Reduce input/output.

Map Input

Map Input by default acquires the line offset as the key. The content of the line is value as Text. We can change them; by using the custom input format.

Map Output

The Map is responsible to filter the data. It also offers the environment to group the data based on the key.

Key– It is field/ text/ object on which the data groups and combines on the reducer.

Value– It is the field/ text/ object which every individual reduces method handles.

Reduce Input

Map output is input to reduce. Hence it’s the same as Map-Output.

Reduce Output

It completely depends on the required output.

MapReduce Key-value Pair Example

For assumption, the content of the file which HDFS stores are Chandler is Joey Mark is John. Therefore, now by using InputFormat, we will define how this file will divide and read. By default, RecordReader uses TextInputFormat to transform this file into a key-value pair.

Key – It is offset by the beginning of the line within the file.

Value – It is the subject of the line, excluding line terminators.

Here, Key is 0 and Value is Chandler is Joey Mark is John.

Conclusion

In conclusion, we can say that key-value is just a record entity that MapReduce receives for execution. InputSplit and RecordReader create Key-value pairs. Therefore, the key is byte offset and value is the subject of the line.