In this tutorial, we are going to cover the other component of the MapReduce process i.e. Hadoop MapReduce InputFormat. We will discuss What is InputFormat in Hadoop, What functionalities are offered by MapReduce InputFormat. We will also cover the types of InputFormat in MapReduce, and how to get the data from the mapper using InputFormat.
What is Hadoop MapReduce InputFormat?
Hadoop InputFormat illustrates the input-specification for the execution of the Map-Reduce job.
InputFormat illustrates how to divide and read input files. In MapReduce job execution, InputFormat is the initial step. It is also accountable for creating the input splits and dividing them into records.
Input files store the data for MapReduce job. Input files reside in HDFS. Though these files format is random, we can also use line-based log files and binary format. Therefore, In MapReduce, InputFormat class is one of the fundamental classes which offers below functionality:
• InputFormat selects the files or other objects for input.
• It also defines the Data splits. It defines both the size of individual Map tasks and its possible execution server.
• Hadoop InputFormat defines the RecordReader. It is also accountable for reading actual records from the input files.
How we get the data from Mapper?
Procedures to get the data from mapper are getsplits() and createRecordReader() which are as below:
1. public abstract class InputFormat<K, V>
2. {
3. public abstract List<InputSplit> getSplits(JobContext context)
4. throws IOException, InterruptedException;
5. public abstract RecordReader<K, V>
6. createRecordReader(InputSplit split,
7. TaskAttemptContext context) throws IOException,
8. InterruptedException;
9. }
Types of InputFormat in MapReduce
There are different types of MapReduce InputFormat in Hadoop which are used for different purposes. Let us discuss the Hadoop InputFormat types below:
FileInputFormat
FileInputFormat is the source class for all file-based InputFormats. FileInputFormat also specifies the input directory which has data files location. When we initiate a MapReduce job execution, FileInputFormat offers a path containing files to read. This InpuFormat will read all files. Then it splits these files into one or more InputSplits.
TextInputFormat
It is the default InputFormat. This InputFormat considers each line of each input file as a separate record. It does no parsing. TextInputFormat is useful for unformatted data or line-based records like log files. Therefore,
• Key – It is the byte offset of the beginning of the line within the file (not whole file one split). Hence it will be unique if combined with the file name.
• Value – It is the subject of the line. It excludes line terminators.
KeyValueTextInputFormat
It is similar to TextInputFormat. This InputFormat also considers each line of input as a separate record. While the difference is that TextInputFormat considers the entire line as the value, but the KeyValueTextInputFormat divides the line itself into key and value by a tab character (‘/t’). Hence,
• Key – The whole thing up to the tab character.
• Value – Value is the remaining part of the line after tab character.
SequenceFileInputFormat
It is an InputFormat that reads sequence files. Sequence files are binary files. These files also store sequences of binary key-value pairs. These are block-compressed and offer direct serialization and deserialization of several random data. Hence,
Key & Value both are user-defined.
SequenceFileAsTextInputFormat
It is the variant of SequenceFileInputFormat. This format transforms the sequence file key values to Text objects. Hence, it performs conversion by calling ‘tostring()’ on the keys and values. Therefore, SequenceFileAsTextInputFormat makes sequence files suitable input for streaming.
SequenceFileAsBinaryInputFormat
By utilizing SequenceFileInputFormat we can obtain the sequence file’s keys and values as an opaque binary object.
NlineInputFormat
It is the other form of TextInputFormat where the keys are byte offset of the line. And values are contents of the line. Hence, each mapper receives a variable number of lines of input with TextInputFormat and KeyValueTextInputFormat. The number relies on the size of the split. Moreover, it relies on the length of the lines. Therefore, if we want our mapper to receive a fixed number of lines of input, then we use NLineInputFormat.
N- It is the number of lines of input that each mapper accepts.
By default (N=1), each mapper accepts exactly one line of input.
Suppose N=2, then each split contains two lines. So, one mapper accepts the first two Key-Value pairs. Another mapper accepts the second two key-value pairs.
DBInputFormat
This InputFormat reads data from a relational database, using JDBC. It also inserts small datasets, perhaps for joining with large datasets from HDFS using various Inputs. Hence,
• Key – LongWritables
• Value – DBWritables.
Conclusion
Hence, InputFormat defines how to read data from a file into the Mapper instances. In this tutorial, we have learned many types of InputFormat like FileInputFormat, TextInputFormat, etc. The default input format is TextInputFormat.