Hadoop Streaming

What is Hadoop Streaming? Study How Streaming Works

Is it possible to write MapReduce jobs in languages other than Java?

Hadoop streaming is the utility that allows us to create or run MapReduce scripts in any language either, java or non-java, as mapper/reducer.

This tutorial thoroughly explains Hadoop Streaming. In this tutorial, you will study how Hadoop streaming works. Later in this tutorial, you will also see some Hadoop Streaming command options.



Introduction to Hadoop Streaming

Hadoop Streaming

By default, the Hadoop MapReduce framework is written in Java and provides support for writing map/reduce programs in Java only. But Hadoop provides API for writing MapReduce programs in languages other than Java.

Hadoop Streaming is the utility that permits us to create and run MapReduce jobs with any script or executable as the mapper or the reducer. It uses Unix streams as the interface between the Hadoop and our MapReduce program so that we can use any language which can read standard input and write to standard output to write for writing our MapReduce program.

Hadoop Streaming supports the execution of Java, as well as non-Java, programmed MapReduce jobs execution over the Hadoop cluster. It supports the Python, Perl, R, PHP, and C++ programming languages.

The syntax for Hadoop Streaming

You can use the below syntax to run MapReduce code written in a language other than JAVA to process data using the Hadoop MapReduce framework.

  1. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
  2. -input myInputDirs \
  3. -output myOutputDir \
  4. -mapper /bin/cat \
  5. -reducer /usr/bin/wc

Parameters Description

Parameter

Description

-input myInputDirs \

Input location for mapper

-output myOutputDir \

Output location for reducer

-mapper /bin/cat \

Mapper executable

-reducer /usr/bin/wc

Reducer executable

How Streaming Works?

How Hadoop Streaming Works

Let us now see how Hadoop Streaming works.

• The mapper and the reducer (in the above example) are the scripts that read the input line-by-line from stdin and emit the output to stdout.

• The utility creates a Map/Reduce job and submits the job to an appropriate cluster and monitor the job progress until its completion.

• When a script is specified for mappers, then each mapper task launches the script as a separate process when the mapper is initialized.

• The mapper task converts its inputs (key, value pairs) into lines and pushes the lines to the standard input of the process. Meanwhile, the mapper collects the line oriented outputs from the standard output and converts each line into a (key, value pair) pair, which is collected as the result of the mapper.

• When reducer script is specified, then each reducer task launches the script as a separate process, and then the reducer is initialized.

• As reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the standard input of the process. Meantime, the reducer gathers the line-oriented outputs from the stdout of the process and converts each line collected into a key/value pair, which is then collected as the result of the reducer.

• For both mapper and reducer, the prefix of a line until the first tab character is the key, and the rest of the line is the value except the tab character. In the case of no tab character in the line, the entire line is considered as key, and the value is considered null. This is customizable by setting -inputformat command option for mapper and -outputformat option for reducer that we will see later in this article.

Let us now discuss some of the streaming command options.

Streaming Command Options

Hadoop Streaming supports some streaming command options. It also supports the generic command option which we will see later in this article.

The general command line syntax is:

  1. mapred streaming [genericOptions] [streamingOptions]

 The Streaming command options are:

1. -input directoryname or filename (Required)

It specifies the input location for the mapper.

2. -output directoryname (Required)

This streaming command option specifies the output location for reducer.

3. -mapper executable or JavaClassName (Optional)

It specifies the Mapper executable. If it is not specified then IdentityMapper is used as the default.

4. -reducer executable or JavaClassName (Optional)

It specifies the Reducer executable. If it is not specified then IdentityReducer is used as the default.

5. -inputformat JavaClassName (Optional)

The class you supply should return key/value pairs of Text class. If not specified then TextInputFormat is used as the default.

6. -outputformat JavaClassName (Optional)

The class you supply should take key/value pairs of Text class. If not specified then TextOutputformat is used as the default.

7. -numReduceTasks (Optional)

It specifies the number of reducers.

8. -file filename (Optional)

It makes the mapper, reducer, or combiner executable available locally on the compute nodes.

9. -mapdebug (Optional)

It is the script that is called when the map task fails.

10. -reducedebug (Optional)

It is the script to call when a reduce task fails.

11. -partitioner JavaClassName (Optional)

This option specifies the class that determines which reducer a key is sent to.

Thus these are some Hadoop streaming command options.

Summary

We hope after reading this tutorial, you clearly understand Hadoop Streaming. Mostly Hadoop Streaming allows us to write Map/reduce jobs in any languages (such as Python, Perl, Ruby, C++, etc) and run as mapper/reducer. Thus it permits a person who is not having any knowledge of Java to write MapReduce job in the language of its own choice.

The tutorial has also described the basic communication protocol between the MapReduce Framework and the Streaming mapper/reducer. The tutorial also explained some of the Hadoop streaming command options.