New Features in Hadoop 3

What is New in Hadoop 3? Study the Unique Hadoop 3 Features

The release of Hadoop 3.x is the next big milestone in the line of Hadoop releases. Numerous people have a question in mind about what feature improvement does Hadoop 3.x gives over Hadoop 2.x. So in this tutorial, we will take a look at what is new in Hadoop 3 and how it differs from the old versions.

new features in hadoop 3

What is New in Hadoop 3?

Below are the changes which are available in Hadoop 3 and that makes it unique and fast. Let’s have a look the What’s new in Hadoop 3.x –

1. The version of Java supported in Hadoop 3.0 is JDK 8.0

All the Hadoop jar files are compiled using Java 8 run time version. The clients now have to install Java 8 to use Hadoop 3.0. And user having JDK 7 has to upgrade it to JDK 8.

2. HDFS Supports Erasure Coding

Hadoop 3.x uses erasure coding for offering fault tolerance. Hadoop 2.x uses a replication technique to offer the same level of fault tolerance. Let us study the difference between the two.

First, we will look at replication. Let us take the default replication factor of 3. In Hadoop 3.x, for 6 blocks we have to store a total of 6*3 i.e. 18 blocks. For every block replicated the storage overhead is 100%. Hence in our case, the storage overhead will be 200%.

Let us see what happens in Erasure Coding. For 6 blocks, 3 parity blocks get calculated. We call this process as encoding. Now whenever a block gets missing or corrupted, it gets calculated from the remaining blocks and parity blocks. We call this process as decoding. In this case, we have a total of 9 blocks stored for 6 blocks making 50% storage overhead.

Therefore we can achieve the same amount of fault tolerance with much lesser storage. But there always overhead in terms of CPU and network for the process of encoding and decoding. Hence it uses for rarely access data.

workflow hadoop 3

3. YARN Timeline Service v.2

Yarn Timeline Service is new in Hadoop 3. The timeline server is accountable for the storage and retrieval of the application’s current and historical information. This information is of two types –

Generic information of the completed application

• Name of the queue

• User information

• Number of attempts per application

• Information about containers which run for each attempt

• Generic data stored by ResourceManager about a completed application that is accessed by web UI.

Per framework information about running and completed application

• Number of map tasks

• Number of reduce task

• Counters

• Information circulated by application developers to TimeLine Server via Timeline client.

YARN Workflow

This data gets queried by REST API for execution by application or framework-specific UI.

The TimeLine server v.2 addresses the main shortcomings in version v.1. One of the problems is scalability. The TimeLine server v.1 has a single instance of reader/writer and storage. It is not scalable beyond a few numbers of nodes. However, in version, v.2 Timeline server has a distributed writer architecture and scalable backend storage. It divides collection (writer) of data from serving (read) of data. Moreover, it uses one collector per YARN application. It has a reader as a distinct instance which servers query via REST API. Timeline server v.2 uses HBase for storage which can get scaled to large size giving good response time for reads and writes.

4. Support for Opportunistic Containers and Distributed Scheduling

Hadoop 3 has initiated the concept of execution type. If there are no resources available at the moment then these containers wait at the NodeManager. Opportunistic containers have low priority than Guaranteed containers. If suppose Guaranteed containers arrive in the middle of the execution of opportunistic containers then later gets preempted. This happens to make room for Guaranteed containers.

5. Support for More Than Two NameNodes

Until now Hadoop supported single active NameNode and single standby NameNode. Having edits replicated to three journal nodes, this architecture allowed for the failure of one NameNode.

But some situation requires a high level of fault tolerance. By arranging five journal nodes we can have a system of three NameNodes. Such a system would tolerate the failure of two NameNodes. Therefore by introducing support for more than two NameNode Hadoop 3.0 has made the system more highly available.

6. Default Ports of Multiple Services Changes

Earlier to Hadoop 3.0 many Hadoop services had their default port in Linux transitory port range (32768-61000). Due to this, most of the time these services would fail to bind at startup. As they would oppose with other applications.

They have moved the default port of these services out of the ephemeral range. The services include NameNode, Secondary NameNode, DataNode, and KeyManagementServer.

7. Intra-DataNode Balancer

A DataNode Manges many disks. During a write operation, these disks get filled evenly. But when we add or remove the disk it results in a significant skew. The HDFS balancer adopts internode data skew and not intra node.

Intra-node balancer addresses this situation. The CLI – hdfs disk balancer invokes this balancer.

8. Daemon and Task Heap Management Reworked

There are various changes in the Heap management of daemons and Map-Reduce tasks:

There are new means to configure daemon heap sizes. The system auto-tunes based on the memory of the host. HADOOP_HEAPSIZE variable is no longer used. In its place, we have HEAP_MAX_SIZE and HEAP_MIN_SIZE variables. Also, they have eliminated the internal variable JAVA_HEAP_SIZE . They have also eliminated default heap sizes which permit for auto-tuning by JVM. All the variables of global and daemon heap size support units. If the variable is only a number then it expects the size to be in megabytes. Moreover, if you want to allow the old default then configure HADOOP_HEAPSIZE_MAX in hadoop-env.sh.

If the value for mapreduce.map/reduce.memory.mb is set to the default of -1. Then it will automatically conclude the value from the Xmx variable specified for mapreduce.map/reduce.java.opts. Xmx is nothing but heap size value system property. This reverse is also possible. Suppose Xmx value is not specified for mapreduce.map/reduce.java.opts keys. The system derives its value from mapredcue.map/reduce.memory.mb keys. If we don’t state either value then the default is 1024MB. For configuration and job code which state this value explicitly will not get affected.

9. Generalization of Yarn Resource Model

They have simplified the Yarn resource model to comprise user-defined resources apart from CPU and memory. These user-defined resources can be software licenses, GPU, or natively attached storage. Yarn tasks get scheduled based on these resources.

We can extend the Yarn resource model to comprise arbitrary “countable” resources. A countable resource is one that gets consumed by the container and the system releases it after completion. Both CPU and memory are countable resources. Similarly, GPUs or Graphics Processing Unit and software licenses are countable resources too. Yarn traces CPU and memory for each node, application, and queue by default. Yarn can be increased to track other user-defined countable resources like GPUs and software licenses. The integration of GPUs with containers has enhanced the performance of Data Science and AI use cases.

10. Consistency and Metadata Caching for S3A Client

S3A clients can store metadata for files and directories in a fast and consistent way. It does this by utilizing a DynamoDB table. We can signify to this new feature as S3GUARD. It caches the directory data so that S3Aclient can get faster lookups. Moreover, it offers resilience to inconsistencies between S3 list operations and the status of the object. When the files get generated using S3GUARD we can always find it. S3GUARD is experimental and we can believe it as unstable.

So, we have studied many new features of Hadoop 3 that make it unique and familiar.

Summary

As we have progressed along with different versions of Hadoop, it gets better and better. The developers have included many changes to fix bugs, make it more user-friendly, and give it improved features. The changes made in default ports of different Hadoop services have made it more comfortable to use. Hadoop includes various feature improvements like erasure coding, the introduction of timeline service v.2, adoption of the intra-node balancer, and so on. These changes have increased the chances of the use of Hadoop by the industry.