Hadoop Pros and Cons

Hadoop Pros and Cons (Advantages & Disadvantages)

The objective of this tutorial is to discuss the advantages and disadvantages of Hadoop 3.0. As many changes are introduced in Hadoop 3.0 it has become a better product.

Hadoop is designed to store and manage a large amount of data. There are many pros of Hadoop like it is free and open-source, easy to use, its performance, etc. but on the other hand, it has some weaknesses which we called as disadvantages.

So, let us start exploring the top advantages and disadvantages of Hadoop.

Hadoop Pros and Cons

Advantages of Hadoop

Hadoop is easy to use, scalable, and cost-effective. Along with this, Hadoop has many advantages. Here we are discussing the top 12 advantages of Hadoop. So, following are the pros of Hadoop that makes it so popular –

Hadoop - Advantages

1. Varied Data Sources

Hadoop accepts a variety of data. Data can be derived from various sources like email conversation, social media, etc. and can be of the structured or unstructured form. Hadoop can derive value from diverse data. Hadoop can accept data in a text file, XML file, images, CSV files, etc.

2. Cost-effective

Hadoop is a cost-effective solution as it uses a cluster of commodity hardware to store data. Commodity hardware is inexpensive machines hence the cost of adding nodes to the framework is not much high. In Hadoop 3.0 we have only 50% of storage above as opposed to 200% in Hadoop2.x. This requires less machine to store data as the surplus data decreased significantly.

3. Performance

Hadoop with its distributed processing and distributed storage architecture processes huge amounts of data with high speed. Hadoop overcame supercomputer the fastest machine in 2008. Hadoop divides the input data file into several blocks and stores data in these blocks over several nodes. It also divides the task that the user submits into various sub-tasks which assign to these worker nodes containing required data and these sub-task run in parallel thereby improving the performance.

4. Fault-Tolerant

In Hadoop 3.0 fault tolerance is offered by erasure coding. For example, 6 data blocks produce 3 parity blocks by using erasure coding technique, so HDFS stores a total of these 9 blocks. In the event of failure of any node the data block affected can be recovered by using these parity blocks and the remaining data blocks.

5. Highly Available

In Hadoop 2.x, HDFS architecture has a single active NameNode and a single Standby NameNode, so if a NameNode goes down then we have standby NameNode to count on. But Hadoop 3.0 supports multiple standby NameNode making the system even more highly available as it can continue functioning in case of two or more NameNodes crashes.

6. Low Network Traffic

In Hadoop, each job submitted by the user is split into many independent sub-tasks and these sub-tasks are assigned to the data nodes thereby moving a small amount of code to data rather than moving massive data to code which leads to low network traffic.

7. High Throughput

Throughput means job done per unit time. Hadoop stores data in a distributed fashion that permits using distributed processing with ease. A given job gets separated into small jobs that work on chunks of data in parallel thereby giving high throughput.

8. Open Source

Hadoop is an open-source technology i.e. its source code is freely available. We can modify the source code to suit a specific requirement.

9. Scalable

Hadoop works on the principle of horizontal scalability i.e. we need to add the entire machine to the cluster of nodes and not change the configuration of a machine like adding RAM, disk and so on which is known as vertical scalability. Nodes can be added to the Hadoop cluster on the fly making it a scalable framework.

10. Ease of use

The Hadoop framework takes care of parallel processing, MapReduce programmers does not need to care for achieving distributed processing, it is done at the backend automatically.

11. Compatibility

Most of the emerging technology of Big Data is compatible with Hadoop like Spark, Flink, etc. They have got processing engines that act over Hadoop as a backend i.e. We use Hadoop as data storage platforms for them.

12. Multiple Languages Supported

Developers can code using many languages on Hadoop like C, C++, Perl, Python, Ruby, and Groovy.

2. Disadvantages of Hadoop

1. Issue With Small Files

Hadoop is suitable for a small number of large files but when it comes to the application which deals with a large number of small files, Hadoop fails here. A small file is nothing but a file that is significantly smaller than Hadoop’s block size which can be either 128MB or 256MB by default. These large numbers of small files overload the Namenode as it stores namespace for the system and makes it difficult for Hadoop to function.

2. Vulnerable By Nature

Hadoop is written in Java which is a widely-used programming language hence it is easily exploited by cybercriminals which makes Hadoop vulnerable to security breaches.

3. Processing Overhead

In Hadoop, the data is read from the disk and written to the disk which makes read/write operations very expensive when we are dealing with tera and petabytes of data. Hadoop cannot do in-memory calculations hence it incurs processing overhead.

4. Supports Only Batch Processing

At the core, Hadoop has a batch processing engine which is not efficient in stream processing. It cannot produce output in real-time with low latency. It only works on data that we collect and store in a file in advance before processing.

5. Iterative Processing

Hadoop cannot do iterative processing by itself. Machine learning or iterative processing has a cyclic data flow whereas Hadoop has data flowing in a chain of stages where output on one stage becomes the input of another stage.

6. Security

For security, Hadoop uses Kerberos authentication which is hard to manage. It is missing encryption at storage and network levels which are a major point of concern.

So, this was all about Hadoop Pros and Cons. I hope you liked our explanation.

Summary 

Every software used by the industry comes with its own set of drawbacks and benefits. If the software is essential for the organization then one can exploit the benefits and take measures to minimize the faults. We can see that Hadoop has benefits that outweigh its shortcomings making it a strong solution to Big Data needs.