Let’s study about the Hadoop scheduler and their different Scheduling Policies.
In this tutorial, we will discuss Hadoop Scheduler in detail. The tutorial begins by explaining what Hadoop Scheduler is? Then let us see the various types of Hadoop Schedulers like FIFO, Fair Scheduler and Capacity Scheduler along with their advantages and disadvantages respectively.
Let us first begin with an introduction to Hadoop Scheduler.
Introduction to Hadoop Scheduler
Earlier to Hadoop 2, Hadoop MapReduce is a software framework for writing applications that process huge amounts of data (terabytes to petabytes) in-parallel on the large Hadoop cluster. This framework is responsible for scheduling tasks, monitoring them, and re-executes the failed task.
In Hadoop 2, a Yet Another Resource Negotiator (YARN) was introduced. The fundamental idea behind the YARN introduction is to distribute the functionalities of resource management and job scheduling or monitoring into distinct daemons that are Resource Manager, Application Master, and Node Manager.
Resource Manager is the master daemon that settles resources among all the applications in the system. Node Manager is the slave daemon accountable for containers, monitoring their resource usage, and reporting the same to Resource Manager or Schedulers. Application Master negotiates resources from the Resource Manager and works with the Node Manager to execute and monitor the task.
The Resource Manager has two key components that are Schedulers and Applications Manager.
Schedulers in YARN Resource Manager is a pure scheduler that is responsible for allotting resources to different running applications.
It is not responsible for monitoring or tracking the status of an application. Therefore, the scheduler does not guarantee about restarting the tasks that are failed either due to hardware failure or application failure.
The scheduler performs scheduling based on the resource requirements of the applications.
It has some pluggable policies that are responsible for partitioning the cluster resources among the various queues, applications, etc.
The FIFO Scheduler, Capacity Scheduler, and Fair Scheduler are such pluggable policies that are responsible for distributing resources to the applications.
Let us now discuss each of these Schedulers in detail.
Types of Hadoop Schedulers
1. FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives more preferences to the application coming first than those coming later. It places the applications in a queue and executes them in the order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the queue are allocated first. Once the first application request is satisfied, then only the next application in the queue is served.
- It is simple to understand and doesn’t need any configuration.
- Jobs are executed in the order of their submission.
- It is not suitable for shared clusters. If the large application comes before the shorter one, then the large application will use all the resources in the cluster, and the shorter application has to wait for its turn. This leads to starvation.
- It does not take into account the balance of resource allocation between the long applications and short applications.
2. Capacity Scheduler
The Capacity Scheduler permits multiple-tenants to securely share a large Hadoop cluster. It is designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing the throughput and the utilization of the cluster.
It supports hierarchical queues to reflect the structure of organizations or groups that utilizes the cluster resources. A queue hierarchy contains three types of queues that are root, parent, and leaf.
The root queue signifies the cluster itself, parent queue represents organization/group or sub-organization/sub-group, and the leaf accepts application submission.
The Capacity Scheduler allows the sharing of the large cluster while giving capacity guarantees to each organization by allocating a fraction of cluster resources to each queue.
Also, when there is a demand for the free resources that are available on the queue who has completed its task, by the queues running below capacity, then these resources will be assigned to the applications on queues running below capacity. This provides elasticity for the organization in a cost-effective manner.
Separately from it, the Capacity Scheduler offers a comprehensive set of limits to ensure that a single application/user/queue cannot use a disproportionate amount of resources in the cluster.
To ensure fairness and stability, it also provides limits on initialized and pending apps from a single user and queue.
• It maximizes the utilization of resources and throughput in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization utilizing cluster.
• It is complex amongst the other scheduler.
3. Fair Scheduler
Fair Scheduler allows YARN applications to justly share resources in large Hadoop clusters. With this scheduler, there is no need for reserving a set amount of capacity because it will dynamically balance resources between all running applications.
It assigns resources to applications in such a way that all applications get, on average, an equal amount of resources over time.
This scheduler, by default, takes scheduling fairness decisions only based on memory. We can configure it to schedule with both memory and CPU.
When the single application is running, then that app uses the whole cluster resources. When other applications are submitted, the free up resources are assigned to the new apps so that every app eventually gets roughly the same amount of resources. Fair Scheduler allows short apps to be completed in a reasonable time without starving the long-lived apps.
Alike to Capacity Scheduler, the Fair Scheduler supports hierarchical queue to reflect the structure of the long shared cluster.
Besides fair scheduling, the Fair Scheduler permits for allocating minimum shares to queues for guaranteeing that certain users, production, or group applications always get enough resources. When an app is located in the queue, then the app gets its minimum share, but when the queue doesn’t need its full guaranteed share, then the excess share is distributed between other running applications.
• It provides a reasonable way to share the Hadoop Cluster between the number of users.
• Moreover, the FairScheduler can work with app priorities where the priorities are used as weights in determining the fraction of the total resources that each application should get.
• It requires configuration.
We hope after reading this tutorial, you understand the various options of pluggable scheduling policies like FIFO, Fair Scheduler, and Capacity Scheduler provided by Hadoop YARN Resource Manager for scheduling resources among the multiple applications running in the Hadoop cluster.