Hadoop Security

Are you worried that whether the data stored and processed using Hadoop is secure or not?

Hadoop is the software framework for storing and processing huge amounts of data. In this tutorial, we will study Hadoop Security. The tutorial first explains the reason for Hadoop Security. Then we will study the Hadoop Security, three A’s in Hadoop Security, and how Hadoop achieves security. The tutorial describes the Kerberos, transparent encryption in HDFS, and HDFS file and directory permission, which solves HDFS security issues. The tutorial also enrolls some Hadoop ecosystem components for monitoring and managing Hadoop Security.

Why Hadoop Security?

The objective of designing Hadoop is to manage huge amounts of data in a trusted environment, so security was not an important concern. But with the growth of the digital universe and the implementation of Hadoop in almost every domain like businesses, finance, health care, military, education, government, etc., security becomes a crucial concern.

The earlier Hadoop implementations are short of security features because the built-in security and available options are inconsistent among release versions, which affect many domains like multiple business sectors, health, medical departments, national security, and military, etc. It became obvious that there should be a mechanism that ensures Hadoop Security.

So, Hadoop security is the crucial leap that Hadoop Framework needs to take.

What to do about Hadoop Security?

The security in Hadoop requires many of the same approaches seen in the traditional data management system. These include 3 A’s of security.

Let us first see what these 3 A’s says:

• Authentication

• Authorization

• Auditing

• Data Protection

Authentication: Authentication is the first stage that strongly authenticates the user to prove their identities. In authentication, user credentials like User Id, password are authenticated. Authentication guarantees that the user who is seeking to operate is the one who he claims to be and thus trustable.

Authorization: It is the second stage that defines what individual users can do after they have been authenticated. Authorization controls what a particular user can do to a specific file. It provides permission to the user whether he can access the data or not.

Auditing: Auditing is the process of keeping track of what an authenticated, authorized user did once he gets access to the cluster. It records all the activity of the authenticated user, including what data was accessed, added, changed, and what analysis occurred by the user from the period when he login to the cluster.

Data Protection: It refers to the use of techniques like encryption and data masking for preventing sensitive data access by unauthorized users and applications.

Introduction to Hadoop Security

Around 2009, Hadoop’s security was designed and implemented and had been stabilizing since then. In 2010, the security feature added in Hadoop with the following two fundamental goals:

• Preventing unauthorized access to the files stored in HDFS.

• Not exceeding high cost while achieving authorization.

Hadoop Security, therefore, refers to the process that provides authentication, authorization, auditing, and secure the Hadoop data storage unit by offering an inviolable wall of security against any cyber threat.

Let us now have a look at how Hadoop achieves its security.

How Hadoop achieve Security?

Apache Hadoop Security

1. Kerberos

Kerberos is an authentication protocol that is now used as a standard to implement authentication in the Hadoop cluster.

Hadoop, by default, does not have any authentication, which can have severe effects on the corporate data centers. To overcome this limitation, Kerberos which provides a secure way to authenticate users was introduced in the Hadoop Ecosystem.

Kerberos is the network authentication protocol developed at MIT, which uses “tickets” to allow nodes to identify themselves.

Hadoop uses the Kerberos protocol to ensure that someone who is making the request is the one whom he claims to be.

In the secure mode, all Hadoop nodes make use of Kerberos to do mutual authentication. It means that when two nodes talk to each other, they each make sure that the other node is whom it says it is.

Kerberos uses secret-key cryptography for providing authentication for client-server applications.

Kerberos in Hadoop

The client makes the three steps while using Hadoop with Kerberos.

  1. Authentication: In Kerberos, the client first authenticates itself to the authentication server. The authentication server provides the timestamped Ticket-Granting Ticket (TGT) to the client.
  2. Authorization: The client then uses TGT to request a service ticket from the Ticket-Granting Server.
  3. Service Request: On receiving the service ticket, the client directly interacts with the Hadoop cluster daemons such as NameNode and ResourceManager.

Authentication server and Ticket Granting Server together form the Key Distribution Center (KDC) of Kerberos.

The client on the user’s behalf performs the authorization and the service request steps.

The authentication step is carried out by the user through the kinit command, which will ask for a password.

We don’t need to enter a password every time while running a job because Ticket-Granting Ticket lasts for 10 hours by default, which is renewable up to a week.

If we don’t want ourselves to get a prompt for the password, we can create a Kerberos keytab file using ktutil command.

The keytab file stores passwords supplied to knit with the -t option.

2. Transparent Encryption in HDFS

For data protection, Hadoop HDFS implements transparent encryption. Once it is configured, the data that is to be read from and written to the special HDFS directories is encrypted and decrypted transparently without requiring any changes to the user application code.

This encryption is end-to-end encryption, which means that only the client will encrypt or decrypt the data. Hadoop HDFS will never store or have access to unencrypted data or unencrypted data encryption keys, satisfying at-rest encryption, and in-transit encryption.

At-rest encryption refers to the encryption of data when data is on tireless media such as a disk.

In-transit encryption means encryption of data when data is moving over the network.

HDFS encryption enables the existing Hadoop applications to run transparently on the encrypted data.

This HDFS-level encryption also thwarts the filesystem or OS-level attacks.

Architecture Design

Encryption Zone(EZ): It is a special directory whose content upon write is encrypted transparently, and during read, the content is transparently decrypted.

Encryption Zone Key (EZK): Every single Encryption Zone key has an EZK specified during zone creation.

Data Encryption Key (DEK): Every file in EZ has its own unique DEK, which is never handled directly by HDFS. They are used to encrypt and decrypt the file data.

Encrypted Data Encryption Key(EDEK): HDFS handles EDEK. The client decrypts the EDEK and then uses the resultant DEK to read/write data.

Key Management Server(KMS): The KMS is responsible for providing access to the stored EZK, generating new EDEK for storage on NameNode, and decrypting the EDEK for use by the HDFS clients.

The transparent encryption in HDFS works in the following manner:

  1. While creating a new file in EZ, the NameNode asks Key Management Server (KMS) to create a new Encrypted Data Encryption Key encrypted with EZk.
  2. This EDEK is stored on the NameNode as part of the file’s metadata.
  3. During file read within the encryption zone, NameNode provides the file’s EDEK along with the EZK version used to encrypt the EDEK to the client.
  4. The client then asks KMS to decrypt the EDEK. KMS first checks whether the client has permission to access the encryption zone key version or not. If the client has access permission, it uses the DEK to decrypt the file’s content.

All these steps take place automatically through the Hadoop HDFS client, the NameNode, and the KMS interactions.

3. HDFS file and directory permission

For authorizing the user, the Hadoop HDFS checks the files and directory permission after the user authentication.

The HDFS permission model is very similar to the POSIX model. Every file and directory in HDFS is having an owner and a group.

The files or directories have different permissions for the owner, group members, and all other users.

For files, r is for reading permission, w is for write or add permission.

For directories, r is the permission to list the content of the directory, w is the permission to create or delete files/directories, and x is the permission to access a child of the directory.

To restrict others except for the files/directory owner and the superuser, from deleting or moving the files within the directory, we can add a sticky bit on directories.

The owner of the file/directory is the user identity of the client process, and the group of file/directory is the parent directory group.

Also, every client process which is going to access the HDFS has a two-part identity that is a user name and group list.

The HDFS do a permission check for the file or directory accessed by the client as follow:

  1. If the user name of the client access process matches the owner of file or directory, then HDFS perform the test for the owner permissions;
  2. If the group of file/directory matches any of member of the group list of the client access process, then HDFS perform the test for the group permissions;
  3. Otherwise, the HDFS tests the other permissions of files/directories.

If the permissions check fails, then the client operation fails.

Tools for Hadoop Security

The Hadoop ecosystem contains some tools for supporting Hadoop Security. The two crucial Apache open-source projects that support Hadoop Security are Knox and Ranger.

1. Knox

Knox is a REST API base perimeter security gateway that achieves authentication, support monitoring, auditing, authorization management, and policy enforcement on Hadoop clusters. It authenticates user credentials generally against LDAP and Active Directory. It allows only the successfully authenticated users to access the Hadoop cluster.

2. Ranger

Apache Ranger is an authorization system that provides or denies access to Hadoop cluster resources such as HDFS files, Hive tables, etc. based on predefined policies. User request assumes to be already authenticated while coming to Apache Ranger. It has several authorization functionalities for various Hadoop components such as YARN, Hive, HBase, etc.

Summary

In this tutorial, we had studied Hadoop security. We had learned how Hadoop uses Kerberos to authenticate the user accessing the Hadoop HDFS files or directories. We had also discussed transparent encryption in HDFS for protecting the files or directories in HDFS. The tutorial had described how HDFS checks for the client’s permission to access the files or directories. Besides, the tutorial also highlighted some major Apache projects such as Knox and Ranger for monitoring and supporting Hadoop Security.