Hadoop Certifications

This is a complete guide about various Spark Hadoop Cloudera certifications. In this Cloudera certification tutorial, we will discuss all the aspects like different certifications offered by Cloudera, the pattern of Cloudera certification exam/test, the number of questions passing score, time limits, required skills, and weightage of every topic. We will discuss all the certifications provided by Cloudera like: “CCA Spark and Hadoop Developer Exam (CCA175)”, “Cloudera Certified Administrator for Apache Hadoop (CCAH)”, “CCP Data Scientist”, “CCP Data Engineer”.

Hadoop Certifications

Hadoop certifications

1. CCA Spark and Hadoop Developer Exam (CCA175)

In CCA Spark and Hadoop Developer certification, you must write code in Scala and Python and run it on the cluster to verify your skills. This exam can be taken from any computer at any time world wide.

CCA175 is a practical exam using Cloudera technologies. The users are given their CDH5 (currently 5.3.2) cluster that is pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, and much other software that are must by the users.

a. CCA Spark and Hadoop Developer Certification Exam (CCA175) Details:

• Number of Questions: 10–12 performance-based tasks on CDH5 cluster

• Time Limit: 120 minutes

• Passing Score: 70%

• Language: English, Japanese (forthcoming)

• CCA Hadoop and Spark Developer certification Cost: USD 295

b. CCA175 Exam Question Format

In each CCA question, you would be needed to solve a particular scenario. In some scenarios, a tool such as Impala or Hive may be used. In other cases, coding is required. In the Spark problem, a template (in Scala or Python) is often provided that contains a skeleton of the solution, asking the candidate to fill in the missing lines with functional code.

c. Prerequisites

There are no requisites required to take any Cloudera certification exam.

d. Exam selection and related topics

I. Required Skills

  • Data Ingest: These are the skills needed to transfer data between external systems and your cluster. It consists:
  • Using Sqoop to import data from a MySQL database into HDFS and Change the delimiter and file format of data
  • Using Sqoop to Export data to a MySQL database
  • Ingest real-time and near-real-time streaming data into HDFS using Flume
  • Using Hadoop File System (FS) commands to load data into and out of HDFS

 II. Transform, Stage, Store:

It changes a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS. This comprises writing Spark applications in Scala / Python for the below tasks:

• Load data from HDFS and store back results to HDFS

• Join dissimilar datasets together

• Calculate aggregate statistics (e.g., average or sum)

• Filter data into a smaller dataset

• Write a query that creates ranked or sorted data

 III. Data Analysis

Data Definition Language (DDL) is used to create tables in the Hive metastore, use by Hive and Impala.

• In a given schema Read and/or create a table in the Hive metastore 

• Avro schema abstraction from a set of data-files

• Hive metastore table generation using the Avro file format and an external schema file

• Increase query performance by creating partitioned tables in the Hive meta-store

• Develop an Avro schema by changing JSON files

2. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Cloudera Certified Administrator for Apache Hadoop (CCAH) certification shows your technical knowledge, skills, and ability to configure, deploy, monitor, manage, maintain, and secure an Apache Hadoop cluster.

a. Cloudera Certified Administrator for Apache Hadoop (CCA-500) details

  • Number of Questions: 60 questions
  • Time Limit: 90 minutes
  • Passing score: 70%
  • Language: English, Japanese
  • Cloudera Certified Administrator for Apache Hadoop (CCAH) certification Price: USD 295

b. Exam sections and related topics

I. HDFS (17%)

• HDFS Features & design principle and function of HDFS daemons

• Describe the operations of the Apache Hadoop cluster in data storage and data processing

• Features of present computing systems that inspired system like Apache Hadoop and commands to handle files in the HDFS

• Given a picture, identify appropriate use cases for HDFS Federation

• Classify components and daemon of an HDFS HA-Quorum cluster

• File read-write paths and HDFS security (Kerberos)

• Regulate the best data serialization choice for a given scenario

• Internals of HDFS read operations and HDFS write operations

II. YARN (17%)

• Understand how to install core ecosystem components along with Spark, Impala, and Hive

• Understand Yarn, MapReduce v2 (MRv2 / YARN) deployments

• Understand basic design approach for YARN and how resource allocations are handled by YARN

• Understand Resource Manager and Node Manager

• Classify the workflow of job running on YARN

• Conclude which files you must change and how to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN

III. Hadoop Cluster Planning (16%)

• Major points to take into account while choosing the hardware and operating systems to host an Apache Hadoop cluster

• Be familiar with kernel tuning and disk swapping

• Classify a hardware configuration and ecosystem components your cluster needs for the given scenario

• Cluster sizing: Classify the essentials for the workload, including CPU, memory, storage, disk I/O for a given case

• Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requisites in a cluster

• Network Topologies: be familiar with network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario

IV. Hadoop Cluster Installation and Administration (25%)

Understand how to install and configure Hadoop cluster

• Classify how the cluster will handle disk and machine failures for the given case

• Evaluate a logging configuration and logging configuration file format

• Be aware of the basics of Hadoop metrics and cluster health monitoring

• Deploy ecosystem components in CDH 5 like Impala, Flume, Oozie, Hue, Cloudera Manager, Sqoop, Hive, and Pig, etc.

• Classify the function and purpose of available tools for managing the Apache Hadoop file system

V. Resource Management (10%)

• Understand the overall design goals of each of Hadoop schedulers and resource manager

• Given a scenario, conclude how the Fair/FIFO/Capacity Scheduler distributes cluster resources under YARN

VI. Monitoring and Logging (15%)

• Be familiar with the functions and features of Hadoop’s metric collection abilities

• Evaluate the NameNode and Yarn Web UIs

• Be aware of how to monitor cluster daemons

• Recognize and monitor CPU usage on master nodes

• Explain how to monitor swap and memory allocation on all nodes

• Interpret a log file and Identify how to manage Hadoop’s log files

3. CCP Data Scientist

CCP Data Scientist is capable to perform descriptive statistics, apply advanced analytical techniques, and develop machine learning models using standard tools. Candidates need to verify their capabilities on a live cluster with huge datasets in a variety of formats. It requires clearing 3 CCP Data Scientist exams (DS700, DS701, and DS702) in any order. All the above-mentioned exams must be passed within 365 days of each other.

a. Common Skills (all exams)

• Extract appropriate features from a large dataset consisting of bad records, partial records, errors, or other forms of “noise”

• Extract features from data in various formats like JSON, XML, raw text logs, industry-specific encodings, and graph link data

b. Descriptive and Inferential Statistics on Big Data (DS700)

Determining confidence for a hypothesis using statistical tests

• Analyze common summary statistics, such as mean, variance, and counts

• Fit a distribution to a dataset and use it to predict event likelihoods

• Execute complex statistical calculations on a large dataset

c. Advanced Analytical Techniques on Big Data (DS701)

• Create a model that contains appropriate features from a huge dataset

• Define appropriate data groupings and assign data records from a large dataset into a defined set of data groupings

• Analyze goodness of fit for a given set of data groupings and a dataset

• Employ innovative analytical techniques, such as network graph analysis or outlier detection

d. Machine Learning at Scale (DS702)

Create a model with appropriate features from a huge dataset and select a classification algorithm for it

• Forecast labels for an unlabeled dataset using a labeled dataset for indication

• Tune algorithm meta parameters to maximize algorithm performance

• Conclude the success of a given algorithm for the given dataset using validation techniques

e. What technologies/languages do you need to know?

You’ll be offered with a cluster with Hadoop technologies on a cluster, plus standard tools like Python and R. Among these standard technologies, it’s your choice of what to use to solve the problem.

4. CCP Data Engineer

CCP Data Engineer can perform core competencies required to ingest, transform, store, and analyze data in Cloudera’s CDH environment.

a. What do you need to know?

I. Data Ingestion

These are the skills to transport data between external systems and your cluster. It consists of:

• Import and export data between an external RDBMS and your cluster, including particular subsets, change the delimiter and file format of imported data during ingest and changing the data access pattern or permissions.

• Ingest real-time and near-real-time (NRT) streaming data into HDFS, including distribution to multiple data sources and converting data on ingest from one format to another.

• Load data into and out of HDFS using the Hadoop File System HDFS commands.

II. Transform, Stage, Store

It means converting a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/HCatalog. It includes:

• Convert data from one file format to another and write it with compression

• Change data from one set of values to another (e.g., Lat/Long to Postal Address using an external library)

• Delete bad records from a data set, e.g., null values

• De-duplication and merge data

• De-normalize data from multiple disparate data sets

• Evolve an Avro or Parquet schema

• Partition an existing data set rendering to one or more partition keys

• Tune data for optimal query performance

III. Data Analysis

It includes operations like Filter, sort, join, aggregate, and/or transform one or more data sets in a given format stored in HDFS to produce a specified result. The queries will include complex data types (e.g., array, map, struct), the implementation of external libraries, partitioned data, compressed data, and requires the use of metadata from Hive/HCatalog.

• Write a query to aggregate multiple rows of data and filter data

• Write a query that creates ranked or sorted data

• Write a query that joins various data sets

• Read and/or create a Hive or an HCatalog table from present data in HDFS

IV. Workflow

It includes the ability to create and execute various jobs and actions that move data towards greater value and use in a system. It includes:

• Create and execute a linear workflow with actions that consist of Hadoop jobs, Hive jobs, Pig jobs, custom actions, etc.

• Create and execute a branching workflow with actions that consist of Hadoop jobs, Hive jobs, Pig jobs, custom action, etc.

• Orchestrate a workflow to implement regularly at predefined times, including workflows that have data dependencies

b. What should you expect?

You are granted five to eight customer glitches each with a unique, large data set, a CDH cluster, and four hours. For every glitch, you must develop a technical solution that meets all the requirements using any tool or combination of tools on the cluster you get to choose the tool(s) that are right for the job.