0% found this document useful (0 votes)
131 views

BDH Admin Ebook

Uploaded by

subba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views

BDH Admin Ebook

Uploaded by

subba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 807

Lesson 1—Big Data and Hadoop - Introduction

What You’ll Learn

Data and existing solutions

The world of Big Data

Case studies

What Big Data is, why it is required, and where it is applicable

Hadoop and its ecosystem, core components, and capabilities

Disclaimer: All the logos used in this course belong to the respective organizations
The Value of Data

Data is critical to organizations for its immense value.

Technologies Businesses have Organizations want to


have advanced become dynamic derive value from the
existing data

“We don’t have better algorithms, we just have more data.”


- Peter Norvig,
(Google’s director of
research)
Big Opportunities — Bigger Challenges

Compare digital universe with worldwide installed raw storage capacity.

18 IDC forecasts the


digital universe to be
16 Byte=1x10^0
14
12
KB=1x10^3
MB=1x10^6 16 ZB
GB=1x10^9 IN 2017*
10 TB=1x10^12
(PB)
8 PB=1x10^15
EB=1x10^18
6
ZB=1x10^21
4
2
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Digital Universe WW Installed Raw Capacity WW Capacity


Shipped
Replace or Be Replaced

Traditional Solutions Newer Solutions


Big Data
Technologies
• RDBMS • Capabilities to store
more data online
• Inexpensive tapes
• Allow streaming
• Storage solutions for NoSql
long-term data archiving • Support agility and Data Stores
dynamism
Replace or Be Replaced (Contd.)

Data Generators

Organizations seek to glean intelligence from the available data and translate that into business
advantage.

• Digitization of business activities


• Newer sources of information New era – Replace or Be Replaced
• Probable use of cheaper equipment
What is Big Data?

The characteristics of Big Data are as follows:

Volume
Velocit
Velocity Variety y
Structured Batch
Variety and
Unstructured
Veracity
Viscosity Big Streaming
Data
Virality
Structured Data

Volatility
Zettabytes Terabytes
Validity

Volume
What is Big Data? (Contd.)

Big Data is a term coined to describe large or complex datasets. Traditional data processing
solutions are inadequate to analyze this data and fail to capture, store, analyze, cure, search,
share, transfer, visualize, query, process, and handle this kind of data.

The benefits of Big Data are as


follows:

It offers Competitive edge and advantage.

It improves decision making.

It identifies the value of


data.
Interesting Facts and Statistics

According to IDC (International Data Corporation), worldwide revenues for Big Data and business
analytics will grow from nearly $122 billion in 2015 to more than $187 billion in 2019.

“Organizations able to take advantage of the new generation of business analytics


solutions can leverage digital transformation to adapt to disruptive changes and to
create competitive differentiation in their markets,”
"These organizations don't just automate existing processes – they treat data and
information as they would any valued asset by using a focused approach to extracting
and developing the value and utility of information.”

- Dan Vesset,
(IDC group vice president, Analytics and Information Management)
Interesting Facts and Statistics (Contd.)

“There is little question that Big Data and Analytics can have a considerable impact on just
about every industry,”
“Its promise speaks to the pressure to improve margins and performance while
simultaneously enhancing responsiveness and delighting customers and prospects.
Forward-thinking organizations turn to this technology for better and faster data-driven
decisions,"
- Jessica Goepfert,
(Program director for IDC’s Customer Insights and Analysis Group)
Big Data— Statistics and Challenges

BIG DATA

57.6% of organizations surveyed say that


BIG DATA is a challenge.

72.7% consider driving operational efficiencies


to be the biggest benefit of a BIG DATA strategy.

50% say that BIG DATA helps in meeting


customer demands and facilitating growth.
Big Data Characteristics
Big Data Statistics

Big Data Statistics


Big Data Statistics
Big Data Customer: Behind the Big Data Curtain
Big Data Customers

Hadoop for analytics


• Uncover new meanings in satellite imagery
• Gain new insights from geospatial data Works on raw sensory data output which
wouldn't be recognizable to an average
user on his own

• Assemble
• Normalize
Agricultural clients can monitor
Skybox customers • Index
Skybox partnered crop yields. Make
can embed their •
with Cloudera meaningful
own algorithms in
(Well known Shipping and supply chain connections
the company’s
Hadoop Vendor) companies can monitor
platform and use
to implement their vehicles.
analytics engine to
own distribution of
crunch data for
Hadoop. Oil and gas companies can
their own uses.
evaluate land areas.
Big Data Customers - Case Studies

Here is a list of some more customers which have adopted Big Data and Hadoop-based technologies
to power Big Data applications.
Apache Hadoop

Apache Hadoop
An open-source software framework for distributed storage, distributed and parallel processing of very
large datasets on commodity machines that form a cluster.

The core of Apache Hadoop Hadoop Distributed File


MapReduce
consists of Storage layer System and Processing layer
Hadoop Ecosystem and its Components

It is a collection of other open-source components/software packages that can be installed on top of


Hadoop and which can leverage the benefits of distributed file system and distributed processing.

Some of the packages are:

Flume Zookeeper

Apache Stor
Hive m
HBase Spark kafka

Pig Sqoop
Hadoop Ecosystem and its Components (Contd.)

The base Apache Hadoop framework consists


of
• Hadoop common – consists of libraries and
utilities needed by other Hadoop modules Hadoop User Experience (HUE)

• Hadoop Distributed File System – Sqoo Pig Hive Mahou Oozie HBas
a distributed file system that stores data on p Scripting SQL
t
ML Workflow e
commodity machines Data
Exchange
• Hadoop Yarn – a resource management Columnar
data
platform for managing computing resources

Coordination
Flum
YARN/Map Reduce V2 store

Keeper
e

Zoo
in cluster and using them for scheduling and
processing user applications Log
Control Hadoop Distributed File
System
• Hadoop MapReduce – a programming
model for large scale distributed and parallel
data processing
• Other open-source components/packages
Hadoop: Daemons, Roles, and Components

RM AM

NM NM NM

Processing layer RM

N SNN
Storage layer
NNN
DN DN DN

NN

Hadoop CLUSTER
Hadoop Cluster: A complete picture

API/Client/Application
Metadata
in RAM

SNN JT or RM
NN NN-NameNode
Metadata DN-DataNode
in DISK
SNN-SecondaryNameNode
HDFS RM/JT-ResourceManager/JobTracker
NM/TT-NodeManager/TaskTracker
M-Map
DN, NM/TT DN, NM/TT DN, NM/TT
R-Reduce
MandR-Map and Reduce
HDFS-Hadoop Distributed File System
Quiz

©Simplilearn. All rights reserved


Quiz Which characteristic of Big Data relates to quality of data under consideration?
1

a. Volume

b. Velocity

c. Veracity

d. Validity

©Simplilearn. All rights reserved


Quiz Which characteristic of Big Data relates to quality of data under consideration?
1

a. Volume

b. Velocity

c. Veracity

d. Validity

The correct answer is c.

Explanation: Veracity is the characteristic of Big Data that relates to quality of data
under consideration.

©Simplilearn. All rights reserved


Quiz
Why is it possible to achieve more accuracy in analysis when using a framework like Apache
2
Hadoop in comparison to existing RDBMs solutions?

a. Big Data technologies allow distributed processing.

b. This statement is not true.

c. HDFS makes it possible to store all data online and offers scaling out approach.

d. RDBMs are always better for analysis.

©Simplilearn. All rights reserved


Quiz Why is it possible to achieve more accuracy in analysis when using a framework like
2 Apache Hadoop in comparison to existing RDBMs solutions?

a. Big Data technologies allow distributed processing.

b. This statement is not true.

c. HDFS makes it possible to store all data online and offers scaling out approach.

d. RDBMs are always better for analysis.

The correct answer is c.

Explanation: HDFS makes it possible to store all data online and offers scaling out approach, hence it is
possible to achieve more accuracy in analysis when using a framework like Apache Hadoop in comparison
to existing RDBMs solutions.
©Simplilearn. All rights reserved
Quiz Which nodes does HDFS use to store data?
3

a. NameNodes

b. Tasktracker

c. DataNodes

d. SecondaryNameNode

©Simplilearn. All rights reserved


Quiz Which nodes does HDFS use to store data?
3

a. NameNodes

b. Tasktracker

c. DataNodes

d. SecondaryNameNode

The correct answer is c.


Explanation: HDFS is a distributed file system and it uses DataNodes to store data.

©Simplilearn. All rights reserved


Quiz
Apache Hadoop as an ecosystem comprises of other packages that can leverage the benefits
4 of Hadoop framework. Select the appropriate list.

a. NameNode, DataNodes, SecondaryNameNode, ResourceManager, Nodemanager

b. NameNode, DataNodes and SecondaryNameNode

c. ResourceManager, NodeManagers and ApplicationMaster

d. Apache Hive, pig, flume,kafka,sqoop, hive, and Hbase

©Simplilearn. All rights reserved


Quiz
Apache Hadoop as an ecosystem comprises of other packages that can leverage the benefits
4 of Hadoop framework. Select the appropriate list.

a. NameNode, DataNodes, SecondaryNameNode, ResourceManager, Nodemanager

b. NameNode, DataNodes and SecondaryNameNode

c. ResourceManager, NodeManagers and ApplicationMaster

d. Apache Hive, pig, flume,kafka,sqoop, hive, and Hbase

The correct answer is d.


Explanation: Apache Hadoop as an ecosystem comprises of other packages such as Apache Hive, Pig, Flume,
Kafka, Sqoop, Hive, HBase etc. which can leverage the benefits of Hadoop framework.

©Simplilearn. All rights reserved


Quiz
5 Choose the components that may constitute a Hadoop Ecosystem.

a. Namenode, DataNodes, ResourceManager, NodeManager, and SecondaryNameNode

b. Hive, Hbase, Sqoop, Flume, Kafka, and zookeeper

c. Namenodes, DataNodes, ResourceManagers, and NodeManager

d. Cloudera manager server and agents

©Simplilearn. All rights reserved


Quiz
5 Choose the components that may constitute a Hadoop Ecosystem.

a. Namenode, DataNodes, ResourceManager, NodeManager, and SecondaryNameNode

b. Hive, Hbase, Sqoop, Flume, Kafka, and zookeeper

c. Namenodes, DataNodes, ResourceManagers, and NodeManager

d. Cloudera manager server and agents

The correct answer is b.


Explanation: Hive, Hbase, Sqoop, Flume, Kafka, and zookeeper constitute Hadoop Ecosystem.

©Simplilearn. All rights reserved


Quiz What is the main role of SecondaryNameNode in a Hadoop cluster?
6

a. It checks if NameNode is up and running.

b. It backs up the data from DataNodes.

c. It does automated checkpointing and preserves a copy of NameNode’s metadata.

d. It is the second NameNode for the cluster.

©Simplilearn. All rights reserved


Quiz What is the main role of SecondaryNameNode in a Hadoop cluster?
6

a. It checks if NameNode is up and running.

b. It backs up the data from DataNodes.

c. It does automated checkpointing and preserves a copy of NameNode’s metadata.

d. It is the second NameNode for the cluster.

The correct answer is c.


Explanation: The main role of SecondaryNameNode in the Hadoop cluster is to do automated checkpointing
and preserve a copy of NameNode’s metadata.

©Simplilearn. All rights reserved


When Data is written to HDFS, does the metadata in RAM and DISK
Quiz get updated too?
7

a. Yes, metadata in RAM and Disk gets updated too.

b. No, only the metadata in RAM is updated.

c. No, only the metadata in Disk is updated.

d. Metadata gets updated only when you do formatting of NameNode.

©Simplilearn. All rights reserved


Quiz
7 When Data is written to HDFS, does the metadata in RAM and DISK get updated too?

a. Yes, metadata in RAM and Disk gets updated too.

b. No, only the metadata in RAM is updated.

c. No, only the metadata in Disk is updated.

d. Metadata gets updated only when you do formatting of NameNode.

The correct answer is a.


Explanation: Yes, the metadata in RAM and Disk also gets updated when data is written to HDFS.

©Simplilearn. All rights reserved


Quiz
8 Can we have multiple NameNodes and ResourceManagers in the same cluster?

a. Yes, if HA is enabled

b. Yes, if we have nodes in the cluster

c. Only if 1 of each is allowed in a Hadoop cluster

d. Only if there can be multiple slave daemons in a cluster

©Simplilearn. All rights reserved


Quiz
8 Can we have multiple NameNodes and ResourceManagers in the same cluster?

a. Yes, if HA is enabled

b. Yes, if we have nodes in the cluster

c. Only if 1 of each is allowed in a Hadoop cluster

d. Only if there can be multiple slave daemons in a cluster

The correct answer is a.


Explanation: Yes, you can have multiple NameNodes and ResourceManagers in the same cluster only if HA is
enabled.

©Simplilearn. All rights reserved


Quiz Which module of Apache Hadoop Framework contains libraries and utilities needed by other
9 Hadoop modules?

a. Hadoop Distributed File System

b. Hadoop common

c. Hadoop Yarn and MapReduce

d. Hadoop user experience- Hue

©Simplilearn. All rights reserved


Quiz Which module of Apache Hadoop Framework contains libraries and utilities needed by other
9 Hadoop modules?

a. Hadoop Distributed File System

b. Hadoop common

c. Hadoop Yarn and MapReduce

d. Hadoop user experience- Hue

The correct answer is b.


Explanation: Hadoop common contains libraries and utilities needed by other Hadoop modules.

©Simplilearn. All rights reserved


Quiz
Which identifier’s mismatch can cause DataNodes to fail and not be able to communicate to
10 NameNode?

a. NamespaceID and ClusterID

b. BlockpoolID and relevant blocks missing

c. Only NamespaceID

d. Datanode IDs

©Simplilearn. All rights reserved


Quiz Which identifier’s mismatch can cause DataNodes to fail and not be able to communicate to
NameNode?
10

a. NamespaceID and ClusterID

b. BlockpoolID and relevant blocks missing

c. Only NamespaceID

d. Datanode IDs

The correct answer is a.


Explanation: NamespaceID and ClusterID mismatch can cause DataNodes to fail and not being able to
communicate to NameNode.

©Simplilearn. All rights reserved


Key Takeaways

Big Data offers competitive edge and advantage, improves decision


making, and identifies value of data.

Organizations such as EMC, Skybox imaging, CAESARS Entertainment,


Cerner, and many others are transforming their business through Big
Data and Hadoop-based technologies.
The four V’s of Big Data are Volume, Velocity, Variety, and Veracity.

The Apache Hadoop framework consists of modules such


as Hadoop common, Hadoop Distributed File System,
Hadoop Yarn, Hadoop MapReduce, and other open-source
components or packages.
Apache Hadoop consists of two logical layers — storage
layer and processing layer.

Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Big Data and Hadoop- Introduction.”
The next lesson is “HDFS: Hadoop Distributed File System.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 2- HDFS: Hadoop Distributed File System
What You’ll Learn

Gain knowledge on HDFS, its internals, working, and features.


Differentiate or find similarities in different distributions of
Hadoop.
Identify the different constituents of a Hadoop cluster.

Learn about the replacements of HDFS.

Disclaimer: All the logos used in this course belong to the respective organizations
Lesson 2: HDFS: Hadoop Distributed File System
Topic 2.1: Introduction to HDFS
Scalability
Scalability

Let’s understand the situation that resulted in the need for


HDFS.
Scalability

So were there any existing solutions to handle massive data?


Yes
Evolution of Approach: Scale Up

The 70s and 90s saw vertical scalability as a solution to scalability problems.
Scale Out

The 90s and 2000s saw scaling out architecture as the preferred option
for vertical stability
Open Scale Out

The advent of cloud platforms have led to the emergence of applications that are
highly scalable, open, and capable of running on heterogeneous platforms.
COMMODITY COMPUTING - A Solution

Commodity computing or commodity supercomputing helps in scaling according to


requirements without incurring huge costs.
Commodity Computing - A Solution

The Internet giants have proved that commodity computing and distributed data storage can be efficiently used.
Commodity Computing - A Solution

How does Hadoop help in managing Big Data?

A logically distributed
file system

Avoids vendor lock-ins Framework for processing


and analyzing large
datasets

Allows resource growth according Designed to run on small commodity


to demand machines for faster processing
HDFS: Key Features

Large scale processing


Uses attached storage of commodity machines and shares the cost of
network and hardware

Stores large datasets


Allows organizations to store massive data at a very low cost per byte. When
vendor-specific distributions of Hadoop are used, cost incurred is inevitable.

Processes through high bandwidth


Employs high bandwidth to support MapReduce workload

Supports data in varied formats


Accepts varied data formats irrespective of any data constraints or schema
limitations

Allows fault-tolerance and Scalability


Allows linear scalability, flexibility, and reliability through built-in auto
replication
Lesson 2: HDFS: Hadoop Distributed File System
Topic 2.2: Hadoop Distributions and terminologies
Hadoop : Different Distributions

http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
Hadoop : Different Distributions
Hadoop: Different Distributions

Helps in developing an ecosystem


that accelerates the adoption of Has successfully driven Hadoop’s
Hadoop in enterprises open source distribution in the IT
market

Drives its innovations through Has developed Apache Ambari, a


Hadoop open data platform Hadoop cluster management
console

Has strong engineering partners Has giant customer accounts like


like RedHat, Microsoft, SAP, etc. Samsung, Bloomberg, eBay, etc.
Hadoop : Different Distributions
Hadoop : Different Distributions
Hadoop: Different Distributions

Offers BigSheets and BigInsights as


Combines its Hadoop services on its smart cloud
distribution with enterprise infrastructure, enabling
enterprise-grade applications supporting Big Data
characteristics. analytics.
Hadoop : Different Distributions
Terminologies in Distributions of Hadoop

The main difference between the vendor specific distributions and core Hadoop distribution is Services.

Use this space to Animate


See examples:
EXAMPLE 1
EXAMPLE 2
Terminologies in Distributions of Hadoop

Use this space to Animate


See examples:
EXAMPLE 1
EXAMPLE 2
Hadoop: Daemons | Roles | Components

Let’s learn about a few differences in terminologies.


Lesson 2: HDFS: Hadoop Distributed File System
Topic 2.3: Working of HDFS
Internals and working of HDFS

Let’s now understand how HDFS works and why it is called a highly fault tolerant, distributed file
system
Internals and working of HDFS (contd.)

API/Client/Application

?
Hadoop Framework
Metadata in
File Split into Blocks - Blk1, Blk2, Blk3, Blk4
RAM
Blk1 Blk1 Blk1
Blk2 Blk2 Blk2
NameNode
Blk3 Blk3 Blk3
Blk4 Blk4 Blk4
Master daemon of cluster and storage layer

Auto-replicated per default or


Defined replication factor

Node Secondary Resource DataNode DataNode DataNode


Manager NameNode Manager
Slave daemons for the storage layer
Slave daemons for Master of the processing layer
Processing layer
Internals and working of HDFS (contd.)
HDFS writes and Replication
Reads from HDFS

Use this space to Animate


See examples:
EXAMPLE 1
EXAMPLE 2
Replication and Rack Awareness

Rack awareness makes HDFS more fault tolerant


Lesson 2: HDFS: Hadoop Distributed File System
Topic 2.4: HDFS Benefits and Probable Replacements
Probable Replacements for HDFS

• Terms of performance • Direct Access Storage


(DAS) architecture
• Availability
• Enterprise-grade
features
Probable Replacements for HDFS

84
Probable Replacements for HDFS

85
Probable Replacements for HDFS

86
Probable Replacements for HDFS

87
Probable Replacements for HDFS

88
Probable Replacements for HDFS

89
Probable Replacements for HDFS

90
Probable Replacements for HDFS

91
Benefits of HDFS Over the Other Contenders

Let us discuss the benefits of HDFS

HDFS features
Scalability, inexpensive devices, and no lock-
ins

HDFS and Hadoop usage


Successfully used in production environments

HDFS Performance
Achieves success in handling data rates efficiently
Quiz
QUIZ What is a Hadoop Distributed File system?
1

a. Service that starts Hadoop related daemons

b Component that takes care of storage


.
c. Role

d. Service and a distributed storage layer that offers fault tolerance.


QUIZ What is a Hadoop Distributed File system?
1

a. Service that starts Hadoop related daemons

b Component that takes care of storage


.
c. Role

d. Service and a distributed storage layer that offers fault tolerance.

d. is
The correct answer

Explanation: Hadoop Distributed File system is a service and a distributed storage layer that offers fault
tolerance.
QUIZ You are configuring your Hadoop cluster to run MapReduce v2 on YARN. What
2 are the two daemons that need to be installed?

a. Namenode, Datanodes

b ResourceManager and nodemanager


.
c. Hmaster and regionservers

d. Jobtracker and Tasktracker


QUIZ You are configuring your Hadoop cluster to run MapReduce v2 on YARN. What
2 are the two daemons that need to be installed?

a. Namenode, Datanodes

b ResourceManager and nodemanager


.
c. Hmaster and regionservers

d. Jobtracker and Tasktracker

b. is
The correct answer

Explanation: The ResourceManager (on Master Node) and the NodeManager (on slave nodes) are the
daemons for managing applications in a distributed manner in YARN.
QUIZ
Which of the following are a list of services that run on a Cloudera Distribution
3 of Hadoop ( CDH)?

a. HDFS, MapReduce, Yarn, Flume, Sqoop

b Yarn, ResourceManager, Hive, HMaster, HDFS


.
c. Namenode, Datanode, Secondarynamenode, ResourceManager and NodeManager

d. None of the above


QUIZ
Which of the following are a list of services that run on a Cloudera Distribution
3 of Hadoop ( CDH)?

a. HDFS, MapReduce, Yarn, Flume, Sqoop

b Yarn, ResourceManager, Hive, HMaster, HDFS


.
c. NameNode, Datanode, SecondaryNameNode, ResourceManager and NodeManager

d. None of the above

a. is
The correct answer

Explanation: HDFS, MapReduce, Yarn, Flume, Sqoop run on a Cloudera Distribution of Hadoop.
QUIZ
What are the functions of Cloudera Manager?
4

a. Monitors the state of services and roles that are running in a cluster

b Monitors a number of metrics for HDFS, MapReduce, Yarn, Hbase, Zookeeper,


flume, and related role instances
.
c. Both a and b

d. Cloudera manager has nothing do with services or roles


QUIZ

4 What are the functions of Cloudera Manager?

a. Monitors the state of services and roles that are running in a cluster

b Monitors a number of metrics for HDFS, MapReduce, Yarn, Hbase, Zookeeper,


flume, and related role instances
.
c. Both a and b

d. Cloudera manager has nothing do with services or roles

c. is
The correct answer

Explanation: Cloudera manager is responsible for monitoring the services and related roles running on the
hosts of your cluster. It also monitors metrics coming in from various services and roles.
QUIZ
Which service takes care of the activities related to installing CDH, configuring
5 services, and starting and stopping of services?

a. Cloudera manager/ Cloudera-scm-server

b Cloudera-scm-agents
.
c. Namenode

d. Hadoop admin and start-up scripts


QUIZ
Which service takes care of the activities related to installing CDH, configuring
5 services, and starting and stopping of services?

a. Cloudera Manager/ Cloudera SCM Server

b Cloudera SCM Agents


.
c. NameNode

d. Hadoop admin and start-up scripts

c. is
The correct answer

Explanation: Cloudera manager, also known as Cloudera SCM Server or CMF server, takes care of
everything related to installing CDH, configuring services, and starting and stopping of services.
QUIZ
What are the default block sizes in HDFS, and can it be changed?
6

a. Default block size in Hadoop v1 is 64mb and in Hadoop v2, it is 128 mb. It can be
changed by using dfs.block.size paramter in Hdfs-site.xml

b There is no default block size. It can be set by using dfs.block.size paramter in


Hdfs-site.xml.
.
c. Default block sizes depend on disk storage available. Block size can’t be set.

d. None of the above.


QUIZ

6 What are the default block sizes in HDFS, and can it be changed?

a. Default block size in Hadoop v1 is 64mb and in Hadoop v2, it is 128 mb. It can be changed by
using dfs.block.size paramter in Hdfs-site.xml

b There is no default block size. It can be set by using dfs.block.size paramter in


Hdfs-site.xml.
.
c. Default block sizes depend on disk storage available. Block size can’t be set.

d. None of the above.

a. is
The correct answer

Explanation: Default block size in Hadoop v1 is 64mb and in Hadoop v2, it is 128 mb. It can be changed by
using dfs.block.size paramter in Hdfs-site.xml
QUIZ With a cluster of 10 machines with 8 datanodes, what is the maximum
replication that can be achieved, and what is the maximum replication that can
7 be set in the configuration files?

a. Max achievable : 3; can set replication max : 8

b Max achievable : 8; can set replication to any number


.
c. Max achievable: 10; can set up to 10

d. Max achievable and replication can be set to 3


QUIZ
With a cluster of 10 machines with 8 datanodes, what is the maximum
7 replication that can be achieved, and what is the maximum replication that can
be set in the configuration files?

a. Max achievable : 3; can set replication max : 8

b Max achievable : 8; can set replication to any number


.
c. Max achievable: 10; can set up to 10

d. Max achievable and replication can be set to 3

b. is
The correct answer

Explanation: The maximum replication that can be achieved is 8 and can set replication to any number in
the configuration files.
QUIZ
If Rack awareness is enabled, how are blocks placed by replication algorithm?
8 Choose the most appropriate option.

a. All the replicas are placed on the same rack for the same file.

b All the replicas are never placed on the same rack.


.
c. If replication set is 3, a minimum of two blocks are placed on the same rack and
one block is placed on a different rack.

d. Rack awareness enabling has nothing to do with replica placement.


QUIZ
If Rack awareness is enabled, how are blocks placed by replication algorithm?
8 Choose the most appropriate option.

a. All the replicas are placed on the same rack for the same file.

b All the replicas are never placed on the same rack.


.
c. If replication set is three, a minimum of two blocks are placed on the same rack
and one block is placed on a different rack.

d. Rack awareness enabling has nothing to do with replica placement.

c. is
The correct answer

Explanation: Assuming the replication is set to three, two replicas are placed on different nodes on a rack
and one replica is placed on a different closest rack. Thus, all replicas are never placed on same rack.
QUIZ
Identify the most appropriate option for the list of distributions of Hadoop:
9

a. CDH, HDP, MapR, Ubuntu, Centos, Yarn

b Yarn, HDFS, IBM Big Insight, MapR


.
c. CDH, HDP, MapR, IBM Big Insight, AWS EMR

d. Apache Hadoop
QUIZ

9 Identify the most appropriate option for the list of distributions of Hadoop:

a. CDH, HDP, MapR, Ubuntu, Centos, Yarn

b Yarn, HDFS, IBM Big Insight, MapR


.
c. CDH, HDP, MapR, IBM Big Insight, AWS EMR

d. Apache Hadoop

c. is
The correct answer

Explanation: The most appropriate option for the list of distributions of Hadoop are CDH, HDP, MapR, IBM
Big Insight, and AWS EMR.
QUIZ
Identify in which Distribution of Hadoop HDFS has been designed to be easily
10 portable from one platform to another?

a. Apache Hadoop distribution

b Vendor specific distribution like Cloudera/Hortonworks


.
c. All Hadoop distributions

d. Only distributions of Hadoop on cloud platforms


QUIZ
Identify in which Distribution of Hadoop HDFS has been designed to be easily
10 portable from one platform to another?

a. Apache Hadoop distribution

b Vendor specific distribution like Cloudera/Hortonworks


.
c. All Hadoop distributions

d. Only distributions of Hadoop on cloud platforms

b. is
The correct answer

Explanation: In vendor specific distribution like Cloudera or Hortonworks HDFS has been designed to be
easily portable from one platform to another.
Key Takeaways

HDFS is a block-structured, distributed file system that solves the storage


needs of Big Data and makes data accessible to Hadoop services.

The various vendor-specific distributions include AWS EMR,


Hortonworks Data Platform, Cloudera Distribution of Hadoop, MapR
Hadoop Distribution, IBM Infosphere BigInsights Hadoop
distribution, and Microsoft Hadoop Distribution.
These distributions take care of functionalities such as support,
reliability, and add-on tools to customize, deploy, manage, and
monitor clusters.
Replication is a sequential activity, whereas reading and
writing data in HDFS are parallel activities.

Rack awareness provides data availability and better


performance and redundancy in the event of network switch
failure or other failures.

Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Hadoop Distributed File System.”
The next lesson is “Hadoop Cluster Setup and Working.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 3- Hadoop Cluster Setup and Working
What You’ll Learn

Setting up a Linux machine in a virtual environment, that is,


Oracle VMBOX
The need for a cluster management solution and the types of
installation methods—automated approach or manual
approach—in brief
Setting up a Hadoop cluster for Apache Hadoop Distribution
and Cloudera distribution of Hadoop and learn to work with
that cluster
Setting up machines using Amazon’s Elastic Compute Cloud
(Amazon EC2)
Ways of storing and managing data
Features of the Cloudera Manager and internals of a Hadoop
cluster

Disclaimer: All the logos used in this course belong to the respective organizations
Install and Prepare Your Machine with
Linux Operating System
Demonstration 1:
Getting Virtualization Software in Linux Disk Image
Demonstration 2:
Adding Machines to your VMBox
Demonstration 3:
Installing Linux into your Machines
Demonstration 4:
Preparing your Linux Machines (CentOS 6) Part 1
Demonstration 5:
Preparing your Linux Machines (CentOS 6) Part 2
Demonstration 6:
Preparing your Linux Machines (CentOS 7)
Cluster Management Solution

Cluster setup is implemented so that servers and network can work together as a centralized data
processing resource.
Cluster Management Solution (Contd.)
Cluster Management Solution Features
Cloudera Manager Vocabulary

Deployment: The configuration for a CM server


and all the hosts configured against it

Cluster: A grouping of hosts, which all run the


Deployment same versions of software. At most, one HDFS
service can run per cluster.
Cluster
“Prod (CDH4)” Host: A machine (typically physical) running the
CM agent

Host Service Rack: Machines in the same rack, typically served


“a001” “HDFS” by the same switch.
Rack Service: A system, which may be distributed
“/r1” (HDFS, Impala) or not (Oozie), running on a
Role
Role cluster.
Host
Role Config Role Config
“b001” Group Group
Role: A participant in a system that is tied
(DataNodes) (NameNodes) to a specific host, e.g., a specific DataNode
Rack
“/r2” Role Config Group: A set of roles (all of the same
type) that are all configured the same way.
http://blog.cloudera.com/blog/2013/07/how-does-cloudera-manager-work/

Config: A key-value pair associated with a scope.


Cloudera Manager: Capabilities

CDH, roles, services, and Cloudera’s internal services

NODE2 Cloudera Manager runs a central Server—


Cloudera Manager Server/SCM Server :
NODE1 CLOUDERA ROLES
MANAGER AGENT • Installs CDH.
CLOUDERA • Configures, starts, and stops services.
MANAGER
NODE3 • Manages helper services for monitoring.
SERVER
CLOUDERA
CLOUDERA ROLES SERVICES • Maintains client configurations and their
MANAGER AGENT
MANAGER AGENT life cycle.
• Maintains entire state of cluster.
NODE4
• Passes instruction to Cloudera manager
CLOUDERA ROLES
MANAGER AGENT agents also called SCM Agents.

HTTP(S) EMBEDDED
DATABASE
Cloudera Manager: Capabilities

NODE2 Cloudera SCM agents:


NODE1 CLOUDERA • Sends heartbeats to the server
ROLES
MANAGER AGENT
• Informs about their status to the SCM
CLOUDERA
MANAGER
server
NODE3
SERVER
• Receive instructions for the tasks
CLOUDERA
CLOUDERA ROLES SERVICES
MANAGER AGENT • Responsible for unpacking
MANAGER AGENT
configuration
NODE4 • Start or stop processes
CLOUDERA ROLES • Monitor and send the status of
MANAGER AGENT
roles/services/hosts
HTTP(S) EMBEDDED
DATABASE
Cloudera Manager: Capabilities

NODE2
Data Model contains:
NODE1 CLOUDERA • An updated catalogue of nodes in a
ROLES
MANAGER AGENT cluster
CLOUDERA • Configurations assigned to each node
MANAGER
NODE3
SERVER • Services, and relevant roles
CLOUDERA
ROLES SERVICES
CLOUDERA MANAGER AGENT Data Model:
MANAGER AGENT
• Sends configuration and task
NODE4 instructions to agents
CLOUDERA ROLES • Tracks their heartbeats
MANAGER AGENT
• Receives information from agents
HTTP(S) EMBEDDED
DATABASE • Calculates health status of services and
overall cluster
Cloudera Manager: Capabilities

Cloudera Manager:
• Deals with the configuration settings
NODE2
NODE1 CLOUDERA • Tracks host metrics
ROLES
MANAGER AGENT
• Monitors the cluster and the role
CLOUDERA status.
MANAGER
NODE3
SERVER • Keeps activity monitoring data and
CLOUDERA
CLOUDERA
ROLES SERVICES configuration changes
MANAGER AGENT
MANAGER AGENT Cloudera SCM Agents:

NODE4 • Listen to servers


CLOUDERA ROLES • Receive instructions to start or stop or
MANAGER AGENT monitor Hadoop-related
HTTP(S) EMBEDDED roles/daemons
DATABASE
• Collect statistics to be sent to the CM
server for health calculations and
assist in health state reporting
Cloudera’s Cluster Management Solution: Cloudera Manager

Cloudera Manager’s admin console is Cloudera’s cluster management solution or application.

Cloudera Manager helps

• CDH installation
• Upgrading CDH and its component versions
• Configuring services and changing settings
• Adding new clusters or hosts
• Adding
• Removing
• Maintaining services.
Cloudera’s Cluster Management Solution: Cloudera Manager (Contd.)

End-to-end visibility and


Centralized control Automated solutions and Service
movement wizards
Take a snapshot of
cluster status and errors
Cluster-wide view of services,
roles, and nodes
Collate reports on
hardware/memory/ job performance
Cloudera Make cluster-wide configuration
Manager changes
Correlate services/
roles/logs/alerts/ metrics/work
flows and report cluster status
Have a centralized approach
to incorporate changes
Automate start/stop/managing
of services/roles
Optimize cluster performance
Tools/widgets for monitoring, collecting, and cluster utilization
and reporting cluster metrics
Editions Offered by Cloudera

DID YOU • Cloudera Express (No License Required)

KNOW ? • Cloudera Enterprise data Hub edition


(60 day Trial License)
• Cloudera Enterprise (needs a License)
Installation Choices and Know Hows

To Install CDH, we need to know its prerequisites; this may involve knowing the supported:

Operating systems

Resource
requirements JDK Versions

Transport layer Browsers


security

Security and networking


Databases
requirements

CDH services
versions

Refer: https://www.cloudera.com/documentation/enterprise/5-7-x/topics/cm_ig_cm_requirements.html
to know about compatibilities, requirements and versions of CDH with its components.
Installation

Cloudera Manager Deployment components comprises:

Cloudera’s distribution of Oracle JDK


Hadoop with its services

Databases Cloudera Manager server


and agents
Installation

There are three Installation methods :

Option 1 Automated Installation by Cloudera Manager (Non-production deployments)


Option 1 – Not for Production due to incapability to scale.

Option 2 Installation using Cloudera Manager Parcels or Packages: Installing packages


and softwares manually on a machine which will host Cloudera Manager
Option 1
server. Installing packages and softwares manually or automatically (sing
Cloudera manager) on machines which will host Cloudera manager agents.

Option 3 Manual Installation using Cloudera manager tarballs: Installing Oracle JDK,
Option 1 Cloudera manager server, and agents manually using tarball and using
Cloudera Manager to automate installation.
Cloudera Manager Software Distribution Formats

Cloudera Manager is used to install CDH and manage services. It supports two software distribution formats:

Package Parcel

A binary distribution format that A binary distribution format that


contains: contains:
• compiled code and program files with metadata
• packaged meta information
Difference Between Packages and Parcels

PARCELS PACKAGES

• It is self contained, installed in • Only one package can be installed at a


versioned directories time
Version • It allows multiple versions to co-exist • Poses limitations and makes installed and
• It designates one version as active active version the same
• It decouples distribution from
activation
• No restrictions on installation • Restrictions on installation.
location. • location: /usr/lib
Location
• Default location:
/opt/cloudera/parcels
• Allows distribution of CDH as a single • Separate package for each role in CDH
object that may trigger mismatched versions.
Uniformity • Provides internal consistency • Allows rolling upgrades
• Allows rolling upgrades. • Only initial installation is possible, leading
to upgrades or components addition

• Handles automatic downloads


• Allows easier setup of CDH and • Manual intervention and effort is involved.
Ease additional components
• Avoids compatibility issues with
additional tools
Demonstration 7:
Setting Apache Hadoop Cluster in Linux Machine in VMBox.
Demonstration 8:
Writing Data to Cluster and Checking Replication Status
Demonstration 9:
Setting up Linux Machine in AWS EC2 to setup Cloudera Cluster
Demonstration 10:
Setting Cloudera Cluster on your Linux Machine in AWS EC2
Quiz

©Simplilearn. All rights reserved


QUIZ Starting and Stopping of processes related to services in a CDH is handled by:
1

a. Cloudera Manager Server

b Cloudera Manager Agent


.
c. Role

d. NameNode

©Simplilearn. All rights reserved


QUIZ Starting and Stopping of processes related to services in a CDH is handled by:
1

a. Cloudera Manager Server

b Cloudera Manager Agent


.
c. Role

d. NameNode

b. is
The correct answer

Explanation: The starting and Stopping of processes related to services in a CDH is handled by Cloudera
Manager Agent.

©Simplilearn. All rights reserved


QUIZ
What is the role of Cloudera SCM agent if the SCM server is installed on every host of cluster
2 to manage the cluster and its services?

a. If SCM server is installed, SCM agent is not needed.

b Cloudera SCM agents start the processes related to services.


.
c. SCM agents are dummy services running on master host.

d. SCM agents start the processes for services, monitor them, and also communicate
with SCM server.

©Simplilearn. All rights reserved


QUIZ
What is the role of Cloudera SCM agent if the SCM server is installed on every host of cluster
2 to manage the cluster and its services?

a. If SCM server is installed, SCM agent is not needed.

b Cloudera SCM agents start the processes related to services.


.
c. SCM agents are dummy services running on master host.

d. SCM agents start the processes for services, monitor them, and also communicate
with SCM server.

d. is
The correct answer
Explanation: SCM agents start the processes for services, monitor them, and also communicate with the SCM
server.
The role of a Cloudera SCM agent is to start the processes for services, monitor them, and also communicate
with the SCM server.
©Simplilearn. All rights reserved
QUIZ
Which role manages and resolves the under-replication or over/mis replication of blocks?
3

a. NameNode

b Cloudera manager
.
c. DataNode themselves.

d. Cloudera hadoop administrator

©Simplilearn. All rights reserved


QUIZ
Which role manages and resolves the under-replication or over/mis replication of blocks?
3

a. NameNode

b Cloudera manager
.
c. DataNode themselves

d. Cloudera Hadoop administrator

a. is
The correct answer

Explanation: NameNode is the role that manages and resolves the under-replication or over/mis-
replication of blocks.

©Simplilearn. All rights reserved


QUIZ
How can services be added to existing list of services on cluster using Cloudera admin
4 console?

a. Click services tab> Top right corner under actions “Add services”>follow the
wizard

b Download the packages related to services, edit configuration files, start the
services.
.
c. Services cannot be added to existing list if services were not added during
installation.

d. Call cloudera support for assistance.

©Simplilearn. All rights reserved


QUIZ
How can services be added to existing list of services on cluster using Cloudera admin
4 console?

a. Click services tab> Top right corner under actions “Add services”>follow the
wizard

b Download the packages related to services, edit configuration files, start the
services.
.
c. Services cannot be added to existing list if services were not added during
installation.

d. Call cloudera support for assistance.

a. is
The correct answer
Explanation: The correct answer is: Click services tab> Top right corner Under actions “Add services”>follow the
wizard
To add services to an existing list of services on cluster using Cloudera admin console, click services tab> in the
top right corner, under actions Add services>follow the wizard.
©Simplilearn. All rights reserved
QUIZ
Which property helps to enable yarn framework in apache Hadoop v2?
5

a. mapred.framework.name

b yarn.resourcemanager.address
.
c. yarn.nodemanager.name

d. YARN cannot be enabled; it exists by default

©Simplilearn. All rights reserved


QUIZ
Which property helps to enable yarn framework in apache Hadoop v2?
5

a. mapred.framework.name

b yarn.resourcemanager.address
.
c. yarn.nodemanager.name

d. YARN cannot be enabled; it exists by default

a. is
The correct answer

Explanation: mapred.framework.name is the property that helps to enable yarn framework in apache
Hadoop v2.

©Simplilearn. All rights reserved


QUIZ
What is the benefit of using parcels to setup CDH?
6

a. Parcels can be installed in versioned directories, thus allowing different


versions to coexist.

b It is always better to use Packages rather than parcels.


.
c. Parcels can be installed anywhere in the filesystem.

d. A and C.

©Simplilearn. All rights reserved


QUIZ
What is the benefit of using parcels to setup CDH?
6

a. Parcels can be installed in versioned directories, thus allowing different


versions to coexist.

b It is always better to use Packages rather than parcels.


.
c. Parcels can be installed anywhere in the filesystem.

d. A and C.

a. is
The correct answer

Explanation: Parcels can be installed in versioned directories and allow different versions. They can be
installed anywhere in the file system.

©Simplilearn. All rights reserved


QUIZ
When are configuration files with properties populated during the installation of CDH?
7

a. During the set up of cluster, configuration files are auto-populated with


properties.

b They are populated; users need to manually edit configuration files.


.
c. They are generated after cluster is set up.

d. They are generated when services are started.

©Simplilearn. All rights reserved


QUIZ
When are configuration files with properties populated during the installation of CDH?
7

a. During the set up of cluster, configuration files are auto-populated with


properties.

b They are populated; users need to manually edit configuration files.


.
c. They are generated after cluster is set up.

d. They are generated when services are started.

a. is
The correct answer

ExplanationDuring the set up of cluster, configuration files are auto-populated with properties. The
configuration files are auto-populated with properties during the set up of cluster.

©Simplilearn. All rights reserved


QUIZ
Can CDH be set up using single user mode?
8

a. Yes, it can be set up.

b Yes, it can be set up, but the complexities increase.


.
c. CDH can only be set up using conventional methods.

d. CDH is always set up as root user.

©Simplilearn. All rights reserved


QUIZ
Can CDH be set up using single user mode?
8

a. Yes, it can be set up.

b Yes, it can be set up, but the complexities increase.


.
c. CDH can only be set up using conventional methods.

d. CDH is always set up as root user.

b. is
The correct answer

Explanation: Yes, CDH can be set up, but the complexities increase.

©Simplilearn. All rights reserved


QUIZ
In a Cloudera cluster, is the command ‘dfsadmin –report’ needed from the terminal? What
9 does it show?

a. It’s not mandatory. It shows the status of DataNodes.

b This command cannot be issued by users.


.
c. This command only works with Apache Hadoop Cluster.

The command is not needed; the status can be seen using admin console> HDFS
d. services.

©Simplilearn. All rights reserved


QUIZ
In a Cloudera cluster, is the command ‘dfsadmin –report’ needed from the terminal? What
9 does it show?

a. It’s not mandatory. It shows the status of DataNodes.

b This command cannot be issued by users.


.
c. This command only works with Apache Hadoop Cluster.

The command is not needed; the status can be seen using admin console> HDFS
d. services.

d. is
The correct answer

Explanation: In a Cloudera cluster, we don’t need the command ‘dfsadmin -report’; we can see the status
using admin console> HDFS services.

©Simplilearn. All rights reserved


QUIZ
Can the config file properties be edited after the cluster has been started?
10

a. The config file properties are present by default; they cannot be changed.

The config file properties can be changed by restarting the services once the
b changes are done.
.
c. The config file properties can be changed only with the permission of Kerberos.

d. The permission of the organization is needed in order to change the properties.

©Simplilearn. All rights reserved


QUIZ
Can the config file properties be edited after the cluster has been started?
10

a. The config file properties are present by default; they cannot be changed.

The config file properties can be changed by restarting the services once the
b changes are done.
.
c. The config file properties can be changed only with the permission of Kerberos.

d. The permission of the organization is needed in order to change the properties.

b. is
The correct answer

Explanation: The config file properties can be changed, but the services should be restarted once done.

©Simplilearn. All rights reserved


Key Takeaways

Cluster Management Solutions offer a centralized interface that helps in easier


management of all functionalities related to a cluster.

Cluster management solution offers features that include Installation


wizard, Node Provisioning, Graphical User Interface, Inbuilt monitoring,
and Widgets.

Cloudera Distribution of Hadoop can be set up in three ways:


automated installation using the Cloudera Manager, installation
using parcels and packages, and installation of components using
tarballs and then using the Cloudera Manager to automate the
installation of CDH.

Disclaimer: All the logos used in this course belong to the respective organizations
©Simplilearn. All rights reserved
This concludes the lesson “Hadoop Cluster Setup and Working.”
The next lesson is “Hadoop Configurations and Daemon Logs”.

Disclaimer: All the logos used in this course belong to the respective organizations
©Simplilearn. All rights reserved
Big Data and Hadoop Administrator
Lesson 4- Hadoop Configurations and Daemon Logs
What You’ll Learn

List and describe the files that control Hadoop configuration

Explain how to manage Hadoop configuration with a Cloudera Manager

Locate configuration files and make changes

Explain how to deal with stale configurations

Disclaimer: All the logos used in this course belong to the respective organizations
What You’ll Learn

Explain the RPC and HTTP default addresses and ports used by
Hadoop Daemons
Locate log files generated on hosts

Filter information in log files

Explain how to use logs to browse information and diagnose issues

Disclaimer: All the logos used in this course belong to the respective organizations
Hadoop Configurations and Daemon Logs

Topic 4.1: Hadoop Configuration


Hadoop Configuration Files

Hadoop’s cluster configuration can be set up in the following files:

hadoop-env.sh Yarn-site.xml

Core-site.xml
Hadoop-metrics2.properties

Mapred-site.xml
Log4j.properties

Hdfs-site.xml

Hadoop-policy.xml
Mapred-env.sh

Yarn-env.sh Slaves
Location of Configuration Files and Directories

Hadoop configuration files are found in:

/etc/Hadoop

/opt/cloudera/parcels when using CDH

A specific directory chosen by the admin when


using Apache Hadoop core distribution
Managing Hadoop Cluster Configurations

Cluster management tools:

• Provide automated wizards

• Include portable cluster configuration templates

• Provide centralized interface to tune configurations and resourcing


parameters
• Recommend pre- and post-alerts when changes in configuration settings
are required

• Keep all machines in sync

• Ensure that the machines are performing regular maintenance tasks


Do You Remember?

Let’s now revise some important terminologies that will help in understanding the topics in
this lesson

Service Instance
A Service Instance is an
instance of a service running Role Instance
on a cluster that spans many A Role Instance is an
role instances. instance of a role running on
a host.

Service
Roles Role Group
A Service is a category of
Roles are Daemons or A Role Group is a set of
managed functionality in
processes that take care of a configuration properties for a set
Cloudera Manager.
service. of role instances.
Configuration Management with Cloudera Manager

Let’s now discuss how to manage Hadoop configurations with Cloudera Manager.

Monitoring

Software management

Resource management
Configuration Management with Cloudera Manager (Contd.)

Let’s now understand how Cloudera Manager helps to handle configurations at different
levels.

@Service Level

@Role Group Level

@Role Instance Level RESOURCEMANAGER


Configuration Management with Cloudera Manager

Let’s now understand how Cloudera Manager helps to handle configurations at different
levels.

Cloudera Manager helps to:


@Service Level
• Easily manage configurations of role
subsets
@Role Group Level • Maintain different configurations for
testing and managing of shared clusters
for a variety of workloads
@Role Instance Level • Independently control processes as all the
processes have separate execution and
configuration environments.

Note: That service related role instances obtain their configurations from a private per process
directory found under “/var/run/cloudera-scm-agent/process/unique-process-name”.
Specifying Configurations

Hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
Resources
<value>3</value>

<Property > : </property>


<Name> , <Value> </configuration>
Specifying Configurations

Let’s now look at how configuration is handled in Cloudera cluster and in Core Apache Hadoop.

• Downloads the tar ball for


Hadoop or Hadoop components
• Untars
• Wizard downloads Hadoop • Edits configuration files on
components different hosts
• Assigns roles to hosts
• Autogenerates and
populates configuration files
Configuration Files

Let’s now look at the files that set up the configurations.

First, we will look at setting up environment variables. Files that include these are:

Hadoop-env.sh,

Mapred-env.sh for MapReduce

Environment Yarn-env.sh for YARN


Variables
Configuration Files

Let’s now look at the files that set up the configurations.

In hadoop-e-n-v dot s-h, set up the following configurations:

JAVA_HOME Path to the java implementation

HADOOP_HEAPSIZE Define the memory heap size

Path where system log files can be found


HADOOP_LOG_DIR $HADOOP_HOME/logs

Hadoop-e-n-v dot s-h configurations


Configuration Files (Contd.)

Let’s look at some other typical configuration files in detail.

core-site.xml

<?xml version="1.0"?>
<!– core-site.xml>
<configuration>
<property> <name>fs.defaultFS </name>
<value>hdfs://hostname:port
</value> </property>
</configuration>

Note: If no port is specified, the port 8020 is used by default.


Configuration Files (Contd.)

Hdfs-site.xml

<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property> <name>dfs.namenode.name.dir</name>
<value>/disk1/hdfs/name,/remote/hdfs/name</value></property>
<property> <name>dfs.datanode.data.dir</name>
<value>/disk1/hdfs/data,/disk2/hdfs/data</value> </property>
<property><name>dfs.namenode.checkpoint.dir</name>
<value>/disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary
</value></property>
</configuration>

Note: If the path is not specified, the default location /tmp is


used.
Configuration Files (Contd.)

Yarn-site.xml

<?xml version="1.0"?><!-- yarn-site.xml -->


<configuration>
<property><name>yarn.resourcemanager.hostname</name>
<value>hostname</value></property>
<property><name>yarn.resourcemanager.address</name>
<value>hostname:8031</value></property>
<property><name>yarn.resourcemanager.local-dirs</name>
<value>/tmp/yarn</value></property>
<property><name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property>
<property><name>yarn.nodemanager.resource.memory-mb</name>
<value>16384</value></property>
<property><name>yarn.nodemanager.resource.cpu-vcores</name>
<value>16</value></property>
</configuration>
Stale Configurations
Attributes on Stale Configuraitons page:

If any changes are made to the cluster, it leads to


stale configurations. This requires a restart of
specific roles, services, or the cluster itself.

Indicato
r Icon
Stale Configurations
Attributes on Stale Configuraitons page:

Environment variables
Environment set for the role

Configuration files
Files used by the role

Process user and Represents user and


group for the role
group

System resources allocated for roles,


System Resources such as ports, directories, limits

Client configs Represents client configurations


Metadata
Fixing Stale Configurations

Let’s now look at how to fix stale configurations. The following actions will
help fix stale configurations:

Restarting the service

Restarting Cloudera Restarting the role or role


management service. group

Restarting the cluster or set of prompted


Refreshing Stale services. services.

Note: In Apache Hadoop cluster, when any configuration changes are made,
related daemons and master daemons must be restarted.
Demonstration 1:
Looking into Logs and Filtering Information
Hadoop Configurations and Daemon Logs

Topic 4.1: Hadoop Configuration


RPC Server Properties
Property name Default value Description

dfs.namenode.rpc-bind-host The address the namenode’s RPC server will bind to. If not set, the bind address is
determined by fs.defaultFS. It can be set to 0.0.0.0 to make the namenode listen on all
interfaces.
dfs.datanode.ipc.address 0.0.0.0:50020 The datanode’s RPC server address and port.

mapreduce.jobhistory.address The job history server’s RPC server address and port. This is used by the client (typically
0.0.0.0:10020
outside the cluster) to query job history.
mapreduce.jobhistory.address The address the job history server’s RPC and HTTP will bind to.

mapreduce.jobhistory.bind-host
The address the resource manager’s RPC and HTTP servers will bind to.

yarn.resourcemanager.bind-host The resource manager’s RPC server address and port. This is used by the client (typically
${y.rm.hostname}:8032
outside the cluster) to communicate with the resource manager.
yarn.resourcemanager.address The resource manager scheduler’s RPC server address and port. This is used by (in-cluster)
${y.rm.hostname}:8030
application masters to communicate with the resource manager.
yarn.resourcemanager.scheduler.ad The resource manager resource tracker’s RPC server address and port. This is used by (in-
dress ${y.rm.hostname}:8031
cluster) node managers to communicate with the resource manager.
yarn.nodemanager.hostname The hostname of the machine the node manager runs on. Abbreviated
0.0.0.0
{y.nm.hostname} below.
yarn.nodemanager.bind-host The address the node manager’s RPC and HTTP servers will bind to.

The node manager’s RPC server address and port. This is used by (in-cluster) application
yarn.nodemanager.address ${y.nm.hostname}:0
masters to communicate with node managers.
HTTP Server Properties
Property name Default value Description

dfs.namenode.http-address 0.0.0.0:50070 The namenode’s HTTP server address and port.

dfs.namenode.http-bind-host The address the namenode’s HTTP server will bind to.

dfs.namenode.secondary.http-address 0.0.0.0:50090 The secondary namenode’s HTTP server address and port.

The datanode’s HTTP server address and port. (Note that the property name
dfs.datanode.http.address 0.0.0.0:50075
is inconsistent with the ones for the namenode.)
The MapReduce job history server’s address and port. This property is set as
mapreduce.jobhistory.webapp.address 0.0.0.0:19888
inmapred-site.xml.
The shuffle handler’s HTTP port number. This is used for serving map
mapreduce.shuffle.port 13562 outputs and is not a user-accessible web UI. This property is set in mapred-
site.xml.
${y.rm.hostname}:
yarn.resourcemanager.webapp.address The resource manager’s HTTP server address and port.
8088
${y.nm.hostname}:
yarn.nodemanager.webapp.address The node manager’s HTTP server address and port.
8042
The web app proxy server’s HTTP server address and port. If not set (the
yarn.web-proxy.address default), then the web app proxy server will run in the resource manager
process.

specifies which network interfaces are used by the datanodes as their IP addresses
dfs.datanode.dns.interface to connect with RPC and HTTP Servers.
Hadoop Configurations and Daemon Logs

Topic 4.2: Working with Logs


Information in Logs

Viewing log information on


Admin Console Viewing logs for all roles
Log Information
Diagnostics>logs Diagnostics>logs>search

Log
Information
Host Log Level Time Source Message

Host where Severity Class that


associated Date and time Message of
log entry generated the
with log entry of log entry log entry
appeared message
Filtering Information from Logs

You can filter information from logs based on the following parameters:

Time range selector Search Phrase

Select Sources Hosts

Log level and severity of messages Search time-out and results per page

Note: If required, you can download the Full log from the logs page.
Log Information in CDH

Logs pertaining to a specific role or service running on a


Linux Terminal cluster that is running on a host
/var/log path

Log information about cloudera-scm-server and agents


Admin console Diagnostic > logs > Sources > Cloudera manager >
select agent/server > search

Cloudera Manager server log


Admin console Diagnostics > logs > server log

Cloudera Manager agent log


/var/log/cloudera-scm-server/
Linux Terminal /var/log/cloudera-scm-agent/
/var/log/cloudera-scm-evenT*/
/var/log/cloudera-scm-installer/
Log Information in Apache Hadoop Cluster
Let’s now look at finding log information in Apache Hadoop Cluster.

Path to logs for daemons Types of Log files

default $HADOOP_HOME/logs
.out .log

: hadoop-<user-running-hadoop>-<daemon>-
Naming convention for Log Files
<hostname>.log

Path to log4j file etc/Hadoop/conf/log4j.properties


More about Logs

Let’s now note some more important information about logs.

Job configuration XML logs /var/log/hadoop or /var/log/hadoop/history

job_<job_ID>_conf.xml
Convention to construct /Hadoop file names
Example: job_200704180028_0002_conf.xml

<hostname>_<epoch-of-jobtracker-start>_<job-
id>_conf.xml
Convention to construct /Hadoop/history file
Example: ec2-52-43-63-183.compute-
names
1.amazonaws.com_1240642372616_job_200704180028_
0002_conf.xml

/var/log/hadoop/userlogs/attempt_<job-id>_<map-or-
Standard error logs
reduce>_<attempt-id>
Demonstration 2:
Working with Configurations in Cloudera Cluster and Fixing Stale
Configurations
Quiz
QUIZ
Which of the following is the configuration file with properties defined for
1 metadata path, data path, and other paths related to roles/daemons?

a. Hdfs-site.xml

b Core-site.xml
.
c. Hadoop-policy.xml

d. Yarn-site.xml
QUIZ
Which of the following is the configuration file with properties defined for
1 metadata path, data path, and other paths related to roles/daemons?

a. Hdfs-site.xml

b Core-site.xml
.
c. Hadoop-policy.xml

d. Yarn-site.xml

The correct answer


a. is

Explanation: Hdfs-site.xml is the configuration file with properties defined for metadata path, data
path, and other paths related to roles/daemons.
QUIZ
Which of the following is the default Heap_Size allocated to each daemon in a
2 cluster?

a. 10 % of node’s RAM

b 1 GB
.
c. 30 % of node’s RAM

d. None of the above


QUIZ
Which of the following is the default Heap_Size allocated to each daemon in a
2 cluster?

a. 10 % of node’s RAM

b 1 GB
.
c. 30 % of node’s RAM

d. None of the above

The correct answer


b. is

Explanation: 1 GB is the default Heap_Size allocated to each daemon in a cluster.


QUIZ
Which of the following are the different levels at which configurations can be
3 defined and managed in a Cloudera Hadoop cluster?

a. Service level, Role group, and Role instance levels

b Cloudera manager, Hosts, Service level, Role group, and Role instance levels
.
c. Role group and Role instance levels only

d. Host and service level


QUIZ
Which of the following are the different levels at which configurations can be
3 defined and managed in a Cloudera Hadoop cluster?

a. Service level, Role group, and Role instance levels

b Cloudera manager, Hosts, Service level, Role group, and Role instance levels
.
c. Role group and Role instance levels only

d. Host and service level

The correct answer


c. is

Explanation: Cloudera manager, Hosts, Service level, Role group, and Role instance levels are the
different levels at which configurations can be defined and managed in a Cloudera Hadoop cluster.
QUIZ
4 Which of the following is a fix for stale configurations?

a. Restart all affected services and redeploy client configurations

b Restart Cloudera Manager


.
c. Restart hosts

d. None of the above


QUIZ
4 Which of the following is a fix for stale configurations?

a. Restart all affected services and redeploy client configurations

b Restart Cloudera Manager


.
c. Restart hosts

d. None of the above

The correct answer


c. is

Explanation: Restarting all affected services and redeploying client configurations is a fix for stale
configurations.
QUIZ
Which of the following specifies the amount of physical memory (in MB) that
5 may be allocated to containers being run by the node manager?

a. yarn.resourcemanager.resource.memory-mb

b yarn.nodemanager.resource.vcores-mb
.
c. yarn.nodemanager.resource.memory-mb

d. Hadoop admin and start-up scripts


QUIZ
Which of the following specifies the amount of physical memory (in MB) that
5
may be allocated to containers being run by the node manager?

a. yarn.resourcemanager.resource.memory-mb

b yarn.nodemanager.resource.vcores-mb
.
c. yarn.nodemanager.resource.memory-mb

d. Hadoop admin and start-up scripts

The correct answer


c. is

Explanation: yarn.nodemanager.resource.memory-mb specifies the amount of physical memory (in


MB) that may be allocated to containers being run by the node manager.
QUIZ
Which of the following does Hadoop run to communicate between daemons
6 and to provide web pages?

a. Cloudera-scm-server and web server

b Roles and Services


.
c. RPC Server and HTTP Server

d. None of the above


QUIZ
Which of the following does Hadoop run to communicate between daemons
6 and to provide web pages?

a. Cloudera-scm-server and web server

b Roles and Services


.
c. RPC Server and HTTP Server

d. None of the above

The correct answer


c. is

Explanation: Hadoop runs RPC Server and HTTP Server to communicate between daemons and to
provide web pages.
QUIZ
7 Which of these log files can be rotated?

a. .log files

b Slave logs
.
c. Daemon logs

d. Cloudera agent log


QUIZ
7 Which of these log files can be rotated?

a. .log files

b Slave logs
.
c. Daemon logs

d. Cloudera agent log

The correct answer


a. is

Explanation: .log files can be rotated.


QUIZ
Which of the following is true of cluster configuration management with
8 Cloudera?

a. Cluster Setup wizard downloads Hadoop and related parcels and sets up
some services by default

b Cluster Setup wizard assigns roles to hosts based on internal check of node
configuration
.
c. Admin can configure services and assign or change role assignment to hosts

d. All of the above


QUIZ
Which of the following is true of cluster configuration management with
8 Cloudera?

a. Cluster Setup wizard downloads Hadoop and related parcels and sets up
some services by default

b Cluster Setup wizard assigns roles to hosts based on internal check of node
configuration
.
c. Admin can configure services and assign or change role assignment to hosts

d. All of the above

The correct answer


d. is

Explanation: All of the above. All the above mentioned statements are true with respect to cluster
configuration management with Cloudera.
QUIZ Choose the default HTTP server ports for these daemons: Namenode,
9 Secondarynamenode, ResourceManager, nodemanager, and DataNodes. (in
order)

a. 50070,50090,8088,8042 & 50075

b 9000,50090,9001,8042 & 50025


.
c. Any available ports can be configured

d. 50070,50075,50090,50091 & 50075


QUIZ Choose the default HTTP server ports for these daemons: Namenode,
9 Secondarynamenode, ResourceManager, nodemanager, and DataNodes. (in
order)

a. 50070,50090,8088,8042 and 50075

b 9000,50090,9001,8042 and 50025


.
c. Any available ports can be configured

d. 50070,50075,50090,50091 & 50075

The correct answer


a. is

Explanation: 50070,50090,8088,8042 and 50075 are the default HTTP server ports for the daemons:
NameNode, SecondaryNameNode, ResourceManager, NodeManager, and DataNodes respectively.
QUIZ
What is the resource manager resource tracker’s RPC default port used by (in-
10 cluster) node managers to communicate with the resource manager?

a. 8032

b 8030
.
c. 8031

d. 10020
QUIZ
What is the resource manager resource tracker’s RPC default port used by (in-
10 cluster) node managers to communicate with the resource manager?

a. 8032

b 8030
.
c. 8031

d. 10020

The correct answer


c. is

Explanation: 8031 is the resource manager resource tracker’s RPC default port used by (in-cluster)
node managers to communicate with the resource manager.
Key Takeaways

Hadoop’s cluster configuration can be used to set up files. Some of the important
files are Hadoop-env.sh, Core-site.xml, Hdfs-site.xml, and yarn-site.xml.

Cluster management tools are very effective in managing configurations.

Cloudera Manager and Ambari are two tools that are popular.

Cloudera Manager defines and manages configurations at


different levels, such as Service level, Role group level, and Role
instance level.

If any changes are made to the cluster, it leads to stale


configurations, and this requires a restart of specific roles,
services, or the cluster itself.

Disclaimer: All the logos used in this course belong to the respective organizations
Key Takeaways

Stale configurations can be viewed on the Stale configurations page.

Hadoop daemons run both an RPC server for communication between


daemons and an HTTP server to provide web pages.

The properties for setting a server’s RPC and HTTP addresses


determine the network interface that the server will bind to and are
used by clients or other machines in the cluster to connect to the
server.

Logs are generated by hosts and include role related information


for normal operations, errors, and internal diagnostic information.

Log information can be filtered by specifying the filter criteria.

Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Hadoop Configurations and Daemon Logs.”
The next lesson is “Cluster Maintenance and Administration.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 5- Hadoop Cluster Maintenance and Administrations
What You’ll Learn

Explain how to add and remove nodes in an adhoc way

Explain how to add and remove nodes in a systematic way, otherwise


known as commissioning and decommissioning of nodes

Explain how to balance a cluster

List the steps for managing services including adding,


deleting, starting, stopping and checking status of services

Disclaimer: All the logos used in this course belong to the respective organizations
What You’ll Learn

Explain the procedure to enable rack awareness

List the steps to add, remove and move role instances and hosts

Cite the challenges faced with the first version of Hadoop

Explain the features in the second version that help overcome the
challenges faced with the first version

Disclaimer: All the logos used in this course belong to the respective organizations
Lesson 5: HDFS: Hadoop Distributed File System
Topic 5.1: Maintaining Clusters
Adding and Removing Nodes: Adhoc Method

Adhoc way Systematic


way
Adding and Removing Nodes: Adhoc Method (Contd.)

Adhoc way

Add
Cluster
Adding and Removing Nodes: Adhoc Method (Contd.)

Adhoc way

Replication Unplanned
Consistency overhead
Cluster
Availability Time-
of Data consuming
Adding and Deleting Nodes: Systematic Method

Systematic way
Cluster

Commissioning a
node

Decommissioning
a node
Adding and Deleting Nodes: Systematic Method (Contd.)

If you are using the Apache Hadoop Cluster, to add or delete a node, you will need to perform the
following tasks:

• Edit the configuration files hdfs-site.xml and yarn-site.xml during the cluster setup. Recall
that in the first version of Hadoop, it was mapred-site dot xml that needed editing.
• Create empty Include and Exclude files.
• Set the properties in the configuration files to point to the Include and Exclude files.
• Start the cluster.
• Edit the Include and Exclude files and specify the nodes to be included or excluded.
• Issue the commands “hdfs dfsadmin – refreshNodes” and “yarn rmadmin –refreshNodes.”
• Issue a “hdfs balancer” command to ensure an even distribution of data.
• Update the “Slave” files.
Demonstration 1: Adding or Removing Machines in an Adhoc Way in Apache Hadoop
Cluster
Demonstration 2: Commission and Decommission a Node in a Cloudera Cluster
Demonstration 3: Decommission and Commission a Node in a Apache Hadoop
Cluster
Balancing a Cluster

Uneven
Distribution
of Data
Commissioning or NameNode
Decommissioning not receiving
of nodes heartbeats

Sudden and
multiple
failures of
data nodes
Balancing a Cluster (Contd.)

Replication Storm

Cluster

Cascading failures
Balancing a Cluster (Contd.)

Failure

Regula Catastrophic
r
• A regular event depends on the rate • Catastrophic events could be due to network
of failure of data nodes. issues, rack failures, or massive hardware
failures.
• Regular events do not cause any • Catastrophic events trigger the loss of
major impact. hundreds of nodes within a few minutes.

These cause performance issues as well as failed tasks and applications.


Balancing a Cluster (Contd.)

A bad job started creating millions of files within a


few minutes.
Data node
This caused the NameNode to become sluggish
because it had to run many ‘create’ transactions and
was, therefore, unable to process heartbeats from
data nodes.

The cluster comprised three thousand nodes and


Name node Data node
one thousand of them lost heartbeats, resulting in a
replication storm.

Data node
Balancing a Cluster

Ensures even
distribution of
data

Avoids
Avoids
performanc HDFS
Replication
e issues and Balancer
Storms
failed tasks

Avoids
Cascading
Failures
Demonstration 4: Using Balancer to Balance Data in Hadoop Cluster
Lesson 5: Hadoop Cluster Maintenance and
Administrations
Topic 5.2: Managing Services
Demonstration 5: Adding a Service to Cloudera Cluster
Demonstration 6: Deleting a Service from Cloudera Cluster
Starting, Stopping, Restarting and Checking Services

Starting and Stopping of services should be done in the correct order because of the dependencies they
may have on other services.

Example:
MapReduce and YARN have a dependency on HDFS. So, you must start HDFS before starting
MapReduce or YARN.

Cloudera Management Service and Hue are the only two services on which no other services depend.
You can start and stop them any time, but there is a recommended order that needs to be followed.
Starting, Stopping, Restarting and Checking Services

To start or stop services with the Cloudera Admin console user interface:

• Select the Cluster tab and then select Services


• On the list of services, select the service and on the drop-down list next to a
service name, select Start, Stop, or Delete as required
Starting, Stopping, Restarting and Checking Services (Contd.)

Cloudera
Managemen Key-Value
t Service HDFS Flume Store Indexer Hive Oozie Hue

Starting
Services

ZooKeeper Solr HBase MapReduce Impala Sqoop [Text]


or YARN

Cloudera
Key-Value Management
Hue Oozie Hive Store Indexer Flume HDFS Service

Stopping
Services

Sqoop Impala MapReduce HBase Solr ZooKeeper [Text]


or YARN
Demonstration 7: Starting or Stopping Services in Cloudera Cluster
Managing Software Packages with Apache Hadoop

Download the tar file of the relevant software package

Untar the file

Install Java Development Kit

Ensure communication across nodes

Update the path of the package in .bashrc

Edit configuration files

Start the daemons on all nodes


Enabling Rack Awareness

Rack 1 Rack 2

DataNode 1 DataNode 5
Create Topology file with node-to-
DataNode 2 DataNode 6 rack information

Select Hosts-> DataNodes ->


DataNode 3 DataNode 7

DataNode 4 DataNode 8
Replication Algorithm Assign Rack

Restart HDFS
Rack Aware

Rack 1 Rack 2 Enabling Rack


Awareness
DataNode 1 DataNode 5

DataNode 2 DataNode 6
Update hdfs-site.xml with the
DataNode 3 DataNode 7
property topology.script.file.name
DataNode 4 DataNode 8 and include the path to the
topology.sh script file.
Demonstration 8: Enabling Rack Awareness in Cloudera Cluster
Managing Role Instances

Select the
Service
Select the
Instance

Start the
Role
Instance

Add the Role


Instance using
Wizard

Select Role Group for


Role Instance or
retain the default Use or Skip
recommendation
s
Demonstration 9: Adding Role Instances in Cloudera Cluster
Managing Hosts

ADD HOSTS WIZARD

You can add one or more hosts with the Add


Hosts wizard.
Add Hosts option
The Add Hosts wizard will install the Oracle
JDK, CDH, Impala, and the Cloudera Manager
Agent packages.
Select Host with no assigned roles
Once the packages are installed and the
Cloudera Manager Agent is started, the
Agent connects to the Cloudera Manager
Server.
Delete Host

Verify health status of host


Managing Hosts (Contd.)

DELETE HOSTS

Select the host to Decommission


The procedure to decommission or
recommission a host is the same as the one
followed to decommission or recommission a
data node or role instances. Stop Agent on Host

Delete Host
Demonstration 10: Adding Hosts to Cloudera Cluster
Demonstration

Placeholder for demo


Lesson 5: HDFS: Hadoop Distributed File System
Topic 5.3: Improvements in Hadoop Version 2
Challenges in Hadoop V1
No horizontal scalability of
NameNode

Cannot process graphs No High-Availability of


NameNode

V1

Overburdened JobTracker

Job Job
Resourc
1 3
e Job
JobTracker 2

TaskTracker TaskTracker TaskTracker


New Features in Hadoop V2

High Availability

Federation YARN Processing Framework

Job Job
Resourc
1 3
V1 e Job
JobTracker 2

TaskTracker TaskTracker TaskTracker


Federation

NameNode

Namespace
Providing DataNode
cluster membership
NS
Processing block reports Block Management
and maintaining the Provide storage by
location of blocks allowing storage of blocks
Block Storage
on the local file system
Supporting block
related operations
DataNode DataNode

Managing replica Handle the read and


placement write permissions
Storage

Hadoop V2
Deleting over-replicated
blocks
Federation (Contd.)

NN-1 NN-k NN-n


Namespace

NS 1 NS k NS n
Block Pool Set of blocks

Pool 1 Pool k Pool n


BlockPoolID Location for list of Block IDs
Block Pool

Namespace +
Namespace Volume
Block Storage

Block Id

DataNode 1 DataNode 2 DataNode m Cluster ID Identifies nodes in cluster


… … …

Common Storage

Hadoop V2
Federation (Contd.)

The advantages of Federation:

NN1 NN2 NN3

Provisions scalability of Isolates the NameNodes and


Enables horizontal Filesystems, thereby delivering enables multiple applications or
scalability of the NameNode. higher throughput for read-write users bound to specific namespaces
operations. to work simultaneously.
High Availability Using Network File System

Failover

HA

Active Standby
NameNode NameNode
High Availability

HDFS
Client
All name space edits YAR
logged to shared N Next Generation
Shared edit logs Reads edit logs
NFS storage; single MapReduce
and applies to
writer (fencing)
its own
namespace

NameNode
High
Availability
Active Standby Resource
NameNode NameNode Manager
Split-Brain Scenario
Both nodes are in
‘active’ state DataNode DataNode Node Manager Node Manager

App App
Fencing Container Manager
Container Manager
Terminate access of
standby node to Node Manager Node Manager
DataNode DataNode
shared storage if
active node is still App App
Container Container
active Manager Manager
High Availability with Zookeeper

Failure Detection Active NameNode


election
Apache
Zookeeper

Zookeeper Zookeeper Failover


Quorum Controller

Zookeeper Zookeeper-
Health
Session based
monitoring
Management Election
Demonstration 11: Enabling High Availability of NameNode and
ResourceManager in Cloudera Cluster
High Availability using Quorum Journal Manager

JournalNode JournalNode JournalNode

• When HA is set up with Q-J-M, both the


NameNodes communicate with a group
of separate daemons called JournalNodes
or j-ns.
Active Standby
NameNod NameNod
e e • The active NameNode logs any namespace
modification performed to a majority of
these JNs.
DataNode DataNode DataNode DataNode
• The Standby node is capable of reading
the modifications from the JNs

Data Blocks
Hadoop V2: Overall Picture

Client

HDFS YARN
Distributed Data Storage Distributed Data Processing

Resource Manager

Masters
Applications
Active Standby Scheduler Manager
NameNod NameNod (AsM)
e e

Shared
o
Edit log
r
JournalNode

DataNode DataNode DataNode

Slaves
App App App
Container Container Container
Master Master Master
Node Manager Node Manager Node Manager
Quiz
QUIZ Which of the following is the Utility or Role that takes care of an even
distribution of data across the cluster and plays an important part
1 when commissioning or decommissioning nodes?

a. Balancer

b. Replication algorithm

c. NameNode

d. Cloudera-scm-agents running on each host


QUIZ Which of the following is the Utility or Role that takes care of an even
distribution of data across the cluster and plays an important part
1 when commissioning or decommissioning nodes?

a. Balancer

b. Replication algorithm

c. NameNode

d. Cloudera-scm-agents running on each host

The correct answer is a .


Explanation: Balancer is the Utility or Role that takes care of an even distribution of data
across the cluster and plays an important part when commissioning or
decommissioning nodes.
QUIZ Which of the following is the feature in Hadoop that handles the issue
2 of single namespace and lack of horizontal scalability of NameNode?

a. YARN

b. Federation & High Availability

c. Cluster ID

d. None of the above


QUIZ Which of the following is the feature in Hadoop that handles the issue
2 of single namespace and lack of horizontal scalability of NameNode?

a. YARN

b. Federation & High Availability

c. Cluster ID

d. None of the above

The correct answer is b .


Explanation: Federation is the feature in Hadoop that handles the issue of single
namespace and lack of horizontal scalability of NameNode.
QUIZ Which of the following are ways of setting up HA with automatic
3 failover?

a. By using NFS or QJM

b. By using Cloudera and Hortonworks distribution of Hadoop

c. By having NameNodes in different geographical locations

d. None of the above


QUIZ Which of the following are ways of setting up HA with automatic
3 failover?

a. By using NFS or QJM

b. By using Cloudera and Hortonworks distribution of Hadoop

c. By having NameNodes in different geographical locations

d. None of the above

The correct answer is a .

Explanation: NFS or QJM are ways of setting up HA with automatic failover.


QUIZ Which of the following is a component of Zookeeper that is
4 responsible for monitoring and managing the state of the NameNode?

a. ZK Failover Controller

b. Zookeeper session elector

c. Zookeeper daemon

d. Zookeeper session handler


QUIZ Which of the following is a component of Zookeeper that is
responsible for monitoring and managing the state of the
4 NameNode?

a. ZK Failover Controller

b. Zookeeper session elector

c. Zookeeper daemon

d. Zookeeper session handler

The correct answer is a .


Explanation: ZK Failover Controller is a component of Zookeeper that is responsible for
monitoring and managing the state of the NameNode.
QUIZ Which of the following is a mechanism that avoids the Split-brain
scenario by cutting off the previous Active node’s access to the shared
5 edits storage?

a. Heartbeat mechanism

b. Manual failover of NameNodes

c. Fencing

d. Automatic failover of NameNodes


QUIZ Which of the following is a mechanism that avoids the Split-brain
scenario by cutting off the previous Active node’s access to the shared
5 edits storage?

a. Heartbeat mechanism

b. Manual failover of NameNodes

c. Fencing

d. Automatic failover of NameNodes

The correct answer is c .


Explanation: Fencing is a mechanism that avoids the Split-brain scenario by cutting off
the previous Active node’s access to the shared edits storage.
QUIZ To which of the following UI elements should you navigate to find
hosts that have no roles assigned but are already known to Cloudera
6 manager?

a. Hosts tab

b. Cluster tab

c. Hosts tab and Currently Managed Hosts

d. None of the above


QUIZ To which of the following UI elements should you navigate to find
hosts that have no roles assigned but are already known to Cloudera
6 manager?

a. Hosts tab

b. Cluster tab

c. Hosts tab and Currently Managed Hosts

d. None of the above

The correct answer is c .


Explanation: You should navigate to the Hosts tab and Currently Managed Hosts to find
hosts that have no roles assigned but are already known to Cloudera Manager.
QUIZ Which of the following are Cloudera services on which no other
7 service is dependent?

a. Cloudera Management Service and HUE

b. MapReduce and YARN

c. Oozie and Zookeeper

d. HDFS and YARN


QUIZ Which of the following are Cloudera services on which no other
7 service is dependent?

a. Cloudera Management Service and HUE

b. MapReduce and YARN

c. Oozie and Zookeeper

d. HDFS and YARN

The correct answer is a .

Explanation: Cloudera Management Service and HUE are Cloudera services on which no
other service is dependent.
QUIZ
Where are shared edits written when HA is set up using QJM?
8

a. On each NameNode

b. On the active NameNode

c. On Journal Nodes

d. Wherever shared edits are mounted


QUIZ
Where are shared edits written when HA is se tup using QJM?
8

a. On each NameNode

b. On the active NameNode

c. On Journal Nodes

d. Wherever shared edits are mounted

The correct answer is c .


Explanation: Shared edits are written on Journal nodes when HA is setup using QJM.
QUIZ Which of the following is a situation that leads to an abrupt shutdown
9 of one or more DataNodes and inconsistent replication?

a.
The number of DataNodes available is less than the number
required for replication
b. NameNode is not available

c. Communication Failure

d. Amount of disk space in available DataNodes is insufficient


QUIZ Which of the following is a situation that leads to an abrupt shutdown
9 of one or more DataNodes and inconsistent replication?

a.
The number of DataNodes available is less than the number
required for replication.
b. NameNode is not available.

c. Communication Failure

d. Amount of disk space in available DataNodes is insufficient.

The correct answer is a .


Explanation: If the number of DataNodes available is less than the number required for
replication, an abrupt shutdown of one or more DataNodes and inconsistent replication
occurs.
QUIZ What is the disadvantage of enabling HA in a Hadoop cluster without
10 using Zookeeper?

a. Cluster will not be stable.

b. More bandwidth is required.

c. Huge amounts of data cannot be handled.

d. Automatic failover is not possible.


QUIZ What is the disadvantage of enabling HA in a Hadoop cluster without
10 using Zookeeper?

a. Cluster will not be stable.

b. More bandwidth is required.

c. Huge amounts of data cannot be handled.

d. Automatic failover is not possible.

The correct answer is d .


Explanation: The disadvantage of enabling HA in a Hadoop cluster without using
Zookeeper is Automatic failover is not possible.
Key Takeaways

There are two ways you can add or remove a node from a
cluster ꟷ the adhoc way and the systematic way.

The systematic way is recommended because it avoids overheads and


loss of time.
Adding and Deleting nodes in the systematic way are called
Commissioning and Decommissioning respectively.

Balancing a cluster helps distribute data evenly and avoid issues


such as Replication storms and Cascading failures.

Services can be added to or deleted from a cluster after the


initial setup also. Services can also be started, stopped and
their status can be checked.

Disclaimer: All the logos used in this course belong to the respective organizations
Key Takeaways (Contd.)

Providing the rack information to the replication algorithm can be done


by enabling ‘Rack awareness’. This helps place blocks on appropriate
nodes.
You can add, remove or re-assign role instances after a service is added to
the cluster.
Federation and High Availability are features in Hadoop version 2
that overcome the challenges faced in version 1.

Federation provisions horizontal scalability of the NameNode by


allowing multiple NameNodes with unique Namespace IDs.

High Availability addresses the failover issue of Hadoop


version 1 by allowing two machines to be configured as
NameNodes within the same cluster.
HA can be set up in a cluster using NFS or QJM.

Disclaimer: All the logos used in this course belong to the respective organizations
Key Takeaways (Contd.)

HA can be set up with automatic failover with Zookeeper.

In an HA set up, one NameNode is ‘Active’ and the other is ‘Standby’.


The ‘active’ NameNode handles all client operations in the cluster,
while the ‘Standby’ NameNode acts as a slave and maintains a
ready state to provide a quick failover when necessary.

The Standby node is always synchronized with the


Active NameNode.

Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Cluster Maintenance and Administration.”
The next lesson is “Hadoop Computational Frameworks”.

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 06—Hadoop Computational Frameworks
What You’ll Learn

Describe the role of computational frameworks

Explain MapReduce concepts

Explain YARN framework and concepts

Describe MRv2 on YARN

Explain the configuring and understanding of YARN

Describe YARN applications

Describe YARN memory and CPU settings


Lesson 6: Hadoop Computational Frameworks
Topic 6.1: Computation Frameworks
Processing Data in Hadoop

Computation frameworks are execution engines responsible for executing


data processing tasks on Hadoop.
Processing Data in Hadoop

Hive

MapReduce Framework Pig

Spark

Cascading
YARN Framework
Crunch

Tez

Drill
Open Source Tools
Impala

Presto
Processing Data in Hadoop (contd.)

Spark Streaming
Hive (in beta)

Dato (Graphlab)

Storm/trident
Cascading

Cascading
Spark SQL
Cascading
Mahout
Giraph
Crunch

Impala

Presto
Giraph
Crunch

Oryx
Drill

H20
MLlib
Hive

Hive
Pig

Pig
Pig
MapReduce Spark Tez

Hadoop Storage managers (HDFS/HBase/Solr)

General purpose Abstraction Graph processing


Execution engines engines engines
Storage
managers
SQL Real-time Machine Learning
engines frameworks engines
Processing Frameworks: Categories

Provide near real-time processing


Real- Process data in Hadoop using
capabilities for data in the Hadoop General Purpose
Time/Streaming Processing low-level API like MapReduce or
ecosystem
spark

Perform machine learning analysis Machine Processing Process data using a high level
on Hadoop data Learning Abstraction
Frameworks abstraction

Enable graph processing


Graph Processing SQL Query data in Hadoop using SQL
capabilities on Hadoop
Processing Framework Based on Architecture

Engines Libraries
Some frameworks Some frameworks do
have active not have any active
components, such as component and can
server, client, be considered
services, and so on. libraries
These can be
considered engines.
Selecting Processing Framework

Making appropriate choices depends on the


aspects of the framework:

Use case

Requirement

Available expertise

Experience
Selecting Processing Framework

Applying one framework for data processing is not an option

Adopting more than one framework is an industry choice and it demands


efficient resource management decisions
Shared Nothing Architecture

All frameworks implement shared nothing architecture.

Hadoop’s processing framework uses distributed HDFS storage.

! X

High scalability Fault tolerance No single point of Fast recovery


failure
General Purpose Processing Framework

1 2
The General Purpose Processing framework is always needed, because other
frameworks only solve a specific use case and may not be sufficient to handle all
processing needs of an organization.
General Purpose Processing Framework

MapReduce Spark
1 2

Apache Flink Tez


Mapreduce and Spark

MapReduce Spark

Unless an Abstraction framework is used, migrating from MapReduce to Spark


necessitates rewriting of jobs in Spark.

The time and code required to write a Spark job is a lot less than for a MapReduce job.

Tez is better suited to building abstraction frameworks rather than building applications.
HDFS: Key Features

Abstraction and SQL frameworks reduce the time spent on writing jobs directly for general purpose frameworks.

As common processing tasks are not implemented using low


level APIs, time is saved

Pig
Possibility of changing the underlying general purpose Abstraction
frameworks when needed Crunch Frameworks

Avoids code changes whenever the framework changes,Cascading


which
might be required in case of direct coding on general purpose
frameworks. Hive
SQL
Frameworks
Impala
Reduced overheads for running an equivalent job on general
purpose frameworks

Faster running of query through Impala or Presto as they use a


different specialized execution model
Graph Framework

Graph Frameworks

Giraph GraphX GraphLab

GraphLab is a
GraphX is a library standalone, special
Giraph is a library for graph purpose Graph
that runs on top of processing on Processing
MapReduce. Spark. framework that can
handle tabular data
Machine-learning Frameworks

Mahou
t
Mahout is a library on top of MapReduce.

MLlib MLlib is a machine learning library for Spark.


Machine
Learning
Frameworks

Oryx Oryx is a standalone specialized machine learning engine

H2O H2O is a standalone specialized machine learning engine.


Real-time /Streaming Frameworks

Real-
Time/Streaming
Frameworks

Spark Streaming Apache Storm

A library for micro- A special-purpose,


batch streaming distributed, real-
analysis on top of time computation
Spark engine
Frameworks

Kafka
Flume HDFS
HDFS/S3 Databases
Kinesis Dashboards
Twitter

Input data Batches of Batches of


stream Input data processed data
Spark Spark
Streaming Engine
Lesson 6: Hadoop Computational Frameworks
Topic 6.2: MapReduce
What is Mapreduce?

HADOOP 1.0 HADOOP 2.0

MapReduce Others
(data processing) (data processing)

MapReduce YARN
(cluster resource management & (cluster resource management)
data processing)

HDFS HDFS
(redundant, reliable storage) (redundant, reliable storage)
Understanding Mapreduce Programming Model

MapReduce is the programming paradigm that allows the


processing and generation of large datasets.

MapReduce is Hadoop’s computation framework for processing


parallelizable problems across large datasets.

Immutability of data elements in MapReduce ensures that any


change in input does not reflect in input, but occurs by generating
temporary intermediate output for further execution.
Mapping Phase

The Mapping phase is taken care of by a Mapper(). This is the


first phase of the MapReduce program; here, a list of data Input list
elements is provided to a function called a Mapper. A
Mapper transforms the input data elements into a list of Mapping
output data elements. function

Output list
Mapping Phase and Reducing Phase

The Reducing phase is taken care of by a Reducer(). This is


the last phase of the MapReduce program; values are
Input list
aggregated in this phase. The Mapping phase provides an
iterator of input values. The Reducer function receives and
combines these values to return a single output value. Mapping
function

Output list
Mapreduce Flow
Hadoop Distributed File System (HDFS)

Input file Output


file

Map task Reduce task


Split 1

Split 2 Input Input

Split 3
map()
Split 4 partition() sort()
combine() reduce()
Split 5

Region 1 Output
Region 2
Job tracker Region 3
Keys and Values

Mapping Driver Reducing

Several instances of Mapper are Component of MapReduce that Several instances of reducer
created on multiple nodes in a initializes the job, instructs method are also instantiated
cluster. Each instance receives a Hadoop platform to execute on different nodes that work
different input file. the job on input, and controls on the data generated by
where the output is placed. mapper and reduce it/sum it to
generate a final output value.
Mapreduce Data Flow

Node 1 Node 2 Node 3


Pre-loaded local
input data

Mapping process Mapping process Mapping process

Intermediate data
from mappers

Values exchanged
by shuffle process

Node 1 Node 2 Node 3


Reducing process
generates outputs

Reducing process Reducing process Reducing process

Outputs are stored


locally
Intermediate Phases

Mapper

Partitio
Reducer
n&
Shuffle
MapReduce

Combine Sort
r
Lesson 6: Hadoop Computational Frameworks
Topic 6.3: YARN
What is Yarn?

YARN is a resource management layer for Apache Hadoop ecosystem.

Capable of managing and monitoring workloads

Implements security control

Manages high availability features of Hadoop

YARN is like an operating system


on a server.
Types of Hosts

A YARN cluster contains two types of hosts

Master
Worker 1 A ResourceManager is the master daemon that:
CPU
• Communicates with the client
Node
Resource • Tracks resources on the cluster
Manager
Manager RAM • Coordinates work by assigning tasks to
NodeManagers

A NodeManager is a worker daemon that


Worker N
launches and tracks processes spawned on
CPU
worker hosts.
Node
Manager
RAM
Types of Resources

It refers to the virtual portion of the CPU core of a particular machine.

Vcores Memory
YARN

It refers to the Random Access Memory of worker nodes.


Types of Resources (contd.)

Master Worker 1
CPU
8x8
Resource Node
Manager Manager RAM
Vcores:6400 Vcores:64 128 GB
RAM:12800 GB RAM:128 GB

Worker 100
CPU
8x8
Node
Manager RAM
Vcores:64 128 GB
RAM:128 GB

• The NodeManager tracks its own resources and advertises its resource
configuration to the ResourceManager.
• The ResourceManager keeps a record of the cluster’s available resources and
knows how to allocate resources when requested.
Basics of Yarn Framework

Container Container
vcore request: 1 vcore request: 1
memory request: 8 GB memory request: 8 GB

PROCESS
Your
Code

CPU RAM HD
D

Client Master

Application Resource
Worker
Process Manager
Execution Flow

Master Worke Master Worke


r r
Client Resource Node Client Resource Node
Manager Manager Manager Manager
vcores: 60 vcores: 60 vcores: 60 vcores: 60
Application RAM: 90 GB RAM: 90 GB Application RAM: 90 GB
RAM: 90 GB
Process Vcore used: 1 Vcore used: 1 Process Vcore used: 1 Vcore used: 1
RAM used: 8 RAM used: 8 RAM used: 8 RAM used: 8
GB GB GB GB

Container Container
vcore request: 1 vcore request: 1
Memory request: 8 Memory request: 8
GB GB
PROCESS Application Master
Execution Flow (contd.)
Worker
Node
Manager
vcores: 60
RAM: 90GB
Vcore used: 1
RAM used: 8 GB

Container
vcore request: 1
Master Memory request: 8 GB

Application
Client Resource Master

Manager
vcores: 120
Application RAM: 180GB
Process Vcore used: 2 Worker
RAM used: 12 Node
GB Manager
vcores: 60
RAM: 90 GB
Vcore used: 1
RAM used: 4 GB

Container
vcore request: 1
Memory request: 4 GB

Task
Execution Flow (contd.)

Client Master

Worke
Worke r
Worke Worke
Worke Worke Node
Worke r
r r r
Node
r r Node
Node
Manager
Node vcores: 60
Node
Node Manager
Manager Manager
Manager Manager RAM: 90 GB
vcores: 60 Manager
vcores:
vcores: 6060 vcores: vcore
60
vcores:
vcores: 60
609090
RAM: GB used: 1
RAM: 90 GB RAM:
RAM: 90 GBGB RAM: 90RAM
GB used: 4
RAM: vcore
90 GB
vcore used:
used: 1
vcore used: 1 vcore
vcore RAMused:
used: 1 1 14
used:
vcore used:
GB 1
RAM used: 8 RAM
RAM used:
used: 4 RAM used: 4
GB RAM GBused:
GB 44 GB Container
GBGB
vcore request: 1
Container
Container Container
Container ContainerMemory request: 4
vcore
Container
vcore request:
request: 1 GB 1
vcore request: 1 vcore request:
Memory 1 1
request: 4
vcore request:
vcoreMemory
request: 1
request: Reduce4task
Memory request: 8 Memory
GB request: 4 4 Memory request:
Memory request: 4
GB GBGB Map Task GB
Application Master GB Map Task Reduce task
Map Task
Map Task
Demo: Running Sample Mapreduce Jobs and Looking at Output

Running a MapReduce job in YARN cluster from


terminals and using web UI to look at output
Yarn Configuration Basics

YARN Allocation

Ideal Realistic

A YARN cluster can be configured to use up All the resources cannot be allocated to YARN due to:
all the resources on the cluster. • Overheads of non-Hadoop related services
running on nodes
• Operating system and utilities, custom programs,
and so on
• Other Hadoop related components that might
need dedicated resources and cannot share
resources
• Distribution specific services in case of CDH cluster
and Hadoop-specific roles
• Resources for Hbase slave daemons regionservers
Cluster Metrics on Yarn Allocation
Let’s look at a snapshot of ResourceManager Web UI and understand cluster metrics on YARN allocation
There are 50 worker nodes
YARN related configuration properties: yarn.nodemanager.resource.memory-mb is 90000
yarn.nodemanager.resource.vcores is 60

50 x 90 = 4500GB = 4.5TB
50 x 60 = 3000cores
Configurations Considerations

Configuration category Minimum Maximum

YARN Container VCore Sizing yarn.scheduler.minimum- yarn.scheduler.maximum-


allocation-vcores allocation-vcores
YARN Container Memory Sizing yarn.scheduler.minimum- yarn.scheduler.maximum-
allocation-mb allocation-mb
YARN Container Allocation Size yarn.scheduler.increment- yarn.scheduler.increment-
Increments allocation-mb allocation-vcores
Configurations Considerations

Values Memory Vcore Container Allocation


size increments
Minimum yarn.scheduler.miminimum- Yarn.scheduler.miminim Yarn.scheduler.incremen
allocation-mb = 0 um-allocation-vcores=0 t-allocation-vocres=1
Maximum >=minimum value >=minimum value
Any other Should be Should be
sizing =<yarn.nodemanager.resource.memor =<yarn.nodemanager.res
properties y-mb ource.vcores
Mapreduce Configuration

Map task memory Reduce task memory property:


property: mapreduce.map.memory.m mapreduce.reduce.memory.mb
b
Application Memory Configuration

yarn.app.mapreduce.am.resource.mb is Since the ApplicationMaster uses a


used to set the memory size for the container, the property should be less
ApplicationMaster. than the Container maximum.
Looking at Configuration Files and Properties for Yarn
Allocation and Management
Quiz
QUIZ
Frameworks can be categorized based on architecture and whether they have active
components. Select two such frameworks.
1

a. Engines : Hive; Libraries : Hbase

b Engines : MapReduce; Libraries : MLIB


.
c. Engines : Hive; Libraries : Spark MLIB

d. MapReduce and YARN.


QUIZ
Frameworks can be categorized based on architecture and whether they have active
components. Select two such frameworks.
1

a. Engines : Hive; Libraries : Hbase

b Engines : MapReduce; Libraries : MLIB


.
c. Engines : Hive; Libraries : Spark MLIB

d. MapReduce and YARN.

The correct answer is c.


Engines : Hive; Libraries : Spark MLIB are the two frameworks that are categorized based on architecture
and presence of active components.
QUIZ
What kind of frameworks can be used for querying data in Hadoop using querying languages
and exist on top of a general-purpose framework?
2

a. Abstraction frameworks

b Graph processing frameworks


.
c. SQL frameworks

d. Real-time/Streaming frameworks
QUIZ
What kind of frameworks can be used for querying data in Hadoop using querying languages
and exist on top of a general-purpose framework?
2

a. Abstraction frameworks

b Graph processing frameworks


.
c. SQL frameworks

d. Real-time/Streaming frameworks

The correct answer is c.


SQL frameworks are used for querying data in Hadoop using querying languages and exist on top of a
general-purpose framework.
QUIZ
In MapReduce, InputFormat defines how input files are split and read. What is the default
InputFormat provided with Hadoop?
3

a. KeyValueInputFormat

b SequenceFileInputFormat
.
c. TextInputFormat

d. FileInputFormat
QUIZ
In MapReduce, InputFormat defines how input files are split and read. What is the default
InputFormat provided with Hadoop?
3

a. KeyValueInputFormat

b SequenceFileInputFormat
.
c. TextInputFormat

d. FileInputFormat

The correct answer is c.


The TextInputFormat is the default InputFormat provided with Hadoop.
QUIZ
What is the parameter to control the split size for InputSplit to be processed by Map task?
4

a. mapred.min.split.size

b mapred.min.inputsplit.size
.
c. dfs.blocksize

d. mapreduce.tasktracker.map.tasks.maximum
QUIZ
What is the parameter to control the split size for InputSplit to be processed by Map task?
4

a. mapred.min.split.size

b mapred.min.inputsplit.size
.
c. dfs.blocksize

d. mapreduce.tasktracker.map.tasks.maximum

The correct answer is a.


mapred.min.split.size is the parameter to control the split size for InputSplit to be processed by Map task.
QUIZ
Speculative execution is enabled by default. How can speculative execution be disabled for
mappers and reducers?
5

a. It cannot be disabled; it is enabled by default for better performance.

b It can be disabled by setting tasks related to map and reduce properties to


false.
.
c. It can be disabled for mapper but not for reducer.

d. Set mapred.map(/reduce).tasks.speculative.execution=0
QUIZ
Frameworks can be categorized based on architecture and whether they have active
components. Select two such frameworks.
5

a. It cannot be disabled; it is enabled by default for better performance.

b It can be disabled by setting tasks related to map and reduce properties to


false.
.
c. It can be disabled for mapper but not for reducer.

d. Set mapred.map(/reduce).tasks.speculative.execution=0

The correct answer is b.


It can be disabled by setting tasks related to map and reduce properties to false.
QUIZ
What component of YARN takes care of negotiating resources with ResourceManager and
works with NodeManagers ?
6

a. ApplicationsManager

b ApplicationsMaster
.
c. Container

d. NodesListManager
QUIZ
What component of YARN takes care of negotiating resources with ResourceManager and
works with NodeManagers ?
6

a. ApplicationsManager

b ApplicationsMaster
.
c. Container

d. NodesListManager

The correct answer is b


ApplicationsMaster component of YARN takes care of negotiating resources with ResourceManager and
works with NodeManagers.
QUIZ
What is responsible for allocating resources to various running applications and performing
scheduling function based on resource requirement?
7

a. ApplicationMasterLauncher

b YarnScheduler
.
c. ApplicationsManager

d. ResourceManager
QUIZ
Who is responsible for allocating resources to various running applications and performing
scheduling function based on resource requirement?
7

a. ApplicationMasterLauncher

b YarnScheduler
.
c. ApplicationsManager

d. ResourceManager

The correct answer is b.


YarnScheduler is responsible for allocating resources to various running applications and performing
scheduling function based on resource requirement.
QUIZ
NodesListManager manages and seeds list of nodes mentioned in configuration files under
certain properties. Choose the right properties.
8

a. yarn.resourcemanager.nodes.include-path/exclude-path

b dfs.hosts.include and dfs.host.exclude


.
c. Masters and slaves file

d. yarn-site.xml
QUIZ
NodesListManager manages and seeds list of nodes mentioned in configuration files under
certain properties. Choose the right properties.
8

a. yarn.resourcemanager.nodes.include-path/exclude-path

b dfs.hosts.include and dfs.host.exclude


.
c. Masters and slaves file

d. yarn-site.xml

The correct answer is a.


yarn.resourcemanager.nodes.include-path/exclude-path is the right property.
QUIZ
When allocating resources for YARN, what is the important consideration for optimum
allocation of resources?
9

a. Make sure to allocate maximum RAM, CPU, and storage for YARN.

b There should be overheads for OS and its utilities and other Hadoop and non-
Hadoop components
.
c. ResourceManager should be on a node with good RAM, CPU, and storage

d. Cloudera Manager should be used to allocate appropriate amount of


resources
QUIZ
When allocating resources for YARN, what is the important consideration for optimum
allocation of resources?
9

a. Make sure to allocate maximum RAM, CPU, and storage for YARN.

b There should be overheads for OS and its utilities and other Hadoop and non-
Hadoop components
.
c. ResourceManager should be on a node with good RAM, CPU, and storage

d. Cloudera Manager should be used to allocate appropriate amount of


resources

The correct answer is b.


There should be overheads for OS and its utilities, other hadoop and non Hadoop components is the
important consideration for optimum allocation of resources.
QUIZ
The values of the properties for Map and Reduce task memory should be less than the
container max size as these tasks run in containers. What are the properties?
10

a. Mapreduce.map.memory.mb & mapreduce.reduce.memory.mb

b Mapreduce.map.memory.gb & mapreduce.reduce.memory.gb


.
c. Mapreduce.map.memory.tb & mapreduce.reduce.memory.tb

d. Mapreduce.map.memory.max & mapreduce.reduce.memory.min


QUIZ
The values of the properties for Map and Reduce task memory should be less than the
container max size as these tasks run in containers. What are the properties?
10

a. Mapreduce.map.memory.mb & mapreduce.reduce.memory.mb

b Mapreduce.map.memory.gb & mapreduce.reduce.memory.gb


.
c. Mapreduce.map.memory.tb & mapreduce.reduce.memory.tb

d. Mapreduce.map.memory.max & mapreduce.reduce.memory.min

The correct answer is a.


Mapreduce.map.memory.mb & mapreduce.reduce.memory.mb are the properties.
Key Takeaways

There are five intermediate phases in MapReduce. They are Mapper, Partition and Shuffle,
Sort, Combiner, and Reducer.

YARN stands for Yet Another Resource Negotiator. It is like an operating system on a server.

YARN cluster contains two types of hosts - Master:ResourceManager and Worker:NodeManager.

YARN defines two resources. They are Vcores and Memory.


This concludes the lesson “Hadoop Computational Frameworks.”
The next lesson is “Scheduling: Managing Resources.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 07—Scheduling: Managing Resources
Describe scheduling concepts

Identify Schedulers

Explain the ways to manage resources using Schedulers

Describe FIFO, Fair Scheduler, and Capacity Scheduler

Explain how to configure Schedulers

Explain queue management


Lesson 7: Scheduling: Managing Resources
Topic 7.1: YARN: Cluster Scheduling
Schedulers

A standalone system has


multiple CPU cores, and each
core runs a single process.

Hundreds of processes run in


a system simultaneously.

Standalone
system

A scheduler is part of a node’s operating system that assigns a process to a


CPU core to run for a short duration.
Role of Cluster Scheduler

Multi-tenancy Scalability

Cluster
Scheduler
Schedulers in Hadoop 2.0 and Yarn

Hadoop 2.0 YARN

In Hadoop 2.0, the In YARN, Schedulers


Scheduler is a pluggable neither monitor nor track
piece of code that resides resources. They don’t
in the ResourceManager restart failed tasks either.
(JobTracker in MR v1).
Available Scheduler Implementation

ii. Capacity Scheduler


i. FIFO Scheduler utilization
utilization
2 queue B

1 2 FIFO queue 1 queue A

time time
Job 1 Job 2 Job 1 Job 2
submitted submitted submitted submitted

iii. Fair Scheduler


utilization

1 fair share
Pool/queue

time
Job 1 Job 2
submitted submitted
FIFO Scheduler

A FIFO Scheduler places applications in queue


and runs them in the order of submission.
i. FIFO Scheduler
Resource requests for applications are satisfied
utilization
in the order of their run.

A FIFO Scheduler is not preferred for shared


clusters. 1 2 FIFO queue

A FIFO Scheduler does not support queues or


time
require any specific configuration.
Job 1 Job 2
submitted submitted
Capacity Scheduler

The Capacity Scheduler was designed to


allow significantly higher cluster utilization.
ii. Capacity Scheduler
utilization
Generally, queues are used to prevent
applications from consuming more 2 queue B
resources than they should.
1 queue A

time
Job 1 Job 2
submitted submitted
Fair Scheduler

A Fair Scheduler is a source aware scheduler.

It allows cluster owners or administrators to iii. Fair Scheduler


allocate logical pools or queues. utilization

2
A Fair Scheduler dynamically balances
resources amongst all running jobs. 1 fair share
Pool/queue

Fair sharing of resources allows multiple time


running jobs at the same time. Job 1 Job 2
submitted submitted
Lesson 7: Scheduling: Managing Resources

Topic 7.2: Fair Scheduler


What is Fair Scheduler?

A Fair scheduler attempts to allocate resources to ensure that all


running applications get an equal share in resources.

Enable Fair Scheduler by updating the configuration property in yarn-site.xml.

Property: yarn.resourcemanager.scheduler.class

Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
Fair Scheduler: Scenario

utilization

queue B
(fair share)

queue A
(fair share)

time

Job 1 Job 3
submitted submitted

Job 2
submitted
Queue Configuration

<?xml version="1.0"?>
<allocations>
<defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>

<queue name="prod">
<weight>40</weight>
<schedulingPolicy>fifo</schedulingPolicy>
</queue>

<queue name="dev">
<weight>60</weight>
<queue name="eng" />
<queue name="science" />
</queue>

<queuePlacementPolicy>
<rule name="specified" create="false" />
<rule name="primaryGroup" create="false" />
<rule name="default" queue="dev.eng" />
</queuePlacementPolicy>
</allocations>
Fair-Scheduler.xml

<?xml version="1.0"?>
<allocations>

<defaultQueueSchedulingPolicy>fair</defaultQueu
eSchedulingPolicy>
Root
<queue name="prod">
Queue <weight>40</weight>
<schedulingPolicy>fifo</schedulingPolicy>
</queue>

<queue name="dev">
<weight>60</weight>
<queue name="eng" />
<queue name="science" />
Queue 1 Queue 2 Queue 3 </queue>

<queuePlacementPolicy>
<rule name="specified" create="false" />
<rule name="primaryGroup" create="false" />
<rule name="default" queue="dev.eng" />
</queuePlacementPolicy>
</allocations>
Demo: Setting up Fair-Scheduler.xml

In this demonstration, you will see how a fair-


scheduler.xml is set up for implementation of FAIR-
Demo
SCHEDULER and understand its properties.
Demo: Sample Run of Jobs using Fair-Scheduler

In this demonstration, you will see how to perform


a sample run of jobs using a fair-scheduler.
Demo
Queues: Scheduling Policies

The policy for a particular queue can be overridden using


the scheduling policy element for that queue.

Queues can be configured with minimum and maximum


resources and a maximum number of running applications.
Lesson 7: Scheduling: Managing Resources

Topic 7.3: Capacity Scheduler


What is Capacity Scheduler?
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>prod,dev</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.queues</name>
<value>eng,science</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.capacity</name>
<value>40</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.maximum-capacity</name>
<value>75</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.eng.capacity</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.science.capacity</name>
<value>50</value>
</property>
</configuration>
Demo: Setting up Capacity-Scheduler.XML

In this demonstration, you will see how to set up


capacity-scheduler.xml for Capacity-SCHEDULER and
understand its properties.
Queue Elasticity

A single job does not use more resources than the capacity of its queue.
However, if there is more than one job in the queue and idle resources
are available, then the Capacity Scheduler may allocate the spare
resources to jobs in the queue, even if that exceeds the queue’s capacity.
Queue Hierarchy

Root

40%
25% 75%
60%
Prod Dev
(for production) (for development)

30% 30%
Eng Science
(for engineering) (for science)
Demo:Sample Run of Jobs Using Capacity Scheduler

In this demonstration, you will see how to perform a


sample run of jobs using capacity scheduler.
Demo
Delay Scheduling

In a busy cluster application, requesting a node may lead to a


conflict of interest with the other containers that are running
on the node.
Node manager’s Resources available
running containers for new containers
This can be resolved by immediately loosening the locality
requirement and allocating a container on the same rack.

Container Container
NodeManager
Container Container
However, waiting for a few seconds can dramatically increase ResourceManager
the chances of a container being allocated on the requested NodeManager
node, thereby increasing the efficiency of the cluster. Container

Scheduling
opportunity
This feature is called delay scheduling and it is supported by
both the Capacity Scheduler and the Fair Scheduler.
Delay Scheduling

Scheduling
opportunities

NodeManager
Delay scheduling is configured by setting
yarn.scheduler.fair.locality.threshold.node
yarn.scheduler.capacity.node-locality-delay
NodeManager
ResourceManager
NodeManager
The Fair Scheduler uses the number of scheduling
yarn.scheduler.fair.locality.threshold.rack,
opportunities to determine the delay. NodeManager
Dominant Resource Fairness

The concept of capacity or fairness is easy to determine when only a single


resource such as memory is being scheduled.

App 1 App 2

Demands more CPU Demands more


and less memory memory and less CPU

Schedulers in YARN use the DRF approach to address the problem by looking at
each user’s dominant resource use or requirement and measure the cluster use.
Dominant Resource Fairness

Application A

2 CPUs with
300 GB each

Cluster: 100 CPUs


Application B
10 TB of memory
6 CPUs with
100 GB each
DRF: Capacity and Fair Scheduler

yarn.scheduler.capacity.resource-calculator

Capacity Scheduler can


be configured to use DRF

org.apache.hadoop.yarn.util.resource.Domin
antResourceCalculator incapacity-
scheduler.xml
Quiz
QUIZ
What are the four scheduler implementations to manage resources available in Hadoop?
1

a. Oozie, Azkaban, FIFO, and Luigi

b FIFO, Fair, Capacity, and DRF


.
c. Hive, Hbase, Spark, and Impala

d. Validity
QUIZ
What are the four scheduler implementations to manage resources available in Hadoop?
1

a. Oozie, Azkaban, FIFO, and Luigi

b FIFO, Fair, Capacity, and DRF


.
c. Hive, Hbase, Spark, and Impala

d. Validity

The correct answer


b. is
FIFO, Fair, Capacity and DRF . The four scheduler implementations to manage resources available in
Hadoop are FIFO, Fair, Capacity, and DRF.
QUIZ In a multi-resource environment determining fairness, comparing two applications and
optimum resource allocation is difficult. What is the approach that can be followed to solve
2 this problem?

a. Configure Capacity scheduler and assign nominal resources

b Configure Fair scheduler with appropriate allocations


.
c. Configure Capacity or Fair scheduler to use DRF(Dominant Resource Fairness)

d. Let cluster use FIFO scheduler and cluster will manage automatically
QUIZ In a multi-resource environment determining fairness, comparing two applications and
optimum resource allocation is difficult. What is the approach that can be followed to solve
2 this problem?

a. Configure Capacity scheduler and assign nominal resources

b Configure Fair scheduler with appropriate allocations


.
c. Configure Capacity or Fair scheduler to use DRF(Dominant Resource Fairness)

d. Let cluster use FIFO scheduler and cluster will manage automatically

The correct answer


c. is
The approach that can be followed for such a situation is to configure Capacity or Fair scheduler to use
DRF(Dominant Resource Fairness).
QUIZ
In YARN, vcores (virtual cores) are used to normalize CPU resources across cluster. Choose
the property used to set the number of CPU cores that can be allocated for containers.
3

a. yarn.nodemanager.resource.cpu-vcores

b yarn.scheduler.capacity.resource-calculator
.
c. yarn.scheduler.minimum-allocation-vcores

d. yarn.scheduler.maximum-allocation-vcores
QUIZ
In YARN, vcores (virtual cores) are used to normalize CPU resources across cluster. Choose
the property used to set the number of CPU cores that can be allocated for containers.
3

a. yarn.nodemanager.resource.cpu-vcores

b yarn.scheduler.capacity.resource-calculator
.
c. yarn.scheduler.minimum-allocation-vcores

d. yarn.scheduler.maximum-allocation-vcores

The correct answer is a


yarn.nodemanager.resource.cpu-vcores is the property used to set a number of CPU cores that can be
allocated for containers.
QUIZ
Capacity scheduler reads settings set in capacity-scheduler.xml while starting or when admin
modifies the settings and reloads. What is the command used to reload settings?
4

a. Yarn rmadmin -refreshNodes

b Yarn rmadmin –refreshQueues


.
c. Yarn mradmin -refresh

d. Settings can be modifed and reloaded using any command


QUIZ
Capacity scheduler reads settings set in capacity-scheduler.xml while starting or when admin
modifies the settings and reloads. What is the command used to reload settings?
4

a. Yarn rmadmin -refreshNodes

b Yarn rmadmin –refreshQueues


.
c. Yarn mradmin -refresh

d. Settings can be modifed and reloaded using any command

The correct answer


b. is
The command Yarn rmadmin is used to reload settings. It refresh the queues.
QUIZ
Which feature should be enabled to ensure that user-served queues can begin to claim
allocated resources without having to wait for other queued application to finish?
5

a. Enable Dominant resource fairness

b Enable delayed scheduling


.
c. Enable Preemption

d. If other queues are utilizing all resources, this is not possible


QUIZ
Which feature should be enabled to ensure that user-served queues can begin to claim
allocated resources without having to wait for other queued application to finish?
5

a. Enable Dominant resource fairness

b Enable delayed scheduling


.
c. Enable Preemption

d. If other queues are utilizing all resources, this is not possible

The correct answer


c. is
Enable Preemption should be enabled to ensure that user-served queues can begin to claim allocated
resources without having to wait for other queues application to finish
QUIZ Two queues have the same resources available. One uses the FIFO ordering policy and the other
uses the Fair Sharing policy. A user submits three jobs to each queue one after another, waiting
just long enough for each job to start. The first job uses 6x the resource limit in the queue, the
6 second 4x, and last 2x. What is the order of completion of jobs in FIFO/FAIR scheduling.

a. FIFO: 6x,4x,2x FAIR : 2x,4x,6x

b FAIR:6x,4x,2x FIFO: 6x,4x,2x


.
c. FAIR:2x,4x,6x FIFO: 2x,4x,6x

d. FIFO : 6x,4x,2x and FAIR: All jobs complete at same time


QUIZ Two queues have the same resources available. One uses the FIFO ordering policy and the other
uses the Fair Sharing policy. A user submits three jobs to each queue one after another, waiting
just long enough for each job to start. The first job uses 6x the resource limit in the queue, the
6 second 4x, and last 2x. What is the order of completion of jobs in FIFO/FAIR scheduling.

a. FIFO: 6x,4x,2x FAIR : 2x,4x,6x

b FAIR:6x,4x,2x FIFO: 6x,4x,2x


.
c. FAIR:2x,4x,6x FIFO: 2x,4x,6x

d. FIFO : 6x,4x,2x and FAIR: All jobs complete at same time

The correct answer


a is
FIFO: The order of completion of jobs in FIFO/FAIR scheduling are 6x,4x,2x FAIR : 2x,4x,6x.
QUIZ
Which Scheduler allows higher cluster utilization while providing predictability of workloads
and shares resources in a predictable manner?
7

a. Fair scheduler

b FIFO Scheduler
.
c. Delay Scheduler

d. Capacity Scheduler
QUIZ
Which Scheduler allows higher cluster utilization while providing predictability of workloads
and shares resources in a predictable manner?
7

a. Fair scheduler

b FIFO Scheduler
.
c. Delay Scheduler

d. Capacity Scheduler

The correct answer is d.


Capacity scheduler allows higher cluster utilization while providing predictability of workloads and shares
resources in a predictable manner
QUIZ
Select the states of queues in YARN.
8

a. Started, on hold, Stopped, terminated

b Running or stopped
.
c. Initiated, Started, On hold, Stopped, Terminated

d. Running or closed
QUIZ
Select the states of queues in YARN.
8

a. Started, on hold, Stopped, terminated

b Running or stopped
.
c. Initiated, Started, On hold, Stopped, Terminated

d. Running or closed

The correct answer


b. is
Running or stopped are the two states of queues in Yarn.
QUIZ
Select the reasons when Administrators decide to stop and drain applications in a queue.
9

a. Decommissioning of queues

b Migrating users to different queues, Decommissioning of queues


.
c. When no new applications are submitted

d. As per routine activities


QUIZ
Select the reasons when Administrators decide to stop and drain applications in a queue.
9

a. Decommissioning of queues

b Migrating users to different queues, Decommissioning of queues


.
c. When no new applications are submitted

d. As per routine activities.

The correct answer


b. is
The two reasons when administrators decide to stop and drain applications in a queue are migrating users
to different queues and decommissioning of queues.
QUIZ
What is the demerit of FIFO scheduler?
10

a. Makes the users wait in single queue based on order number of job
submission

b It is default and cannot be used for more number of jobs.


.
c. Makes the system and processing very slow.

d. Does not support queues


QUIZ
What is the demerit of FIFO scheduler?
10

a. Makes the users wait in single queue based on order number of job
submission

b It is default and cannot be used for more number of jobs


.
c. Makes the system and processing very slow

d. Does not support queues

The correct answer


a. is
The demerit of FIFO scheduler is that it makes the users wait in single queue based on order number of job
submission.
Scheduling refers to the allocation of available resources in the
cluster.
A scheduler is a part of a node’s operating system that assigns
a process to a CPU core to run for a short duration.

The scheduler implementations available are FIFO or First in


First out, Fair Schedulers or Resource Aware scheduling, and
Capacity Schedulers.
The concept of waiting for containers to be allocated in
the same rack rather than emphasizing on locality is
called delay scheduling.
Dominant Resource Fairness (DRF) is the approach
when schedulers in YARN can compare two
applications run by two users by looking at each
user’s dominant resource use or requirement and
measure the cluster use.
This concludes the lesson “Hadoop Computational Frameworks.”
The next lesson is “Scheduling: Managing Resources.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 8- Hadoop Cluster Planning
Planning a Hadoop cluster

General Planning considerations

Workload and cluster sizing

Hadoop Cluster Setup Options: Physical, Virtualization,


Cloud or Hybrid

Making Choices: Hardware, Software & Network

Making Choices: Master/Slave considerations

News from the world: Existing Setups


Lesson 8: HDFS: Hadoop Cluster Planning
Topic 8.1: Planning a Cluster
Plan Your Cluster

Let us identify the key points involved while planning a cluster.

Knowing your requirement

Knowing eventual profile of workload

Choosing right setup option

Understanding the variables

Understanding deployment

Learning from other setups


Planning: Knowing the Requirements

Need:
What do I need a Hadoop cluster for?

Data Type:
What kind of data will it handle?

Volume:
What is the volume of data?

Speed:
How quickly is the data growing?

Processing:
How frequently do we need to process it?
Planning: Knowing the Requirements

Cost:
Are there any budget constraints?

Workload:
What kind of workload will my cluster manage?

Time:
Asking right questions
How much time do I have?

Resource consumers/Applications:
How many applications do I have?
Cluster sizing tool/manual:
Do I use a cluster-sizing tool to help me estimate or do I use my
experience?
Size:
What would be the appropriate size of the cluster?
Questions- Responses- Reality Check

Need: Is it for “delivering new business value” or is it for “delivering data center efficiency”?

Data Type: A cluster might be used to handle a variety of data that may be generated from various
sources.

Volume: Is the data massive with the magnitude of petabytes and exabytes, or is it up to terabytes?
How long do we need to retain the data?

Speed: Is the growth expected or unexpected?

Processing: The frequency of data processing would directly be proportional to the choice of
resources in the cluster.
Questions- Responses- Reality Check

Cost: Is the cluster bound by costs and does it impose budget constraints?

Workload: Will it be network intensive, IO Intensive, CPU intensive, or will it be a balanced workload?
Are we looking at an evolving workload?

Time: If time is a constraint, usage of cluster-management solutions will be preferred and, if not,
open-source core edition could be used.

Resource consumers/applications: How many applications would be supported by the cluster? This
may affect the cluster size and count.
Questions- Responses- Reality Check

Cluster sizing tool/manual: For planning and estimation, will we use cluster-sizing tool/calculator or
rely on experience and expertise?

Size: Based on the above requirements, the size of the cluster can be decided.
Lesson 8: HDFS: Hadoop Cluster Planning
Topic 8.2: Workload Patterns
Sample Cluster Sizing Tool
Cluste r-sizi ng tool offe re d by one of the vendors
Sample Output of Cluster Sizing/Estimation Tool

Based on the
options selected in
the tool, as shown
in the previous
screen, the tool
estimates the
hardware and
infrastructure, and
shows some
recommendations.
Workload Patterns

Hadoop clusters are used for massive data storage and processing.

Choices of hardware, software, and network in the cluster are


dependent on workload patterns handled by the cluster.
Workload Patterns

Let’s understand the workload patterns and their impact on a cluster.

Selecting the right Hadoop


distributions with the
required features

Appropriate version and


services
Impacts the
choices of

Underlying OS with the


appropriate system services
Workload Patterns

• Disk space consumed and choices of storage devices.


• I/O bandwidth required and choices of network settings.
• Computation power required for processing.
Understand Your Workload

Architects usually ask a few questions while trying to understand the workload patterns:

• Are the data access patterns uniform or skewed?


• What are the typical data set sizes?
• What is the frequency of accessing and re-accessing data?
• How does load vary overtime?
• Are the cycles regular or predictable?
• What is the frequency and the size of load bursts?
• What are the compute patterns?
• What is the ratio between compute and the movement of data?
• What are the common processing types?
• What is used - high level query language or Java?
Typical Workload Patterns

Selecting the appropriate hardware that provides the best balance of performance and economy for a
workload pattern is a critical decision to make when planning a Hadoop cluster.

Workload pattern types:

Compute Intensive
I/O Intensive Workload Network Intensive
Workload

CPU bound, which demand a I/O bound, which demands more Network bound, which demands
large number of CPUs and a vast investment on disks and storage appropriate network devices and
memory. devices. settings to support intense
network traffic.

Unknown or evolving
Balanced Workload
Workloads

Distributed workloads involving Unknown workloads, which may


various job types. demand CPU, I/O, or network
depending on ongoing processing.
Typical Workload Patterns

High
Computation Balanced More
Balanced
Optimized Power per Node

CPU
Fewer Disks Disk More Disks

Low Power Storage


Consumption Optimized

Low

Machine configuration as per workload


Lesson 8: HDFS: Hadoop Cluster Planning
Topic 8.3: Cluster Setup Options
Typical Workload Patterns

The choices of infrastructure to deploy a Hadoop cluster include:

Physical Cloud AWS Virtualizatio Hybrid


machines on n software environment
premise
Cluster Setup Options

Hybrid deployment offers:

BI or ML Cluster
Cluster Backup Options
Helps achieve business continuity through Development Backup and
and POC Cluster Archive
replication of on-premises and cloud-based Cluster
storage.

Development
Keeps the development environment
Production
separate from the production environment. Cluster

Data Exploration

Offers computation and storage capabilities.


Cluster Setup Options

Choices of infrastructure to deploy a Hadoop cluster include:


• On-premises machines
• Cloud like Azure, AWS, and so on
• Virtualization software
• Hybrid environment
Making Choices: Hardware Considerations

Hardware choices involve selecting appropriate hardware and considering factors such as:

• NICs
• Power supply
• Cooling
• RAID or JBOD disk configurations
• RAM and RACK for both master and slave nodes

Cluster size Business requirement Process criticality


Making Choices: Hardware Considerations

Hadoop hardware comes in two different classes: Master nodes and Slave nodes.

To avoid a heterogeneous platform and reduce proliferation of hardware profiles, architects select
single profile for master nodes and slave nodes.

Worker

Master Nodes: Worker


Process

Master nodes have a critical function. They demand


more cost to have machines with high-end features. Master
Master
Process

Worker
Worker
Process
Making Choices: Hardware Considerations

Here is a sample hardware baseline profile:


Small clusters of up to 20 nodes have:
• Dual quad-core 2.6 GHz CPU,
• 24 GB of DDR3 RAM,
• Dual 1 GB Ethernet NICs,
• A SAS drive controller, and
• Minimum two SATA II drives in a JBOD configuration in addition to the host OS device.
Mid-size clusters of up to 300 nodes have:
• An additional 24 GB of RAM for a total of 48 GB
Master nodes in large clusters should have a total of 96 GB of RAM.
Making Choices: Hardware Considerations

Following are the points to be considered while setting up a cluster:

Do not use commodity machines.

Use dual power supplies, dedicated cooling, bonded Network Interface Cards
(NICs) and raided disks.

Use machines with good RAM and nominal or moderate disk capacity.

The OS for master machines should be highly available, thus RAID hard drives
are recommended.

Add enough RAM to store metadata.

It should have access to NFS to have a copy of metadata (disk).


Making Choices: Software Considerations

• Job Tracker/Resource Manager is


• Secondary NameNode identical to
memory hungry as NameNode or
NameNode setup
secondary NameNode
• It’s not a hot standby but can be used
• To provide job and task-level status,
as a replacement hardware for
counters, progress keeps metadata in
NameNode when NameNode fails
RAM

Slave/Worker Nodes
• Worker nodes are responsible for storage and
computation
Considerations
• Commodity machines
• Should have enough storage capacity, CPU, and memory
to process data
• Multiple disks from same vendor, with no RAID
• JBOD disk configurations

Midline Configuration: High-End Configuration :


• CPU : 2 × 6 core 2.9 Ghz/15 MB cache • CPU : 2 × 6 core 2.9 Ghz/15 MB cache
• Memory : 64 GB DDR3-1600 ECC • Memory : 96 GB DDR3-1600 ECC
• Disk controller : SAS 6 Gb/s • Disk controller : 2 X SAS 6 Gb/s
• Disks : 12 × 3 TB LFF SATA II 7200 RPM • Disks : 24 × 1 TB SFF Nearline/MDL SAS
• Network controller : 2 × 1 Gb Ethernet 7200 RPM
• Network controller : 1 × 10 Gb Ethernet
Making Choices: Software Considerations

1. Distribution 2. Hadoop Version 3. Appropriate OS

Select the right Select the appropriate Linux


distribution as per the distribution/windows
Select the right version
features and stability version. The choice depends
as per features and
required - free or on admin tools, hardware &
stability required
licensed distribution software commercial
support
Making Choices: Network Considerations

Dedicated and
preferably top-
of-rack switches

Nodes
Intensive connected with
bandwidth minimum
speed of
1GB/sec

Speed such as
Network
1 GE over- 10GB/sec for
Considerations: For
subscription large amount of
your Hadoop Cluster
between racks intermediate
data

2*10 GB 1 GB links for


interconnect all nodes in a
links per rack 20-node rack
Can You Guess? Cluster Sizing Scenario

15 TB of storage
Default Replication &
space required every
data grows by 5
week
TB/week

Consider a cluster Assume overheads to


with 10 DataNodes, be 30%
each having 20 TB
storage capacity disks

When do you think


we need to add more
DataNodes to the
cluster?
Sizing Recommendation

Optimal results from a Hadoop implementation depends on choosing


the right hardware and software stacks

Machine Workload Pattern/ Processor (# of Memory


Type Cluster Type Storage Cores) (GB) Network
Twelve 2-3 TB 1 GB onboard, 2x10 GBE
Slaves Balanced disks 8 128-256 mezzanine/ external
Twelve 1-2 TB 1 GB onboard, 2x10 GBE
Compute-intensive disks 10 128-256 mezzanine/ external
Twelve 4+ TB 1 GB onboard, 2x10 GBE
Storage-heavy disks 8 128-256 mezzanine/ external

Four or more
2-3 TB RAID 1 GB onboard, 2x10 GBE
NameNode Balanced 10 with spares 8 128-256 mezzanine/ external

Four or more
Resource 2-3 TB RAID 1 GB onboard, 2x10 GBE
Manager Balanced 10 with spares 8 128-256 mezzanine/ external
Industry Setups: Examples

PSG Tech, Coimbatore, India

• This organization works on determining evolutionary linkages and predicts


molecular structures. The dynamic nature of the algorithm coupled with
data and compute parallelism of Hadoop data grids improves the accuracy
and speed of sequence alignment.

• Cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad
Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to
E7200 / E7400 processors with 4 GB RAM and 160 GB HDD

Specific Media, United States

• They use Apache Hadoop for log aggregation, reporting, and analysis
• Two Apache Hadoop clusters, all nodes 16 cores, 32 GB RAM
• Cluster 1: 27 nodes (total 432 cores, 544GB RAM, 280TB storage)
• Cluster 2: 111 nodes (total 1776 cores, 3552GB RAM, 1.1PB storage)
Industry Setups: Examples

Yahoo!, United States

• Yahoo! has more than 100,000 CPUs in over 40,000 servers running
Hadoop, with its biggest Hadoop cluster running 4,500 nodes. Yahoo! stores
455 petabytes of data in Hadoop

• That's big, and approximately four times larger than Facebook's beefiest
Hadoop cluster
Quiz

©Simplilearn. All rights reserved


A cluster with more nodes performs better than one with fewer, slightly faster nodes.
QUIZ
State True or False.
1

a. True

b False
.

©Simplilearn. All rights reserved


A cluster with more nodes performs better than one with fewer, slightly faster nodes.
QUIZ
State True or False.
1

a. True

b False
.

The correct answer is a.


Explanation: Yes it’s true that a cluster with more nodes performs better than one with fewer, slightly faster
nodes.

©Simplilearn. All rights reserved


QUIZ If secondary NameNode's hardware configuration is half of NameNode, will that create a
2 problem?

a. No, there will be no problem if Secondary NameNode’s hardware


configuration is half of the NameNode.
b. Yes, if secondary NameNode is made NameNode while doing manual failover.

Yes, there will be a problem as Secondary NameNode should have configurations


c. more than NameNode.

d. Yes, there will be a problem only when secondary NameNode does checkpointing.

©Simplilearn. All rights reserved


QUIZ If secondary NameNode's hardware configuration is half of NameNode, will that create a
2 problem?

a. No, there will be no problem if Secondary NameNode’s hardware configuration is half


of the NameNode.
b. Yes, if secondary NameNode is made NameNode while doing manual failover.

Yes, there will be a problem as Secondary NameNode should have configurations


c. more than NameNode.

d. Yes, there will be a problem only when secondary NameNode does checkpointing.

The correct answer is b.


Explanation: The correct answer is b. It will create a problem if secondary NameNode is
made a NameNode during manual failover.

©Simplilearn. All rights reserved


QUIZ
While setting up DataNodes, would you prefer to have multiple disks and a single large disk?
3

a. Yes, multiple disks allow better fault tolerance, I/O, and parallelism.

b. No, it will be an operational issue.

c. No, DataNodes are daemons, they don’t have disks.

d. None of the above.


QUIZ
While setting up DataNodes, would you prefer to have multiple disks and a single large disk?
3

a. Yes, multiple disks allow better fault tolerance, I/O, and parallelism.

b. No, it will be an operational issue.

c. No, DataNodes are daemons, they don’t have disks.

d. None of the above.

The correct answer is a

Explanation: While setting up DataNodes, multiple disks allow better fault tolerance, I/O,
and parallelism.
QUIZ The amount of memory required for the master nodes depends on the number of files and
4 blocks.

a. True

b. False
QUIZ The amount of memory required for the master nodes depends on number of files and
4 blocks.

a. True

b. False

The correct answer is a.


Explanation: It is true. The amount of memory required for the master nodes depends on
number of files and blocks.
QUIZ
Name the workload characterized by the need of large number of CPUs and vast memory.
5

a. I/O intensive workload

b. Evolving workload

c. Network intensive workload

d. Compute intensive workload


QUIZ
Name the workload characterized by the need of large number of CPUs and vast memory.
5

a. I/O intensive workload

b. Evolving workload

c. Network intensive workload

d. Compute intensive workload

The correct answer is d


Explanation: Compute intensive workload is characterized by the need of large number of
CPUs and large amounts of memory.
QUIZ Based on data growth and replication can we estimate the storage requirement and the
6 need for adding DataNodes?

a. Yes, we can estimate the storage requirement and the need for
adding DataNodes

b. Yes, but it depends on NameNode

c. A and C

d. None of the above


QUIZ Based on data growth and replication can we estimate the storage requirement and the
6 need for adding DataNodes?

a. Yes, we can estimate the storage requirement and the need for adding DataNodes

b. Yes, but it depends on NameNode

c. A and C

d. None of the above

The correct answer is a


Explanation: Yes, based on data growth and replication we can estimate the storage
requirement and need for adding DataNodes.
QUIZ Can understanding the frequency of processing data or type of data processed help us in
7 estimating the size of cluster and overall infrastructure?

a. Yes, it can help us estimate the size of a cluster and overall infrastructure

b. No, it cannot estimate the size of a cluster and its overall infrastructure

c. Don’t have knowledge about this

d. Frequency can help but type of data doesn’t


QUIZ Can understanding the frequency of processing data or type of data processed help us in
7 estimating the size of cluster and overall infrastructure?

a. Yes, it can help us estimate the size of a cluster and overall infrastructure

b. No, it cannot estimate the size of a cluster and its overall infrastructure.

c. Don’t have knowledge about this

d. Frequency can help but type of data doesn’t

The correct answer is a

Explanation: Yes, understanding the frequency of processing data or type of data processed
helps us in estimating the size of cluster and the overall infrastructure.
QUIZ
Select a list of considerations when setting up Master nodes.
8

a. JBOD disk configuration, NO RAIDED drives, RAM same as DataNodes, multiple NICs

b. Raided hard drives, good RAM, multiple NICs, multiple cores

c. Multiple disks with JBOD configuration, default replication, dedicated power supply

d. None of the above


QUIZ
Select a list of considerations when setting up Master nodes.
8

a. JBOD disk configuration, NO RAIDED drives, RAM same as DataNodes, multiple NICs

b. Raided hard drives, good RAM, multiple NICs, multiple cores

c. Multiple disks with JBOD configuration, default replication, dedicated power supply

d. None of the above

The correct answer is b


Explanation: A list of considerations when setting up Master nodes are raided hard drives,
good RAM, multiple NICs, and multiple cores.
QUIZ
Can your cluster benefit from physical and network isolation?
9

a. Cluster nodes if isolated, trigger a lot of problems

Yes, physical and network isolation benefit clusters and avoid bottlenecks and
b.
resource sharing

c. Only vendor-specific distributions have this privilege

d. It isn’t necessary
QUIZ
Can your cluster benefit from physical and network isolation?
9

a. Cluster nodes if isolated, trigger a lot of problems

Yes, physical and network isolation benefit clusters and avoid bottlenecks and resource
b. sharing

c. Only vendor-specific distributions have this privilege

d. It isn’t necessary

The correct answer is b


Explanation: Yes, a cluster can benefit from physical and network isolation and avoids
bottlenecks and resource sharing.
QUIZ
What is the benefit of having multiple network interface cards in nodes?
10

a. They are defaults and cannot be changed

b. They can be bonded together to provide more throughput

c. Only if Kerberos allows

d. None of the above


QUIZ
What is the benefit of having multiple network interface cards in nodes?
10

a. They are defaults and cannot be changed

b. They can be bonded together to provide more throughput

c. Only if Kerberos allows

d. None of the above

The correct answer is b


Explanation: The benefit of having multiple network interface cards in nodes is that they
can be bonded together to provide more throughput.
Planning a Hadoop cluster requires knowing and
understanding the requirements.
Understanding workload patterns directly impacts the
choices made for selecting the right Hadoop distributions.
Typical workload pattern types include compute intensive
workload, I/O intensive workload, network intensive
workload, balanced workload, and unknown or evolving
workloads.
While setting up a cluster the hardware considerations
include selecting appropriate hardware, and considering
other factors such as NICs, power supply, cooling, RAID, or
JBOD disk configurations, considering RAM and RACK for
both master and slave nodes.
The software considerations involve selecting the right
distribution, which is either free or licensed distribution, the
right version of Hadoop, and selecting the appropriate OS.
Network consideration is the most challenging parameter to
estimate due to varying workloads.
This concludes the lesson “Hadoop Cluster Planning.”
The next lesson is “Hadoop Clients and Hue Interface.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 09—Hadoop Clients and Hue Interface
Explain the concepts of Hadoop client, edge nodes, and gateway nodes

Install and configure Hadoop clients

Explain how Hue works

Install and configure Hue

Describe how authentication and authorization is managed in


Hue
Lesson 9: Hadoop Clients and Hue Interface
Topic 9.1: Overview of Hadoop Client, Edge Notes, and Gateway Nodes
Client Nodes: Hadoop/HDFS Clients
Hadoop Server Roles

Clients

Distributed Data Processing Distributed Data Storage Run


Map Reduce HDFS
master
daemon

Secondary
Job Tracker NameNode Masters
NameNode

Machines that
DataNode & DataNode & DataNode & do all the storing
Task Tracker Task Tracker Task Tracker and running of
Slaves computations
DataNode & DataNode & DataNode &
Task Tracker Task Tracker Task Tracker
Client Nodes: Hadoop/HDFS Clients

 Client machines have Hadoop or Hadoop Server Roles


packages of Hadoop components
installed with all the cluster settings
and without any master or slave Clients
daemons.
Distributed Data Distributed Data
 Clients regulate the data access Processing Storage
Map Reduce HDFS
from the Hadoop cluster; machines
are used to run client programs or
APIs, which load data into the
cluster, and to submit jobs. Secondary
Job Tracker NameNode master
NameNode
s
 Client node then retrieves
information on data and job
outputs when the job is complete.
DataNode & DataNode & DataNode &
 Client’s interaction with the master Task Tracker Task Tracker Task Tracker
node is inevitable for any kind of slaves
data read, write, or processing. DataNode & DataNode & DataNode &
Task Tracker Task Tracker Task Tracker
Client Nodes: HDFS Clients Performing FileSystem Metadata Operations

Data is replicated on multiple


DEV Cluster
Master (Name) & Service DataNodes, so the loss of a single

Fiber Hadoop Private Network – 172.20.176.0


nodes
User Edge DataNode should never be fatal to
Space Node the cluster or cause data loss
Fiber Production Network – 10.87.176.0

Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm


01 02 03 04
10.87.181.11 10.87.181.12 10.87.181.13 10.87.181.14
172.20.176.11 172.20.176.12 172.20.176.13 172.20.176.14

DataNodes
HDFS clients perform filesystem
metadata operations through a
single server known as the
Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm
01 01 01 01 01
10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11
172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11

NameNode and send and retrieve


Am1dlccmrhde01 (Priv)

filesystem data by communicating


10.87.181.10
172.20.176.51
Dbccmehdedge01 (Pub)
10.87.176.140

with a pool of DataNodes

This is a Dev cluster setup; the same kind of


setup can be used for Production, Disaster
Recovery, Test, or UAT cluster.
Client to Cluster Interaction

01 How the client-to-cluster


communication takes place

02 How write is done by the client

03 How the client ensures data integrity


Communication Between a Client and a Cluster

1 All HDFS communication protocols are layered


on top of the TCP/IP protocol

The client The DataNode

3
establishes a Protocol is used for
2
DEV Cluster
connection to a DataNode

Fiber Hadoop Private Network – 172.20.176.0


Master (Name) & Service
nodes

configurable TCP User Edge communication with

Fiber Production Network – 10.87.176.0


Spac Node
Port on NameNode e Am1dlcom
rhdm01
10.87.181.
11
Am1dlcom
rhdm02
10.87.181.
12
Am1dlcom
rhdm03
10.87.181.
13
Am1dlcom
rhdm04
10.87.181.
14
NameNode
172.20.176 172.20.176 172.20.176 172.20.176
.11 .12 .13 .14

DataNodes

NameNode never
RPC abstraction wraps
Am1dlcomrhdm0 Am1dlcomrhdm0 Am1dlcomrhdm0 Am1dlcomrhdm0 Am1dlcomrhdm0
1 1 1 1 1

5
initiates any RPCs and
4
10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11
172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11

both the Client Protocol


only responds to RPC
Am1dlccmrhde01
(Priv)

and the DataNode


10.87.181.10
172.20.176.51
Dbccmehdedge01

requests issued by
(Pub)
10.87.176.140

Protocol
DataNodes or clients
How Writes are Done from Client

NameNode inserts the file


HDFS client caches the file name into the file system
data into a temporary local hierarchy and allocates a
file data block for it.

Client NameNode
Blk A Blk B Blk C File.txt

DataNode 1 DataNode 5 DataNode 6


… DataNode N

When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the
DataNode. The client then tells the NameNode that the file is closed. The NameNode commits the
file creation operation into a persistent store.
Understand How Client Ensures Data Integrity
The HDFS client software implements checksum checking on the contents
of HDFS files.
NameNode

Client Namespace

DataNode DataNode DataNode


How to Configure and Deploy Clients

To allow clients to use Hbase, HDFS, Hive, MapReduce, or


Yarn services, Cloudera Manager creates configuration file
zip archives; they contain service properties, also known as
client configuration files.

How are clients Client configuration files are auto-generated and auto-
configured and deployed based on services and roles in the cluster
deployed

For each host that has a service role instance installed


or for each host that is configured as gateway node, any
configuration changes made result in stale client
configurations and may demand redeployment
How to Configure and Deploy Clients (contd.)

Deploy function uses Linux


Deploy function downloads Usage alternatives mechanism to set
the configuration zip file Downloading For each host that a configurable priority level
has a service role
instance installed
and for each host
that is configured as
a gateway node for
that service

Unzipping

Deploy function
unzips it into appropriate
configuration directory

If roles for multiple services are running on the same host, then the client configurations for
both roles are deployed on that host, with the alternatives priority determining which
configuration takes precedence
How to Download Client Configuration Files

To download the client configuration files To download an individual client


configuration zip file

1. Login to Admin Console and click the 1. Click the Services tab in the Admin
Services tab. Console.

2. From the Home button, select the status. 2. Proceed to select the service instance whose
configuration you want to download.

3. Click the Client configuration URLs button


that shows links to configuration files based on 3. From the Actions menu, select Download
services. Client Configuration.

4. Save the link and download the configuration 4. This downloads the configuration files for the
files. selected service.
Edge Nodes or Gateway Nodes

 Act as the interface between the


Hadoop cluster and an external
network
DEV Cluster
 Are mostly used to run client Master (Name) & Service nodes

Fiber Hadoop Private Network – 172.20.176.0


applications and cluster User Edge
administration tools Spac Node

Fiber Production Network – 10.87.176.0


e Am1dlcomr
hdm01
Am1dlcomr
hdm02
Am1dlcomr
hdm03
Am1dlcomr
hdm04
10.87.181.1 10.87.181.1 10.87.181.1 10.87.181.1
1 2 3 4
172.20.176. 172.20.176. 172.20.176. 172.20.176.

 Are used as staging areas for


11 12 13 14

DataNodes
intermittent data when data is being
transferred to Hadoop cluster Am1dlcomrhdm
01
Am1dlcomrhdm
01
Am1dlcomrhdm
01
Am1dlcomrhdm
01
Am1dlcomrhdm
01
10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11
172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11

Oozie Pig Am1dlccmrhde0


1 (Priv)
10.87.181.10
172.20.176.51
Dbccmehdedge0
1 (Pub)

Sqoop Hive 10.87.176.140

Ambari Client app1

Hue Client app n

Edge Node 1 Edge Node 2


Considerations for Edge Nodes

There are some considerations for edge nodes, although specifics always depend on business and technical
requirements.

0
To successfully handle high inbound and

0
outbound data transfer rates, Edge nodes Apart from Edge nodes, no
should have multiple pairs of bonded 10 other nodes must be used to
GbE network connectors. deploy and run administration

7
tools.

01 1
Edge nodes should be multi-homed, that is, 07 Avoid placing data import/export

0
connected to multiple networks and into the services such as Sqoop on master and

0
private subnet of Hadoop cluster. Two pairs slave nodes as the high data transfer
of bonded 1 GbE network connections are 06 volumes may lower the ability of
recommended—one to connect to Hadoop Hadoop services to communicate with

6
cluster and the other for external network. 02 each other. High latency may cause the
05
2 nodes to get detached from the cluster.

0
03
0
The processor configuration 04 Edge nodes should use carrier
should be the same as or a little
class hardware
more than that of slave nodes; 48
GB of RAM would be sufficient.

5 3
0
Edge nodes oriented to data
ingestion should be equipped with
optimum storage space.

4
Lesson 9: Hadoop Clients and Hue Interface
Topic 9.2: Installing, Configuring, Refreshing and Working with Clients
Demonstration 1: Installing, Configuring, Refreshing and
Working with Clients
Lesson 9: Hadoop Clients and Hue Interface
Topic 9.3: Overview of Hadoop User Experience (Hue)
Introduction to Hue

Hue is a web interface for analyzing data with Apache Hadoop.

Hue Plugins
Cloudera HBase
Pig HiveServer2 Zookeeper
Impala
YARN Job Tracker Oozie HDFS Hive Solr Spoop2
Metastore
Why is Hue Required

Why do we need a new interface?

Name Node UI
Resource Manager UI

Making changes at cluster level, service level, or


Only allow users to browse cluster-specific or
role levels is not allowed, nor is working with
daemon-specific information
data on the cluster.
Why is Hue Required (contd.)

Hue is an application in the background that allows users to interact with a Hadoop cluster from a
web browser and requires no client installation.

Hue Plugins
Cloudera HBase
Pig HiveServer2 Zookeeper
Impala
YARN Job Tracker Oozie HDFS Hive Solr Spoop2
Metastore

Hue does it all!


Hue Architecture
Hue sever and its suite of applications:
Hue Server

Hue UI Hue DB

Job Job Oosie Pig File Beeswax Metastore Cloudera Impala Shell
Browser Designer Editor Editor Browser (Hive UI) Manager Query UI

Sqoop2 HBase Pig

Hue Plugins

Hive Cloudera
YARN Job Tracker Oozie Pig HDFS HiveServer2 HBase Spoop2
Metastore Impala

Hue Hue Server


Apps
Hue UI
Hue DB

Hue for CDH CDH

How does Hue do it?


• Hue consists of a web service that runs on one of the nodes of a cluster.
• This node may be called the Hue Server.
• Hue Server is a “container” web application that sits in between the CDH Installation and the
browser. It hosts various web applications and communicates with CDH services.
More About Hue Architecture
Hue Hue Server
Apps
Hue UI
Hue DB

CDH

Hue applications may run Hue-


specific daemons. Hue applications
Hue UI allows users to use
Hue server makes communicate with these daemons
various applications through
communication easier. either by using Thrift service or
a Web interface.
by exchanging the state
through a database.

1 2 3 4 5

Hue applications internally Hue server uses a database to


interact with services configured manage sessions, authentication, and
to run on cluster. Hue application data.
Lesson 9: Hadoop Clients and Hue Interface
Topic 9.4: Overview of Hadoop User Experience (Hue)

 Hue application interfaces


Hue Application Interface—Beeswax
Beeswax application enables you to perform queries on Apache Hive.
Hue Application Interface—Beeswax
Beeswax application enables you to perform queries on Apache Hive.

Allows primitive and Allows integration with


collective datatypes to User Defined Functions.
impose structure on
data and process the
data.

Makes working with Hive


easier. The latest versions
of Hue use HIVESERVER2
instead of Beeswax.
Hue Application Interface—Impala Query UI
Impala Query application UI helps in performing queries on Apache Hadoop data stored in HDFS or Hbase data
stored in HDFS

Impala provides fast, The METASTORE app,


interactive SQL queries. It under browsers, helps
uses the same metadata users to browse tables
and syntax as Apache Hive, with ease as it uses the
ODBC driver, and user same metastore as Hive.
interface.

Most of the Hive queries are


compatible with Impala, and
using Impala through Hue makes
it easier to work on queries.
RDBMS UI in DB Query

RDBMS UI is a new
application that enables the
viewing of data in other
RDBMs.
Pig Editor

The Pig Editor application allows you to define and


run pig scripts that may be used to extract or
transform data and view the status of jobs.
Job Designer
The Job Designer application allows users to create and launch jobs on Hadoop cluster.

Job Designer can be enabled to accept variables when jobs


are being run. Job Designer supports the actions
supported by oozie, MapReduce, Streaming, Java, Pig, Hive,
Sqoop, Shell, S S H, DistCp or distributed copy, Fs, and
Email.
Job Browser
The Job Browser application allows users to examine the running jobs on Hadoop cluster.
Job browser presents the jobs and tasks in layers.

List of jobs > Job’s Tasks > task’s


attempts and task properties (Start, end,
output size, and so on) and job logs.
Metastore Tables in Hive Metastore Browser
The Table browser application allows users to browse and work with Hive tables.
Metastore Tables in Hive Metastore Browser
The Table browser application allows users to browse and work with Hive tables.

Creating tables, browsing data, and


loading data in tables is made easier
through this application.
Hbase in HBase UI
The HBase UI application allows users to work with HBase and unstructured, column-oriented HBase data on
HDFS.

Bulk data upload, smartview to view the table,


and other features make this UI a useful
interface to interact with Hbase; this is
preferable to using Hbase Shell from terminals.
Sqoop Transfer
Sqoop UI enables users to transfer, that is, import and export data between structured and unstructured data
sources using Sqoop.

Sqoop is a batch migration tool extensively used for


transferring data from relational databases to
Hadoop HDFS.

This application allows data import in various formats,


from various structured data sources and through direct
ingestion of data in components of Hadoop ecosystem
such as Hive, Hbase, Accumulo, and so on.
Zookeeper in Zookeeper Browser

The Zookeeper browser application


helps in listing the zookeeper cluster
statistics and clients.
File Browser

Using this GUI avoids logging into gateway


hosts, terminals, and clients on edge
nodes for these operations.

Allows users to browse, download,


rename, move, copy, and change
ownerships and permissions. It also allows
sort data by attributes, view content of
files, and upload data.
File Browser

In the Oozie editor and Administration is used to


Dashboard, to define the Oozie Security management is used to
handle security and privileges for grant access to users to work
workflow, coordinate, and bundle in Hue and on HDFS.
applications, run the workflow Hive Tables, Solr collections, and
and view the status of jobs. File Access control lists or ACLs.

Use Solr Search to perform


keyword searches across
Hadoop data by using a wizard
that lets you style the result
snippets, specify facets to group
the results, sort the results, and
highlight result fields.
Demonstration 2: Adding Hue as a Service in CDH
Demonstration 3: Checking Hue Configurations and
Checking Status of Hue
Demonstration 4: Working With Hue and Using Hue
Interface
Demonstration 5: Hue Authentication and Authorization
Hadoop clients are used to run client programs or APIs, which load data
into the cluster, and to submit MapReduce jobs describing how that data
is to be processed. The client node then retrieves information on data and
job outputs when the job is complete.

To understand the process of interaction between the client and cluster,


you need to know how communication takes place, how the write is done
from the client, and how the client ensures data integrity.

To allow clients to use Hbase, HDFS, Hive, MapReduce, or Yarn services,


Cloudera Manager creates configuration file zip archives, which contain
service properties. These archives are also known as client configuration
files.

Edge nodes are mostly used to run client applications and cluster
administration tools.
Hue, although a web interface, is a background application that allows users to
interact with a Hadoop cluster from a web browser and requires no client
installation.

There are various Hue application interfaces, such as Beeswax, Impala Query
UI, RDBMS UI in BD Query, Pig Editor, Job Designer, Job Browser, Metastore
Tables in Hive Metastore Browser, HBase in HBase UI, Sqoop Transfer,
Zookeeper in Zookeeper Browser, and File Browser.
Quiz
QUIZ Identify the disadvantage of running Administration tools or data transfer tools like
1 Sqoop on master/slave nodes?

a. It’s mandatory to run the admin /data transfer tools on nodes that are not
part of the cluster.

b. Conflict in resource usage and high volume data transfer may impact
Hadoop services.

c. Master/slave nodes are busy with data handling and cannot be efficient.

Services running on master/slave nodes may block data transfer or data


d. administration.
QUIZ Identify the disadvantage of running Administration tools or data transfer tools like
1 Sqoop on master/slave nodes?

a. It’s mandatory to run the admin /data transfer tools on nodes that are not
part of cluster.

b. Conflict in resource usage and high volume data transfer may impact
Hadoop services.

c. Master/slave nodes are busy with data handling and cannot be efficient.

Services running on master/slave nodes may block data transfer or data


d. administration.

The correct answer is b.


Explanation: The disadvantage of running Administration tools or data transfer tools like Sqoop
on master/slave nodes is that the conflict in resource usage and high volume data transfer may
impact hadoop services.
QUIZ
Identify the processes that can run on edge nodes.
2

a. NameNode, DataNodes, and secondary NameNode

b. Resourcemanager and NameNode only

c. Sqoop, Oozie, Ambari, Hue, Pig, Hive, and client applications

d. Only client applications


QUIZ
Identify the processes that can run on edge nodes.
2

a. NameNode, DataNodes, and secondary NameNode

b. Resourcemanager and NameNode only

c. Sqoop, Oozie, Ambari, Hue, Pig, Hive, and client applications

d. Only client applications

The correct answer is c .


Explanation: The processes that can run on edge nodes include Sqoop, Oozie, Ambari, Hue,
Pig, Hive, and client applications.
QUIZ Which of the following regulates the data access for the data stored in HDFS or
3 data stores using HDFS?

a. NameNode

b. ResourceManager

c. HDFS Clients

d. Service Nodes
QUIZ Which of the following regulates the data access for the data stored in HDFS or
3 data stores using HDFS?

a. NameNode

b. ResourceManager

c. HDFS Clients

d. Service Nodes

The correct answer is c .


Explanation: HDFS Clients regulate the data access for the data stored in HDFS or data
stores using HDFS.
QUIZ
How can the stale configurations issue be fixed?
4

a. Reinstalling of clients will be required.

b. It can be fixed by restarting hosts.


It can be fixed by restarting all affected services and redeploying client
c. configurations.

d. Cloudera manager handles this automatically.


QUIZ
How can the stale configurations issue be fixed?
4

a. Reinstalling of clients will be required.

b. It can be fixed by restarting hosts.


It can be fixed by restarting all affected services and redeploying client
c. configurations.

d. Cloudera manager handles this automatically.

The correct answer is c .


Explanation: Stale configurations can be fixed by restarting all affected services and
redeploying client configurations.
QUIZ Which of the following is the ‘container’ web application that sits in between
5 Hadoop installation and browser and facilitates working with a cluster with ease?

a. Cloudera Manager

b. Hue Server

c. Hue Database

d. Apache Web Server


QUIZ Which of the following is the ‘container’ web application that sits in between
5 Hadoop installation and browser and facilitates working with a cluster with ease?

a. Cloudera manager

b. Hue Server

c. Hue Database

d. Apache Web Server

The correct answer is b.


Explanation: Hue Server is the ‘container’ web application that sits in between Hadoop
installation and browser and facilitates working with a cluster with ease.
QUIZ Identify the application of Hue that allows users to issue fast interactive queries
6 and shares metastore with Hive.

a. Pig

b. Hive metastore

c. Hbase

d. Impala
QUIZ Identify the application of Hue that allows users to issue fast interactive queries
6 and shares metastore with Hive.

a. Pig

b. Hive metastore

c. Hbase

d. Impala

The correct answer is d .


Explanation: Impala allows users to issue fast interactive queries and shares metastore with
Hive.
QUIZ Identify the benefit of using ‘File Browser’ in Hue instead of working without Hue
7 UI.

Avoids logging in from edge/gateway nodes and works on command


a.
line
b. Allows you to browse and edit files without permission issues

c. Helps you to work on HDFS as admin

d. Allows you to browse hidden files


QUIZ Identify the benefit of using ‘File Browser’ in Hue instead of working without Hue
7 UI.

a. Avoids logging in from edge/gateway nodes and works on command line.

b. Allows you to browse and edit files without permission issues.

c. Helps you to work on HDFS as admin.

d. Allows you to browse hidden files.

The correct answer is a .


Explanation: The benefit of using ‘File Browser’ in Hue instead of working without Hue UI is
that it avoids logging in from edge/gateway nodes and works on command line.
QUIZ
What is main difference between daemon web interfaces and Hue web interface?
8

a. Daemon web interfaces stop when daemons stop, but Hue interface
doesn’t.

b. Hue allows users to work with data on HDFS, but daemon web interfaces
allow only browsing.
Hue web interface provides full access on HDFS, but daemon web interfaces
c. don’t.
Daemon web interfaces show accurate information, but Hue interface
d. doesn’t.
QUIZ
What is main difference between daemon web interfaces and Hue web interface?
8

a. Daemon web interfaces stop when daemons stop, but Hue interface
doesn’t.

b. Hue allows to work with data on HDFS, but daemon web interfaces allow
only browsing.
Hue web interface provides full access on HDFS, but daemon web interfaces
c. don’t.
Daemon web interfaces show accurate information, but Hue interface
d. doesn’t.

The correct answer is b .


Explanation: The main difference between daemon web interfaces and hue web interface is
that Hue allows to work with data on HDFS, and daemon web interfaces allow only browsing.
QUIZ Identify the application of Hue that allows you to work with unstructured column-
9 oriented data in HDFS?

a. Sqoop

b. Hive

c. Hbase

d. Job designer and browser


QUIZ Identify the application of Hue that allows you to work with unstructured column-
9 oriented data in HDFS?

a. Sqoop

b. Hive

c. Hbase

d. Job designer and browser

The correct answer is c .


Explanation: Hbase is the application of Hue that allows you to work with unstructured
column-oriented data in HDFS.
QUIZ While file is edited to make sure applications in Hue interact with underlying
10 services of cluster?

a. Client configuration files

b. Hue.ini file

c. Hue.conf file

d. Hue-site.xml
QUIZ While file is edited to make sure applications in Hue interact with underlying
10 services of cluster?

a. Client configuration files

b. Hue.ini file

c. Hue.conf file

d. Hue-site.xml

The correct answer is b .


Explanation: Hue.ini file is edited to make sure applications in Hue interact with underlying
services of cluster.
This concludes the lesson “Hadoop Clients and Hue Interface.”
The next lesson is “Data Ingestion in Hadoop Cluster.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson10 - Data Ingestion in Hadoop Cluster
Explain data ingestion

List the tools used for data ingestion

Explain the types of data ingestion

Explain how Apache Flume works

Explain how Sqoop works


Data Ingestion in Hadoop Cluster
Data Ingestion
Data Ingestion—What is it?
Data Ingestion Tools

Apache NiFi Apache Flume Apache Sqoop Apache Kafka

• Automates data • Is reliable and highly • Offers two-way • Is designed to provide


movement available replication with both high throughput
• Provides real-time • Provides extensibility snapshots persistent messaging
control for online analytic • Supports the • Uses compression to
• Supports different applications incremental load of a optimize I/O
data formats, • Maintains a central list single table or free performance
schemas, protocols, of ongoing data flows form SQL queries • Uses mirroring to
speeds, and sizes • Allows saving jobs of improve availability
• Allows tracing of data repeated runs and scalability
in real time • Allows importing data
into Hive or HBase
Data Ingestion Tools

Amazon
Chukwa Apache Storm Gobblin
Kinesis

• Captures and stores • Provides a scalable • Handles real-time • Extracts, transforms,


terabytes of data per distributed system for analytics, online and loads large
hour from numerous monitoring and machine learning, and volumes of data from
data sources analysis of log-based continuous a variety of data
data. computation sources onto Hadoop
• Handles the routine
tasks required for all
data ingestion ETLs
Data Ingestion Tools

Fluentd Cloudera Morphlines Apache Samza

• Decouples data • Reduces the time • Distributed stream


sources from backend consumed in building processing framework
systems by providing and changing Hadoop using Apache Kafka
a unified logging layer ETL stream processing for messaging and
in between applications YARN for fault
tolerance, processor
isolation, security, and
resource
management
Data Ingestion Types
Data Ingestion in Hadoop Cluster
Streaming Data Ingestion with Apache Flume
Apache Flume—Features enabling Data Ingestion

Insulate systems Guaranteed data delivery

Ingestion of stream data Scale horizontally

Source Streams

Application logs Social media

Sensor and machine data Geo-location data


Apache Flume—How it works

Cloud

Log/event
data Log/event
data

Social
media

Web servers

Log/event generators Centralized Stores


Reliabl Fault Scalable Manageable and
e tolerant Customizable
Terminologies in Flume

JVM - Flume Agent


Event Event Event
Channel 1 Sink 1 HDFS

Source
Channel 2 Sink 2
JMS
Event
Channel 3 Sink 3
Event Event
Event
JVM - Flume Agent

Source

Event Event

Channel 4 Sink 4 HBase


Data Flow in Apache Flume

agent
Web service

Collector HDFS
HDFS
HDFS
HDFS
HDFS
agent
Web service

HBase
agent
Web service
Collector
Centralized
stores
agent
Web service
Types of Data Flow

Multi-hop Flow
JVM - Flume Agent JVM - Flume Agent

Replicating
Source Event A Event A
Source
Event A

Event Event Event Event

Channel 1 Sink 1 Channel 2 Sink 2


HBase
Types of Data Flow

Fan-out Flow
JVM - Flume Agent

Source

Event Event Event

Channel 1 Channel 2 Sink 2


HBase
Types of Data Flow

Fan-in Flow

JVM - Flume Agent

Source Source

Event Event Event


Source Channel 2 Sink 2
HBase
Demonstration 1:
Adding Flume Service to CDH Cluster
Data Ingestion in Hadoop Cluster
Structured Data Ingestion with Apache Sqoop
Apache Sqoop—Features

Can import sequential datasets


Can import directly to ORC files
from the mainframe

Can move certain data from


Can support parallel data
external stores and EDWs into
transfer
Hadoop

Can speedily make data copies


from external systems into
Hadoop
Apache Sqoop—How it works

Sqoop connects to external Various connectors support


systems in an optimal way through connectivity to popular database
Enterprise
a pluggable mechanism Data Document and data warehousing systems
Based Relational
Warehous
Systems Database
e
Command

Sqoop uses MapReduce to import


and export the data in HDFS or
directly in Hive or HBase

With the Sqoop extension API, new


Sqoop Map
connectors are embedded into
Sqoop installations Task

Sqoop integrates with Apache


Oozie, which coordinates the
HDFS/HBase workflow
/Hive
Apache Sqoop 1—Challenges

Root privileges are required for


Cryptic and contextual command local configuration and
line arguments lead to errors in installation; thus, these tasks
connector matching are difficult to manage

Tight coupling between data Debugging the map job is


transfer and the serialization limited to turning on the
format causes issues verbose flag

There are security concerns with Connectors have to follow the


openly shared credentials JDBC model and use JDBC
vocabulary such as URL, database,
table, and so on, whether it is
applicable or not
Sqoop 2—Features

UI built on top of a REST API that


can be used by a command line
client
Enterprise
Data Document
C:\ Based Relational
Warehous
Systems Database
C:\_ e
Web-based user interface REST
CL
I UI

Sqoop
Server

Connectors
Browser Metadata
Map
Task
Sqoop
Client Reduce
Task

Metadata
Repository HDFS/HBase
Built-in connectors /Hive
Sqoop 2—Features

Usability Extensibility Security

• Installed and configured on • Does not restrict usage of • Operates as a server-based


the server side only JDBC connectors application
• Provides a web-based • Connectors are responsible • Supports secure access to
service a command-line only for data transport external systems by
interface or browser • Web based UI eliminates providing role-based access
• Integrates with hive and h- the chances of user errors to connection objects
base on the server side • Enables users to choose the • Can be installed once and
• Decouples itself with oozie, connectors used from anywhere
and oozie manages sqoop • Integrates better with
tasks via REST API external systems such as
oozie
Demonstration 2:
Adding Sqoop Service to a CDH Cluster
Data ingestion is the process of importing, transferring, or loading data
into a persistent storage layer for immediate or later use or for storage.

Apache Flume, Apache Sqoop, and Apache Kafka are some of the
tools used for data ingestion.

Flume agent is a Java Virtual machine or JVM process that hosts the
components through which events flow.

The types of data flow are Multi-hop flow, Fan-out flow, and
Fan-in flow.

Apache Flume enables ingestion of high-volume streaming


data into HDFS.

Apache Sqoop facilitates efficient transfer of bulk data


between Apache Hadoop and structured data stores
such as relational databases.
Quiz
QUIZ
Which Framework/tools can be used for streaming data ingestion?
1

a. CopyFromLocal and CopyToLocal HDFS commands

b. Apache Flume, Chukwa, Kafka, Storm

c. Apache Sqoop 2

d. Apache Samza and Apache NiFi


QUIZ
Which Framework/tools can be used for streaming data ingestion?
1

a. CopyFromLocal and CopyToLocal HDFS commands

b. Apache Flume, Chukwa, Kafka, Storm

c. Apache Sqoop 2

d. Apache Samza and Apache NiFi

The correct answer is b .


Explanation: Apache Flume, Chukwa, Kafka, Storm are frameworks that can be used for
streaming data ingestion.
QUIZ Name the entities through which data enters into Flume and data is delivered to
2 destination?

a. Agent and collector

b. Producer and consumer

c. Source and Sink

d. Social media events delivered via channels


QUIZ Name the entities through which data enters into Flume and data is delivered to
2 destination?

a. Agent and collector

b. Producer and consumer

c. Source and Sink

d. Social media events delivered via channels

The correct answer is c .


Explanation: Source and Sink are the entities through which data enters into Flume and
data is delivered to destination.
QUIZ
Which tool allows bulk data transfer across Hadoop and structured data stores?
3

a. Apache Flume

b. Apache Sqoop

c. Apache Storm

d. Spark SQL
QUIZ
Which tool allows bulk data transfer across Hadoop and structured data stores?
3

a. Apache Flume

b. Apache Sqoop

c. Apache Storm

d. Spark SQL

The correct answer is b .


Explanation: Apache Sqoop allows bulk data transfer across Hadoop and structured
data stores.
QUIZ What is the type of data flow in Flume when events may travel through more
than one agent?
4

a. Fan-in Flow

b. Fan-out Flow

c. Multiplexing Flow

d. Multi-Hop Flow
QUIZ What is the type of data flow in Flume when events may travel through more
4 than one agent?

a. Fan-in Flow

b. Fan-out Flow

c. Multiplexing Flow

d. Multi-Hop Flow

The correct answer is d .


Explanation: Multi hop flow is the type of data flow in Flume when events may travel
through more than one agent.
QUIZ What is the name of the data ingestion tool that allows automation of data
movement between systems, provides real time control, and enables ease of
5 data movement?

a. Apache Kafka

b. Apache Sqoop

c. Apache NiFi

d. Amazon Kinesis
QUIZ What is the name of the data ingestion tool that allows automation of data
movement between systems, provides real time control, and enables ease of
5 data movement?

a. Apache Kafka

b. Apache Sqoop

c. Apache NiFi

d. Amazon Kinesis

The correct answer is c .


Explanation: Apache Nifi is the name of the data ingestion tool that allows automation
of data movement between systems, provides real time control, and enables ease of
data movement.
QUIZ Which of the following is the term used for the process that hosts the
6 components through which events flow within Flume?

a. Flume event

b. Flume agent

c. Flume channels

d. Flume sinks
QUIZ Which of the following is the term used for the process that hosts the
6 components through which events flow within Flume?

a. Flume event

b. Flume agent

c. Flume channels

d. Flume sinks

The correct answer is b .


Explanation: Flume agent is the term used for the process that hosts the components
through which events flow within Flume.
QUIZ What is the capability of Sqoop that mitigates excessive storage and processing
7 loads to other systems?

a. Load Balancing

b. Offload data processing

c. Import/export of data

d. Performance optimization
QUIZ What is the capability of Sqoop that mitigates excessive storage and processing
7 loads to other systems?

a. Load Balancing

b. Offload data processing

c. Import/export of data

d. Performance optimization

The correct answer is a .


Explanation: Load balancing is the capability of Sqoop that mitigates excessive storage
and processing loads to other systems.
QUIZ Which of the following services does Sqoop integrate with to allow automation
8 and scheduling of import/export tasks?

a. Sqoop Server

b. Apache Kafka

c. Apache Oozie

d. Apache NiFi
QUIZ Which of the following services does Sqoop integrate with to allow automation
8 and scheduling of import/export tasks?

a. Sqoop Server

b. Apache Kafka

c. Apache Oozie

d. Apache NiFi

The correct answer is c .


Explanation: Apache oozie, the workflow coordinator, is the service that Sqoop
integrates with to allow automation and scheduling of import/export tasks.
QUIZ How can Sqoop improve compression and achieve light-weight indexing for
9 improved performance?

a. By using Oozie to schedule workflows and run jobs

b. By importing data into the ORCFile format

c. By specifying compression while importing data

d. By storing imported data in Hive


QUIZ How can Sqoop improve compression and achieve light-weight indexing for
9 improved performance?

a. By using Oozie to schedule workflows and run jobs

b. By importing data into the ORCFile format

c. By specifying compression while importing data

d. By storing imported data in Hive

The correct answer is b .


Explanation: By Importing data into ORCFile format, Sqoop improves compression and
achieves light-weight indexing for improved performance.
QUIZ
How does Flume ensure guaranteed delivery of events to the destination?
10

a. By using source, channels, and sinks

b. By using channel-based transactions

c. By using agents and collectors

d. By ingesting only smaller files


QUIZ
How does Flume ensure guaranteed delivery of events to the destination?
10

a. By using source, channels, and sinks

b. By using channel-based transactions

c. By using agents and collectors

d. By ingesting only smaller files

The correct answer is b .


Explanation: Flume uses channel-based transactions to ensure reliable message delivery. When
a message moves from one agent to another, two transactions are started, one on the agent
that delivers the event and the other on the agent that receives the event.
This concludes the lesson “Data Ingestion in Hadoop Cluster.”
The next lesson is “Hadoop Ecosystem Components.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 11—Hadoop Ecosystem Components/Services
List some of the services and open-source components that work
within the Hadoop ecosystem

List the advantages and key features of Hive

Describe the components of Hive briefly

Explain how to configure Hive in different modes

Explain the architecture of HBase and cite the


advantages of using HBase

Explain the working of Apache Kafka

Describe the architecture of Apache Spark


Lesson 11: Hadoop Ecosystem Components/Services
Topic 11.1: Apache Hive
Apache Hive What is it?

Data
summarization

? Query

Data Data
Warehouse Warehouse Analysis
Apache Hive What is it?
Apache Hive Applications

Machine Learning & Data Mining and Analysis


research
development
.

Hosting Fact and Dimension


Social Graph Analysis
i
n.
n
e

Data
.
t
o
r
g

Data Data Log Data normalization User Analytics


Warehouse Warehouse
Model building and
Reporting and Adhoc Queries Trend detection

Integral Log Business Intelligence


processing & Analysis and analysis
Apache Hive Features

Let us learn the key features of Hive


Apache Hive Features

Enables easy data extract, transform, and load


(ETL) with in-built tools
Apache Hive Features

Supports structuring of multiple and varied data


formats
Apache Hive Features

Data Data
Warehouse Warehouse

Facilitates storage of file access directly in Apache


HDFS or in other data storage systems such as
Apache H-Base
Apache Hive Features

Provisions query execution via MapReduce


Apache Hive Features

SQL

Supports query execution using simple query


language
Apache Hive Features

SQL

Supports plugging in of custom mappers and


reducers
Apache Hive Features

Facilitates sophisticated analysis


Apache Hive Features

Hdfs-site.xml

<configuration>
<property>
Supports extension of Query Language with User- <name>dfs.replication</name>
Defined Functions <value>3</value>
</property>
</configuration>
Apache Hive Features

<configuration>
<property>
<name>dfs.replication</name
Permits read and write of data in non-Hive format >
<value>3</value>
</property>
</configuration>
Apache Hive Features

Effective with thrift, control limited, and


specialized data formats
Apache Hive Features

Data 1

Effective for batch jobs over large append-only Data 2

data sets and cannot support Online Line


Transaction Processing or OLTP workloads or real- Data 3

time queries.
Data 4
Apache Hive Features

Designed to support the scalability that Hadoop


offers and also to have loose coupling with input
formats
Data
Apache Hive Components

Driver Session handling, connectivity of hive with Hadoop, and


other relevant activities

Parsing queries, preparing query plans, and optimizing the


plans

Compiler Compiling queries and executing plans


Execution
Engine

Hive can interact with data.

Query data, connect to external databases, change


environment variables, or edit configuration properties
Meta Store Shell
It defines the mode in which Hive is setup.
Metastore contains information about schema.

Metadata about data such as object definitions and how


they are mapped to data
Apache Hive Components

Metastore service

Client nodes
2

Nodes
Apache Hive Configurable Modes

Local Metastore

Embedded Metastore Remote Metastore


Apache Hive Configurable Modes

Local Metastore Remote Metastore


Embedded Metastore

JVM

Hive Driver Metastore Interface Derby database


Apache Hive Configurable Modes

Embedded Metastore Remote Metastore


Local Metastore

JVM

Hive Driver Metastore Interface

MySQL database

Hive Driver Metastore Interface


Apache Hive Configurable Modes

Embedded Metastore Remote Metastore


Local Metastore

JVM

Matastore

User 1 User 2 User 3


Apache Hive Configurable Modes

Embedded Metastore Local Metastore


Remote Metastore

JVM

Hive Driver Metastore Interface

MySQL database

Hive Driver Metastore Interface


In this demo, you will see how Hive is added as a service
in CDH, how to look into configurations and how to use
CLI or HUE to work with Hive.
In this demo, you will see how to configure the different
modes of Hive in an Apache Hadoop cluster and work
with Hive using Hue.
Lesson 11: Hadoop Ecosystem Components/Services

Topic 11.3 HBase


HBase: Need

Normalizing and indexing

Relational Database Systems


data
Implementing joins
Executing stored
procedures

1 5
Using transactions to Set up a single Master
ensure data consistency Add slave database
server
and referential integrity servers
Using a domain-specific

2 6
language such as SQL.
Scale the Master server
Add a cache
vertically

10010101001010 1001010
3 De-normalize schemas
7 Avoid use of built-in
features

01010010101001 0101001
Store only the amount of data that
01011010101101
01101010110101
01011010101101
0101101
0110101
0101101
4 can enable optimization of access
patterns 8 Determine the costliest
queries

00101100010110 0010110
10101011010101 1010101
01010100101010 0101010
HBase: Need

SQL
Non relational Database Systems
Not-Only SQL or NoSQL!!
(Term coined by Eric Evans)
HBase: Applications

Capture incremental data from various sources

Capture user-interaction data

Capture raw clickstream and telemetry and user


interaction data incrementally and process
it using different processing mechanisms

Back applications on which users consume or


generate large amounts of content, such as Twitter,
Facebook, Instagram, and other micro blogs
HBase: Features

Stores structured, semi-structured, and Guarantees high availability by using standby


unstructured data master nodes

Stores tweets, parsed log files, product catalogs,


Provides record-level consistency
customer reviews, videos, and images

Stores integers in one row and strings in another


Provides Atomicity, Consistency, Isolation, and
row of the same column, that is, it has a column-
Durability semantics for each row of data
oriented data storage mechanism

Is capable of filtering and has better scanning


Runs on a cluster of nodes
throughput

Is flexible and scalable Supports co-processors

Provides a key-based access to a cell or a range of


Facilitates random write actions
cells

Replicates data and thus protects data from loss


Optimizes read actions
due to cluster node failure
HBase: Usage Scenarios

The data access There is an existing


patterns are well known Hadoop cluster.
in advance.

The volume of
There is a need for faster data is huge.
and random read-write
actions.
HBase—Architecture

Java Client APIs External APIs (Thrift, Avro, REST)

Write-Ahead Log (WAL)


Region Server
Region Region Region
Master MemStor MemStor MemStor
e e e

HFile HFile HFile

Hadoop FileSystem API ZooKeeper

Hadoop Distributed FileSystem (HDFS)


HBase—Architecture
Tables are

Regions are
horizontally
portioned into
zookeeper .META location
Meta table is stored in
assigned to key ranges
Region Servers (regions)
HMaster location Zookeeper

Client
ge Meta Cache Client zookeeper
t

Get Region
Regio startKe startKe Regio startKe startKe
server for row
y y y y
n Region Region n key from meta
Region Region
Serve Key colB colC Key colB colC Serve Key colB colC Key colB colC
r r Put or Get Row

xxx val val xxx val val xxx val val xxx val val Region Region
Server Server
xxx val Val xxx val Val xxx val Val xxx val Val
DataNode DataNode
1G 1G 1G 1G
endKey B endKey B endKey B endKey B

Read META table is used


META table
Cache, LRU to find the Region B tree
Row key Value
Region Server evicted for a given Table
Block Cache Key Table,key,region Region server

Write Cache,
Region Region sorted map of
KeyValues in
MemStor MemStor memory Regio Regio Regio
e e n n n
Server Server Server
HFile HFile Region Region Region Region Region Region
Key colB colC colB col Key colB colC colB col Key colB colC colB col
WAL
Key Key Key
C C C
Hfile=sorted xxx val val xxx val val xxx val val xxx val val xxx val val xxx val val
KeyValues on disk
Write ahead log on
disk is used for
HDFS DataNode xxx val Val xxx val Val xxx val Val xxx val Val xxx val Val xxx val Val

recovery
HBase—Architecture

Client Zookeeper HMaster

HRegionServer HRegionServer
HRegion HRegion
HBase

Store Store
… …
Store Store
… …
MemStor MemStor
HLog

MemStor MemStor

HLog
e e e e
StoreFil StoreFil StoreFil StoreFil StoreFil StoreFil
e
HFile
e
HFile
… e
HFile
… e e e …
HFile HFile HFile

… DFS Client … DFS Client


HDFS


DataNode DataNode DataNode DataNode DataNode
In this demo, you will see how to add HBase as a service
in CDH.
Lesson 11: Hadoop Ecosystem Components/Services

Topic 11.4 Apache Kafka


Apache Kafka What is it?

Kafka handles the real-time monitoring and processing of events at Netflix.

Kafka is used to obtain performance and usage data from the end-users’ browsers to be used in projects.

Kafka powers the Storm stream processing infrastructure at Twitter.

At Square, Kafka acts as a bus and moves systems events through various datacenters.

LinkedIn uses Apache Kafka to stream activity data and to obtain operational metrics.

Many teams in Yahoo use Kafka, including the Media Analytics team which uses it for real-time analytics.
Apache Kafka What is it?

This interface helps to:


• Identify topics that are unevenly distributed across the cluster.
• Identify topics that have partition leaders unevenly distributed across the cluster.
• Manage multiple clusters, select the required replicas, re-assign replicas, and create topics.
• Obtain an overview of the cluster.
Apache Kafka How it works
Apache Kafka How it works (Contd.)
Advantages of the partitions are:
Messages are continuously appended to them.
Logs refer to any kind of append-only file with ordered records.
The Kafka cluster maintains a partitioned log for each topic.
A partition contains an ordered sequence of messages that does not change over time.

• They work in parallel, as a single unit.


• A topic can have many partitions so that it can handle any amount of data.
• The log can scale to a size that is beyond a single server’s capacity.
Anatomy of a Topic

Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
0 0 1 2

Partition
0 1 2 3 4 5 6 7 8 9 Commit
Write
1
s log
Partition 1 1 1
0 1 2 3 4 5 6 7 8 9 Offset
2 0 1 2
s
Apache Kafka How it works (Contd.)

Producer 1
Consumer 1-1
Consumer 1-2
Consumer 1-3
Producer 2
Consumer Group 1

Producer 3 Kafka topic Consumer 2-1


Consumer 2-2
Consumer 2-3
Producer 4
Consumer Group 2
Apache Kafka Distribution

My Topic Partition 1

Producer A Producer A
0 1

Partition 2

Producer B
0 1 2

Partition 3

Producer C
Producer B
0 1
Apache Kafka Unique Features

Treats topic partitions logs

Retains unread messages

Supports a large number of consumers

Retains large amount of data with negligible overhead


Apache Kafka Unique Features

Data streams in Kafka are structured as partitions


and each partition is a log.

Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
0 0 1 2

Partition
0 1 2 3 4 5 6 7 8 9
1

Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
2 0 1 2
Apache Kafka Unique Features

Structuring as a log is a major advantage because using logs is


the best solution, as seen in databases, and this ensures:

• Reliability of data storage


• Synchronization with a replica
• Consensus in a distributed algorithm such as Raft
• Recording of activity data
• Robustness of data infrastructure when scaled
Apache Kafka Applications

Messaging Website Activity Tracking Metrics Event Sourcing

Kafka has better throughput Kafka rebuilds user activities Kafka collects the Kafka supports storage
and built-in partitions to as real-time publish- distributed applications and of huge log data.
handle replication and subscribe feeds. It provides generates centralized feeds
achieve better fault- real-time processing, real- of operational data.
tolerance. time monitoring, offline data
warehousing systems, and
reporting. Kafka is a good
solution as activity tracking
generates high volumes of
data.
Apache Kafka Applications

Log Aggregation Stream Processing Commit Log

Kafka sometimes replaces a log Data is processed in stages. Raw data Kafka has a log compaction feature
aggregation solution. Kafka abstracts is consumed from topics and then that enables it to serve as an external
the file details and provides the log transformed into new Kafka topics to commit-log for a distributed system.
or event data as a stream of be used by applications. The log supports replication of data
messages. between nodes and re-synchronizes
This helps to achieve: failed nodes to restore data.
• lower-latency processing
• support for multiple data sources
• Provision of distributed data
consumption
Apache Spark What is it?

Let us learn the features of Apache Spark


Apache Spark What is it?

Provides fast, simple solutions for existing


development environments
Apache Spark What is it?

SQL

Integrates easily with existing applications


Supports programming languages
Apache Spark What is it?

Runs on clusters that are often used for


Hadoop jobs
Apache Spark What is it?

Amazon Web Services can easily launch Spark


Apache Spark What is it?

Integrates with data storage systems such as


Apache Hbase and Hive
Apache Spark What is it?

Spark can cache datasets in memory


Apache Spark What is it?

Applications that can leverage Spark’s features can be


created with Spark running on YARN
Apache Spark What is it?

Apache Spark consists of the Spark core and a set of


libraries that provide a platform for distributed ETL
applications
Apache Spark Users
Apache Spark Stack and Architecture

BlinkDB Alpha/Pre-Alpha
Approximat
e
SQL

Spark Spark MLlib GraphX Spark R


SQL Streaming Machine Graph R on Spark
Streaming learning Computatio
n

Spark Core Engine


Apache Spark Stack and Architecture (Contd.)

BlinkDB Alpha/Pre-Alpha
Approximat
e
SQL

Spark Spark MLlib GraphX Spark R


SQL Streaming Machine Graph R on Spark
Streaming learning Computatio
n

Spark Core Engine


Apache Spark Computing through Resilient Distributed Datasets

RDDs
Apache Spark Computing through Resilient Distributed Datasets

Operations

Actions Transformations
Apache Spark and Hadoop

Contribute to Hadoop-
• Yarn ResourceManager
based jobs through YARN
• HDFS
Rapid in-memory
• Disaster Recovery processing of large data
• Data Security volumes
• Distributed Data platform SQL, streaming, and graph
processing capability.
Quiz
QUIZ Which component of Apache Hive takes care of session handling and connectivity of
Hive with Hadoop?
1

a. Execution engine

b Compiler
.
c. Metastore

d. Driver
QUIZ Which component of Apache Hive takes care of session handling and connectivity of
Hive with Hadoop?
1

Execution engine
a.

Compiler
b
.
c. Metastore

d. Driver

The correct answer is d.

Explanation: Driver is the component of Apache Hive that takes care of session handling and connectivity of Hive with
Hadoop.
QUIZ
Which component of Apache Hive defines the mode in which Hive is setup and
contains metadata about data?
2

a. Compiler

b Execution Engine
.
c. Metastore

d. Metastore service
QUIZ
Which component of Apache Hive defines the mode in which Hive is setup and
contains metadata about data?
2

a. Compiler

b Execution Engine
.
c. Metastore

Metastore service
d.

The correct answer is c .


Explanation: Metastore is the component of Apache Hive that defines the mode in which Hive is setup and contains
metadata about data.
QUIZ Which setup of Hive allows multiple users to access Hive via CLI or Hue’s Hive
interface?
3

a. Embedded metastore

b Local metastore
.
c. Remote metastore

d. Metastore service
QUIZ Which setup of Hive allows multiple users to access Hive via CLI or Hue’s Hive
interface?
3

a. Embedded metastore

b Local metastore
.
c. Remote metastore

d. Metastore service

The correct answer is b.

Explanation: Local metastore is the setup of Hive that allows multiple users to access Hive via CLI or Hue’s Hive
interface.
QUIZ Which Apache service can be used to capture unstructured, incremental, and user
interaction data?
4

a. Apache Kafka

b Apache Spark
.
c. Apache Hive

d. Apache HBase
QUIZ Which Apache service can be used to capture unstructured, incremental, and user
interaction data?
4

a. Apache Kafka

b Apache Spark
.
c. Apache Hive

d. Apache HBase

The correct answer is d.

Explanation: Apache Hbase can be used to capture unstructured, incremental, and user interaction data.
QUIZ What are three main components of Apache Hbase that enable the working of Hbase
in a Hadoop cluster?
5

a. WAL, Hfiles, and memstore

b HMaster, Hregionserver, and Zookeeper


.
c. Column family, Region, and Regionservers

d. Block cache, memstore, and Hfiles


QUIZ What are three main components of Apache Hbase that enable the working of Hbase
in a Hadoop cluster?
5

a. WAL, Hfiles, and memstore

b HMaster, Hregionserver, and Zookeeper


.
c. Column family, Region, and Regionservers

d. Block cache, memstore, and Hfiles

The correct answer is b.

Explanation: The three main components of Apache Hbase that enable the working of Hbase in a Hadoop cluster are
HMaster, Hregionserver, and Zookeeper.
QUIZ
How many regions can a regionserver serve?
6

a. 10000

b Any number
.
c. 1000

d. Same as the number of region servers in a cluster


QUIZ How many regions can a regionserver serve?

a. 10000

b Any number
.
c. 1000

d. Same as the number of region servers in a cluster

The correct answer is c.

Explanation: A region server can serve 1000 regions.


QUIZ
Which of the following maintains the information about server states in a cluster and
acts as a distribution coordination service?
7

a. Zookeeper

b HMaster
.
c. Zookeeper Quorum

d. Ephemeral Nodes
QUIZ
Which of the following maintains the information about server states in a cluster and
acts as a distribution coordination service?
7

a. Zookeeper

b HMaster
.
c. Zookeeper Quorum

d. Ephemeral Nodes

The correct answer is c.

Explanation: Zookeeper Quorum maintains the information about server states in a cluster and acts as a distribution
coordination service.
QUIZ
In Apache Kafka, what are the processes that subscribe to topics and process the
messages?
8

a. Kafka brokers

b Partitions
.
c. Consumers

d. Kaf logs
QUIZ
In Apache Kafka, what are the processes that subscribe to topics and process the
messages?
8

a. Kafka brokers

b Partitions
.
c. Consumers

d. Kaf logs

The correct answer is c.

Explanation: Consumers are the processes that subscribe to topics and process the messages.
QUIZ
Which server handles read write requests for a partition within a kafka cluster?
9

a. Follower

b Leader
.
c. Producer

d. Consumer
QUIZ
Which server handles read write requests for a partition within a kafka cluster?
9

a. Follower

b Leader
.
c. Producer

d. Consumer

The correct answer is b.

Explanation: Leader server handles read write requests for a partition within a kafka cluster.
QUIZ
Which component of the Apache Spark stack can be used for hypothesis testing,
regression analysis, classification, and principal component analysis?
10

a. Dataframe API

b MLlib
.
c. Spark Streaming

d. Spark core
QUIZ
Which component of the Apache Spark stack can be used for hypothesis testing,
regression analysis, classification, and principal component analysis?
10

a. Dataframe API

b MLlib
.
c. Spark Streaming

d. Spark core

The correct answer is b.

Explanation: Mllib is the component of Apache Spark stack that can be used for hypothesis testing, regression analysis,
classification, and principal component analysis.
Several services or open-source components work within the Hadoop ecosystem.
These include Apache Hive, Apache Pig, Impala, HBase, Apache Kafka, and Apache
Spark.
Apache Hive is a data warehouse infrastructure built on top of Hadoop to
provision data summarization, query, and analysis.
HBase is a service that is built on top of Hadoop and Zookeeper.
It is also called Hadoop Database.
Kafka is a fast, scalable, and durable distributed messaging system. It
follows the publish-scribe messaging pattern.

Apache Spark is an engine for large-scale data processing.


This concludes the lesson “Hadoop Ecosystem Components.”
The next lesson is “Hadoop Security.”

Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 12—Hadoop Security
Describe the different ways to avoid risks and secure data

Identify the different threat categories

Describe the security aspects for different nodes

Describe operating system security

Describe Kerberos and how it works

Describe Service Level Authorization


Lesson 12: Hadoop Security
Topic 12.1: Security models
Secure Data

Time and Privacy


financial loss violations

Damage to Regulatory
business Risk infractions
continuity plans s
Hadoop ecosystem and its
components, along with its
Data integrity Damage of corporate image
processing frameworks, allow compromise and shareholder value
you to store data and process it
in new and exciting ways.
CIA Model

Confidentiality

CIA
Mode
Availabilit l Integrity
y

The three components of the model:


• Can be applied to a wide range of information systems
• Help the CIA model to organize information security.
Confidentiality

Confidentiality is a security principle that emphasizes the notion


</
> that information should only be seen by the intended or authorized
recipients.

Confidentiality is established by using a unique identity or


</
identification method that allows data access to specific >
personnel.

Encryption, though not mandatory, is an important concept of


</
> confidentiality. It is a mechanism of applying mathematical
algorithms to mask data or information.
Integrity

Integrity is an important component of information


or data security. It ensures that information remains
unchanged and uncompromised irrespective of
whether it stays in one place or travels from one
point to another.
Availability

Availability is about preparedness.

Availability of data or services can be impacted by:


• Regular outages
• Implementation of security patches
• Security events
Tripple A’s

AAA

Accounting

Authorization

Authenticatio
n
Authentication, Authorization, and Accounting

The triple A’s refer to architectural pattern in computer security.

Users are granted A record of user


Users prove their identify access based on actions is
predefined rules and maintained for
policies auditing purposes

Identity and identification method refers to the process in the system that distinguishes between
different entities, users, and services and allows or disallows the user to access the data.
Pillars of Enterprise Security

The five pillars of enterprise security are:

Administration Authorization Data protection

Authentication Audit
Securing Distributed Systems

Potential threats and complexity: To arrive at a robust security architecture,


it is important to:
• Are directly proportional to the
magnitude of the system being • Understand the probable and potential
distributed threats
• Increase as distributed system is • Categorize them
scaled
Securing Distributed Systems: Example
Threat Categories

Insider threat
The attack comes from a business or from regular
users such as employees, contractors, or
consultants.

Threat
Unauthorized access or Denial of service
Categories
masquerade
Denial of service is a situation
A masquerade attack refers to an where one or more clients are
event where an invalid user unable to access a service.
presents himself or herself as a
valid user by obtaining valid
credentials.

In a distributed environment, attacks to distributed multiple systems might not be possible.


Risk Assessment

Risks

User assessment Environment assessment Vulnerabilities

The important process The critical process of Observing and fixing


of assessing users who assessing the distributed software or network
will have direct or system and its vulnerabilities by using
indirect access to the environment regular software or
network patches
distributed system
Lesson 12: Hadoop Security
Topic 12.2: Implementation
Security Strategy and Implementation

Each Hadoop ecosystem component has services and each service has roles running on different nodes.

Service Roles

NameNode (Active/Standby/Secondary), DataNode, JournalNode,FailoverController, HttpFS,


HDFS
NFSGateway

YARN ResourceManager (Active/Standby), NodeManager, JobHistory Server

Hive Hive Metastore Server, HiveServer2, WebHCatServer

The separation strategy involves the following steps:


• Identify all the master services to be run on master nodes and worker services on worker nodes.
• Identify which components require client configuration files to be deployed so that users can
access the services.
Master Nodes

Master nodes are the Administrators are The reasons for this limitation
most important nodes of allowed to access the are as follows:
the cluster. Therefore, master nodes • Avoid any chance of resource
they have a strict security contention
policy to protect them.
• Avoid security vulnerabilities
Worker Nodes

Worker nodes handle the Administrators are The reasons for this limitation are
bulk of the functions of a allowed to access the as follows:
Hadoop cluster. These worker nodes • Avoids resource contention and
functions include the skew in resource management
storing and processing of
data. • Avoids worker role skew in
behavior
Management Nodes

Management nodes Administrators are The most critical role hosted on a


provide the mechanism allowed to access the management node is configuration
to install, configure, management nodes management software.
monitor, and maintain
the Hadoop cluster. Administrators configure the
cluster here.
Edge Nodes

Edge nodes host web interfaces, proxies, and client configurations


that ultimately provide the mechanism for users to work with
Hadoop cluster. Users have direct or remote access to these nodes.
Operating System Security

The Hadoop ecosystem is inherently complex and needs tools and


access methods that are available to allow access to cluster nodes.

Examples of such tools and access methods are Firewall and


SELinux.
Firewalls

iptables -N hdfsiptables -A hdfs -p tcp -s 0.0.0.0/0 --dport 8020 -j ACCEPT

iptables -A hdfs -p tcp -s 0.0.0.0/0 --dport 8022 -j ACCEPT


Users’ access to specific nodes can
be controlled through remote
access control tools. Host firewalls iptables -A hdfs -p tcp -s 0.0.0.0/0 --dport 50070 -j ACCEPT
can be used to limit the types of
traffic going into and out of a node.
iptables -A hdfs -p tcp -s 0.0.0.0/0 --dport 50470 -j ACCEPT

iptables -A INPUT -j hdfs


Selinux

SELinux implements security at kernel level and provides Linux kernel enhancements.
In a CDH setup, this has to be disabled on every node of the cluster.

Disabled Component Tools


SELinux is not active and does not provide any s
additional level of security to the operating system.

Hadoop
Permissive Ecosystem
SELinux is enabled but does not protect the system.

Enabled
SELinux protects the system based on the specified
SELinux policy. Widget
s
Kerberos

• Kerberos is a Network Authentication Protocol built on


the assumption that network connections are unreliable.

• Kerberos uses secret-key cryptography to enable strong


authentication by providing user-to-server
authentication.

Kerberos • Kerberos is one of most used and relied on security


mechanisms to protect Hadoop clusters and prevent
unauthorized access to data stored on the clusters.
Need for Kerberos

By default, Hadoop does not question or verify the identity of the user
accessing the cluster.

By default, everyone has read access to the cluster and the petabytes of the
data that it stores.

In the case of large clusters, managing access to the cluster at user, group, or
data level is not enough to protect it.

Someone has to verify the identity of the user or the service before the
cluster and its data is hampered.
Kerberos Internals

Hadoop uses Kerberos as the basis for strong authentication


and identity propagation for both users and services.

Kerberos is a third party authentication mechanism.

The Kerberos server itself is known as the Key Distribution


Center or KDC.
Kerberos Terminologies

Key Distribution
Center
Kerberos
KDC KDC Server
Admin Account

Kerberos
Keytabs Kerbero Client
s

Realm
Tickets

Principals
Kerberos Server
Implementing Kerberos in CDH

If Kerberos is enabled, authentication in a CDH cluster is handled by it.

Integration of Kerberos with AD or LDAP is also possible by allowing options to


manage and store credentials in AD.

Without Kerberos enabled, Hadoop only examines a user and his or her
group membership to verify if he or she is allowed to access HDFS.

With Kerberos enabled, a user must first authenticate himself or herself to a


Kerberos KDC to obtain a valid TGT.
Demonstration 1:
Enabling Kerberos
Lesson 12: Hadoop Security
Topic 12.3: Service Level Authorization
Securing Cluster

Service Level Authorization is the initial


authorization mechanism to ensure that clients
connecting to a particular Hadoop service have
the necessary, pre-configured permissions and
are authorized to access the given service.

A MapReduce cluster can use this mechanism to allow a configured list of users/groups to submit jobs.
It is disabled by default.
To enable: Edit $HADOOP_CONF_DIR/core-site.xml
Property: Hadoop.security.authorization: true
Configuration Properties

Property Service

security.client.protocol.acl ACL for ClientProtocol, which is used by user code via the DistributedFileSystem.

ACL for DatanodeProtocol, which is used by datanodes to communicate with the


security.datanode.protocol.acl
namenode.

ACL for NamenodeProtocol, the protocol used by the secondary namenode to


security.namenode.protocol.acl
communicate with the namenode.

ACL for InterTrackerProtocol, used by the tasktrackers to communicate with the


security.inter.tracker.protocol.acl
jobtracker.

ACL for JobSubmissionProtocol, used by job clients to communciate with the jobtracker
security.job.submission.protocol.acl
for job submission, querying job status, etc.

ACL for RefreshAuthorizationPolicyProtocol, used by the dfsadmin and mradmin


security.refresh.policy.protocol.acl
commands to refresh the security policy in-effect.
Implement Service Level Authorization

Access control list for each Hadoop Service is defined in


$HADOOP_CONF_DIR/Hadoop-policy.xml.

group1,group2,group3 All users who have access to


user1,user2,user3
the service are denoted by ‘*’.
If no value is specified with
A list of users includes A list of groups includes a blank properties, then no one has
only the users who are space followed by groups that access to the service.
allowed to access the have access to the service.
service.
Implement Service Level Authorization

If the access control list is not defined for a service, the


value of security.service.authorization.default.acl is
applied.

<property>

If security.service.authorization.default.acl is not defined, <name>security.job.submission.protocol.acl</name>


then * is applied. <value>user1,user2 yarngrp</value>
</property>

To specify blocked access control list for a service, suffix


‘blocked’ at end of property.

Example: security.client.protocol.acl>> will be


>>>security.client.protocol.acl.blocked
HDFS ACLS

• ACLs are used to restrict access to HDFS.

• HDFS supports a permission model equivalent to


traditional Unix permission.

• For each file or directory, permissions are


managed for owner, group, and others.

• There are three different permissions controlled


for each user class: read, write, and execute.
IMPLEMENT ACLS

The commands used to implement ACLs The commands used to interact with ACLs are
are the following: the following:

<property> hdfsdfs -setfacl -m group:groupname:r-- /


<name>dfs.namenode.acls.enabled</name> hdfsdfs -getfacl /
<value>true</value> hdfsdfs -setfacl -
</property> m default:group:groupname:--- /

The property used for admins isdfs.cluster.administrators = ACL-for-admins.


YARN ACLS

ACLs can be enabled for YARN processing and used to control who can act as the administrator
of the YARN cluster or submit job to YARN cluster and configured queues.

The property is yarn.acl.enable = true

Queue Level Fair-Scheduler


Implementation
Set properties in YARN-SITE.xml
for capacity scheduler Set the properties
implementation aclSubmitApps,
aclAdministerApps with the list
of users/groups in allocations
file
Controlling Via Commands

They grant ownership of specific directories to specific users.

They grant read/write/execute permissions to files/directories to


specific users on HDFS

To set count limit in a particular path, use setQuota<limit>.

To set limitation on volume of data that can be written on HDFS in a


particular path, use spaceQuota.
Demonstration 2:
Using Quotas to control Amount of Data
Written in HDFS
Demonstration 3:
Granting Permission to Users on HDFS
Demonstration 4:
Enabling ACLS for YARN and HDFS
Quiz
QUIZ
What are the three critical components of Secure computing?
1

a. Confidentiality, integrity, and Availability

b. Identity, integrity, and Availability

c. Authentication, authorization, and accounting

d. Availability, authorization, and confidentiality


QUIZ
What are the three critical components of Secure computing?
1

a. Confidentiality, integrity, and Availability

b. Identity, integrity, and Availability

c. Authentication, authorization, and accounting

d. Availability, authorization, and confidentiality

The correct answer is c .


Explanation: The three critical components of Secure computing are Authentication,
authorization, and accounting.
QUIZ Which component of CIA model ensures that information remains
2 unchanged and uncompromised?

a. Confidentiality

b. Integrity

c. Availability

d. Identity
QUIZ Which component of CIA model ensures that information remains
2 unchanged and uncompromised?

a. Confidentiality

b. Integrity

c. Availability

d. Identity

The correct answer is b.


Explanation: Integrity ensures that information remains unchanged and uncompromised.
QUIZ Which Pillar of enterprise security constitutes provisioning access to
3 data?

a. Authorization

b. Data protection

c. Administration

d. Authentication
QUIZ Which Pillar of enterprise security constitutes provisioning access to
3 data?

a. Authorization

b. Data protection

c. Administration

d. Authentication

The correct answer is a .


Explanation: The Pillar of enterprise security that constitutes provisioning access to data is
authorization.
QUIZ Which category of threat is most dangerous and arises when an
unauthorized user has access to data via some unknown authorized
4 user?

a. Denial of Service

b. Unauthorized access

c. Masquerade

d. Insider threat
QUIZ Which category of threat is most dangerous and arises when an
unauthorized user has access to data via some unknown authorized
4 user?

a. Denial of Service

b. Unauthorized access

c. Masquerade

d. Insider threat

The correct answer is d.


Explanation: Insider threat is the most dangerous threat and arises when an unauthorized
user has access to data via some unknown authorized user.
QUIZ What is the preferred status of SELINUX in CDH to implement security at
5 kernel level?

a. Permissive

b. Enabled

c. Disabled

d. Blocked
QUIZ What is the preferred status of SELINUX in CDH to implement security at
5 kernel level?

a. Permissive

b. Enabled

c. Disabled

d. Blocked

The correct answer is c .


Explanation: Disabled is the preferred status of SELINUX in CDH to implement security at
kernel level.
QUIZ What is the trusted source for authentication in a Kerberos-enabled
6 environment called?

a. KDC-Key Distribution centre

b. Kerberos client

c. Principal

d. Authentication Server
QUIZ What is the trusted source for authentication in a Kerberos-enabled
6 environment called?

a. KDC-Key Distribution centre

b. Kerberos client

c. Principal

d. Authentication Server

The correct answer is a .


Explanation: KDC-Key Distribution centre is the trusted source for authentication in a
Kerberos-enabled environment.
QUIZ In a Kerberos-enabled environment, who takes care of initial
7 authentication and issues a TGT (Ticket Granting Ticket)?

a. Ticket granting Server

b. Kerberos database

c. Kerberos server

d. Authentication Server
QUIZ In a Kerberos-enabled environment, who takes care of initial
7 authentication and issues a TGT (Ticket Granting Ticket)?

a. Ticket granting Server

b. Kerberos database

c. Kerberos server

d. Authentication Server

The correct answer is d.


Explanation: The Authentication Server takes care of initial authentication and issues a TGT
in a Kerberos enabled environment.
QUIZ What is the file that contains resource principal’s authentication
8 credentials called?

a. Realm

b. Admin principal

c. Keytab

d. Ticket
QUIZ What is the file that contains resource principal’s authentication
8 credentials called?

a. Realm

b. Admin principal

c. Keytab

d. Ticket

The correct answer is c .


Explanation: The file that contains resource principal’s authentication credentials is called
Keytab.
QUIZ What is the default status of Service Level Authorization in any Hadoop
9 Cluster?

a. Enabled

b. Disabled

c. Inactive

d. Active
QUIZ What is the default status of Service Level Authorization in any Hadoop
9 Cluster?

a. Enabled

b. Disabled

c. Inactive

d. Active

The correct answer is b.


Explanation: The default status of Service Level Authorization in any Hadoop Cluster is
‘Disabled’.
QUIZ How do we control the volume of data that can be written by an
10 authorized user on HDFS?

a. By using small & multiple disks

b. By using Quotas

c. By enabling ACLs

d. By blocking NameNode’s RAM


QUIZ How do we control the volume of data that can be written by an
10 authorized user on HDFS?

a. By using small & multiple disks

b. By using Quotas

c. By enabling ACLs

d. By blocking NameNode’s RAM

The correct answer is b.


Explanation: We can control the volume of data that can be written by an authorized user
on HDFS using Quotas.
One of the main information security models is the CIA
model. It stands for confidentiality, integrity, and availability.
The components of Triple A’s are authentication,
authorization, and accounting. These are critical to secure
computing.
The five pillars of enterprise security are administration,
authentication, authorization, audit, and data protection.
The three threat categories are unauthorized access or
masquerade, insider threat, and denial of service.
Risks of potential threats can be assessed by user
assessment, environment assessment, and vulnerabilities.
Each Hadoop ecosystem component has services, and each
service has roles running on different nodes that take care
of the service’s functionality.
The Hadoop ecosystem is inherently complex and needs tools
and access methods to allow access to cluster nodes.
Examples of such tools and access methods include Firewall
and SELinux.
Kerberos is a Network Authentication Protocol built on the
assumption that network connections are unreliable.
Service Level Authorization is the initial authorization
mechanism to ensure that clients connecting to a particular
Hadoop service have the necessary, pre-configured,
permissions and are authorized to access the given service.
This is performed before performing other access control
checks at file level on HDFS or before checking permissions at
job queue level.
This concludes the lesson “Hadoop Security.”
The next lesson is “Hadoop Cluster Monitoring.”

Disclaimer: All the logos used in this course belong to the respective organizations
Describe cluster monitoring

Describe the ways to choose the right monitoring solutions

List the features and considerations of Cloudera


manager for monitoring

Describe the different categories of Hadoop Metrics

List the different types of Hadoop Metrics

List the steps to monitor a cluster by using


Cloudera Manager
Hadoop Cluster Monitoring
Cluster Monitoring
Hadoop Cluster Monitoring

Clusters

Organizations
monitor
Various
components

Building mission-critical systems or


Hadoop clusters to host massive data
for processing requires mechanisms to Systems
within the
know their operational state and ways
clusters
to gather and use performance metrics.
Making a Choice

Hadoop comes with a monitoring challenge

Scalability Flexibility

Considerations

Zero
Extensibility configuration
Hadoop Performance Monitoring Tools: Features

Monitoring the Application Application performance


Real-time
performanceCluster
of or
node
performance monitoring solutions
visualization
monitoring of application
applications running in
monitoring provide
execution information on
Hadoop cluster is very what has failedPerformance
and the
history and
Compatibility
important. reason. analysis
trend

Consolidated
Notifications monitoring
and alerts
Features across
technologies

Multi-cluster
support and Custom views
reporting
Task
Cluster Monitoring
Self-service
performance tools provide
Application
troubleshooting
metrics service level
information
management on what
has failed.
Categorizing Monitoring Solutions and Monitoring

Monitoring
Systems

Metric
Use of Metrics
Collection
Categorizing Monitoring Solutions and Monitoring

Monitoring

Health Performance
Monitoring Monitoring
Monitoring Examples

• Checking if each daemon runs with appropriate memory


consumption, when monitoring HDFS

• Checking if each daemon responds to requests in the


defined window of time

• Checking the percentage of slave machines that


communicate with the master node or look into block
distribution on save nodes

• Checking how applications perform on top of HDFS


Cloudera Manager for Monitoring

Hosts  Tracks the performance and resource demands of


the user jobs running on your cluster

 Notifies asynchronously when an important event


of interest occurs
Overall Cloudera
Services
Cluster Manager  Manages some of its helper services

 Allows monitoring with ease without installing


additional tools or configuration

Nodes
Cloudera Manager for Monitoring: Capabilities

Information from specific metrics is collected,


aggregated, and presented on dashboards or
Hadoop in the form of charts; this helps diagnose
Services problems.

Host monitoring allows you


Cloudera Manager can be Alert
Hosts to view information related
configured to generate alerts Notifications to all the hosts in the
on a variety of events.
Capabilities cluster.
of Cloudera
Manager

Reports provide a historical view Activity monitoring allows you to


Report User monitor the activities that are
and current statistics pertaining to Generation Activities
various aspects of the cluster. running on the cluster and the users
running it.
Service Monitor

1 1 0 1 1 1 1 1 1 1
1
0 0 1 0 0 0 0 0 0 0
0 Charts help users to query and
0 0 0 0 1 0 0 0 1 0
1 explore the metrics being collected.
1 1 1 1 0 1 0 1 1 1
1
0 1 0 0 1 0 0 0 0 0
0 Health Checks Service monitor helps
1 0 0 1 0 1 1 1 1 1
1 1 1 1 0 1 1 1 0 1
1
Cloudera to:
1  Evaluate “health checks” for every
0 1 1 0 0 0 0 0 1 0
0
Manager entity in the system
0 1 0
0 collects these  Check for disk space in every node
Metrics are  Check for successful checkpoint or
simple numeric connectivity of DataNode with
values NameNode
 Projects status of a service based
on health checks done on
underlying daemons

Service
Monitor
Sample Yarn Metrics
Hadoop Cluster Monitoring
Metrics
Hadoop Metrics Details

Each daemon can be configured to collect


this data from internal components at JVM

regular intervals and then aggregate the


metrics.

Related metrics are grouped into a RPC


Primary
DFS
Context
named context.

Some contexts are common to all


daemons. Other contexts apply only to
Mapped
daemons of a specific service.
Handling Metrics Data Using Plug-ins

Disabling Metrics
# hadoop-metrics.properties
jvm.class = org.apache.hadoop.metrics.spi.NullContext
dfs.class = org.apache.hadoop.metrics.spi.NullContext

Writing Metrics
# hadoop-metrics.properties
jvm.class = org.apache.hadoop.metrics.file.FileContext
jvm.period = 10
jvm.fileName = /tmp/jvm-metrics.log
dfs.class = org.apache.hadoop.metrics.file.FileContext
dfs.period = 10
dfs.fileName = /tmp/dfs-metrics.log
Handling Metrics Data Using Plug-ins (Contd.)

Ganglia

org.apache.hadoop.metric
org.apache.hadoop.metri
s.ganglia.GangliaContext3
cs.ganglia.GangliaContext
1
Handling Metrics Data Using Plug-ins (Contd.)

01 02 03 04
PHP web
relays data to a gmetad process application or
gmond collects central gmetad records data in a Apache Web Server
metrics locally process series of RRD displays the data
Handling Metrics Data Using Plug-ins (Contd.)

# hadoop-metrics.properties( Sample)
jvm.class = org.apache.hadoop.metrics.file.FileContext
jvm.period = 10
jvm.servers = 10.0.0.xxx
# The server value may be a comma separated list of host:port pairs.
# The port is optional, in which case it defaults to 8649.
# jvm.servers = gmond-host-a, gmond-host-b:8649
dfs.class = org.apache.hadoop.metrics.file.FileContext
dfs.period = 10dfs.servers = 10.0.0.xxx
Hadoop Metrics
Health Monitoring

? Which are the metrics that are important?

?
Which are the metrics that represent the health of monitored
services?

?
What thresholds are set to indicate issues and generate alerts in
alignment to the cluster usage and growth?
Hadoop Metrics: Categories

HDFS
Metrics

Zookeeper Hadoop MapReduce


Metrics Metrics Counters

YARN
Metrics
HDFS Metrics

NameNode emitted
Metrics
NameNode
Metrics
NameNode JVM
HDFS Metrics
Metrics

DataNode Metrics
NameNode Metrics

Capacity-
Remaining

MissingBlocks

Metrics
NumDead-
Emitted by
DataNodes
NameNode

FilesTotal

Totalload
CapacityRemaining

CapacityRemaining

Total available capacity remaining DataNodes that are out of space


across the entire HDFS cluster are likely to fail on boot

Any running jobs that write out temporary data may fail due to lack of capacity.

It is a good practice to ensure that disk use never exceeds 80 percent capacity.
MissingBlocks

Missing
Blocks

Corrupt Missing
MissingBlocks

Tells the original Reads the


Receives Serves it DataNode to data from
a request to the delete its one of the
client corrupted copy other
replicas

NameNode,
If the meanwhile,
Locates the checksum schedules a re-
requested does not replication of
block match, the the block from
client reports one of the
the corruption healthy copies
MissingBlocks
1
1 1 1
1 1 0
1 0 0 0
0 0 1 1
0 1 1 1 1
0 0 0 0
A missing block cannot be recovered 0 1 0
1 0 0 0
0 1 1 1
1 0 1
by copying a replica. 1 1 1 1
0 0 0 1 0 0
0 1 1
0 0 1 1
1 1 0 1 0 1
1 0 0 0 0
If a series of DataNodes were taken 1 1 0 0
0 0 0
0 1 1 0 1
offline for maintenance, missing blocks 0 1 1 1 0 0
1 1 0
0 0 1 1
maybe reported until they are brought 0 1 0 0 1
1 0 0 0 0
0 1 1 1
back up. 0 0
0 1 0 1
1 0 0
1 1
1 1 1 1
1 1 0 1
0 0 1 0
1 0
0
0
NumDeadDataNodes

= Ideally, the number of live DataNodes will be equal to


the number of DataNodes provisioned for the cluster.

 If the number of live DataNodes drops unexpectedly, it


may warrant an investigation.

 When the NameNode does not hear from a DataNode


within 30 seconds, the DataNode is marked “stale.”

 If the DataNode fails to communicate with the


NameNode within 10 minutes, the DataNode is marked
“dead.”

The death of a DataNode causes a flurry of network activity, as the


NameNode initiates replication of blocks lost on the dead nodes.
FilesTotal

FilesTotal is a running count of the number of files being tracked by the NameNode. NameNode
stores all metadata in memory.

As the number of files tracked increases, the


memory required by NameNode also increases.

Each object (file, directory, and block) tracked by the


NameNode consumes roughly 150 bytes of memory.

The default replication factor is 3 (so 900 bytes per file),


and each replica adds an additional 16 bytes, which results
in a total of ~1 KB of metadata per file.
TotalLoad

TotalLoad is the current number of concurrent file


accesses (read/write) across all DataNodes.

Since worker nodes running the DataNode daemon


also perform MapReduce tasks, extended periods of
high I/O, indicated by high TotalLoad, generally
Read/Write
Access
translate to degraded job execution performance.

Tracking TotalLoad over time can help get to the


bottom of job performance issues.
NameNode JVM Metrics

NameNode runs in Java virtual machine.

It depends on Java garbage collection


processes to free the memory.
JVM
Java Virtual Machine

More activity in the cluster triggers more


garbage collection processes.

Excessive pauses during garbage collection can be fixed by upgrading the JDK version or garbage collector.
Additionally, Java runtime can be tuned to minimize garbage collection.
DataNode Metrics

Points to the remaining


Remaining disk space on DataNodes

Metrics
Emitted by
DataNode

Shows the number of failed


NumFailed
volumes among the total Volumes
number of disks or volumes in
DataNodes
Mapreduce Counters

The MapReduce framework exposes a number of


counters to track statistics on MapReduce job runs.

Job counters Task counters Hue Job Browser

Counters
ResourceManage
r and
Terminals
File system Nodemanager
Custom counters web UI
counters
MapReduce Counters

MILLIS_MAPS/MILLIS_
REDUCE

NUM_FAILED_MAPS/
NUM_FAILED_
REDUCES
MapReduce Counters
DATA_LOCAL_MAPS/R
ACK_LOCAL_MAPS/O
THER_LOCAL_MAPS
REDUCE_INPUT_REC
ORDS

Task Counters

GC_TIME_MILLIS
Yarn Metrics

NodeManager
Metrics
Cluster
Metrics

YARN
Metrics

Application
Metrics
Cluster Metrics

Name Description Metric Type

unhealthyNodes Number of unhealthy nodes Resource: Error

Number of currently active


activeNodes Resource: Availability
nodes

lostNodes Number of lost nodes Resource: Error

Number of failed
appsFailed Work: Error
applications

Total amount of
totalMB/allocatedMB memory/amount of memory Resource: Utilization
allocated
Application Metrics

Application metrics provide detailed information on the


execution of individual YARN applications.

0-1
Progress provides a real-time window Reported value will always be in the
into the execution of a YARN range of zero to one (inclusive).
application.
NodeManager Metrics

NodeManager metrics provide resource


information at the individual node level.

containersFailed tracks the number of containers


that failed to launch on a particular NodeManager.

Launching of containers may fail due to


Resources
NodeManager’s disk being full or due to heap
memory being very less.
Zookeeper Metrics

zk_followers (leader only)


Zookeeper plays an important role in
Hadoop deployment.

zk_avg_latency
If High Availability is enabled,
monitoring ZooKeeper metrics can be
beneficial. zk_num_alive_connections
Monitoring Hadoop Cluster

Apache Hadoop core distribution does not offer any


inbuilt monitoring services or tools to monitor the
cluster.

NameNode web interface UI and ResourceManager


web interface UI can be used to browse information
about the cluster.

To monitor the cluster, gather metrics, and view


them, integration with third party software such as
Ganglia and Nagios would be required.
Cloudera Manager's Service Monitoring: Features

Monitoring service Presenting health Generating events Maintaining a


health, performance and performance related to system and complete record of
metrics about the data in a variety of service health and service-related
services, role instances formats, including critical log entries and actions and
running on the cluster, interactive charts making them available configuration
and metrics against for searching and changes
configurable thresholds alerting
Monitor Health and Status of Services

To check the status and health of a service, click on


the Services tab under Cluster tab and select all
services.

This shows the following:


 Type of service
 Service Status
 Overall health of service
 Type and number of roles configured for the
service

Services show the current status of a service. Earlier status of a service can be
seen by adjusting the Time Marker in the Cloudera admin interface.
Demonstration 1:
Monitoring Your CDH Cluster
Demonstration 2:
Monitoring Your CDH Cluster - 2
Quiz
QUIZ
What are the two distinct categories of monitoring?
1

a. Service monitoring and system monitoring

b. Cluster monitoring and service monitoring

c. Health monitoring and performance monitoring

d. Metrics collection and usage of metrics


QUIZ
What are the two distinct categories of monitoring?
1

a. Service monitoring and system monitoring

b. Cluster monitoring and service monitoring

c. Health monitoring and performance monitoring

d. Metrics collection and usage of metrics

The correct answer is c .


Explanation: The two distinct categories of monitoring are health monitoring and
performance monitoring.
QUIZ Which service of Cloudera manager helps in collecting information pertaining to
2 activities running on the cluster, and viewing current and historical activity?

a. Host monitor

b. Service monitor

c. Activity monitor

d. Reports manager
QUIZ Which service of Cloudera manager helps in collecting information pertaining to
2 activities running on the cluster, and viewing current and historical activity?

a. Host monitor

b. Service monitor

c. Activity monitor

d. Reports manager

The correct answer is c .


Explanation: The activity monitor of Cloudera manager helps in collecting information
pertaining to activities running on the cluster, and in viewing current and historical activity.
QUIZ
Which service of Cloudera manager does the most metric collection?
3

a. Host monitor

b. Service monitor

c. Cloudera-scm-server

d. Cloudera-scm-agents
QUIZ
Which service of Cloudera manager does the most metric collection?
3

a. Host monitor

b. Service monitor

c. Cloudera-scm-server

d. Cloudera-scm-agents

The correct answer is d .


Explanation: Cloudera-scm-agents does the most metric collection.
QUIZ
Which metrics-related contexts are common to all daemons?
4

a. DFS & Mapred operations

b. DFS & RPC operations

c. RPC & Mapred operations

d. JVM & RPC operations


QUIZ
Which metrics-related contexts are common to all daemons?
4

a. DFS & Mapred operations

b. DFS & RPC operations

c. RPC & Mapred operations

d. JVM & RPC operations

The correct answer is d .


Explanation: JVM & RPC operations are common to all daemons.
QUIZ Which Hadoop configuration file is updated to enable plug-ins for metrics
5 collection?

a. Core-site.xml

b. Hadoop-policy.xml

c. Hadoop-metrics.properties

d. Hdfs-site.xml
QUIZ Which Hadoop configuration file is updated to enable plug-ins for metrics
5 collection?

a. Core-site.xml

b. Hadoop-policy.xml

c. Hadoop-metrics.properties

d. Hdfs-site.xml

The correct answer is c .


Explanation: Hadoop-metrics.properties is updated to enable plug-ins for metrics collection.
QUIZ What are the two daemons of Ganglia that collect metrics and record metrics
6 and can help in displaying cluster/system metrics?

a. Ganglia master and Ganglia slave

b. GangliaContext& GangliaContext31

c. Gmond&Gmetad

d. Ganglia master and Ganglia monitor


QUIZ What are the two daemons of Ganglia that collect metrics and record metrics
6 and can help in displaying cluster/system metrics?

a. Ganglia master and Ganglia slave

b. GangliaContext& GangliaContext31

c. Gmond&Gmetad

d. Ganglia master and Ganglia monitor

The correct answer is c .


Explanation: Gmond&Gmetad are the two daemons of Ganglia that collect metrics and
record metrics and can help in displaying cluster/system metrics.
QUIZ What are the two categories of Hadoop metrics pertaining to processing
7 framework?

a. Yarn metrics and ZooKeeper metrics

b. HDFS metrics and YARN metrics

c. MapReduce counters and YARN metrics

d. NodeManager metrics and application metrics


QUIZ What are the two categories of Hadoop metrics pertaining to processing
7 framework?

a. Yarn metrics and ZooKeeper metrics

b. HDFS metrics and YARN metrics

c. MapReduce counters and YARN metrics

d. Nodemanager metrics and Application metrics

The correct answer is c .


Explanation: MapReduce counters and YARN metrics are the two categories of Hadoop
metrics pertaining to processing framework.
QUIZ Choose the list of metrics emitted by NameNode that help in cluster
8 monitoring.

a. Missing Blocks, JVM metrics, TotalLoad

b. NumFailedVolumes, ActiveNodes, lostNodes

c. CapacityRemaining, MissingBlocks, FilesTotal

d. appsFailed, activeNodes, unhealthyNodes


QUIZ Choose the list of metrics emitted by NameNode that help in cluster
8 monitoring.

a. Missing Blocks, JVM metrics, TotalLoad

b. NumFailedVolumes, ActiveNodes, lostNodes

c. CapacityRemaining, MissingBlocks, FilesTotal

d. appsFailed, activeNodes, unhealthyNodes

The correct answer is c .


Explanation: CapacityRemaining, MissingBlocks, FilesTotal are emitted by NameNode that
help in cluster monitoring.
QUIZ
What kind of metrics provide resource information at the individual node level?
9

a. Application metrics

b. NodeManager metrics

c. Yarn cluster metrics

d. MapReduce and task counters


QUIZ
What kind of metrics provide resource information at the individual node level?
9

a. Application metrics

b. NodeManager metrics

c. Yarn cluster metrics

d. MapReduce and task counters

The correct answer is b .


Explanation: NodeManager metrics provide resource information at the individual node
level.
QUIZ Which metrics under the MapReduce category track the time spent across all
10 map and reduce tasks?

a. Data_Local_Maps and rack_local_maps

b. Milli_maps and milli_reduces

c. Reduce_input_records and map_input_records

d. GC_time_millis
QUIZ Which metrics under the MapReduce category track the time spent across all
10 map and reduce tasks?

a. Data_Local_Maps and rack_local_maps

b. Milli_maps and milli_reduces

c. Reduce_input_records and map_input_records

d. GC_time_millis

The correct answer is b .


Explanation: Milli_maps and milli_reduces track the time spent across all map and reduce
tasks.
Considerations involved in choosing the right monitoring solution
from various solutions available include scalability, flexibility,
extensibility, and zero consideration.

Monitoring the performance of applications running in a Hadoop


cluster is very important, and choosing the right monitoring
solution is equally important. There are multiple features of
Hadoop cluster that can help organizations decide which
monitoring tool or solution is suitable.
Cloudera Manager consists of various features for
monitoring the health and performance of the
components. It also tracks the performance and resource
demands of the user jobs running on your cluster.

Hadoop metrics can be categorized into HDFS Metrics,


MapReduce Counters, YARN Metrics, and Zookeeper
Metrics.

To check the status and health of a service, click on the


Services tab under Cluster tab and select all services.
This concludes the lesson “Hadoop Cluster Monitoring.”
Thank You

Disclaimer: All the logos used in this course belong to the respective organizations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy