BDH Admin Ebook
BDH Admin Ebook
Case studies
Disclaimer: All the logos used in this course belong to the respective organizations
The Value of Data
Data Generators
Organizations seek to glean intelligence from the available data and translate that into business
advantage.
Volume
Velocit
Velocity Variety y
Structured Batch
Variety and
Unstructured
Veracity
Viscosity Big Streaming
Data
Virality
Structured Data
Volatility
Zettabytes Terabytes
Validity
Volume
What is Big Data? (Contd.)
Big Data is a term coined to describe large or complex datasets. Traditional data processing
solutions are inadequate to analyze this data and fail to capture, store, analyze, cure, search,
share, transfer, visualize, query, process, and handle this kind of data.
According to IDC (International Data Corporation), worldwide revenues for Big Data and business
analytics will grow from nearly $122 billion in 2015 to more than $187 billion in 2019.
- Dan Vesset,
(IDC group vice president, Analytics and Information Management)
Interesting Facts and Statistics (Contd.)
“There is little question that Big Data and Analytics can have a considerable impact on just
about every industry,”
“Its promise speaks to the pressure to improve margins and performance while
simultaneously enhancing responsiveness and delighting customers and prospects.
Forward-thinking organizations turn to this technology for better and faster data-driven
decisions,"
- Jessica Goepfert,
(Program director for IDC’s Customer Insights and Analysis Group)
Big Data— Statistics and Challenges
BIG DATA
• Assemble
• Normalize
Agricultural clients can monitor
Skybox customers • Index
Skybox partnered crop yields. Make
can embed their •
with Cloudera meaningful
own algorithms in
(Well known Shipping and supply chain connections
the company’s
Hadoop Vendor) companies can monitor
platform and use
to implement their vehicles.
analytics engine to
own distribution of
crunch data for
Hadoop. Oil and gas companies can
their own uses.
evaluate land areas.
Big Data Customers - Case Studies
Here is a list of some more customers which have adopted Big Data and Hadoop-based technologies
to power Big Data applications.
Apache Hadoop
Apache Hadoop
An open-source software framework for distributed storage, distributed and parallel processing of very
large datasets on commodity machines that form a cluster.
Flume Zookeeper
Apache Stor
Hive m
HBase Spark kafka
Pig Sqoop
Hadoop Ecosystem and its Components (Contd.)
• Hadoop Distributed File System – Sqoo Pig Hive Mahou Oozie HBas
a distributed file system that stores data on p Scripting SQL
t
ML Workflow e
commodity machines Data
Exchange
• Hadoop Yarn – a resource management Columnar
data
platform for managing computing resources
Coordination
Flum
YARN/Map Reduce V2 store
Keeper
e
Zoo
in cluster and using them for scheduling and
processing user applications Log
Control Hadoop Distributed File
System
• Hadoop MapReduce – a programming
model for large scale distributed and parallel
data processing
• Other open-source components/packages
Hadoop: Daemons, Roles, and Components
RM AM
NM NM NM
Processing layer RM
N SNN
Storage layer
NNN
DN DN DN
NN
Hadoop CLUSTER
Hadoop Cluster: A complete picture
API/Client/Application
Metadata
in RAM
SNN JT or RM
NN NN-NameNode
Metadata DN-DataNode
in DISK
SNN-SecondaryNameNode
HDFS RM/JT-ResourceManager/JobTracker
NM/TT-NodeManager/TaskTracker
M-Map
DN, NM/TT DN, NM/TT DN, NM/TT
R-Reduce
MandR-Map and Reduce
HDFS-Hadoop Distributed File System
Quiz
a. Volume
b. Velocity
c. Veracity
d. Validity
a. Volume
b. Velocity
c. Veracity
d. Validity
Explanation: Veracity is the characteristic of Big Data that relates to quality of data
under consideration.
c. HDFS makes it possible to store all data online and offers scaling out approach.
c. HDFS makes it possible to store all data online and offers scaling out approach.
Explanation: HDFS makes it possible to store all data online and offers scaling out approach, hence it is
possible to achieve more accuracy in analysis when using a framework like Apache Hadoop in comparison
to existing RDBMs solutions.
©Simplilearn. All rights reserved
Quiz Which nodes does HDFS use to store data?
3
a. NameNodes
b. Tasktracker
c. DataNodes
d. SecondaryNameNode
a. NameNodes
b. Tasktracker
c. DataNodes
d. SecondaryNameNode
a. Yes, if HA is enabled
a. Yes, if HA is enabled
b. Hadoop common
b. Hadoop common
c. Only NamespaceID
d. Datanode IDs
c. Only NamespaceID
d. Datanode IDs
Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Big Data and Hadoop- Introduction.”
The next lesson is “HDFS: Hadoop Distributed File System.”
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 2- HDFS: Hadoop Distributed File System
What You’ll Learn
Disclaimer: All the logos used in this course belong to the respective organizations
Lesson 2: HDFS: Hadoop Distributed File System
Topic 2.1: Introduction to HDFS
Scalability
Scalability
The 70s and 90s saw vertical scalability as a solution to scalability problems.
Scale Out
The 90s and 2000s saw scaling out architecture as the preferred option
for vertical stability
Open Scale Out
The advent of cloud platforms have led to the emergence of applications that are
highly scalable, open, and capable of running on heterogeneous platforms.
COMMODITY COMPUTING - A Solution
The Internet giants have proved that commodity computing and distributed data storage can be efficiently used.
Commodity Computing - A Solution
A logically distributed
file system
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
Hadoop : Different Distributions
Hadoop: Different Distributions
The main difference between the vendor specific distributions and core Hadoop distribution is Services.
Let’s now understand how HDFS works and why it is called a highly fault tolerant, distributed file
system
Internals and working of HDFS (contd.)
API/Client/Application
?
Hadoop Framework
Metadata in
File Split into Blocks - Blk1, Blk2, Blk3, Blk4
RAM
Blk1 Blk1 Blk1
Blk2 Blk2 Blk2
NameNode
Blk3 Blk3 Blk3
Blk4 Blk4 Blk4
Master daemon of cluster and storage layer
84
Probable Replacements for HDFS
85
Probable Replacements for HDFS
86
Probable Replacements for HDFS
87
Probable Replacements for HDFS
88
Probable Replacements for HDFS
89
Probable Replacements for HDFS
90
Probable Replacements for HDFS
91
Benefits of HDFS Over the Other Contenders
HDFS features
Scalability, inexpensive devices, and no lock-
ins
HDFS Performance
Achieves success in handling data rates efficiently
Quiz
QUIZ What is a Hadoop Distributed File system?
1
d. is
The correct answer
Explanation: Hadoop Distributed File system is a service and a distributed storage layer that offers fault
tolerance.
QUIZ You are configuring your Hadoop cluster to run MapReduce v2 on YARN. What
2 are the two daemons that need to be installed?
a. Namenode, Datanodes
a. Namenode, Datanodes
b. is
The correct answer
Explanation: The ResourceManager (on Master Node) and the NodeManager (on slave nodes) are the
daemons for managing applications in a distributed manner in YARN.
QUIZ
Which of the following are a list of services that run on a Cloudera Distribution
3 of Hadoop ( CDH)?
a. is
The correct answer
Explanation: HDFS, MapReduce, Yarn, Flume, Sqoop run on a Cloudera Distribution of Hadoop.
QUIZ
What are the functions of Cloudera Manager?
4
a. Monitors the state of services and roles that are running in a cluster
a. Monitors the state of services and roles that are running in a cluster
c. is
The correct answer
Explanation: Cloudera manager is responsible for monitoring the services and related roles running on the
hosts of your cluster. It also monitors metrics coming in from various services and roles.
QUIZ
Which service takes care of the activities related to installing CDH, configuring
5 services, and starting and stopping of services?
b Cloudera-scm-agents
.
c. Namenode
c. is
The correct answer
Explanation: Cloudera manager, also known as Cloudera SCM Server or CMF server, takes care of
everything related to installing CDH, configuring services, and starting and stopping of services.
QUIZ
What are the default block sizes in HDFS, and can it be changed?
6
a. Default block size in Hadoop v1 is 64mb and in Hadoop v2, it is 128 mb. It can be
changed by using dfs.block.size paramter in Hdfs-site.xml
6 What are the default block sizes in HDFS, and can it be changed?
a. Default block size in Hadoop v1 is 64mb and in Hadoop v2, it is 128 mb. It can be changed by
using dfs.block.size paramter in Hdfs-site.xml
a. is
The correct answer
Explanation: Default block size in Hadoop v1 is 64mb and in Hadoop v2, it is 128 mb. It can be changed by
using dfs.block.size paramter in Hdfs-site.xml
QUIZ With a cluster of 10 machines with 8 datanodes, what is the maximum
replication that can be achieved, and what is the maximum replication that can
7 be set in the configuration files?
b. is
The correct answer
Explanation: The maximum replication that can be achieved is 8 and can set replication to any number in
the configuration files.
QUIZ
If Rack awareness is enabled, how are blocks placed by replication algorithm?
8 Choose the most appropriate option.
a. All the replicas are placed on the same rack for the same file.
a. All the replicas are placed on the same rack for the same file.
c. is
The correct answer
Explanation: Assuming the replication is set to three, two replicas are placed on different nodes on a rack
and one replica is placed on a different closest rack. Thus, all replicas are never placed on same rack.
QUIZ
Identify the most appropriate option for the list of distributions of Hadoop:
9
d. Apache Hadoop
QUIZ
9 Identify the most appropriate option for the list of distributions of Hadoop:
d. Apache Hadoop
c. is
The correct answer
Explanation: The most appropriate option for the list of distributions of Hadoop are CDH, HDP, MapR, IBM
Big Insight, and AWS EMR.
QUIZ
Identify in which Distribution of Hadoop HDFS has been designed to be easily
10 portable from one platform to another?
b. is
The correct answer
Explanation: In vendor specific distribution like Cloudera or Hortonworks HDFS has been designed to be
easily portable from one platform to another.
Key Takeaways
Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Hadoop Distributed File System.”
The next lesson is “Hadoop Cluster Setup and Working.”
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 3- Hadoop Cluster Setup and Working
What You’ll Learn
Disclaimer: All the logos used in this course belong to the respective organizations
Install and Prepare Your Machine with
Linux Operating System
Demonstration 1:
Getting Virtualization Software in Linux Disk Image
Demonstration 2:
Adding Machines to your VMBox
Demonstration 3:
Installing Linux into your Machines
Demonstration 4:
Preparing your Linux Machines (CentOS 6) Part 1
Demonstration 5:
Preparing your Linux Machines (CentOS 6) Part 2
Demonstration 6:
Preparing your Linux Machines (CentOS 7)
Cluster Management Solution
Cluster setup is implemented so that servers and network can work together as a centralized data
processing resource.
Cluster Management Solution (Contd.)
Cluster Management Solution Features
Cloudera Manager Vocabulary
HTTP(S) EMBEDDED
DATABASE
Cloudera Manager: Capabilities
NODE2
Data Model contains:
NODE1 CLOUDERA • An updated catalogue of nodes in a
ROLES
MANAGER AGENT cluster
CLOUDERA • Configurations assigned to each node
MANAGER
NODE3
SERVER • Services, and relevant roles
CLOUDERA
ROLES SERVICES
CLOUDERA MANAGER AGENT Data Model:
MANAGER AGENT
• Sends configuration and task
NODE4 instructions to agents
CLOUDERA ROLES • Tracks their heartbeats
MANAGER AGENT
• Receives information from agents
HTTP(S) EMBEDDED
DATABASE • Calculates health status of services and
overall cluster
Cloudera Manager: Capabilities
Cloudera Manager:
• Deals with the configuration settings
NODE2
NODE1 CLOUDERA • Tracks host metrics
ROLES
MANAGER AGENT
• Monitors the cluster and the role
CLOUDERA status.
MANAGER
NODE3
SERVER • Keeps activity monitoring data and
CLOUDERA
CLOUDERA
ROLES SERVICES configuration changes
MANAGER AGENT
MANAGER AGENT Cloudera SCM Agents:
• CDH installation
• Upgrading CDH and its component versions
• Configuring services and changing settings
• Adding new clusters or hosts
• Adding
• Removing
• Maintaining services.
Cloudera’s Cluster Management Solution: Cloudera Manager (Contd.)
To Install CDH, we need to know its prerequisites; this may involve knowing the supported:
Operating systems
Resource
requirements JDK Versions
CDH services
versions
Refer: https://www.cloudera.com/documentation/enterprise/5-7-x/topics/cm_ig_cm_requirements.html
to know about compatibilities, requirements and versions of CDH with its components.
Installation
Option 3 Manual Installation using Cloudera manager tarballs: Installing Oracle JDK,
Option 1 Cloudera manager server, and agents manually using tarball and using
Cloudera Manager to automate installation.
Cloudera Manager Software Distribution Formats
Cloudera Manager is used to install CDH and manage services. It supports two software distribution formats:
Package Parcel
PARCELS PACKAGES
d. NameNode
d. NameNode
b. is
The correct answer
Explanation: The starting and Stopping of processes related to services in a CDH is handled by Cloudera
Manager Agent.
d. SCM agents start the processes for services, monitor them, and also communicate
with SCM server.
d. SCM agents start the processes for services, monitor them, and also communicate
with SCM server.
d. is
The correct answer
Explanation: SCM agents start the processes for services, monitor them, and also communicate with the SCM
server.
The role of a Cloudera SCM agent is to start the processes for services, monitor them, and also communicate
with the SCM server.
©Simplilearn. All rights reserved
QUIZ
Which role manages and resolves the under-replication or over/mis replication of blocks?
3
a. NameNode
b Cloudera manager
.
c. DataNode themselves.
a. NameNode
b Cloudera manager
.
c. DataNode themselves
a. is
The correct answer
Explanation: NameNode is the role that manages and resolves the under-replication or over/mis-
replication of blocks.
a. Click services tab> Top right corner under actions “Add services”>follow the
wizard
b Download the packages related to services, edit configuration files, start the
services.
.
c. Services cannot be added to existing list if services were not added during
installation.
a. Click services tab> Top right corner under actions “Add services”>follow the
wizard
b Download the packages related to services, edit configuration files, start the
services.
.
c. Services cannot be added to existing list if services were not added during
installation.
a. is
The correct answer
Explanation: The correct answer is: Click services tab> Top right corner Under actions “Add services”>follow the
wizard
To add services to an existing list of services on cluster using Cloudera admin console, click services tab> in the
top right corner, under actions Add services>follow the wizard.
©Simplilearn. All rights reserved
QUIZ
Which property helps to enable yarn framework in apache Hadoop v2?
5
a. mapred.framework.name
b yarn.resourcemanager.address
.
c. yarn.nodemanager.name
a. mapred.framework.name
b yarn.resourcemanager.address
.
c. yarn.nodemanager.name
a. is
The correct answer
Explanation: mapred.framework.name is the property that helps to enable yarn framework in apache
Hadoop v2.
d. A and C.
d. A and C.
a. is
The correct answer
Explanation: Parcels can be installed in versioned directories and allow different versions. They can be
installed anywhere in the file system.
a. is
The correct answer
ExplanationDuring the set up of cluster, configuration files are auto-populated with properties. The
configuration files are auto-populated with properties during the set up of cluster.
b. is
The correct answer
Explanation: Yes, CDH can be set up, but the complexities increase.
The command is not needed; the status can be seen using admin console> HDFS
d. services.
The command is not needed; the status can be seen using admin console> HDFS
d. services.
d. is
The correct answer
Explanation: In a Cloudera cluster, we don’t need the command ‘dfsadmin -report’; we can see the status
using admin console> HDFS services.
a. The config file properties are present by default; they cannot be changed.
The config file properties can be changed by restarting the services once the
b changes are done.
.
c. The config file properties can be changed only with the permission of Kerberos.
a. The config file properties are present by default; they cannot be changed.
The config file properties can be changed by restarting the services once the
b changes are done.
.
c. The config file properties can be changed only with the permission of Kerberos.
b. is
The correct answer
Explanation: The config file properties can be changed, but the services should be restarted once done.
Disclaimer: All the logos used in this course belong to the respective organizations
©Simplilearn. All rights reserved
This concludes the lesson “Hadoop Cluster Setup and Working.”
The next lesson is “Hadoop Configurations and Daemon Logs”.
Disclaimer: All the logos used in this course belong to the respective organizations
©Simplilearn. All rights reserved
Big Data and Hadoop Administrator
Lesson 4- Hadoop Configurations and Daemon Logs
What You’ll Learn
Disclaimer: All the logos used in this course belong to the respective organizations
What You’ll Learn
Explain the RPC and HTTP default addresses and ports used by
Hadoop Daemons
Locate log files generated on hosts
Disclaimer: All the logos used in this course belong to the respective organizations
Hadoop Configurations and Daemon Logs
hadoop-env.sh Yarn-site.xml
Core-site.xml
Hadoop-metrics2.properties
Mapred-site.xml
Log4j.properties
Hdfs-site.xml
Hadoop-policy.xml
Mapred-env.sh
Yarn-env.sh Slaves
Location of Configuration Files and Directories
/etc/Hadoop
Let’s now revise some important terminologies that will help in understanding the topics in
this lesson
Service Instance
A Service Instance is an
instance of a service running Role Instance
on a cluster that spans many A Role Instance is an
role instances. instance of a role running on
a host.
Service
Roles Role Group
A Service is a category of
Roles are Daemons or A Role Group is a set of
managed functionality in
processes that take care of a configuration properties for a set
Cloudera Manager.
service. of role instances.
Configuration Management with Cloudera Manager
Let’s now discuss how to manage Hadoop configurations with Cloudera Manager.
Monitoring
Software management
Resource management
Configuration Management with Cloudera Manager (Contd.)
Let’s now understand how Cloudera Manager helps to handle configurations at different
levels.
@Service Level
Let’s now understand how Cloudera Manager helps to handle configurations at different
levels.
Note: That service related role instances obtain their configurations from a private per process
directory found under “/var/run/cloudera-scm-agent/process/unique-process-name”.
Specifying Configurations
Hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
Resources
<value>3</value>
Let’s now look at how configuration is handled in Cloudera cluster and in Core Apache Hadoop.
First, we will look at setting up environment variables. Files that include these are:
Hadoop-env.sh,
core-site.xml
<?xml version="1.0"?>
<!– core-site.xml>
<configuration>
<property> <name>fs.defaultFS </name>
<value>hdfs://hostname:port
</value> </property>
</configuration>
Hdfs-site.xml
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property> <name>dfs.namenode.name.dir</name>
<value>/disk1/hdfs/name,/remote/hdfs/name</value></property>
<property> <name>dfs.datanode.data.dir</name>
<value>/disk1/hdfs/data,/disk2/hdfs/data</value> </property>
<property><name>dfs.namenode.checkpoint.dir</name>
<value>/disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary
</value></property>
</configuration>
Yarn-site.xml
Indicato
r Icon
Stale Configurations
Attributes on Stale Configuraitons page:
Environment variables
Environment set for the role
Configuration files
Files used by the role
Let’s now look at how to fix stale configurations. The following actions will
help fix stale configurations:
Note: In Apache Hadoop cluster, when any configuration changes are made,
related daemons and master daemons must be restarted.
Demonstration 1:
Looking into Logs and Filtering Information
Hadoop Configurations and Daemon Logs
dfs.namenode.rpc-bind-host The address the namenode’s RPC server will bind to. If not set, the bind address is
determined by fs.defaultFS. It can be set to 0.0.0.0 to make the namenode listen on all
interfaces.
dfs.datanode.ipc.address 0.0.0.0:50020 The datanode’s RPC server address and port.
mapreduce.jobhistory.address The job history server’s RPC server address and port. This is used by the client (typically
0.0.0.0:10020
outside the cluster) to query job history.
mapreduce.jobhistory.address The address the job history server’s RPC and HTTP will bind to.
mapreduce.jobhistory.bind-host
The address the resource manager’s RPC and HTTP servers will bind to.
yarn.resourcemanager.bind-host The resource manager’s RPC server address and port. This is used by the client (typically
${y.rm.hostname}:8032
outside the cluster) to communicate with the resource manager.
yarn.resourcemanager.address The resource manager scheduler’s RPC server address and port. This is used by (in-cluster)
${y.rm.hostname}:8030
application masters to communicate with the resource manager.
yarn.resourcemanager.scheduler.ad The resource manager resource tracker’s RPC server address and port. This is used by (in-
dress ${y.rm.hostname}:8031
cluster) node managers to communicate with the resource manager.
yarn.nodemanager.hostname The hostname of the machine the node manager runs on. Abbreviated
0.0.0.0
{y.nm.hostname} below.
yarn.nodemanager.bind-host The address the node manager’s RPC and HTTP servers will bind to.
The node manager’s RPC server address and port. This is used by (in-cluster) application
yarn.nodemanager.address ${y.nm.hostname}:0
masters to communicate with node managers.
HTTP Server Properties
Property name Default value Description
dfs.namenode.http-bind-host The address the namenode’s HTTP server will bind to.
dfs.namenode.secondary.http-address 0.0.0.0:50090 The secondary namenode’s HTTP server address and port.
The datanode’s HTTP server address and port. (Note that the property name
dfs.datanode.http.address 0.0.0.0:50075
is inconsistent with the ones for the namenode.)
The MapReduce job history server’s address and port. This property is set as
mapreduce.jobhistory.webapp.address 0.0.0.0:19888
inmapred-site.xml.
The shuffle handler’s HTTP port number. This is used for serving map
mapreduce.shuffle.port 13562 outputs and is not a user-accessible web UI. This property is set in mapred-
site.xml.
${y.rm.hostname}:
yarn.resourcemanager.webapp.address The resource manager’s HTTP server address and port.
8088
${y.nm.hostname}:
yarn.nodemanager.webapp.address The node manager’s HTTP server address and port.
8042
The web app proxy server’s HTTP server address and port. If not set (the
yarn.web-proxy.address default), then the web app proxy server will run in the resource manager
process.
specifies which network interfaces are used by the datanodes as their IP addresses
dfs.datanode.dns.interface to connect with RPC and HTTP Servers.
Hadoop Configurations and Daemon Logs
Log
Information
Host Log Level Time Source Message
You can filter information from logs based on the following parameters:
Log level and severity of messages Search time-out and results per page
Note: If required, you can download the Full log from the logs page.
Log Information in CDH
default $HADOOP_HOME/logs
.out .log
: hadoop-<user-running-hadoop>-<daemon>-
Naming convention for Log Files
<hostname>.log
job_<job_ID>_conf.xml
Convention to construct /Hadoop file names
Example: job_200704180028_0002_conf.xml
<hostname>_<epoch-of-jobtracker-start>_<job-
id>_conf.xml
Convention to construct /Hadoop/history file
Example: ec2-52-43-63-183.compute-
names
1.amazonaws.com_1240642372616_job_200704180028_
0002_conf.xml
/var/log/hadoop/userlogs/attempt_<job-id>_<map-or-
Standard error logs
reduce>_<attempt-id>
Demonstration 2:
Working with Configurations in Cloudera Cluster and Fixing Stale
Configurations
Quiz
QUIZ
Which of the following is the configuration file with properties defined for
1 metadata path, data path, and other paths related to roles/daemons?
a. Hdfs-site.xml
b Core-site.xml
.
c. Hadoop-policy.xml
d. Yarn-site.xml
QUIZ
Which of the following is the configuration file with properties defined for
1 metadata path, data path, and other paths related to roles/daemons?
a. Hdfs-site.xml
b Core-site.xml
.
c. Hadoop-policy.xml
d. Yarn-site.xml
Explanation: Hdfs-site.xml is the configuration file with properties defined for metadata path, data
path, and other paths related to roles/daemons.
QUIZ
Which of the following is the default Heap_Size allocated to each daemon in a
2 cluster?
a. 10 % of node’s RAM
b 1 GB
.
c. 30 % of node’s RAM
a. 10 % of node’s RAM
b 1 GB
.
c. 30 % of node’s RAM
b Cloudera manager, Hosts, Service level, Role group, and Role instance levels
.
c. Role group and Role instance levels only
b Cloudera manager, Hosts, Service level, Role group, and Role instance levels
.
c. Role group and Role instance levels only
Explanation: Cloudera manager, Hosts, Service level, Role group, and Role instance levels are the
different levels at which configurations can be defined and managed in a Cloudera Hadoop cluster.
QUIZ
4 Which of the following is a fix for stale configurations?
Explanation: Restarting all affected services and redeploying client configurations is a fix for stale
configurations.
QUIZ
Which of the following specifies the amount of physical memory (in MB) that
5 may be allocated to containers being run by the node manager?
a. yarn.resourcemanager.resource.memory-mb
b yarn.nodemanager.resource.vcores-mb
.
c. yarn.nodemanager.resource.memory-mb
a. yarn.resourcemanager.resource.memory-mb
b yarn.nodemanager.resource.vcores-mb
.
c. yarn.nodemanager.resource.memory-mb
Explanation: Hadoop runs RPC Server and HTTP Server to communicate between daemons and to
provide web pages.
QUIZ
7 Which of these log files can be rotated?
a. .log files
b Slave logs
.
c. Daemon logs
a. .log files
b Slave logs
.
c. Daemon logs
a. Cluster Setup wizard downloads Hadoop and related parcels and sets up
some services by default
b Cluster Setup wizard assigns roles to hosts based on internal check of node
configuration
.
c. Admin can configure services and assign or change role assignment to hosts
a. Cluster Setup wizard downloads Hadoop and related parcels and sets up
some services by default
b Cluster Setup wizard assigns roles to hosts based on internal check of node
configuration
.
c. Admin can configure services and assign or change role assignment to hosts
Explanation: All of the above. All the above mentioned statements are true with respect to cluster
configuration management with Cloudera.
QUIZ Choose the default HTTP server ports for these daemons: Namenode,
9 Secondarynamenode, ResourceManager, nodemanager, and DataNodes. (in
order)
Explanation: 50070,50090,8088,8042 and 50075 are the default HTTP server ports for the daemons:
NameNode, SecondaryNameNode, ResourceManager, NodeManager, and DataNodes respectively.
QUIZ
What is the resource manager resource tracker’s RPC default port used by (in-
10 cluster) node managers to communicate with the resource manager?
a. 8032
b 8030
.
c. 8031
d. 10020
QUIZ
What is the resource manager resource tracker’s RPC default port used by (in-
10 cluster) node managers to communicate with the resource manager?
a. 8032
b 8030
.
c. 8031
d. 10020
Explanation: 8031 is the resource manager resource tracker’s RPC default port used by (in-cluster)
node managers to communicate with the resource manager.
Key Takeaways
Hadoop’s cluster configuration can be used to set up files. Some of the important
files are Hadoop-env.sh, Core-site.xml, Hdfs-site.xml, and yarn-site.xml.
Cloudera Manager and Ambari are two tools that are popular.
Disclaimer: All the logos used in this course belong to the respective organizations
Key Takeaways
Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Hadoop Configurations and Daemon Logs.”
The next lesson is “Cluster Maintenance and Administration.”
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 5- Hadoop Cluster Maintenance and Administrations
What You’ll Learn
Disclaimer: All the logos used in this course belong to the respective organizations
What You’ll Learn
List the steps to add, remove and move role instances and hosts
Explain the features in the second version that help overcome the
challenges faced with the first version
Disclaimer: All the logos used in this course belong to the respective organizations
Lesson 5: HDFS: Hadoop Distributed File System
Topic 5.1: Maintaining Clusters
Adding and Removing Nodes: Adhoc Method
Adhoc way
Add
Cluster
Adding and Removing Nodes: Adhoc Method (Contd.)
Adhoc way
Replication Unplanned
Consistency overhead
Cluster
Availability Time-
of Data consuming
Adding and Deleting Nodes: Systematic Method
Systematic way
Cluster
Commissioning a
node
Decommissioning
a node
Adding and Deleting Nodes: Systematic Method (Contd.)
If you are using the Apache Hadoop Cluster, to add or delete a node, you will need to perform the
following tasks:
• Edit the configuration files hdfs-site.xml and yarn-site.xml during the cluster setup. Recall
that in the first version of Hadoop, it was mapred-site dot xml that needed editing.
• Create empty Include and Exclude files.
• Set the properties in the configuration files to point to the Include and Exclude files.
• Start the cluster.
• Edit the Include and Exclude files and specify the nodes to be included or excluded.
• Issue the commands “hdfs dfsadmin – refreshNodes” and “yarn rmadmin –refreshNodes.”
• Issue a “hdfs balancer” command to ensure an even distribution of data.
• Update the “Slave” files.
Demonstration 1: Adding or Removing Machines in an Adhoc Way in Apache Hadoop
Cluster
Demonstration 2: Commission and Decommission a Node in a Cloudera Cluster
Demonstration 3: Decommission and Commission a Node in a Apache Hadoop
Cluster
Balancing a Cluster
Uneven
Distribution
of Data
Commissioning or NameNode
Decommissioning not receiving
of nodes heartbeats
Sudden and
multiple
failures of
data nodes
Balancing a Cluster (Contd.)
Replication Storm
Cluster
Cascading failures
Balancing a Cluster (Contd.)
Failure
Regula Catastrophic
r
• A regular event depends on the rate • Catastrophic events could be due to network
of failure of data nodes. issues, rack failures, or massive hardware
failures.
• Regular events do not cause any • Catastrophic events trigger the loss of
major impact. hundreds of nodes within a few minutes.
Data node
Balancing a Cluster
Ensures even
distribution of
data
Avoids
Avoids
performanc HDFS
Replication
e issues and Balancer
Storms
failed tasks
Avoids
Cascading
Failures
Demonstration 4: Using Balancer to Balance Data in Hadoop Cluster
Lesson 5: Hadoop Cluster Maintenance and
Administrations
Topic 5.2: Managing Services
Demonstration 5: Adding a Service to Cloudera Cluster
Demonstration 6: Deleting a Service from Cloudera Cluster
Starting, Stopping, Restarting and Checking Services
Starting and Stopping of services should be done in the correct order because of the dependencies they
may have on other services.
Example:
MapReduce and YARN have a dependency on HDFS. So, you must start HDFS before starting
MapReduce or YARN.
Cloudera Management Service and Hue are the only two services on which no other services depend.
You can start and stop them any time, but there is a recommended order that needs to be followed.
Starting, Stopping, Restarting and Checking Services
To start or stop services with the Cloudera Admin console user interface:
Cloudera
Managemen Key-Value
t Service HDFS Flume Store Indexer Hive Oozie Hue
Starting
Services
Cloudera
Key-Value Management
Hue Oozie Hive Store Indexer Flume HDFS Service
Stopping
Services
Rack 1 Rack 2
DataNode 1 DataNode 5
Create Topology file with node-to-
DataNode 2 DataNode 6 rack information
DataNode 4 DataNode 8
Replication Algorithm Assign Rack
Restart HDFS
Rack Aware
DataNode 2 DataNode 6
Update hdfs-site.xml with the
DataNode 3 DataNode 7
property topology.script.file.name
DataNode 4 DataNode 8 and include the path to the
topology.sh script file.
Demonstration 8: Enabling Rack Awareness in Cloudera Cluster
Managing Role Instances
Select the
Service
Select the
Instance
Start the
Role
Instance
DELETE HOSTS
Delete Host
Demonstration 10: Adding Hosts to Cloudera Cluster
Demonstration
V1
Overburdened JobTracker
Job Job
Resourc
1 3
e Job
JobTracker 2
High Availability
Job Job
Resourc
1 3
V1 e Job
JobTracker 2
NameNode
Namespace
Providing DataNode
cluster membership
NS
Processing block reports Block Management
and maintaining the Provide storage by
location of blocks allowing storage of blocks
Block Storage
on the local file system
Supporting block
related operations
DataNode DataNode
Hadoop V2
Deleting over-replicated
blocks
Federation (Contd.)
NS 1 NS k NS n
Block Pool Set of blocks
Namespace +
Namespace Volume
Block Storage
Block Id
Common Storage
Hadoop V2
Federation (Contd.)
Failover
HA
Active Standby
NameNode NameNode
High Availability
HDFS
Client
All name space edits YAR
logged to shared N Next Generation
Shared edit logs Reads edit logs
NFS storage; single MapReduce
and applies to
writer (fencing)
its own
namespace
NameNode
High
Availability
Active Standby Resource
NameNode NameNode Manager
Split-Brain Scenario
Both nodes are in
‘active’ state DataNode DataNode Node Manager Node Manager
App App
Fencing Container Manager
Container Manager
Terminate access of
standby node to Node Manager Node Manager
DataNode DataNode
shared storage if
active node is still App App
Container Container
active Manager Manager
High Availability with Zookeeper
Zookeeper Zookeeper-
Health
Session based
monitoring
Management Election
Demonstration 11: Enabling High Availability of NameNode and
ResourceManager in Cloudera Cluster
High Availability using Quorum Journal Manager
Data Blocks
Hadoop V2: Overall Picture
Client
HDFS YARN
Distributed Data Storage Distributed Data Processing
Resource Manager
Masters
Applications
Active Standby Scheduler Manager
NameNod NameNod (AsM)
e e
Shared
o
Edit log
r
JournalNode
Slaves
App App App
Container Container Container
Master Master Master
Node Manager Node Manager Node Manager
Quiz
QUIZ Which of the following is the Utility or Role that takes care of an even
distribution of data across the cluster and plays an important part
1 when commissioning or decommissioning nodes?
a. Balancer
b. Replication algorithm
c. NameNode
a. Balancer
b. Replication algorithm
c. NameNode
a. YARN
c. Cluster ID
a. YARN
c. Cluster ID
a. ZK Failover Controller
c. Zookeeper daemon
a. ZK Failover Controller
c. Zookeeper daemon
a. Heartbeat mechanism
c. Fencing
a. Heartbeat mechanism
c. Fencing
a. Hosts tab
b. Cluster tab
a. Hosts tab
b. Cluster tab
Explanation: Cloudera Management Service and HUE are Cloudera services on which no
other service is dependent.
QUIZ
Where are shared edits written when HA is set up using QJM?
8
a. On each NameNode
c. On Journal Nodes
a. On each NameNode
c. On Journal Nodes
a.
The number of DataNodes available is less than the number
required for replication
b. NameNode is not available
c. Communication Failure
a.
The number of DataNodes available is less than the number
required for replication.
b. NameNode is not available.
c. Communication Failure
There are two ways you can add or remove a node from a
cluster ꟷ the adhoc way and the systematic way.
Disclaimer: All the logos used in this course belong to the respective organizations
Key Takeaways (Contd.)
Disclaimer: All the logos used in this course belong to the respective organizations
Key Takeaways (Contd.)
Disclaimer: All the logos used in this course belong to the respective organizations
This concludes the lesson “Cluster Maintenance and Administration.”
The next lesson is “Hadoop Computational Frameworks”.
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 06—Hadoop Computational Frameworks
What You’ll Learn
Hive
Spark
Cascading
YARN Framework
Crunch
Tez
Drill
Open Source Tools
Impala
Presto
Processing Data in Hadoop (contd.)
Spark Streaming
Hive (in beta)
Dato (Graphlab)
Storm/trident
Cascading
Cascading
Spark SQL
Cascading
Mahout
Giraph
Crunch
Impala
Presto
Giraph
Crunch
Oryx
Drill
H20
MLlib
Hive
Hive
Pig
Pig
Pig
MapReduce Spark Tez
Perform machine learning analysis Machine Processing Process data using a high level
on Hadoop data Learning Abstraction
Frameworks abstraction
Engines Libraries
Some frameworks Some frameworks do
have active not have any active
components, such as component and can
server, client, be considered
services, and so on. libraries
These can be
considered engines.
Selecting Processing Framework
Use case
Requirement
Available expertise
Experience
Selecting Processing Framework
! X
1 2
The General Purpose Processing framework is always needed, because other
frameworks only solve a specific use case and may not be sufficient to handle all
processing needs of an organization.
General Purpose Processing Framework
MapReduce Spark
1 2
MapReduce Spark
The time and code required to write a Spark job is a lot less than for a MapReduce job.
Tez is better suited to building abstraction frameworks rather than building applications.
HDFS: Key Features
Abstraction and SQL frameworks reduce the time spent on writing jobs directly for general purpose frameworks.
Pig
Possibility of changing the underlying general purpose Abstraction
frameworks when needed Crunch Frameworks
Graph Frameworks
GraphLab is a
GraphX is a library standalone, special
Giraph is a library for graph purpose Graph
that runs on top of processing on Processing
MapReduce. Spark. framework that can
handle tabular data
Machine-learning Frameworks
Mahou
t
Mahout is a library on top of MapReduce.
Real-
Time/Streaming
Frameworks
Kafka
Flume HDFS
HDFS/S3 Databases
Kinesis Dashboards
Twitter
MapReduce Others
(data processing) (data processing)
MapReduce YARN
(cluster resource management & (cluster resource management)
data processing)
HDFS HDFS
(redundant, reliable storage) (redundant, reliable storage)
Understanding Mapreduce Programming Model
Output list
Mapping Phase and Reducing Phase
Output list
Mapreduce Flow
Hadoop Distributed File System (HDFS)
Split 3
map()
Split 4 partition() sort()
combine() reduce()
Split 5
Region 1 Output
Region 2
Job tracker Region 3
Keys and Values
Several instances of Mapper are Component of MapReduce that Several instances of reducer
created on multiple nodes in a initializes the job, instructs method are also instantiated
cluster. Each instance receives a Hadoop platform to execute on different nodes that work
different input file. the job on input, and controls on the data generated by
where the output is placed. mapper and reduce it/sum it to
generate a final output value.
Mapreduce Data Flow
Intermediate data
from mappers
Values exchanged
by shuffle process
Mapper
Partitio
Reducer
n&
Shuffle
MapReduce
Combine Sort
r
Lesson 6: Hadoop Computational Frameworks
Topic 6.3: YARN
What is Yarn?
Master
Worker 1 A ResourceManager is the master daemon that:
CPU
• Communicates with the client
Node
Resource • Tracks resources on the cluster
Manager
Manager RAM • Coordinates work by assigning tasks to
NodeManagers
Vcores Memory
YARN
Master Worker 1
CPU
8x8
Resource Node
Manager Manager RAM
Vcores:6400 Vcores:64 128 GB
RAM:12800 GB RAM:128 GB
Worker 100
CPU
8x8
Node
Manager RAM
Vcores:64 128 GB
RAM:128 GB
• The NodeManager tracks its own resources and advertises its resource
configuration to the ResourceManager.
• The ResourceManager keeps a record of the cluster’s available resources and
knows how to allocate resources when requested.
Basics of Yarn Framework
Container Container
vcore request: 1 vcore request: 1
memory request: 8 GB memory request: 8 GB
PROCESS
Your
Code
CPU RAM HD
D
Client Master
Application Resource
Worker
Process Manager
Execution Flow
Container Container
vcore request: 1 vcore request: 1
Memory request: 8 Memory request: 8
GB GB
PROCESS Application Master
Execution Flow (contd.)
Worker
Node
Manager
vcores: 60
RAM: 90GB
Vcore used: 1
RAM used: 8 GB
Container
vcore request: 1
Master Memory request: 8 GB
Application
Client Resource Master
Manager
vcores: 120
Application RAM: 180GB
Process Vcore used: 2 Worker
RAM used: 12 Node
GB Manager
vcores: 60
RAM: 90 GB
Vcore used: 1
RAM used: 4 GB
Container
vcore request: 1
Memory request: 4 GB
Task
Execution Flow (contd.)
Client Master
Worke
Worke r
Worke Worke
Worke Worke Node
Worke r
r r r
Node
r r Node
Node
Manager
Node vcores: 60
Node
Node Manager
Manager Manager
Manager Manager RAM: 90 GB
vcores: 60 Manager
vcores:
vcores: 6060 vcores: vcore
60
vcores:
vcores: 60
609090
RAM: GB used: 1
RAM: 90 GB RAM:
RAM: 90 GBGB RAM: 90RAM
GB used: 4
RAM: vcore
90 GB
vcore used:
used: 1
vcore used: 1 vcore
vcore RAMused:
used: 1 1 14
used:
vcore used:
GB 1
RAM used: 8 RAM
RAM used:
used: 4 RAM used: 4
GB RAM GBused:
GB 44 GB Container
GBGB
vcore request: 1
Container
Container Container
Container ContainerMemory request: 4
vcore
Container
vcore request:
request: 1 GB 1
vcore request: 1 vcore request:
Memory 1 1
request: 4
vcore request:
vcoreMemory
request: 1
request: Reduce4task
Memory request: 8 Memory
GB request: 4 4 Memory request:
Memory request: 4
GB GBGB Map Task GB
Application Master GB Map Task Reduce task
Map Task
Map Task
Demo: Running Sample Mapreduce Jobs and Looking at Output
YARN Allocation
Ideal Realistic
A YARN cluster can be configured to use up All the resources cannot be allocated to YARN due to:
all the resources on the cluster. • Overheads of non-Hadoop related services
running on nodes
• Operating system and utilities, custom programs,
and so on
• Other Hadoop related components that might
need dedicated resources and cannot share
resources
• Distribution specific services in case of CDH cluster
and Hadoop-specific roles
• Resources for Hbase slave daemons regionservers
Cluster Metrics on Yarn Allocation
Let’s look at a snapshot of ResourceManager Web UI and understand cluster metrics on YARN allocation
There are 50 worker nodes
YARN related configuration properties: yarn.nodemanager.resource.memory-mb is 90000
yarn.nodemanager.resource.vcores is 60
50 x 90 = 4500GB = 4.5TB
50 x 60 = 3000cores
Configurations Considerations
a. Abstraction frameworks
d. Real-time/Streaming frameworks
QUIZ
What kind of frameworks can be used for querying data in Hadoop using querying languages
and exist on top of a general-purpose framework?
2
a. Abstraction frameworks
d. Real-time/Streaming frameworks
a. KeyValueInputFormat
b SequenceFileInputFormat
.
c. TextInputFormat
d. FileInputFormat
QUIZ
In MapReduce, InputFormat defines how input files are split and read. What is the default
InputFormat provided with Hadoop?
3
a. KeyValueInputFormat
b SequenceFileInputFormat
.
c. TextInputFormat
d. FileInputFormat
a. mapred.min.split.size
b mapred.min.inputsplit.size
.
c. dfs.blocksize
d. mapreduce.tasktracker.map.tasks.maximum
QUIZ
What is the parameter to control the split size for InputSplit to be processed by Map task?
4
a. mapred.min.split.size
b mapred.min.inputsplit.size
.
c. dfs.blocksize
d. mapreduce.tasktracker.map.tasks.maximum
d. Set mapred.map(/reduce).tasks.speculative.execution=0
QUIZ
Frameworks can be categorized based on architecture and whether they have active
components. Select two such frameworks.
5
d. Set mapred.map(/reduce).tasks.speculative.execution=0
a. ApplicationsManager
b ApplicationsMaster
.
c. Container
d. NodesListManager
QUIZ
What component of YARN takes care of negotiating resources with ResourceManager and
works with NodeManagers ?
6
a. ApplicationsManager
b ApplicationsMaster
.
c. Container
d. NodesListManager
a. ApplicationMasterLauncher
b YarnScheduler
.
c. ApplicationsManager
d. ResourceManager
QUIZ
Who is responsible for allocating resources to various running applications and performing
scheduling function based on resource requirement?
7
a. ApplicationMasterLauncher
b YarnScheduler
.
c. ApplicationsManager
d. ResourceManager
a. yarn.resourcemanager.nodes.include-path/exclude-path
d. yarn-site.xml
QUIZ
NodesListManager manages and seeds list of nodes mentioned in configuration files under
certain properties. Choose the right properties.
8
a. yarn.resourcemanager.nodes.include-path/exclude-path
d. yarn-site.xml
a. Make sure to allocate maximum RAM, CPU, and storage for YARN.
b There should be overheads for OS and its utilities and other Hadoop and non-
Hadoop components
.
c. ResourceManager should be on a node with good RAM, CPU, and storage
a. Make sure to allocate maximum RAM, CPU, and storage for YARN.
b There should be overheads for OS and its utilities and other Hadoop and non-
Hadoop components
.
c. ResourceManager should be on a node with good RAM, CPU, and storage
There are five intermediate phases in MapReduce. They are Mapper, Partition and Shuffle,
Sort, Combiner, and Reducer.
YARN stands for Yet Another Resource Negotiator. It is like an operating system on a server.
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 07—Scheduling: Managing Resources
Describe scheduling concepts
Identify Schedulers
Standalone
system
Multi-tenancy Scalability
Cluster
Scheduler
Schedulers in Hadoop 2.0 and Yarn
time time
Job 1 Job 2 Job 1 Job 2
submitted submitted submitted submitted
1 fair share
Pool/queue
time
Job 1 Job 2
submitted submitted
FIFO Scheduler
time
Job 1 Job 2
submitted submitted
Fair Scheduler
2
A Fair Scheduler dynamically balances
resources amongst all running jobs. 1 fair share
Pool/queue
Property: yarn.resourcemanager.scheduler.class
Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
Fair Scheduler: Scenario
utilization
queue B
(fair share)
queue A
(fair share)
time
Job 1 Job 3
submitted submitted
Job 2
submitted
Queue Configuration
<?xml version="1.0"?>
<allocations>
<defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
<queue name="prod">
<weight>40</weight>
<schedulingPolicy>fifo</schedulingPolicy>
</queue>
<queue name="dev">
<weight>60</weight>
<queue name="eng" />
<queue name="science" />
</queue>
<queuePlacementPolicy>
<rule name="specified" create="false" />
<rule name="primaryGroup" create="false" />
<rule name="default" queue="dev.eng" />
</queuePlacementPolicy>
</allocations>
Fair-Scheduler.xml
<?xml version="1.0"?>
<allocations>
<defaultQueueSchedulingPolicy>fair</defaultQueu
eSchedulingPolicy>
Root
<queue name="prod">
Queue <weight>40</weight>
<schedulingPolicy>fifo</schedulingPolicy>
</queue>
<queue name="dev">
<weight>60</weight>
<queue name="eng" />
<queue name="science" />
Queue 1 Queue 2 Queue 3 </queue>
<queuePlacementPolicy>
<rule name="specified" create="false" />
<rule name="primaryGroup" create="false" />
<rule name="default" queue="dev.eng" />
</queuePlacementPolicy>
</allocations>
Demo: Setting up Fair-Scheduler.xml
A single job does not use more resources than the capacity of its queue.
However, if there is more than one job in the queue and idle resources
are available, then the Capacity Scheduler may allocate the spare
resources to jobs in the queue, even if that exceeds the queue’s capacity.
Queue Hierarchy
Root
40%
25% 75%
60%
Prod Dev
(for production) (for development)
30% 30%
Eng Science
(for engineering) (for science)
Demo:Sample Run of Jobs Using Capacity Scheduler
Container Container
NodeManager
Container Container
However, waiting for a few seconds can dramatically increase ResourceManager
the chances of a container being allocated on the requested NodeManager
node, thereby increasing the efficiency of the cluster. Container
Scheduling
opportunity
This feature is called delay scheduling and it is supported by
both the Capacity Scheduler and the Fair Scheduler.
Delay Scheduling
Scheduling
opportunities
NodeManager
Delay scheduling is configured by setting
yarn.scheduler.fair.locality.threshold.node
yarn.scheduler.capacity.node-locality-delay
NodeManager
ResourceManager
NodeManager
The Fair Scheduler uses the number of scheduling
yarn.scheduler.fair.locality.threshold.rack,
opportunities to determine the delay. NodeManager
Dominant Resource Fairness
App 1 App 2
Schedulers in YARN use the DRF approach to address the problem by looking at
each user’s dominant resource use or requirement and measure the cluster use.
Dominant Resource Fairness
Application A
2 CPUs with
300 GB each
yarn.scheduler.capacity.resource-calculator
org.apache.hadoop.yarn.util.resource.Domin
antResourceCalculator incapacity-
scheduler.xml
Quiz
QUIZ
What are the four scheduler implementations to manage resources available in Hadoop?
1
d. Validity
QUIZ
What are the four scheduler implementations to manage resources available in Hadoop?
1
d. Validity
d. Let cluster use FIFO scheduler and cluster will manage automatically
QUIZ In a multi-resource environment determining fairness, comparing two applications and
optimum resource allocation is difficult. What is the approach that can be followed to solve
2 this problem?
d. Let cluster use FIFO scheduler and cluster will manage automatically
a. yarn.nodemanager.resource.cpu-vcores
b yarn.scheduler.capacity.resource-calculator
.
c. yarn.scheduler.minimum-allocation-vcores
d. yarn.scheduler.maximum-allocation-vcores
QUIZ
In YARN, vcores (virtual cores) are used to normalize CPU resources across cluster. Choose
the property used to set the number of CPU cores that can be allocated for containers.
3
a. yarn.nodemanager.resource.cpu-vcores
b yarn.scheduler.capacity.resource-calculator
.
c. yarn.scheduler.minimum-allocation-vcores
d. yarn.scheduler.maximum-allocation-vcores
a. Fair scheduler
b FIFO Scheduler
.
c. Delay Scheduler
d. Capacity Scheduler
QUIZ
Which Scheduler allows higher cluster utilization while providing predictability of workloads
and shares resources in a predictable manner?
7
a. Fair scheduler
b FIFO Scheduler
.
c. Delay Scheduler
d. Capacity Scheduler
b Running or stopped
.
c. Initiated, Started, On hold, Stopped, Terminated
d. Running or closed
QUIZ
Select the states of queues in YARN.
8
b Running or stopped
.
c. Initiated, Started, On hold, Stopped, Terminated
d. Running or closed
a. Decommissioning of queues
a. Decommissioning of queues
a. Makes the users wait in single queue based on order number of job
submission
a. Makes the users wait in single queue based on order number of job
submission
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 8- Hadoop Cluster Planning
Planning a Hadoop cluster
Understanding deployment
Need:
What do I need a Hadoop cluster for?
Data Type:
What kind of data will it handle?
Volume:
What is the volume of data?
Speed:
How quickly is the data growing?
Processing:
How frequently do we need to process it?
Planning: Knowing the Requirements
Cost:
Are there any budget constraints?
Workload:
What kind of workload will my cluster manage?
Time:
Asking right questions
How much time do I have?
Resource consumers/Applications:
How many applications do I have?
Cluster sizing tool/manual:
Do I use a cluster-sizing tool to help me estimate or do I use my
experience?
Size:
What would be the appropriate size of the cluster?
Questions- Responses- Reality Check
Need: Is it for “delivering new business value” or is it for “delivering data center efficiency”?
Data Type: A cluster might be used to handle a variety of data that may be generated from various
sources.
Volume: Is the data massive with the magnitude of petabytes and exabytes, or is it up to terabytes?
How long do we need to retain the data?
Processing: The frequency of data processing would directly be proportional to the choice of
resources in the cluster.
Questions- Responses- Reality Check
Cost: Is the cluster bound by costs and does it impose budget constraints?
Workload: Will it be network intensive, IO Intensive, CPU intensive, or will it be a balanced workload?
Are we looking at an evolving workload?
Time: If time is a constraint, usage of cluster-management solutions will be preferred and, if not,
open-source core edition could be used.
Resource consumers/applications: How many applications would be supported by the cluster? This
may affect the cluster size and count.
Questions- Responses- Reality Check
Cluster sizing tool/manual: For planning and estimation, will we use cluster-sizing tool/calculator or
rely on experience and expertise?
Size: Based on the above requirements, the size of the cluster can be decided.
Lesson 8: HDFS: Hadoop Cluster Planning
Topic 8.2: Workload Patterns
Sample Cluster Sizing Tool
Cluste r-sizi ng tool offe re d by one of the vendors
Sample Output of Cluster Sizing/Estimation Tool
Based on the
options selected in
the tool, as shown
in the previous
screen, the tool
estimates the
hardware and
infrastructure, and
shows some
recommendations.
Workload Patterns
Hadoop clusters are used for massive data storage and processing.
Architects usually ask a few questions while trying to understand the workload patterns:
Selecting the appropriate hardware that provides the best balance of performance and economy for a
workload pattern is a critical decision to make when planning a Hadoop cluster.
Compute Intensive
I/O Intensive Workload Network Intensive
Workload
CPU bound, which demand a I/O bound, which demands more Network bound, which demands
large number of CPUs and a vast investment on disks and storage appropriate network devices and
memory. devices. settings to support intense
network traffic.
Unknown or evolving
Balanced Workload
Workloads
High
Computation Balanced More
Balanced
Optimized Power per Node
CPU
Fewer Disks Disk More Disks
Low
BI or ML Cluster
Cluster Backup Options
Helps achieve business continuity through Development Backup and
and POC Cluster Archive
replication of on-premises and cloud-based Cluster
storage.
Development
Keeps the development environment
Production
separate from the production environment. Cluster
Data Exploration
Hardware choices involve selecting appropriate hardware and considering factors such as:
• NICs
• Power supply
• Cooling
• RAID or JBOD disk configurations
• RAM and RACK for both master and slave nodes
Hadoop hardware comes in two different classes: Master nodes and Slave nodes.
To avoid a heterogeneous platform and reduce proliferation of hardware profiles, architects select
single profile for master nodes and slave nodes.
Worker
Worker
Worker
Process
Making Choices: Hardware Considerations
Use dual power supplies, dedicated cooling, bonded Network Interface Cards
(NICs) and raided disks.
Use machines with good RAM and nominal or moderate disk capacity.
The OS for master machines should be highly available, thus RAID hard drives
are recommended.
Slave/Worker Nodes
• Worker nodes are responsible for storage and
computation
Considerations
• Commodity machines
• Should have enough storage capacity, CPU, and memory
to process data
• Multiple disks from same vendor, with no RAID
• JBOD disk configurations
Dedicated and
preferably top-
of-rack switches
Nodes
Intensive connected with
bandwidth minimum
speed of
1GB/sec
Speed such as
Network
1 GE over- 10GB/sec for
Considerations: For
subscription large amount of
your Hadoop Cluster
between racks intermediate
data
15 TB of storage
Default Replication &
space required every
data grows by 5
week
TB/week
Four or more
2-3 TB RAID 1 GB onboard, 2x10 GBE
NameNode Balanced 10 with spares 8 128-256 mezzanine/ external
Four or more
Resource 2-3 TB RAID 1 GB onboard, 2x10 GBE
Manager Balanced 10 with spares 8 128-256 mezzanine/ external
Industry Setups: Examples
• Cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad
Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to
E7200 / E7400 processors with 4 GB RAM and 160 GB HDD
• They use Apache Hadoop for log aggregation, reporting, and analysis
• Two Apache Hadoop clusters, all nodes 16 cores, 32 GB RAM
• Cluster 1: 27 nodes (total 432 cores, 544GB RAM, 280TB storage)
• Cluster 2: 111 nodes (total 1776 cores, 3552GB RAM, 1.1PB storage)
Industry Setups: Examples
• Yahoo! has more than 100,000 CPUs in over 40,000 servers running
Hadoop, with its biggest Hadoop cluster running 4,500 nodes. Yahoo! stores
455 petabytes of data in Hadoop
• That's big, and approximately four times larger than Facebook's beefiest
Hadoop cluster
Quiz
a. True
b False
.
a. True
b False
.
d. Yes, there will be a problem only when secondary NameNode does checkpointing.
d. Yes, there will be a problem only when secondary NameNode does checkpointing.
a. Yes, multiple disks allow better fault tolerance, I/O, and parallelism.
a. Yes, multiple disks allow better fault tolerance, I/O, and parallelism.
Explanation: While setting up DataNodes, multiple disks allow better fault tolerance, I/O,
and parallelism.
QUIZ The amount of memory required for the master nodes depends on the number of files and
4 blocks.
a. True
b. False
QUIZ The amount of memory required for the master nodes depends on number of files and
4 blocks.
a. True
b. False
b. Evolving workload
b. Evolving workload
a. Yes, we can estimate the storage requirement and the need for
adding DataNodes
c. A and C
a. Yes, we can estimate the storage requirement and the need for adding DataNodes
c. A and C
a. Yes, it can help us estimate the size of a cluster and overall infrastructure
b. No, it cannot estimate the size of a cluster and its overall infrastructure
a. Yes, it can help us estimate the size of a cluster and overall infrastructure
b. No, it cannot estimate the size of a cluster and its overall infrastructure.
Explanation: Yes, understanding the frequency of processing data or type of data processed
helps us in estimating the size of cluster and the overall infrastructure.
QUIZ
Select a list of considerations when setting up Master nodes.
8
a. JBOD disk configuration, NO RAIDED drives, RAM same as DataNodes, multiple NICs
c. Multiple disks with JBOD configuration, default replication, dedicated power supply
a. JBOD disk configuration, NO RAIDED drives, RAM same as DataNodes, multiple NICs
c. Multiple disks with JBOD configuration, default replication, dedicated power supply
Yes, physical and network isolation benefit clusters and avoid bottlenecks and
b.
resource sharing
d. It isn’t necessary
QUIZ
Can your cluster benefit from physical and network isolation?
9
Yes, physical and network isolation benefit clusters and avoid bottlenecks and resource
b. sharing
d. It isn’t necessary
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 09—Hadoop Clients and Hue Interface
Explain the concepts of Hadoop client, edge nodes, and gateway nodes
Clients
Secondary
Job Tracker NameNode Masters
NameNode
Machines that
DataNode & DataNode & DataNode & do all the storing
Task Tracker Task Tracker Task Tracker and running of
Slaves computations
DataNode & DataNode & DataNode &
Task Tracker Task Tracker Task Tracker
Client Nodes: Hadoop/HDFS Clients
DataNodes
HDFS clients perform filesystem
metadata operations through a
single server known as the
Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm Am1dlcomrhdm
01 01 01 01 01
10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11
172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11
3
establishes a Protocol is used for
2
DEV Cluster
connection to a DataNode
DataNodes
NameNode never
RPC abstraction wraps
Am1dlcomrhdm0 Am1dlcomrhdm0 Am1dlcomrhdm0 Am1dlcomrhdm0 Am1dlcomrhdm0
1 1 1 1 1
5
initiates any RPCs and
4
10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11
172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11
requests issued by
(Pub)
10.87.176.140
Protocol
DataNodes or clients
How Writes are Done from Client
Client NameNode
Blk A Blk B Blk C File.txt
When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the
DataNode. The client then tells the NameNode that the file is closed. The NameNode commits the
file creation operation into a persistent store.
Understand How Client Ensures Data Integrity
The HDFS client software implements checksum checking on the contents
of HDFS files.
NameNode
Client Namespace
How are clients Client configuration files are auto-generated and auto-
configured and deployed based on services and roles in the cluster
deployed
Unzipping
Deploy function
unzips it into appropriate
configuration directory
If roles for multiple services are running on the same host, then the client configurations for
both roles are deployed on that host, with the alternatives priority determining which
configuration takes precedence
How to Download Client Configuration Files
1. Login to Admin Console and click the 1. Click the Services tab in the Admin
Services tab. Console.
2. From the Home button, select the status. 2. Proceed to select the service instance whose
configuration you want to download.
4. Save the link and download the configuration 4. This downloads the configuration files for the
files. selected service.
Edge Nodes or Gateway Nodes
DataNodes
intermittent data when data is being
transferred to Hadoop cluster Am1dlcomrhdm
01
Am1dlcomrhdm
01
Am1dlcomrhdm
01
Am1dlcomrhdm
01
Am1dlcomrhdm
01
10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11 10.87.181.11
172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11 172.20.176.11
There are some considerations for edge nodes, although specifics always depend on business and technical
requirements.
0
To successfully handle high inbound and
0
outbound data transfer rates, Edge nodes Apart from Edge nodes, no
should have multiple pairs of bonded 10 other nodes must be used to
GbE network connectors. deploy and run administration
7
tools.
01 1
Edge nodes should be multi-homed, that is, 07 Avoid placing data import/export
0
connected to multiple networks and into the services such as Sqoop on master and
0
private subnet of Hadoop cluster. Two pairs slave nodes as the high data transfer
of bonded 1 GbE network connections are 06 volumes may lower the ability of
recommended—one to connect to Hadoop Hadoop services to communicate with
6
cluster and the other for external network. 02 each other. High latency may cause the
05
2 nodes to get detached from the cluster.
0
03
0
The processor configuration 04 Edge nodes should use carrier
should be the same as or a little
class hardware
more than that of slave nodes; 48
GB of RAM would be sufficient.
5 3
0
Edge nodes oriented to data
ingestion should be equipped with
optimum storage space.
4
Lesson 9: Hadoop Clients and Hue Interface
Topic 9.2: Installing, Configuring, Refreshing and Working with Clients
Demonstration 1: Installing, Configuring, Refreshing and
Working with Clients
Lesson 9: Hadoop Clients and Hue Interface
Topic 9.3: Overview of Hadoop User Experience (Hue)
Introduction to Hue
Hue Plugins
Cloudera HBase
Pig HiveServer2 Zookeeper
Impala
YARN Job Tracker Oozie HDFS Hive Solr Spoop2
Metastore
Why is Hue Required
Name Node UI
Resource Manager UI
Hue is an application in the background that allows users to interact with a Hadoop cluster from a
web browser and requires no client installation.
Hue Plugins
Cloudera HBase
Pig HiveServer2 Zookeeper
Impala
YARN Job Tracker Oozie HDFS Hive Solr Spoop2
Metastore
Hue UI Hue DB
Job Job Oosie Pig File Beeswax Metastore Cloudera Impala Shell
Browser Designer Editor Editor Browser (Hive UI) Manager Query UI
Hue Plugins
Hive Cloudera
YARN Job Tracker Oozie Pig HDFS HiveServer2 HBase Spoop2
Metastore Impala
CDH
1 2 3 4 5
RDBMS UI is a new
application that enables the
viewing of data in other
RDBMs.
Pig Editor
Edge nodes are mostly used to run client applications and cluster
administration tools.
Hue, although a web interface, is a background application that allows users to
interact with a Hadoop cluster from a web browser and requires no client
installation.
There are various Hue application interfaces, such as Beeswax, Impala Query
UI, RDBMS UI in BD Query, Pig Editor, Job Designer, Job Browser, Metastore
Tables in Hive Metastore Browser, HBase in HBase UI, Sqoop Transfer,
Zookeeper in Zookeeper Browser, and File Browser.
Quiz
QUIZ Identify the disadvantage of running Administration tools or data transfer tools like
1 Sqoop on master/slave nodes?
a. It’s mandatory to run the admin /data transfer tools on nodes that are not
part of the cluster.
b. Conflict in resource usage and high volume data transfer may impact
Hadoop services.
c. Master/slave nodes are busy with data handling and cannot be efficient.
a. It’s mandatory to run the admin /data transfer tools on nodes that are not
part of cluster.
b. Conflict in resource usage and high volume data transfer may impact
Hadoop services.
c. Master/slave nodes are busy with data handling and cannot be efficient.
a. NameNode
b. ResourceManager
c. HDFS Clients
d. Service Nodes
QUIZ Which of the following regulates the data access for the data stored in HDFS or
3 data stores using HDFS?
a. NameNode
b. ResourceManager
c. HDFS Clients
d. Service Nodes
a. Cloudera Manager
b. Hue Server
c. Hue Database
a. Cloudera manager
b. Hue Server
c. Hue Database
a. Pig
b. Hive metastore
c. Hbase
d. Impala
QUIZ Identify the application of Hue that allows users to issue fast interactive queries
6 and shares metastore with Hive.
a. Pig
b. Hive metastore
c. Hbase
d. Impala
a. Daemon web interfaces stop when daemons stop, but Hue interface
doesn’t.
b. Hue allows users to work with data on HDFS, but daemon web interfaces
allow only browsing.
Hue web interface provides full access on HDFS, but daemon web interfaces
c. don’t.
Daemon web interfaces show accurate information, but Hue interface
d. doesn’t.
QUIZ
What is main difference between daemon web interfaces and Hue web interface?
8
a. Daemon web interfaces stop when daemons stop, but Hue interface
doesn’t.
b. Hue allows to work with data on HDFS, but daemon web interfaces allow
only browsing.
Hue web interface provides full access on HDFS, but daemon web interfaces
c. don’t.
Daemon web interfaces show accurate information, but Hue interface
d. doesn’t.
a. Sqoop
b. Hive
c. Hbase
a. Sqoop
b. Hive
c. Hbase
b. Hue.ini file
c. Hue.conf file
d. Hue-site.xml
QUIZ While file is edited to make sure applications in Hue interact with underlying
10 services of cluster?
b. Hue.ini file
c. Hue.conf file
d. Hue-site.xml
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson10 - Data Ingestion in Hadoop Cluster
Explain data ingestion
Amazon
Chukwa Apache Storm Gobblin
Kinesis
Source Streams
Cloud
Log/event
data Log/event
data
Social
media
Web servers
Source
Channel 2 Sink 2
JMS
Event
Channel 3 Sink 3
Event Event
Event
JVM - Flume Agent
Source
Event Event
agent
Web service
Collector HDFS
HDFS
HDFS
HDFS
HDFS
agent
Web service
HBase
agent
Web service
Collector
Centralized
stores
agent
Web service
Types of Data Flow
Multi-hop Flow
JVM - Flume Agent JVM - Flume Agent
Replicating
Source Event A Event A
Source
Event A
Fan-out Flow
JVM - Flume Agent
Source
Fan-in Flow
Source Source
Sqoop
Server
Connectors
Browser Metadata
Map
Task
Sqoop
Client Reduce
Task
Metadata
Repository HDFS/HBase
Built-in connectors /Hive
Sqoop 2—Features
Apache Flume, Apache Sqoop, and Apache Kafka are some of the
tools used for data ingestion.
Flume agent is a Java Virtual machine or JVM process that hosts the
components through which events flow.
The types of data flow are Multi-hop flow, Fan-out flow, and
Fan-in flow.
c. Apache Sqoop 2
c. Apache Sqoop 2
a. Apache Flume
b. Apache Sqoop
c. Apache Storm
d. Spark SQL
QUIZ
Which tool allows bulk data transfer across Hadoop and structured data stores?
3
a. Apache Flume
b. Apache Sqoop
c. Apache Storm
d. Spark SQL
a. Fan-in Flow
b. Fan-out Flow
c. Multiplexing Flow
d. Multi-Hop Flow
QUIZ What is the type of data flow in Flume when events may travel through more
4 than one agent?
a. Fan-in Flow
b. Fan-out Flow
c. Multiplexing Flow
d. Multi-Hop Flow
a. Apache Kafka
b. Apache Sqoop
c. Apache NiFi
d. Amazon Kinesis
QUIZ What is the name of the data ingestion tool that allows automation of data
movement between systems, provides real time control, and enables ease of
5 data movement?
a. Apache Kafka
b. Apache Sqoop
c. Apache NiFi
d. Amazon Kinesis
a. Flume event
b. Flume agent
c. Flume channels
d. Flume sinks
QUIZ Which of the following is the term used for the process that hosts the
6 components through which events flow within Flume?
a. Flume event
b. Flume agent
c. Flume channels
d. Flume sinks
a. Load Balancing
c. Import/export of data
d. Performance optimization
QUIZ What is the capability of Sqoop that mitigates excessive storage and processing
7 loads to other systems?
a. Load Balancing
c. Import/export of data
d. Performance optimization
a. Sqoop Server
b. Apache Kafka
c. Apache Oozie
d. Apache NiFi
QUIZ Which of the following services does Sqoop integrate with to allow automation
8 and scheduling of import/export tasks?
a. Sqoop Server
b. Apache Kafka
c. Apache Oozie
d. Apache NiFi
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 11—Hadoop Ecosystem Components/Services
List some of the services and open-source components that work
within the Hadoop ecosystem
Data
summarization
? Query
Data Data
Warehouse Warehouse Analysis
Apache Hive What is it?
Apache Hive Applications
Data
.
t
o
r
g
Data Data
Warehouse Warehouse
SQL
SQL
Hdfs-site.xml
<configuration>
<property>
Supports extension of Query Language with User- <name>dfs.replication</name>
Defined Functions <value>3</value>
</property>
</configuration>
Apache Hive Features
<configuration>
<property>
<name>dfs.replication</name
Permits read and write of data in non-Hive format >
<value>3</value>
</property>
</configuration>
Apache Hive Features
Data 1
time queries.
Data 4
Apache Hive Features
Metastore service
Client nodes
2
Nodes
Apache Hive Configurable Modes
Local Metastore
JVM
JVM
MySQL database
JVM
Matastore
JVM
MySQL database
1 5
Using transactions to Set up a single Master
ensure data consistency Add slave database
server
and referential integrity servers
Using a domain-specific
2 6
language such as SQL.
Scale the Master server
Add a cache
vertically
10010101001010 1001010
3 De-normalize schemas
7 Avoid use of built-in
features
01010010101001 0101001
Store only the amount of data that
01011010101101
01101010110101
01011010101101
0101101
0110101
0101101
4 can enable optimization of access
patterns 8 Determine the costliest
queries
00101100010110 0010110
10101011010101 1010101
01010100101010 0101010
HBase: Need
SQL
Non relational Database Systems
Not-Only SQL or NoSQL!!
(Term coined by Eric Evans)
HBase: Applications
The volume of
There is a need for faster data is huge.
and random read-write
actions.
HBase—Architecture
Regions are
horizontally
portioned into
zookeeper .META location
Meta table is stored in
assigned to key ranges
Region Servers (regions)
HMaster location Zookeeper
Client
ge Meta Cache Client zookeeper
t
Get Region
Regio startKe startKe Regio startKe startKe
server for row
y y y y
n Region Region n key from meta
Region Region
Serve Key colB colC Key colB colC Serve Key colB colC Key colB colC
r r Put or Get Row
xxx val val xxx val val xxx val val xxx val val Region Region
Server Server
xxx val Val xxx val Val xxx val Val xxx val Val
DataNode DataNode
1G 1G 1G 1G
endKey B endKey B endKey B endKey B
Write Cache,
Region Region sorted map of
KeyValues in
MemStor MemStor memory Regio Regio Regio
e e n n n
Server Server Server
HFile HFile Region Region Region Region Region Region
Key colB colC colB col Key colB colC colB col Key colB colC colB col
WAL
Key Key Key
C C C
Hfile=sorted xxx val val xxx val val xxx val val xxx val val xxx val val xxx val val
KeyValues on disk
Write ahead log on
disk is used for
HDFS DataNode xxx val Val xxx val Val xxx val Val xxx val Val xxx val Val xxx val Val
recovery
HBase—Architecture
HRegionServer HRegionServer
HRegion HRegion
HBase
Store Store
… …
Store Store
… …
MemStor MemStor
HLog
MemStor MemStor
HLog
e e e e
StoreFil StoreFil StoreFil StoreFil StoreFil StoreFil
e
HFile
e
HFile
… e
HFile
… e e e …
HFile HFile HFile
…
DataNode DataNode DataNode DataNode DataNode
In this demo, you will see how to add HBase as a service
in CDH.
Lesson 11: Hadoop Ecosystem Components/Services
Kafka is used to obtain performance and usage data from the end-users’ browsers to be used in projects.
At Square, Kafka acts as a bus and moves systems events through various datacenters.
LinkedIn uses Apache Kafka to stream activity data and to obtain operational metrics.
Many teams in Yahoo use Kafka, including the Media Analytics team which uses it for real-time analytics.
Apache Kafka What is it?
Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
0 0 1 2
Partition
0 1 2 3 4 5 6 7 8 9 Commit
Write
1
s log
Partition 1 1 1
0 1 2 3 4 5 6 7 8 9 Offset
2 0 1 2
s
Apache Kafka How it works (Contd.)
Producer 1
Consumer 1-1
Consumer 1-2
Consumer 1-3
Producer 2
Consumer Group 1
My Topic Partition 1
Producer A Producer A
0 1
Partition 2
Producer B
0 1 2
Partition 3
Producer C
Producer B
0 1
Apache Kafka Unique Features
Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
0 0 1 2
Partition
0 1 2 3 4 5 6 7 8 9
1
Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
2 0 1 2
Apache Kafka Unique Features
Kafka has better throughput Kafka rebuilds user activities Kafka collects the Kafka supports storage
and built-in partitions to as real-time publish- distributed applications and of huge log data.
handle replication and subscribe feeds. It provides generates centralized feeds
achieve better fault- real-time processing, real- of operational data.
tolerance. time monitoring, offline data
warehousing systems, and
reporting. Kafka is a good
solution as activity tracking
generates high volumes of
data.
Apache Kafka Applications
Kafka sometimes replaces a log Data is processed in stages. Raw data Kafka has a log compaction feature
aggregation solution. Kafka abstracts is consumed from topics and then that enables it to serve as an external
the file details and provides the log transformed into new Kafka topics to commit-log for a distributed system.
or event data as a stream of be used by applications. The log supports replication of data
messages. between nodes and re-synchronizes
This helps to achieve: failed nodes to restore data.
• lower-latency processing
• support for multiple data sources
• Provision of distributed data
consumption
Apache Spark What is it?
SQL
BlinkDB Alpha/Pre-Alpha
Approximat
e
SQL
BlinkDB Alpha/Pre-Alpha
Approximat
e
SQL
RDDs
Apache Spark Computing through Resilient Distributed Datasets
Operations
Actions Transformations
Apache Spark and Hadoop
Contribute to Hadoop-
• Yarn ResourceManager
based jobs through YARN
• HDFS
Rapid in-memory
• Disaster Recovery processing of large data
• Data Security volumes
• Distributed Data platform SQL, streaming, and graph
processing capability.
Quiz
QUIZ Which component of Apache Hive takes care of session handling and connectivity of
Hive with Hadoop?
1
a. Execution engine
b Compiler
.
c. Metastore
d. Driver
QUIZ Which component of Apache Hive takes care of session handling and connectivity of
Hive with Hadoop?
1
Execution engine
a.
Compiler
b
.
c. Metastore
d. Driver
Explanation: Driver is the component of Apache Hive that takes care of session handling and connectivity of Hive with
Hadoop.
QUIZ
Which component of Apache Hive defines the mode in which Hive is setup and
contains metadata about data?
2
a. Compiler
b Execution Engine
.
c. Metastore
d. Metastore service
QUIZ
Which component of Apache Hive defines the mode in which Hive is setup and
contains metadata about data?
2
a. Compiler
b Execution Engine
.
c. Metastore
Metastore service
d.
a. Embedded metastore
b Local metastore
.
c. Remote metastore
d. Metastore service
QUIZ Which setup of Hive allows multiple users to access Hive via CLI or Hue’s Hive
interface?
3
a. Embedded metastore
b Local metastore
.
c. Remote metastore
d. Metastore service
Explanation: Local metastore is the setup of Hive that allows multiple users to access Hive via CLI or Hue’s Hive
interface.
QUIZ Which Apache service can be used to capture unstructured, incremental, and user
interaction data?
4
a. Apache Kafka
b Apache Spark
.
c. Apache Hive
d. Apache HBase
QUIZ Which Apache service can be used to capture unstructured, incremental, and user
interaction data?
4
a. Apache Kafka
b Apache Spark
.
c. Apache Hive
d. Apache HBase
Explanation: Apache Hbase can be used to capture unstructured, incremental, and user interaction data.
QUIZ What are three main components of Apache Hbase that enable the working of Hbase
in a Hadoop cluster?
5
Explanation: The three main components of Apache Hbase that enable the working of Hbase in a Hadoop cluster are
HMaster, Hregionserver, and Zookeeper.
QUIZ
How many regions can a regionserver serve?
6
a. 10000
b Any number
.
c. 1000
a. 10000
b Any number
.
c. 1000
a. Zookeeper
b HMaster
.
c. Zookeeper Quorum
d. Ephemeral Nodes
QUIZ
Which of the following maintains the information about server states in a cluster and
acts as a distribution coordination service?
7
a. Zookeeper
b HMaster
.
c. Zookeeper Quorum
d. Ephemeral Nodes
Explanation: Zookeeper Quorum maintains the information about server states in a cluster and acts as a distribution
coordination service.
QUIZ
In Apache Kafka, what are the processes that subscribe to topics and process the
messages?
8
a. Kafka brokers
b Partitions
.
c. Consumers
d. Kaf logs
QUIZ
In Apache Kafka, what are the processes that subscribe to topics and process the
messages?
8
a. Kafka brokers
b Partitions
.
c. Consumers
d. Kaf logs
Explanation: Consumers are the processes that subscribe to topics and process the messages.
QUIZ
Which server handles read write requests for a partition within a kafka cluster?
9
a. Follower
b Leader
.
c. Producer
d. Consumer
QUIZ
Which server handles read write requests for a partition within a kafka cluster?
9
a. Follower
b Leader
.
c. Producer
d. Consumer
Explanation: Leader server handles read write requests for a partition within a kafka cluster.
QUIZ
Which component of the Apache Spark stack can be used for hypothesis testing,
regression analysis, classification, and principal component analysis?
10
a. Dataframe API
b MLlib
.
c. Spark Streaming
d. Spark core
QUIZ
Which component of the Apache Spark stack can be used for hypothesis testing,
regression analysis, classification, and principal component analysis?
10
a. Dataframe API
b MLlib
.
c. Spark Streaming
d. Spark core
Explanation: Mllib is the component of Apache Spark stack that can be used for hypothesis testing, regression analysis,
classification, and principal component analysis.
Several services or open-source components work within the Hadoop ecosystem.
These include Apache Hive, Apache Pig, Impala, HBase, Apache Kafka, and Apache
Spark.
Apache Hive is a data warehouse infrastructure built on top of Hadoop to
provision data summarization, query, and analysis.
HBase is a service that is built on top of Hadoop and Zookeeper.
It is also called Hadoop Database.
Kafka is a fast, scalable, and durable distributed messaging system. It
follows the publish-scribe messaging pattern.
Disclaimer: All the logos used in this course belong to the respective organizations
Big Data and Hadoop Administrator
Lesson 12—Hadoop Security
Describe the different ways to avoid risks and secure data
Damage to Regulatory
business Risk infractions
continuity plans s
Hadoop ecosystem and its
components, along with its
Data integrity Damage of corporate image
processing frameworks, allow compromise and shareholder value
you to store data and process it
in new and exciting ways.
CIA Model
Confidentiality
CIA
Mode
Availabilit l Integrity
y
AAA
Accounting
Authorization
Authenticatio
n
Authentication, Authorization, and Accounting
Identity and identification method refers to the process in the system that distinguishes between
different entities, users, and services and allows or disallows the user to access the data.
Pillars of Enterprise Security
Authentication Audit
Securing Distributed Systems
Insider threat
The attack comes from a business or from regular
users such as employees, contractors, or
consultants.
Threat
Unauthorized access or Denial of service
Categories
masquerade
Denial of service is a situation
A masquerade attack refers to an where one or more clients are
event where an invalid user unable to access a service.
presents himself or herself as a
valid user by obtaining valid
credentials.
Risks
Each Hadoop ecosystem component has services and each service has roles running on different nodes.
Service Roles
Master nodes are the Administrators are The reasons for this limitation
most important nodes of allowed to access the are as follows:
the cluster. Therefore, master nodes • Avoid any chance of resource
they have a strict security contention
policy to protect them.
• Avoid security vulnerabilities
Worker Nodes
Worker nodes handle the Administrators are The reasons for this limitation are
bulk of the functions of a allowed to access the as follows:
Hadoop cluster. These worker nodes • Avoids resource contention and
functions include the skew in resource management
storing and processing of
data. • Avoids worker role skew in
behavior
Management Nodes
SELinux implements security at kernel level and provides Linux kernel enhancements.
In a CDH setup, this has to be disabled on every node of the cluster.
Hadoop
Permissive Ecosystem
SELinux is enabled but does not protect the system.
Enabled
SELinux protects the system based on the specified
SELinux policy. Widget
s
Kerberos
By default, Hadoop does not question or verify the identity of the user
accessing the cluster.
By default, everyone has read access to the cluster and the petabytes of the
data that it stores.
In the case of large clusters, managing access to the cluster at user, group, or
data level is not enough to protect it.
Someone has to verify the identity of the user or the service before the
cluster and its data is hampered.
Kerberos Internals
Key Distribution
Center
Kerberos
KDC KDC Server
Admin Account
Kerberos
Keytabs Kerbero Client
s
Realm
Tickets
Principals
Kerberos Server
Implementing Kerberos in CDH
Without Kerberos enabled, Hadoop only examines a user and his or her
group membership to verify if he or she is allowed to access HDFS.
A MapReduce cluster can use this mechanism to allow a configured list of users/groups to submit jobs.
It is disabled by default.
To enable: Edit $HADOOP_CONF_DIR/core-site.xml
Property: Hadoop.security.authorization: true
Configuration Properties
Property Service
security.client.protocol.acl ACL for ClientProtocol, which is used by user code via the DistributedFileSystem.
ACL for JobSubmissionProtocol, used by job clients to communciate with the jobtracker
security.job.submission.protocol.acl
for job submission, querying job status, etc.
<property>
The commands used to implement ACLs The commands used to interact with ACLs are
are the following: the following:
ACLs can be enabled for YARN processing and used to control who can act as the administrator
of the YARN cluster or submit job to YARN cluster and configured queues.
a. Confidentiality
b. Integrity
c. Availability
d. Identity
QUIZ Which component of CIA model ensures that information remains
2 unchanged and uncompromised?
a. Confidentiality
b. Integrity
c. Availability
d. Identity
a. Authorization
b. Data protection
c. Administration
d. Authentication
QUIZ Which Pillar of enterprise security constitutes provisioning access to
3 data?
a. Authorization
b. Data protection
c. Administration
d. Authentication
a. Denial of Service
b. Unauthorized access
c. Masquerade
d. Insider threat
QUIZ Which category of threat is most dangerous and arises when an
unauthorized user has access to data via some unknown authorized
4 user?
a. Denial of Service
b. Unauthorized access
c. Masquerade
d. Insider threat
a. Permissive
b. Enabled
c. Disabled
d. Blocked
QUIZ What is the preferred status of SELINUX in CDH to implement security at
5 kernel level?
a. Permissive
b. Enabled
c. Disabled
d. Blocked
b. Kerberos client
c. Principal
d. Authentication Server
QUIZ What is the trusted source for authentication in a Kerberos-enabled
6 environment called?
b. Kerberos client
c. Principal
d. Authentication Server
b. Kerberos database
c. Kerberos server
d. Authentication Server
QUIZ In a Kerberos-enabled environment, who takes care of initial
7 authentication and issues a TGT (Ticket Granting Ticket)?
b. Kerberos database
c. Kerberos server
d. Authentication Server
a. Realm
b. Admin principal
c. Keytab
d. Ticket
QUIZ What is the file that contains resource principal’s authentication
8 credentials called?
a. Realm
b. Admin principal
c. Keytab
d. Ticket
a. Enabled
b. Disabled
c. Inactive
d. Active
QUIZ What is the default status of Service Level Authorization in any Hadoop
9 Cluster?
a. Enabled
b. Disabled
c. Inactive
d. Active
b. By using Quotas
c. By enabling ACLs
b. By using Quotas
c. By enabling ACLs
Disclaimer: All the logos used in this course belong to the respective organizations
Describe cluster monitoring
Clusters
Organizations
monitor
Various
components
Scalability Flexibility
Considerations
Zero
Extensibility configuration
Hadoop Performance Monitoring Tools: Features
Consolidated
Notifications monitoring
and alerts
Features across
technologies
Multi-cluster
support and Custom views
reporting
Task
Cluster Monitoring
Self-service
performance tools provide
Application
troubleshooting
metrics service level
information
management on what
has failed.
Categorizing Monitoring Solutions and Monitoring
Monitoring
Systems
Metric
Use of Metrics
Collection
Categorizing Monitoring Solutions and Monitoring
Monitoring
Health Performance
Monitoring Monitoring
Monitoring Examples
Nodes
Cloudera Manager for Monitoring: Capabilities
1 1 0 1 1 1 1 1 1 1
1
0 0 1 0 0 0 0 0 0 0
0 Charts help users to query and
0 0 0 0 1 0 0 0 1 0
1 explore the metrics being collected.
1 1 1 1 0 1 0 1 1 1
1
0 1 0 0 1 0 0 0 0 0
0 Health Checks Service monitor helps
1 0 0 1 0 1 1 1 1 1
1 1 1 1 0 1 1 1 0 1
1
Cloudera to:
1 Evaluate “health checks” for every
0 1 1 0 0 0 0 0 1 0
0
Manager entity in the system
0 1 0
0 collects these Check for disk space in every node
Metrics are Check for successful checkpoint or
simple numeric connectivity of DataNode with
values NameNode
Projects status of a service based
on health checks done on
underlying daemons
Service
Monitor
Sample Yarn Metrics
Hadoop Cluster Monitoring
Metrics
Hadoop Metrics Details
Disabling Metrics
# hadoop-metrics.properties
jvm.class = org.apache.hadoop.metrics.spi.NullContext
dfs.class = org.apache.hadoop.metrics.spi.NullContext
Writing Metrics
# hadoop-metrics.properties
jvm.class = org.apache.hadoop.metrics.file.FileContext
jvm.period = 10
jvm.fileName = /tmp/jvm-metrics.log
dfs.class = org.apache.hadoop.metrics.file.FileContext
dfs.period = 10
dfs.fileName = /tmp/dfs-metrics.log
Handling Metrics Data Using Plug-ins (Contd.)
Ganglia
org.apache.hadoop.metric
org.apache.hadoop.metri
s.ganglia.GangliaContext3
cs.ganglia.GangliaContext
1
Handling Metrics Data Using Plug-ins (Contd.)
01 02 03 04
PHP web
relays data to a gmetad process application or
gmond collects central gmetad records data in a Apache Web Server
metrics locally process series of RRD displays the data
Handling Metrics Data Using Plug-ins (Contd.)
# hadoop-metrics.properties( Sample)
jvm.class = org.apache.hadoop.metrics.file.FileContext
jvm.period = 10
jvm.servers = 10.0.0.xxx
# The server value may be a comma separated list of host:port pairs.
# The port is optional, in which case it defaults to 8649.
# jvm.servers = gmond-host-a, gmond-host-b:8649
dfs.class = org.apache.hadoop.metrics.file.FileContext
dfs.period = 10dfs.servers = 10.0.0.xxx
Hadoop Metrics
Health Monitoring
?
Which are the metrics that represent the health of monitored
services?
?
What thresholds are set to indicate issues and generate alerts in
alignment to the cluster usage and growth?
Hadoop Metrics: Categories
HDFS
Metrics
YARN
Metrics
HDFS Metrics
NameNode emitted
Metrics
NameNode
Metrics
NameNode JVM
HDFS Metrics
Metrics
DataNode Metrics
NameNode Metrics
Capacity-
Remaining
MissingBlocks
Metrics
NumDead-
Emitted by
DataNodes
NameNode
FilesTotal
Totalload
CapacityRemaining
CapacityRemaining
Any running jobs that write out temporary data may fail due to lack of capacity.
It is a good practice to ensure that disk use never exceeds 80 percent capacity.
MissingBlocks
Missing
Blocks
Corrupt Missing
MissingBlocks
NameNode,
If the meanwhile,
Locates the checksum schedules a re-
requested does not replication of
block match, the the block from
client reports one of the
the corruption healthy copies
MissingBlocks
1
1 1 1
1 1 0
1 0 0 0
0 0 1 1
0 1 1 1 1
0 0 0 0
A missing block cannot be recovered 0 1 0
1 0 0 0
0 1 1 1
1 0 1
by copying a replica. 1 1 1 1
0 0 0 1 0 0
0 1 1
0 0 1 1
1 1 0 1 0 1
1 0 0 0 0
If a series of DataNodes were taken 1 1 0 0
0 0 0
0 1 1 0 1
offline for maintenance, missing blocks 0 1 1 1 0 0
1 1 0
0 0 1 1
maybe reported until they are brought 0 1 0 0 1
1 0 0 0 0
0 1 1 1
back up. 0 0
0 1 0 1
1 0 0
1 1
1 1 1 1
1 1 0 1
0 0 1 0
1 0
0
0
NumDeadDataNodes
FilesTotal is a running count of the number of files being tracked by the NameNode. NameNode
stores all metadata in memory.
Excessive pauses during garbage collection can be fixed by upgrading the JDK version or garbage collector.
Additionally, Java runtime can be tuned to minimize garbage collection.
DataNode Metrics
Metrics
Emitted by
DataNode
Counters
ResourceManage
r and
Terminals
File system Nodemanager
Custom counters web UI
counters
MapReduce Counters
MILLIS_MAPS/MILLIS_
REDUCE
NUM_FAILED_MAPS/
NUM_FAILED_
REDUCES
MapReduce Counters
DATA_LOCAL_MAPS/R
ACK_LOCAL_MAPS/O
THER_LOCAL_MAPS
REDUCE_INPUT_REC
ORDS
Task Counters
GC_TIME_MILLIS
Yarn Metrics
NodeManager
Metrics
Cluster
Metrics
YARN
Metrics
Application
Metrics
Cluster Metrics
Number of failed
appsFailed Work: Error
applications
Total amount of
totalMB/allocatedMB memory/amount of memory Resource: Utilization
allocated
Application Metrics
0-1
Progress provides a real-time window Reported value will always be in the
into the execution of a YARN range of zero to one (inclusive).
application.
NodeManager Metrics
zk_avg_latency
If High Availability is enabled,
monitoring ZooKeeper metrics can be
beneficial. zk_num_alive_connections
Monitoring Hadoop Cluster
Services show the current status of a service. Earlier status of a service can be
seen by adjusting the Time Marker in the Cloudera admin interface.
Demonstration 1:
Monitoring Your CDH Cluster
Demonstration 2:
Monitoring Your CDH Cluster - 2
Quiz
QUIZ
What are the two distinct categories of monitoring?
1
a. Host monitor
b. Service monitor
c. Activity monitor
d. Reports manager
QUIZ Which service of Cloudera manager helps in collecting information pertaining to
2 activities running on the cluster, and viewing current and historical activity?
a. Host monitor
b. Service monitor
c. Activity monitor
d. Reports manager
a. Host monitor
b. Service monitor
c. Cloudera-scm-server
d. Cloudera-scm-agents
QUIZ
Which service of Cloudera manager does the most metric collection?
3
a. Host monitor
b. Service monitor
c. Cloudera-scm-server
d. Cloudera-scm-agents
a. Core-site.xml
b. Hadoop-policy.xml
c. Hadoop-metrics.properties
d. Hdfs-site.xml
QUIZ Which Hadoop configuration file is updated to enable plug-ins for metrics
5 collection?
a. Core-site.xml
b. Hadoop-policy.xml
c. Hadoop-metrics.properties
d. Hdfs-site.xml
b. GangliaContext& GangliaContext31
c. Gmond&Gmetad
b. GangliaContext& GangliaContext31
c. Gmond&Gmetad
a. Application metrics
b. NodeManager metrics
a. Application metrics
b. NodeManager metrics
d. GC_time_millis
QUIZ Which metrics under the MapReduce category track the time spent across all
10 map and reduce tasks?
d. GC_time_millis
Disclaimer: All the logos used in this course belong to the respective organizations