Hadoop Basics With Ibm Biginsights
Hadoop Basics With Ibm Biginsights
Hadoop Basics With Ibm Biginsights
Contents
LAB 2 HADOOP ADMINISTRATION ............................................................................................................................ 4
2.1 MANAGING A HADOOP CLUSTER BY ADDING/REMOVING NODES ...........................................................................5
2.1.1 PREPARE YOUR ENVIRONMENT ...............................................................................................................5
2.1.2 SETTING UP SSH .....................................................................................................................................6
2.1.3 WORKING WITH THE AMBARI WEB CONSOLE .........................................................................................7
2.2 MANAGING A HADOOP CLUSTER ..........................................................................................................................10
2.2.1 VISUAL HEALTH CHECK OF A CLUSTER USING THE AMBARI WEB CONSOLE ..........................................10
2.2.2 DFS DISK CHECK USING A TERMINAL WINDOW......................................................................................12
2.3 HADOOP ADMINISTRATION ...................................................................................................................................14
2.3.1 ADMINISTERING SPECIFIC SERVICES ....................................................................................................14
2.3.2 CONFIGURING HADOOP DEFAULT SETTINGS ........................................................................................15
2.3.3 INCREASING STORAGE BLOCK SIZE .....................................................................................................15
2.3.4 CONFIGURING THE REPLICATION FACTOR .............................................................................................17
2.3.5 LIMIT DATANODE DISK USAGE ..............................................................................................................17
2.4 SUMMARY .............................................................................................................................................................18
Contents Page 3
IBM Software
Username Password
VM image setup screen root password
Linux virtuser password
For this lab all Hadoop components should be up and running. If all components are running you may
move on to Section 2 of this lab. Otherwise please refer to Hadoop Basics Unit 1: Exploring Hadoop
Distributed File System Section 1.1 to get started. (All Hadoop components should be started).
Page 4 Unit 4
IBM Software
So far you have been working with just a single node cluster. To add a second node to the cluster, you
would need to have a second VMware image. For clarification purposes, the existing image will be
referred to as the master image.
That second copy of your QSE VMware image would need to be stored in a different directory. Boot it
and go through the same process of accepting the licenses that you did for the Master image. Specify the
same password for root and virtuser for the child image as you did for the Master image.
The users on all the cluster nodes need to have the same logins and id numbers. Thus the child image
would need a username of virtuser.
For a node to be added to a BigInsights cluster, BigInsights cannot be installed. Thus BigInsights would
need to be uninstalled on the child image
Child image
The hostname and IP address on the child image have to be different from the hostname and IP address on
the master image (since it is a direct copy it would initially have the same hostname and IP address as the
master.) You also need to update the /etc/hosts file so the child image can communicate with the master
image. These operations are done using a terminal window (right click on desktop and click Open in
Terminal).
The procedure would be: switch to root (su -), edit /etc/HOSTNAME (gedit /etc/HOSTNAME),
change the first part of the hostname to append “2” (e.g., from rvm.svl.ibm.com to rvm2.svl.ibm.com),
then save your work & close the editor. This, however, does not change the hostname until the next
reboot.
To immediately change the hostname (until the next reboot), you would execute: hostname
rvm2.svl.ibm.com and then check that this worked by executing the command hostname again, but
without the name afterwards.
Hands-on-Lab Page 5
IBM Software
Get the ipaddress from the master image. On the master image, right-click the desktop, select Open in
Terminal. Then from the command line execute:
su -
ifconfig
On the child image, edit the /etc/hosts file. (gedit /etc/hosts ) Change the hostname to bivm2.ibm.com
bivm2 Then add the ipaddress and hostname from the master. Save your work and close the editor.
The following file /etc/hosts is an example. In this case, the IP address for the master was 192.168.70.202
and the IP address for the child was 192.168.70.201.
# File is generated from /home/virtuser/setHostname.sh
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
192.168.70.202 rvm.svl.ibm.com master
192.168.70.201 rvm2.svl.ibm.com second
Master image
On the master image, switch user to root. Then edit /etc/hosts and add the hostname and IP address for
the child image exactly the same as for the child image. Save your work and close the editor.
Hostname and IP address resolution must be the same from all members of a cluster, no matter what the
size.
One of the key parts of managing a Hadoop cluster is being able to scale the cluster with ease, adding and
removing nodes as needed. Adding a node can be done through a range of methods, of which we will
cover adding from a BigInsights Console, and from a terminal. Each of these methods can achieve the
same results.
Again, we are not executing these actions. The description here shows the method of working with the
process of setting up a second (or later) node in a cluster.
Before proceeding with adding a node, you should first verify that you can access the node you are trying
to add. This can be done by simply “sshing” (pronounced as ess-ess-aitch-ing) the given node(s) as
follows.
On the master image, open a terminal window by right-clicking the desktop and select Open in
Terminal.
Type the following ssh command to make sure that you have connectivity between the master and the
child images: ssh root@rvm2.svl.ibm.com
When doing ssh on a new IP you will typically get an authenticity message:
The authenticity of host 'rvm2.svl.ibm.com (192.168.70.201)' can't be established.
RSA key fingerprint is 29:2f:72:9f:f4:97:16:89:cf:d9:cc:09:d3:16:d9:bf.
Are you sure you want to continue connecting (yes/no)?
Page 6 Unit 4
IBM Software
One of the great features of IBM BigInsights 4.0 is the Ambari Web Console. The web console a user-
friendly way for performing the tasks associated with Hadoop and other service administration.
All of the following steps are done on the Master node. The BigInsights services must be started.
__1. Sign in to your VMware Image (virtuser / password).
__2. Start the Ambari Web Console:
● Start the Firefox browser
● Use the URL: localhost:8080
● Sign in to Ambari (admin / admin)
__3. When you are signed in, you will see the following Dashboard
page. If the service on the left-hand side shows a red-triangle with
exclamation point, the particular service is not currently running;
if the individual service has a green-circle with check mark, the
particular service is running.
Hands-on-Lab Page 7
IBM Software
To start all services, click Actions at the bottom of the left-hand side, and then Start All:
Once all components have started successfully as shown on the Ambari Web Console, you can
proceed with the following actions using the Ambari Web Console.
__4. You will now see on the Ambari Dashboard. To work with
Adding / Deleting Hosts from the Ambari Web Console, select
the Hosts tab.
Page 8 Unit 4
IBM Software
__5. On the Hosts tab, select Add New Node from the Actions drop down on the left.
__6. This provides you with the places where you fill in the
information for adding Hosts to your cluster:
Hands-on-Lab Page 9
IBM Software
Since we will not be actually performing this work, because we do not have another host to add to
our cluster for this lab, review the meaning of these individual steps at:
● https://ambari.apache.org/1.2.3/installing-hadoop-using-ambari/content/ambari-
chap3.html
For this review, open up the “+” of the explorer-type listing of “3. Installing, Configuring, and
Deploying the Cluster” that you will find on the left side of this documentation.
Review also the process by which this can be done in a Terminal Window on the command line at
the Ambari Wike. The relevant command line API access point to Ambari uses curl:.
● https://cwiki.apache.org/confluence/display/AMBARI/Add+a+host+and+deploy+comp
onents+using+APIs
The remaining parts of this Hands-on Lab can be adequately done on a single-node cluster and
thus the following steps should be done in full.
2.2.1 Visual health check of a cluster using the Ambari Web Console
Servers, machines, and disk drives are all prone to a physical failure over time. When running a large
cluster with dozens of nodes, it is crucial to over time maintain a constant health check of hardware and
take appropriate actions when necessary. BigInsights v4 allows for a quick and simple way to perform
these types of health checks on a cluster.
You can visually check the status of your cluster by following these simple steps that require you have a
login on the Ambari Web Console:
Page 10 Unit 4
IBM Software
__7. Open the Ambari Web Console, and click on the Dashboard Tab:
On the left-side check the list of services. If any service shows a red-triangle with an exclamation
point, that service is not running.
Hands-on-Lab Page 11
IBM Software
__8. On the left-hand panel, click on HDFS. For a healthy disk system, you should see:
There are various ways to monitoring the DFS Disk, and this should be done occasionally to avoid space
issues which can arise if there is low disk storage remaining. One such issue can occur if the “hadoop
healthcheck” or heartbeat as it is also referred to sees that a node has gone offline. If a node is offline for a
certain period of time, the data that the offline node was storing will be replicated to other nodes (since
there is a 3-node replication, the data is still available on the other 2 nodes). If there is limited disk space,
this can quickly cause an issue.
Some of these commands require that you are logged in as an HDFS administrator, hdfs. Since you do not
currently know the password for hdfs, but you do know the root password, you can login to hdfs by first
going to root.
Page 12 Unit 4
IBM Software
Hands-on-Lab Page 13
IBM Software
A single node can have a wide variety of services running at any given time, as seen in the screenshot
below. Depending on your system and needs, it may not always be necessary to have all of the services
running, as the more services running the more resources and computing power is being consumed by
them.
In the Ambari Web Console, on the hosts tab, there are a list of hosts in the cluster. Here we have just one
(rvm.svl.ibm.com), as listed in the first/left column. In the right-hand column here, you can see that 31
components are running. By clicking on 31 Components, you will get a list of the actual components
(highlighted for this illustration as a vertical bar and arrow):
Stopping specific services can be done easily through the Ambari Web Console. We are not going to stop
any services in this Hands-On Lab, but this is where you can find what is running.
Page 14 Unit 4
IBM Software
__11. The configuration files for Hadoop and the various services can be found in a number of
configuration files that end with the name –site.xml. You can easily find these files by doing a
search as root:
su -
find /etc -name "-site.xml"
There are certain attributes from Apache Hadoop which are imported, and some have been changed to
improve performance. One such attribute is the default block size used for storing large files.
Consider the following short example. You have a 1GB file, on a 3-node replication cluster. With a block-
size of 128MB, this file will be split into 24 blocks (8 blocks, each replicated 3 times), and then stored on
the Hadoop cluster accordingly by the master node. Increasing and decreasing the block size can have
very specific use-case implications; however, for the sake of this lab we will not cover those Hadoop
specific questions, but rather how to change these default values.
Hands-on-Lab Page 15
IBM Software
Hadoop uses a standard block storage system to store the data across its data nodes. Since block size is
slightly more of an advanced topic, we will not cover the specifics as to what and why the data is stored as
blocks throughout the cluster.
The default block size value for IBM BigInsights 4.0 is currently set at 128MB (as opposed to the Hadoop
default of 64MB in versions of Hadoop prior to Hadoop 2). If your specific use-case requires you to
change this, it can be easily modified through Hadoop configuration files.
When making any Hadoop core changes, it is good practice (and a requirement for most) to stop the
services you are changing before making any necessary changes. For the block size, you must stop the
“Hadoop” and “Console” services before proceeding if you have not done so in the previous steps, and re-
start them after you have made the changes.
__12. Move to the directory where Hadoop staging configuration files are stored. In this directory, you
will see a file named “hdfs-site.xml”, one of the site-specific configuration files, which is on
every host in your cluster. Edit the file with gedit:
cd /etc/Hadoop/conf.empty
gedit hdfs-site.xml
__13. Navigate to the property dfs.blocksize (use Search in the toolbar of gedit), and you will see the
value is set to 128MB, the default block size for BigInsights. For the purpose of this lab, we will
not change the value.
Page 16 Unit 4
IBM Software
__14. Navigate to the property dfs.replication. The current default replication factor will depend on the
number of DataNodes that you have in your cluster and also the way that the cluster has been set
up.. If you only have one node, then realistically the value is 1. If you have two DataNodes, then it
would make sense to have the value of 2. For three or more DataNodes, the standard / default
value is 3. You can change the default value by setting an appropriate value as the default in the
following lines to this file (hdfs-site.xml). The value will be the number of your choice.
__15. Navigate to the property named dfs.datanode.du.reserved. This value represents reserved space in
bytes per volume. HDFS will always leave this much space free for non-dfs use.
Hands-on-Lab Page 17
IBM Software
For the purpose of this lab, we will not save this configuration change. This part of the Hands-On Lab is
intended to just let you browse how to change some of the configuration values when you need it later on.
However, in real-life, once you have made changes to this file, you would then synchronize this file
across all appropriate nodes in the cluster.
Configuration can also be done through the Ambari Web Console, but that is beyond the scope and time
allotted for this Lab.
For more information on the configuration files:
● https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
● Core: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
● HDFS: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
● MapRed: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-
default.xml
● YARN: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
2.4 Summary
Congratulations! You have now experience some common tasks of Hadoop administrations.
Page 18 Unit 4
NOTES
NOTES
© Copyright IBM Corporation 2015.