We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 16
yarn6124, act ANE BDC -3- Setting up Hadoop
BDCC
Apache Hadoop can be set up in the following three different
configurations:
* Developer mode: Developer mode can be used to run programs in a standalone
manner. This arrangement does not require any Hadoop process daemons, and jars can
tun directly. This mode is useful if developers wish to debug their code on MapReduce.
* Pseudo cluster (single node Hadoop): A pseudo cluster is a single node cluster that
has similar capabilities to that of a standard cluster; it is also used for the development
and testing of programs before they are deployed on a production cluster. Pseudo
clusters provide an independent environment for all developers for coding and testing.
* Cluster mode: This mode is the real Hadoop cluster where you will set up multiple
nodes of Hadoop across your production environment. You should use it to solve all of
your big data problems.
Best practices Hadoop deployment
+ Start small: Like other software projects, an implementation Hadoop also involves risks and
cost. It's always better to set up a small Hadoop cluster of four nodes. This small cluster can be
set up as proof of concept (POC). Before using any Hadoop component, it can be added to the
existing Hadoop POC cluster as proof of technology (POT). It allows the infrastructure and
development team to understand big data project requirements. After successful completion of
POC and POT, additional nodes can be added to the existing cluster.
+ Hadoop cluster monitoring: Proper monitoring of the NameNode and all DataNodes is
required to understand the health of the cluster. It helps to take corrective actions in the event
of node problems. Ifa service goes down, timely action can help avoid big problems in the
future. Setting up Gangalia and Nagios are popular choices to configure alerts and monitoring.
In the case of the Hortonworks cluster, Ambari monitoring, and the Cloudera cluster, Cloudera
(CDH) manager monitoring can be an easy setup.
+ Automated deployment: Use of tools like Puppet or Chef is essential for Hadoop deployment.
It becomes super easy and productive to deploy the Hadoop cluster with automated tools
instead of manual deployment. Give importance to data analysis and data processing using
available tools/components. Give preference to using Hive or Pig scripts for problem solving
rather than writing heavy, custom MapReduce code. The goal should be to develop less and
analyze more.
hips bac. santechscomunit2/3-seting-up-hasoop ans.yarn6124, att AN BDC -3- Setting up Hadoop
BDCC
failure or crash, the system should be able to recover itself or failover to another data center/
site.
+ Security: Data needs to be protected by creating users and groups, and mapping users to the
groups. Setting appropriate permissions and enforcing strong passwords should lock each user
group down.
+ Data protection: The identification of sensitive data is critical before moving it to the Hadoop
cluster. It's very important to understand privacy policies and government regulations for the
better identification and mitigation of compliance exposure risks.
Batch processing
* Very efficient in processing a high volume of data.
* All data processing steps (that is, data collection, data ingestion, data processing, and results
presentation) are done as one single batch job.
+ Throughput carries more importance than latency. Latency is always more than a single
minute.
+ Throughput directly depends on the size of the data and available computational system
resources.
+ Available tools include Apache Sqoop, MapReduce jobs, Spark jobs, Hadoop DistCp utility, and
soon.
Real-time processing
+ Latency is extremely important, for example, less than one second
+ Computation is relatively simple
* Data is processed as an independent unit
* Available tools include Apache Storm, Spark Streaming, Apache Fink, Apache Kafka, and so on
hips bac. santechscomunit2/3-seting-up-hasoop 26yarn6124, att AN BDC -3- Setting up Hadoop
BDCC
+ Text/CSV file
Text and CSV files are very common in Hadoop data processing algorithms. Each line in the file
is treated as anew record.
+ JSON
The JSON format is typically used in data exchange applications and it is treated as an object,
record, struct, or an array. These files are text files and support schema evolutions. It's very
easy to add or delete attributes from a JSON file.
* Sequence fil
A sequence file is a flat file consisting of binary key/value pairs. They are extensively used in
MapReduce as input/output formats. They are mostly used for intermediate data storage
within a sequence of MapReduce jobs.
+ Avro
Avro is a widely used file type within the Hadoop community. It is popular because it helps
schema evolution. It contains serialized data with a binary format. An Avro fie is splittable and
supports block compression. It contains data and metadata. It uses a separate JSON file to
define the schema format. When Avro data is stored in a file, its schema is stored with it so that
files may be processed later by any program.
+ Parquet
Parquet stores nested data structures in a flat columnar format. Parquet is more efficient in
terms of storage and performance than any row-level file formats. Parquet stores binary data
in a column-oriented way. In the Parquet format, new columns are added at the end of the
structure.
* ORC
ORC files are optimized record columnar file format and are the extended version of RC files.
These are great for compression and are best suited for Hive SQL performance when Hive is
reading, writing, and processing data to reduce access time and the storage space.
hips bac. santechscomunit2/3-seting-up-hasoop aneyarn6124, att AN BOC - 3 - Setting up Hasoop,
BDCC
SUUU apL anstaLt vpenyun-Li-yuK -y
Once the installation process is complete, verify
the current Java version:
jaaronjohns@aaron-hadoop:~$ java -version; javac -version
lopenjdk version "11.@.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-@ubuntu1. 20.04)
OpenJOK 64~Bit Server VM (build 11.@.10+9-Ubuntu-@ubuntu1.20.04, mixed mode, sha
ring)
javac 11.0.10
Install OpenSSH
sudo apt install openssh-server openssh-client -y
Create a user named hadoop and set the
password for this user
sudo adduser hadoop
Then switch to that user
su — hadoop
Generate an SSH key pair for this user
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
hntps:ifbdce santechz.com/uni-23-seting-up-hacoop aneyarn6124, att AN BOC - 3 - Setting up Hasoop,
BDCC
wget https://dlcdn. apache. org/hadoop/common/hadoop-3.3.6/
hadoop-3.3.6. tar.gz
Extract the files
tar xzf hadoop-3.3.6.tar.gz
Since we are deploying Hadoop on a single node, we do
the following settings
Open the .bashre file and append the following to the bottom of the file.
Then exit the file after you make the changes.
nano .bashrc
#Hadoop Related Options
export HADOOP_HOME=/home/hadoop/hadoop-3.2.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=SHADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/1ib/native
export PATH=$PA
export
Apply the changes to the current running environment
source ~/.bashre
hntps:ifbdce santechz.com/uni-23-seting-up-hacoop se276124, att ANE BOC -3- Setting up Hasoop
BDCC
if [ -f /usr/share/bash-completion/bash_completion ]; then
. /usr/share/bash-completion/bash_completion
elif [ -f /etc/bash_completion }; then
+ /etc/bash_completion
fi
fi
#Hadoop Related Options
export HADOOP_HOME=/home/hadoop/hadoop-3.2.2
export HADOOP_INSTALL=SHADOOP_HOME
‘export HADOOP_MAPRED_HOME=SHADOOP_HOME
‘export HADOOP_COMMON_HOME=SHADOOP_HOME
‘export HADOOP_HDFS_HOME=SHADOOP_HOME
export YARN_HOME=$HADOOP_HOME
‘export HADOOP_COMMON_LIB_NATIVE_DIR=SHADOOP_HOME/1ib/native
export PATH=SPATH: SHADOOP_HOME/sbin:SHADOOP_HOME/bin
export HADOOP_OPTS="-Djava. library.path=SHADOOP_HOME/1ib/native"
Write Out J} where Is JM cut Text [BJ Justity Cur Pos
Read File MN Replace W Paste Textil To Spell fll Go To Line
Find the path to your jdk and note it down. The area to note down is
highlighted with a red box
Get Help
Exit
[aaronjohns@aaron-hadoop:~$ readlink -f /usr/bin/javac
7usr/1ib/jvm/java~11-openjdk-amd64}bin/javac
Edit the hadoop-env.sh file. Uncomment the $JAVA_HOME variable (i.e.,
remove the # sign) and add the full path to the OpenJDK installation on
your system.
nano $HADOOP_HOME/etc/hadoop/hadoop-env. sh
@ The java inplenentation to use. By default, this environnent
# variable is REQUIRED on ALL platforms except 0S X!
export JAVA_HOME=/usr/1ib/ jvm/Java~11-openjdk-and6all
cet Help PB write out ME where ts PB cur text PB austity PB cur pos
exit Read File BN Replace if Paste TextBll To Spell fl Go To Line
hntps:ifbdce santechz.com/uni-23-seting-up-hacoop eteyarn6124, att AN BDC -3- Setting up Hadoop
BDCC
Add the following configuration to override the default values for the
temporary directory and add your HDFS URL to replace the default
local file system setting:
hadoop. tmp. dir/home/hadoop/tmpdatafs.default.namehdfs: //127.0.0.1:9000
GNU nano 4.8 /home/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml_ Modified
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
_>
hadoop.tmp.dir/home/hadoop/tmpdatafs.default.namehdfs://127.0.0.1:9000
RE cet Help JM write out J} where 1s PM cut text PR gustity PB cur pos
Qe Exit Read File (NV Replace [fl paste Textal To Spell Go To Line
hips bac. santechscomunit2/3-seting-up-hasoop m6yarn6124, att AN BDC -3- Setting up Hadoop
BDCC
Add the following configuration to the file
dfs. data. dir/home/hadoop/df sdata/namenode
dfs.data.dir/home/hadoop/dfsdata/datanodedfs. replicationi
GNU nano 4.8 /home/hadoop/hadoop-3.2.2/etc/hadoop/hdfs—site.xml Modified
configuration>
imapreduce . framework .nameyarn
k/configuration>
MS Get Help [J write out J} where Is MM cut Text R) Justity PM cur pos
ee Exit Wi Read File MV Replace fl) paste Textifj To Spell Go To Line
hips bac. santechscomunit2/3-seting-up-hasoop eneyarn6124, att AN
BDCC
BDC -3- Setting up Hadoop
Add the following configuration to the file
yarn. nodemanager . aux-
servicesmapreduce_shuffleyarn. nodemanager . aux-
services.mapreduce. shuffle. class
name>
org. apache. hadoop.mapred. Shuf f
leHandleryarn. resourcemanager.hostname
name>
d.0.
@yarn.acl.enable0yarn.nodemanager.env-
whitelistJAVA_HOME, HADOOP_COMMON_HOME, H
ADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASS.
PATH_PERPEND_DISTCACHE, HADOOP_YARN_HO
ME, HADOOP_MAPRED_HOME
Cnrneee)
Unless required by applicable law or agreed to in writing,
UTEP ae
ERIC Te et ee
Modified
software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
>
yarn.nodemanager.aux-services
-nodemanager.aux-services .mapreduce.shuffle.class
G cur Pos
yarn
org. apache. hadoop.mapred.ShuffleHandleryarn. resourcemanager .hostname127.0.0.1ll
Get Help Write Out [MJ where Is
Qe Exit Mi Read File [MN Replace
My cut Text [) Justify
] Paste Text To Spell
Go To Line
hips bac. santechscomunit2/3-seting-up-hasoop
s0n6yarn6124, att AN BOC - 3 - Setting up Hasoop,
BDCC
The shutdown notification signifies the end of the
NameNode format process.
Start the namenode and datanode.
cd ~/hadoop-3.2.2/sbin
«/start-dfs.sh
radoope =Thadoop-3.2.2/SbInS.7start-ats. sh
starting ‘on [localhost]
jodes Caaron-hadoop]
Permanently added ‘aaron-hadoop' (ECDSA) to the list of k
own hosts.
Start the YARN resource and nodemanagers
-/start-yarn.sh
[Radooptaaron-hadoop:~/hadoop-3.2.2/sbinS_./start-yarn-sh
Starting resourcemanager
[starting nodemanagers
hntps:ifbdce santechz.com/uni-23-seting-up-hacoop n6yarn6124, att AN BOC - 3 - Setting up Hasoop,
BDCC
Ips
jhadoop@aaron-hadoop:~/hadoop-3.2.2/sbin$ jps
4611 Ips
[3878 SecondaryNameNode
[3451 NameNode
3645 DataNode
\4@94 ResourceManager
|4271 NodeManager
Note your ip address down
ipa
ins ip
t glen 1620
Link/loopback 00:00:00:00:00:00 brd 8@:00:00:00:00:08
inet 127.8.0.1/8 scope host lo
valid_Ift forever preferred_ift forever
ineté ::/128 scope host
valid_ift forever preferred ift forever
2: enposs
oup default qlen 1¢¢e
Link/ether 0 a4:9e:6e brd fff tiff a
inet 10.211.55.21/24 brd 10.211.55.255 scope global dynamic enpess
valid_1ft 1¢80sec preterred_ift 198esec
inete fdb2:2¢26: fhe4:0:21
noprefixroute
valid_1ft 2591946sec preferred_ift 604746sec
inet fed: :21¢:42Ff:fead:9e6c/64 scope link
valid ift forever preferred ift forever
mtu 65536 qdisc noqueue state UNKNOWN group defaul
mtu 1500 qdise fq_codel state UP or
2fFifeas:9e6c/64 scope global dynamic mngtmpaddr
hntps:ifbdce santechz.com/uni-23-seting-up-hacoop
a6ya7n6124, 8:41 AM DCC -3- Setting up Hacaop.
BDCC
Overview ‘ocainost:2000" (active)
Summary
| ran cen cp cs Oni ek = et
ap ony a 248549 Coma ey np aan.
onaiescaey race
The default port 9864 is used to access individual DataNodes
directly from your browser
DataNode on aaron-hadoop:9866
comer: co-seronsserseatie ctor
Vert: 2122 cataract
Block Pools
amenade Aree Block Pst Actor Siatn Last earbat Last lock Raprt Lat Bac Repo
lecahonsm00 = LRH 27.O1.-0178T700NSE | RUWWNG te omnae on eeuey
hntps:ifbdce santechz.com/uni-23-seting-up-hacoop 136yarn6124, att AN
BDCC
BOC - 3 - Setting up Hasoop,
(Chena All Applications
eee
Re ERE
PA intimes | era ato : ——_
Sonera
ee — a —
Simm | daeaurtnen
ae ss a —
2 ea SS a
ae
SS =
rumcing M9885 stocated
SSS SS SSS aas
= oy mn Ag cme rey oy rae i EES
Shona
TESTING A PROGRAM
STANDALONE NODE
ON THE
Create a file named WordCount.java in a directory named programs
sport
spore
ors.
ons.
snport
Anport
import
import
Lib. npat -FLlernput Forma
1ibcovtpot .Fileoueputrormaty
public class Hordcount (
public static class Tokenizersapper
‘extends Mapper