Complete Hadoop Map Reduce Hive Setup Step by Step
Complete Hadoop Map Reduce Hive Setup Step by Step
NOTE: While creating virtual machine, point to D or E drive, as VM files occupy space in c:\ drive by
default and u will get into storage issues.
Create a folder like bigdata in d:\ drive and use os type as linux and version as Ubuntu 64 bit.
U can start with 2048 MB i.e 2 GB memory and later can increase it to 3 gb or 4 gb.
Select Create Virtual Hard Disk now option and click Create button.
It’s time to point the right ISO image corresponding to Ubuntu 20.04 version (64 bit) file by following
the steps given below:
Select the VM bigdata and right click and choose settings
Click the storage option and the pop up window look like the following
Now select the Disk symbol (empty) and in the right panel, there are Attributes, click the down
arrow corresponding to disk sign to choose the optical drive/iso file..
We can go with 16.04.7 – server version for full or can choose 20.04.4 desktop-amd64 file
Once you select the right iso image Ubuntu-20.04.4-desktop-amd64…ur screen look like above.
Keep the network setting with NAT for accessing internet from your linux
SELECT OK and proceed…
Now you are ready to start your VIRTUAL MACHINE.. WHICH WILL BE BOOTED AND INITIALIZED
WITH UBUNTU 20.04 ISO IMAGE… (this process takes about 30 minutes.. max depending on your
system). Select the VM, select Start … and select Normal Start option as below:
Maximize the vm screen where installation is happening…
By default it select india Calcutta region.. go for it.. if in case if it is showing different region..select
accordingly.
NEXT SCREEN IS VERY IMPORTANT.. REMEMBER TO GIVE RIGHT USER NAME AND PASSWORD.. AND
STORE IT…/REMEMBER IT…
bigdata2023
don’t upgrade or any packages in Ubuntu / linux OS.. if it asking forcefully ***
Use the following command to update your system before initiating a new installation:
Step-1
sudo apt update
drdvenkat@drdvenkat-VirtualBox:~/Desktop$ sudo apt update
[sudo] password for drdvenkat:
Hit:1 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:2 http://in.archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://in.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:4 http://in.archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists... Done
Building dependency tree
Reading state information... Done
329 packages can be upgraded. Run 'apt list --upgradable' to see them.
drdvenkat@drdvenkat-VirtualBox:~/Desktop$
to change the shell prompt.. long one to short one/ur favourtie use the following step:
drdvenkat@drdvenkat-VirtualBox:~/Desktop$ export PS1=$:
$:
Step-2
sudo apt install openjdk-8-jdk -y
$:which java
/usr/bin/java
$:java -version
openjdk version "1.8.0_352"
OpenJDK Runtime Environment (build 1.8.0_352-8u352-ga-1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode)
$:which javac
/usr/bin/javac
$:javac -version
javac 1.8.0_352
$:
Set Up a Non-Root User for Hadoop Environment
It is advisable to create a non-root user, specifically for the
Hadoop environment. A distinct user improves security and helps you
manage your cluster more efficiently. To ensure the smooth
functioning of Hadoop services, the user should have the ability to
establish a passwordless SSH connection with the localhost.
Install the OpenSSH server and client using the following
command
Step-4
sudo apt install openssh-server openssh-client -y
Step-9
chmod 0600 ~/.ssh/authorized_keys
$:ls -lt ~/.ssh/authorized_keys
-rw------- 1 hdoop hdoop 580 Jan 24 22:04 /home/hdoop/.ssh/authorized_keys
$:
The new user is now able to SSH without needing to enter a password every time.
Verify everything is set up correctly by using the hdoop user to SSH to localhost:
Step-10
ssh localhost
$:ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is
SHA256:CXz9eqInsu9wcBTgemSUKUdujMiDkgM91L0lU758Yj0.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.15.0-58-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
330 updates can be applied immediately.
255 of these updates are standard security updates.
hdoop@drdvenkat-VirtualBox:~$
Download and Install Hadoop on Ubuntu
Visit the official Apache Hadoop project page, and select the version of Hadoop you want to
implement.
Go to the archive page.. and select Hadoop 3.2.1
https://archive.apache.org/dist/hadoop/common/
hadoop-3.2.1.tar.gz 100%[===================================>]
342.56M 2.91MB/s in 2m 0s
hdoop@drdvenkat-VirtualBox:~$
Extract the files to initiate the Hadoop installation
Step-12:
tar xzf hadoop-3.2.1.tar.gz
hdoop@drdvenkat-VirtualBox:~$ tar xzf hadoop-3.2.1.tar.gz
The Hadoop binary files are now located within the hadoop-3.2.1 directory
hdoop@drdvenkat-VirtualBox:~$ PS1=$:
$:id
uid=1001(hdoop) gid=1001(hdoop) groups=1001(hdoop)
$:ls -lt
total 350788
-rw-rw-r-- 1 hdoop hdoop 359196911 Jul 3 2020 hadoop-3.2.1.tar.gz
drwxr-xr-x 9 hdoop hdoop 4096 Sep 10 2019 hadoop-3.2.1
Single Node Hadoop Deployment (Pseudo-Distributed Mode)
Hadoop excels when deployed in a fully distributed mode on a large cluster of networked
servers. However, if you are new to Hadoop and want to explore basic commands or test
applications, you can configure Hadoop on a single node. This setup, also called pseudo-
distributed mode, allows each Hadoop daemon to run as a single Java process.
Once u edited and added the entries.. save them and exit..
execute the environment file using the following command.
hdoop@drdvenkat-VirtualBox:~$ source .bashrc
hdoop@drdvenkat-VirtualBox:~$ env | grep HADOOP
HADOOP_OPTS=-Djava.library.path=/home/hdoop/hadoop-3.2.1/lib/nativ
HADOOP_INSTALL=/home/hdoop/hadoop-3.2.1
HADOOP_MAPRED_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_COMMON_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_HDFS_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_COMMON_LIB_NATIVE_DIR=/home/hdoop/hadoop-3.2.1/lib/native
hdoop@drdvenkat-VirtualBox:~$
Edit hadoop-env.sh File
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and
Hadoop-related project settings. When setting up a single node Hadoop cluster, you need to
define which Java implementation is to be utilized. Use the previously created
$HADOOP_HOME variable to access the hadoop-env.sh file:
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
mkdir /home/hdoop/tmpdata
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
Edit hdfs-site.xml File
U HAVE TO CREATE TWO DIRECTORIES IN ORDER TO CREATE NAME NODE AND DATA NODE
DATA
$:mkdir -p /home/hdoop/dfsdata/namenode
$:mkdir -p /home/hdoop/dfsdata/datanode
The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage
file, and edit log file. Configure the file by defining the NameNode and DataNode storage
directories.
Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match
the single node setup.
Use the following command to open the hdfs-site.xml file for editing:
$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration to the file and, if needed,
adjust the NameNode and DataNode directories to your custom
locations:
$: vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<!-- ENTRIES ADDED BY Dr D VENKAT. -->
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
Edit mapred-site.xml File
Use the following command to access the mapred-site.xml file and define MapReduce
values:
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
<!-- Entries Added by Dr D VENKAT for MAP REDUCE to use YARN scheduler -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit yarn-site.xml File
The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations
for the Node Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:
vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSP
ATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
##########SETTING UP THE HADOOP SINGLE NODE CLUSTER######################
STEP-14: Format HDFS NameNode
It is important to format the NameNode before starting Hadoop services for the first time:
hdfs namenode -format
$:which hdfs
/home/hdoop/hadoop-3.2.1/bin/hdfs
$:which hadoop
/home/hdoop/hadoop-3.2.1/bin/hadoop
…………………………
SHUTDOWN_MSG: Shutting down NameNode at drdvenkat-VirtualBox/127.0.1.1
************************************************************/
If in case if you see any error.. don’t proceed.. check google uncle
Step-15: Start Hadoop Cluster
Navigate to the hadoop-3.2.1/sbin directory and execute the following commands to start the
NameNode and DataNode:
$:source .bashrc
hdoop@drdvenkat-VirtualBox:~$ PS1=$:
$:env | grep hadoop
YARN_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_OPTS=-Djava.library.path=/home/hdoop/hadoop-3.2.1/lib/nativ
HADOOP_INSTALL=/home/hdoop/hadoop-3.2.1
HADOOP_MAPRED_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_COMMON_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_HDFS_HOME=/home/hdoop/hadoop-3.2.1
HADOOP_COMMON_LIB_NATIVE_DIR=/home/hdoop/hadoop-3.2.1/lib/native
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
games:/usr/local/games:/snap/bin:/home/hdoop/hadoop-3.2.1/sbin:/home/
hdoop/hadoop-3.2.1/bin
$:cd $HADOOP_HOME
$:cd sbin
$:./start-dfs.sh
12882 NodeManager
12499 SecondaryNameNode
12757 ResourceManager
13225 Jps
12330 DataNode
12186 NameNode
$:
FOR HIVE
FOR KAFKA…..