Commands in Hadoop
Commands in Hadoop
Commands in Hadoop
1. ls:
This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
Example:
2. mkdir:
To create a directory. In Hadoop fs there is no home directory by default. So let’s first create it.
Syntax:
3. touchz:
Syntax:
Example:
To copy files/folders from local file system to hdfs store. This is the most important command.
Local filesystem means the files present on the OS.
Syntax:
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder user
present on hdfs.
OR
5. cat:
Syntax:
Example:
Syntax:
Example:
(OR)
hadoop fs -put /user/data.txt ../Desktop
7 cp:
This command is used to copy files within hdfs. Lets copy folder user to user_copied.
Syntax:
Example:
8. mv:
This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from user folder
to user_copied.
Syntax:
Example:
9. du:
Syntax:
Example:
10. dus:
Syntax:
hadoop fs -dus <dirName>
Example:
Step 1: Create a file with the name word_count_data.txt and add some data to it
Step 2: Create a mapper.py file that implements the mapper logic. It will read the data from
STDIN and will split the lines into words, and will generate an output of each word with its
individual count.
#!/usr/bin/env python
# import sys because we need to read and write data to STDIN and STDOUT
import sys
# reading entire line from STDIN (standard input)
for line in sys.stdin:
# to remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
for word in words:
print ('%s\t%s' % (word, 1))
Step 3: Create a reducer.py file that implements the reducer logic. It will read the output of
mapper.py from STDIN(standard input) and will aggregate the occurrence of each word and
will write the final output to STDOUT.
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t')
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print ('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print ('%s\t%s' % (current_word, current_count))
Step 4: Now let’s start all our Hadoop daemons with the below command.
start-all.cmd
Now make a directory word_count_in_python in our HDFS in the root directory that will
store our word_count_data.txt file with the below command.
hdfs dfs -mkdir /word_count_in_python
Let’s give executable permission to our mapper.py and reducer.py with the help of below
command.
cd Documents/
chmod 777 mapper.py reducer.py # changing the permission to read, write, execute for user,
group and others
Step 5: Now download the latest hadoop-streaming jar. Then place, this Hadoop,-streaming
jar file to a place from you can easily access it.
Now let’s run our python files with the help of the Hadoop streaming utility as shown below.