Eco System Notes

Java Map Redue :-
You need to ready the single node machine till the JPS command. On CLI ( Terminal ) we can see
that namenode , datanode , job tracker & task tracker is running with their ID’s.
Step 1. cat >WordCount.java <Enter>
( This command make new file called “WordCount.Java & wright below command)
//package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();public void map(LongWritable key, Text value,
OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);
//conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
(After this command you need to press CTRL + C to go back to our machine IP command line)
Step 2. export CLASSPATH=/usr/local/hadoop/hadoop-core-1.2.1.jar
(We export the CLASSPATH to our hadoop jar file. Java Import which is defined at begining will
take from .jar file)
Step 3. mkdir wordcount_classes
(We this command we make directory (Folder) with name “wordcount_classes” )
Step 4. javac -d wordcount_classes/ WordCount.java
( Javac is Java Compalier. -d is use to debunging. In Wordcount_classes folder 3 classes will store)
Step 5. cd wordcount_classes > then type ls to see below result
Result > WordCount.class WordCount$Map.class WordCount$Reduce.class
( We have three file in Map class , reduce class & WordCount.Class ( Driver class)
Step6. Cd <Enter>
( To go back our IP command line )
Step7. jar -cvf wordcount.jar -C wordcount_classes/ .
( tar is tape drive archieval and jar is java archieval. C is create , v is for verbously ( show us what
you doind , f is for forcefully ( don’t show me unnessary error) , give the name “wordcount.jar. -C
is for you will find the classes in wordcount_classes. It will create & store in parent direct as per / . )
Step8. ls <Enter>
Result = hadoop-1.2.1.tar.gz wordcount_classes wordcount.jar WordCount.java
( We can able to see the jar file in parent folder)
Step 9. exit <Enter>
( To exit from machine IP command & come to home folder so that we can send our java to cluster.
Step10. scp -i "cloudm.pem" en.7nov.txt ubuntu@ec2-18-212-29-171.compute-

1.amazonaws.com:~/
( In this scenario my file name is en.7nov.txt . We send this text file to machine After ubuntu use
your machine IP address & at the end add :~/)
Step 11. Now connect again to your machine & type ls <Enter>
Result = en.7nov.txt wordcount_classes WordCount.java

hadoop-1.2.1.tar.gz wordcount.jar
( We can able to see the en7.nov file in machine parent folder )
Step 12. hadoop fs -put en.7nov.txt .
( We use this command to put file to hadoop)
Step 13. hadoop fs -du en.7nov.txt
( To view the file is present or not on hadoop)
Result : Found 1 items

591874 hdfs://localhost:9000/user/ubuntu/en.7nov.txt
Step 14. hadoop jar wordcount.jar WordCount en.7nov.txt result
( We run the jar file. In this jar we have WordCount & we will process of .txt data & it will save in
result)
Step 15. Goto to Terminal & paste IP with 50070 <Enter>
( To access the file )
Step 16. Follow this link to view the data /user/ubuntu/result/part-00000

( To view the result on GUI )
Step 17. Goto terminal & type this command hadoop fs -lsr /user/ubuntu/result
( To view the result on CLI)
Step 18. hadoop fs -get /user/ubuntu/result/part-00000 results
( We this command we have the same information which is on GUI get same information on terminal with new
folder results )
Step 19. cat results
( To view the information on CLI)
Step 20. sort -n -k2 results > result
( We sort the data in numerical order & paste in to file called result )
Step 21. cat result
( We can able to see the our information as per numerical order )

With this process we can understand which word occure in any book we processed.
Apache Hive Installation
Step 1. wget http://www-us.apache.org/dist/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz
( This command is use to download hive )
Step 2. tar -zxf apache-hive-1.2.2-bin.tar.gz
( To untarr the hive tar ball )
Step 3. sudo mv apache-hive-1.2.2-bin /usr/local/hive
( Move the hive to local folder )
Step 4. nano .bashrc <Enter>
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
then exec bash <Enter>
( We are telling linux where is hive )
Step 5. cd $HIVE_HOME/conf
( Configuration folder of hive, after enter if command then it will configure successfully )
Step 6. cp hive-env.sh.template hive-env.sh
( From this command we chnage template file into hive-env.sh )
Step 7 . nano $HIVE_HOME/conf/hive-env.sh
( Now we are opening the enviroment of hive to know where is hadoop)
Step 8. export HADOOP_HOME=/usr/local/hadoop
( Inform hive where is hadoop)
Step 9. hive <Enter>
( It will not work but still enter it )
Step 10. hadoop fs -chmod -R 777 /tmp
( 777 is full permmission. This /temp folder is on Hadoop )
Step 11. hive <Enter>

( To go on hive command line, Hive command line is not secure thats why in company they use
beeline command. To get acccess to beeline we need to secure the cluster. For that we need kerlos
install
“Loading Server Log Data in hive”
Step 1. wget https://s3.amazonaws.com/cloud-age/eventlog.log
( As we dont have any data set so we are download event log file from this command)
Step 2. hadoop fs -rmr /user/ubuntu/*
( We delete the result & results file from hadoop with this command)
Step 3. hadoop fs -copyFromLocal /home/ubuntu/eventlog.log /user/ubuntu/serverlog.log
( We change the name to serverlog )
Step 4. hadoop fs -ls /user/ubuntu/
( To check the current file present on hadoop)
Step 5. type hive <Enter>
( To go on hive command line )
Step 6. Create database server;
( With hive command line we create the database server )
Step 7. Show databases;
( To view the data base )
Step 8. Use server;
( To use the server )
Step 9. create table serverdata (time STRING, ip STRING, country STRING, status STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/user/ubuntu/';
( To creat the scima )
Step 10. select * from serverdata limit 10;
( To select * from server data)

Result >
2016-03-28T13:15:2654.115.148.98 FR SUCCESS
2016-03-28T13:15:2612.134.245.229 FR SUCCESS
2016-03-28T13:15:26228.164.202.212 DE SUCCESS
2016-03-28T13:15:26148.96.126.81 DE SUCCESS
2016-03-28T13:15:26170.196.91.97 DE SUCCESS
2016-03-28T13:15:26242.90.117.80 GB SUCCESS
2016-03-28T13:15:265.191.139.199 GB SUCCESS
2016-03-28T13:15:26190.147.43.163 GB SUCCESS
Step 11. SELECT * FROM serverdata where country = "IN" LIMIT 5;
( To View data from IN country)
Step 12. select * from serverdata where country = "GB" ;
(To View data from GB country)
step 13. select * from serverdata where country = "IN" AND status = "ERROR";
( To view data from IN with Error )
Step 14. select * from serverdata where country = "FR" AND status = "SUCCESS";
(To view data from FR Country with Success message )
Step 15. SELECT ip, time FROM Serverdata;
( To view the IP of all transaction )
Step 16. SELECT DISTINCT ip, time from serverdata;
( This command need processing hence our Map & reduce process will run & we will get the result)
Step 18. SELECT DISTINCT ip from serverdata;
Step 19. SELECT DISTINCT * FROM serverdata;
Step 17. create table doc(text string) row format delimited fields terminated by '\n' stored as textfile;
( To create doc for GUI )
Step 18. load data inpath '/user/ubuntu/serverlog.log' overwrite into table doc;
( To load data )
Step 19. SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ' ')) lTable as
word GROUP BY word;
Apache Pig Installation
Step 1. wget https://archive.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
( To download the pig )
Step 2. tar -zxvf pig-0.16.0.tar.gz
( To Untarr the pig)
Step 3. sudo mv pig-0.16.0 /usr/local/pig
( To move the pig on local directory )
Step 4. export PIG_HOME=/usr/local/pig/

export PATH=$PATH:$PIG_HOME/bin
( To inform linux where is pig)
Step 5. pig <Enter>
( To go on command line of pig Grunt )

Step 6. lines = LOAD '/user/hive/warehouse/server.db/doc/serverlog.log' AS (line:chararray);
/user/hive/warehouse/server.db/doc/serverlog.log
( To load the server dp into doc )

Step 7. words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
Step 8. grouped = GROUP words BY word;
Step 9. wordcount = FOREACH grouped GENERATE group, COUNT(words);
Step 10. DUMP wordcount
Flume_log_data_ingestion :-
Step 1 . sudo apt-get update && sudo apt-get upgrade -y

( As we have more than 1000 ATM transaction data. So we fetch the data through flume & put it to
hadoop cluster , if we have new machine then only we can use step 1 command)
Step 2. wget http://archive.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz
( Download flume )
Step 3. tar -zxvf apache-flume-1.4.0-bin.tar.gz
( To untarr the flume & we dont put flume in usr folder because we want to run it from out side of
cluster)
Step 4. cd apache-flume-1.4.0-bin/conf/
( Configuration file of flume )
Step 5. mv flume-env.sh.template flume-env.sh
( To make temple file to flume-env.sh )
Step 6. nano /home/ubuntu/apache-flume-1.4.0-bin/conf/flume-env.sh
( To configuration of flume environment )
Step 7. JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
FLUME_CLASSPATH="/home/ubuntu/apache-flume-1.4.0-bin/lib/*.jar"
( To inform flume about java )
Step 8. nano flume.conf
( To configure flume configuration file )
Step 9. Paste the below classes in conf file
# Flume agent config
cloudage.sources = eventlog
cloudage.channels = file_channel
cloudage.sinks = sink_to_hdfs
# Define / Configure source
cloudage.sources.eventlog.type = exec
cloudage.sources.eventlog.command = tail -F /var/log/flume/eventlog.log
cloudage.sources.eventlog.restart = true
cloudage.sources.eventlog.batchSize = 1000
#cloudage.sources.eventlog.type = seq
# HDFS sinks
cloudage.sinks.sink_to_hdfs.type = hdfs
cloudage.sinks.sink_to_hdfs.hdfs.fileType = DataStream
cloudage.sinks.sink_to_hdfs.hdfs.path = hdfs://localhost:9000/user/ubuntu/flume/events
cloudage.sinks.sink_to_hdfs.hdfs.filePrefix = eventlogcloudage.sinks.sink_to_hdfs.hdfs.fileSuffix =
.log
cloudage.sinks.sink_to_hdfs.hdfs.batchSize = 1000
# Use a channel which buffers events in memory
cloudage.channels.file_channel.type = file
cloudage.channels.file_channel.checkpointDir = /var/log/flume/checkpoint
cloudage.channels.file_channel.dataDirs = /var/log/flume/data
# Bind the source and sink to the channel
cloudage.sources.eventlog.channels = file_channel
cloudage.sinks.sink_to_hdfs.channel = file_channel
( Channel = We have two channel . Memory channel & hard drive channel. In our case we take data
from twitter. Its third party app & they don’t give access to their server they gave access to their
network. Access to network means we can get data which present on Network. It will not allow to
come on server. So this streaming data for us. For steamining data we need Memory channel. In
ATM example we need file channel because ATM is personal so we can get the information. This is
internal data ( ATM Data ). We get the data of channel from sources. We have put data on HDFS
( so we have Sink on conf file ).
Sources = In Sources we have event log . Type is exec. So this is executble. What is executable the
command is tail ( If we put tail with any file then we can able to get all information about that file if
any changes is made in that file ). -F is used for continuesly. Restart true means if any machine
not run for few machine then it will timeout automatically. If machine promote for timeout then it
will refresh again with this restart function . This function will processed with flume agent.
Batchsize is 1000 it means we make transaction size of 1000 for this example. To view the how
data is coming. We mark comment seq because data is come in seq already.
Java is read the Source first then it will read to Channel then it will come Sink. In Sink we set path
as hdfs://localhost:9000. It will knock on port number 9000 & put in user/unbuntu/flume/events. In
Size we set the batch size is 1000 ( same size we put with set in sources ). We set it to file channel
on twitter we use memory channel. Benefit of file channel is Zero Data loss But speed will slow
( because it will run with hard drive speed), In Memory channel we get the high speed, in failiar
situation we may face the data loss. In Organisation they conf. Both because they use both channel
(Memory & File )
Step 10. sudo mkdir /var/log/flume/
( In conf we promised in conf. that we will make one folder & in folder we make one check point so
whenever it will restart then will get the idle time ( To know the when will have the idle time ( no
customer visit time ). This will help us to schedule the maintainance. We give the data to WFM. )
Step 11. sudo mkdir /var/log/flume/checkpoint/

( Whenever it do the check point then it will write in data. So we can not give check point
information instead we will give them information from data which will be smaller file.)
Step 12. sudo mkdir /var/log/flume/data/
( To make folder with name of data )
Step 13. sudo chmod 777 -R /var/log/flume
( 777 is full permission )
Step 14. hadoop fs -mkdir hdfs://localhost:9000/user/ubuntu/flume/events
( We make folder on hadoop it is for sink )
Step 15. Goto on Browser userubuntu/flume/events
( We can see the above folder is blank so we need one application so that we can put data in this
folder )
Step 16. wget https://s3.amazonaws.com/cloud-age/generate_logs.py
( To get logs file )
Step 17. ls
( To view file in our linux folder )
Step 18. nano generate_logs.py
( to go in conf file )
Now go to the parser add option ( last line with parser name) go to event log-demo.log and remove
the -demo from it ( event log.log ) Add flume before eventlog ( log/flume/eventlog.log ) go to end
of the same line and make same changes which we did at first ( remove -demo & add flume/)
Step 19. Open new terminal & connect it with machine. (This will be our 2nd Terminal )
Step 20. tail -F /var/log/flume/eventlog.log ( 2nd Terminal )
Error we will get that can not open file )
Step 21. sudo python generate_logs.py ( 1st Terminal )
( We will get information on 2nd Terminal )
Step 22. mv flume.conf apache-flume-1.4.0-bin/conf/
( To move the flume file)
Step 23. cd apache-flume-1.4.0-bin/bin
( To go in bin folder)
Step 24. ls <Enter>
( We can able to see the flume.eng)
Step 25. ./flume-ng agent --conf /home/ubuntu/apache-flume-1.4.0-bin/conf/ --conf-file
/home/ubuntu/apache-flume-1.4.0-bin/conf/flume.conf --name cloudage
( To start the agent )
Step 26. Go on 2nd Terminal sudo python generate_logs.py <Enter>
( To run the python file. It is sink & put it on hdfs )
Step 27. Go on terminal & refresh the link ( /user/ubuntu/flume/events )
( We will able to all transaction details on GUI )
Now press CTRL + C on 1st Terminal. We will get the message shutdown background worker
Tweeter Data Feching to HDFS
Note : Consumer key & access token used may not work. Change it with your
personal consumer key & access token.
Step 1. cd .. < Enter>

( On 2nd Terminal ( bin conf file )
Step 2. lib <Enter>
( To go in library folder )
Step 3. wget https://s3.amazonaws.com/cloud-age/flume-sources-1.0-SNAPSHOT.jar
( We put this into library directly)
Step 4. cd <Enter on 2nd Terminal)
Step 5. FLUME_CLASSPATH="/home/ubuntu/apache-flume-1.4.0-bin/lib/flume-
sources-1.0-SNAPSHOT.jar"
( Make # on second line because it has * )
Step 6. nano twitter.conf
( To make twitter.conf)
Step 7. TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = bsITNAn6Gq0WrSopSWvp4VCfb
TwitterAgent.sources.Twitter.consumerSecret =
Pd6gq5woj6rkFZ0ATt6G0b2rZvuhx52UnnoeNiQHOoHte7z8gw
TwitterAgent.sources.Twitter.accessToken = 2238647731-
nARVjyclOs0YxcbFUNdjYFNt2ycwnKxLHgsO1Ut
TwitterAgent.sources.Twitter.accessTokenSecret =
ayPrZalppaFEhG04dw6QojDSPgFmyPyfYnRogqtosoGri
TwitterAgent.sources.Twitter.keywords = TSLPRB, hadoop, big data, analytics,
bigdata, cloudera, data science, data scientist, business intelligence, Nandi
Awards, new data
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/ubuntu/tweeter
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 900
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 10000
Step 7. We can add hot topic keyword in TwitterAgent.sources.Twitter.keywords
Step 8. mv twitter.conf apache-flume-1.4.0-bin/conf/
( We move the twitter.conf to configuration folder )
Step 9. cd /home/ubuntu/apache-flume-1.4.0-bin/bin
( To run the agent ( third party agent which is twitter )
Step 10. ./flume-ng agent --conf /home/ubuntu/apache-flume-1.4.0-bin/conf/ -f
/home/ubuntu/apache-flume-1.4.0-bin/conf/twitter.conf
-Dflume.root.logger=DEBUG,console -n TwitterAgent
./flume-ng agent --conf /home/ubuntu/apache-flume-1.4.0-bin/conf/ -f

/home/ubuntu/apache-flume-1.4.0-bin/conf/twitter.conf
-Dflume.root.logger=DEBUG,console -n TwitterAgent

Eco System Notes

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Eco System Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eco System Notes

Uploaded by

Copyright:

Available Formats

Java Map Redue :-

Step 1. cat >WordCount.java <Enter>

Step 2. export CLASSPATH=/usr/local/hadoop/hadoop-core-1.2.1.jar

Step 3. mkdir wordcount_classes

(We this command we make directory (Folder) with name “wordcount_classes” )

Step 4. javac -d wordcount_classes/ WordCount.java

Step 5. cd wordcount_classes > then type ls to see below result

Result > WordCount.class WordCount$Map.class WordCount$Reduce.class

( To go back our IP command line )

Step7. jar -cvf wordcount.jar -C wordcount_classes/ .

Result = hadoop-1.2.1.tar.gz wordcount_classes wordcount.jar WordCount.java

( We can able to see the jar file in parent folder)

Step 9. exit <Enter>

Step10. scp -i "cloudm.pem" en.7nov.txt ubuntu@ec2-18-212-29-171.compute-

Result = en.7nov.txt wordcount_classes WordCount.java

( We can able to see the en7.nov file in machine parent folder )

Step 12. hadoop fs -put en.7nov.txt .

( We use this command to put file to hadoop)

Step 13. hadoop fs -du en.7nov.txt

( To view the file is present or not on hadoop)

Result : Found 1 items

Step 14. hadoop jar wordcount.jar WordCount en.7nov.txt result

Step 15. Goto to Terminal & paste IP with 50070 <Enter>

( To access the file )

Step 16. Follow this link to view the data /user/ubuntu/result/part-00000

( To view the result on CLI)

Step 18. hadoop fs -get /user/ubuntu/result/part-00000 results

Step 19. cat results

( To view the information on CLI)

Step 20. sort -n -k2 results > result

Step 21. cat result

( We can able to see the our information as per numerical order )

Step 1. wget http://www-us.apache.org/dist/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz

( This command is use to download hive )

Step 2. tar -zxf apache-hive-1.2.2-bin.tar.gz

( To untarr the hive tar ball )

Step 3. sudo mv apache-hive-1.2.2-bin /usr/local/hive

( Move the hive to local folder )

Step 4. nano .bashrc <Enter>

then exec bash <Enter>

( We are telling linux where is hive )

Step 6. cp hive-env.sh.template hive-env.sh

( From this command we chnage template file into hive-env.sh )

Step 7 . nano $HIVE_HOME/conf/hive-env.sh

( Now we are opening the enviroment of hive to know where is hadoop)

Step 8. export HADOOP_HOME=/usr/local/hadoop

( Inform hive where is hadoop)

Step 9. hive <Enter>

( It will not work but still enter it )

Step 10. hadoop fs -chmod -R 777 /tmp

( 777 is full permmission. This /temp folder is on Hadoop )

Step 11. hive <Enter>

“Loading Server Log Data in hive”

Step 1. wget https://s3.amazonaws.com/cloud-age/eventlog.log

Step 2. hadoop fs -rmr /user/ubuntu/*

Step 3. hadoop fs -copyFromLocal /home/ubuntu/eventlog.log /user/ubuntu/serverlog.log

( We change the name to serverlog )