Exercise 2 Configuring Flume For Data Loading: IBM Software
Exercise 2 Configuring Flume For Data Loading: IBM Software
Exercise 2
Configuring Flume for Data Loading
© Copyright IBM Corporation, 2013
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM Software
Contents
LAB 2 CONFIGURING FLUME FOR DATA LOADING ................................................................................................ 4
1.1 GETTING STARTED .................................................................................................................................. 5
1.2 GETTING PUTTY SETUP ........................................................................................................................... 7
1.3 INSTALL THE FLUME SERVICE .................................................................................................................. 10
1.4 CREATING A CONFIGURATION FILE FOR BASIC TESTING ............................................................................... 11
1.5 TESTING YOUR FIRST AGENT ................................................................................................................... 13
1.6 A MORE COMPLICATED CONFIGURATION OVERVIEW .................................................................................... 14
1.7 SETUP YOUR AGENTS ............................................................................................................................. 15
1.8 TEST YOUR CONFIGURATION ................................................................................................................... 17
1.9 TRANSFER DATA .................................................................................................................................... 20
Contents Page 3
IBM Software
This version of the lab was updated and tested on the InfoSphere BigInsights 4.1 Quick Start Edition.
Throughout this lab you will be using the following account login information. If your passwords are
different, please note the difference.
Username Password
Flume is a distributed service for efficiently moving the large amounts of data. The original use
case of Flume was to gather a set of log files on various machines in a cluster and aggregate
them to a centralized persistent store such as Hadoop Distributed File System (HDFS). It has
since been re-architected and expanded to cover a much wider variety of data.
Probably the most important ingredient for this exercise is your imagination. You are going to
have to draw upon the inner child within you. Your exercises are running on a BigInsights cluster
that has a whopping one node. So it should be obvious that you are not going to be able to
move data from one node to another. For this exercise to be somewhat meaningful, you are
going to have to image that, when you start multiple agents, you have multiple nodes and that
Flume is running on each of those nodes.
Note:
Be aware that the initialization of the flume agents will take several minutes. Please be patient.
Page 4
IBM Software
__1. Open a Web browser and navigate to http://rvm.svl.ibm.com:8080, and sign in using the
Ambari user id and password specified at the beginning of this document.
Notice that most of the BigInsights components listed at the left are in a Stopped state as
indicated by the red, triangular warning icon.
Hands-on-Lab Page 5
IBM Software
__4. Click Actions -> Selected Hosts (1) -> Hosts -> Start All Components
Note: Be sure to allow ample time for all the components to start. The first time you start the components
and services, it may take approximately 30 minutes or even longer, depending on the physical resources
on your machine.
Page 6
IBM Software
__6. Periodically examine the background operations indicator at the top of the screen. It should show
that 1 operation is running. When it updates to 0 ops as shown below, the Start All script is
complete and the components should all be running as indicated by the green check mark icons
on the left.
Note: If your icons still show red warning signs after the startup, it may be that the Ambari interface did
not refresh properly, even though the details in the background operations show 100% and display a
successful message. Feel free to click the Admin button at the top right of the window, and then click
Sign out. You will be presented with the Ambari login screen. Log back in using the credentials at the
beginning of this document and the component list should be updated with the correct, green check-mark
icons.
Now that the components are started you may move onto the next section.
Hands-on-Lab Page 7
IBM Software
__3. Now for your Host Name, you should be able to find it easily on the page when you first log
into BigInsights QuickStart 4.1 VM image. It should show it on this page.
Note: To get to this page when your already logged in, just type “exit” until you get back to this
screen (instead of restart the whole VM)
Page 8
IBM Software
__4. Now just input the Host Name (IP Address) into your PuTTY client under “Host Name” and
hit “Open.” It should look like this: (Optional: Hit “Save” after you input your details to keep it for
next time!)
__5. Next you should be greeted with a command line requesting login details. Just enter the
VM image setup screen login (default- username: root password: password). Then you have
successfully setup PuTTY for to connect to your VM!
Hands-on-Lab Page 9
IBM Software
Page 10
IBM Software
Keep clicking Next through the screens and finally click the Deploy button. You should receive
confirmation that the service has been deployed and started. After the install you may be asked to
restart some of the Hadoop components. Restart the requested components and move on to the
next section of the lab.
Hands-on-Lab Page 11
IBM Software
__ 1. Make sure you are logged into your lab image as root.
__ 2. If Hadoop is not running, start it using Ambari.
__ 3. Your Flume configuration file can reside anywhere as long as the agent can access it.
The convention is to place the configuration file in Flume’s conf directory. We will just
place our Flume configuration files into the /home/virtuser directory for this lab.
Start the vi editor or use a notepad (transferred via VM Shared Folder).
Important:
Did you note that the binding definition for the source contains the keyword channels (plural)
and the binding definition for the sink contains the keyword channel? This is because a source
can read from multiple channels whereas a sink can only write to a single channel.
__ 10. Save your work into File System->home->virtuser and make sure the file is named
flume_agent1.properties.
Hands-on-Lab Page 13
IBM Software
Note:
To terminate your running agent, do a ctrl-z in the console window. This is true for both an agent
that did not initialize properly due to a configuration error and one that is running just fine. The
cntl-z does not terminate the Java process however.
If you have a configuration error, do a ctrl-z, correct your problem and restart your agent. You do
not have to worry about killing the process. The existing process is able to reload the
configuration file.
Now start your flume agent. Override the default logging information and write
informational records to the console. Your agent’s name is agent1. Note: a lot of data will
be written to the console. cnlt-z will terminate the output.
bin/flume-ng agent --name agent1 --conf conf --conf-file
/home/virtuser/flume_agent1.properties -Dflume.root.logger=INFO,console
Or
bin/flume-ng agent -n agent1 --conf conf -f / home/virtuser
/flume_agent1.properties -Dflume.root.logger=INFO,console
Page 14
IBM Software
agent13 is to do the initial read of the data from files dropped into a specified directory.
__ 1. First create your directory. From a command line
mkdir /home/virtuser/flumesourcedata
__ 2. Open a new file in your text editor.
__ 3. Define the source, sink, and channel to be used by agent13 as well as the bindings.
__ a. The source name is spoolDirSource. Its type is spoolDir.
__ b. The sink name is avroSink. Its type is avro. (Remember that to pass events from one
agent to another requires avro sinks and sources.)
The hostname for binding is localhost and the port is 10013.
__ c. The channel name is memChannel. Its type is memory and it has a capacity of 100.
#These statements are for agent13
agent13.sources = spoolDirSource
agent13.sinks = avroSink
agent13.channels = memChannel
agent13.sources.spoolDirSource.type = spooldir
agent13.sources.spoolDirSource.spoolDir= /home/virtuser/flumesourcedata
agent13.sinks.avroSink.type = avro
agent13.sinks.avroSink.hostname = localhost
agent13.sinks.avroSink.port = 10013
agent13.channels.memChannel.type = memory
agent13.channels.memChannel.capacity = 100
agent13.sources.spoolDirSource.channels = memChannel
agent13.sinks.avroSink.channel = memChannel
__ 4. agent99 is to get its events from agent13. Since the avro sink for agent13 was bound to
localhost at port 10013, that implies that the avro source for agent99 will also be bound to
locahost at port 10013.
Also, you are going to enhance your events by adding a timestamp to the header for
each event.
Define the source, sink, and channel to be used by agent99 as well as the bindings.
Hands-on-Lab Page 15
IBM Software
agent99.sources.avroSource.type = avro
agent99.sources.avroSource.bind = localhost
agent99.sources.avroSource.port = 10013
agent99.sources.avroSource.interceptors = ts
agent99.sources.avroSource.interceptors.ts.type = timestamp
agent99.sinks.avroSink.type = avro
agent99.sinks.avroSink.hostname = localhost
agent99.sinks.avroSink.port = 10099
agent99.channels.memChannel.type = memory
agent99.channels.memChannel.capacity = 100
agent99.sources.avroSource.channels = memChannel
agent99.sinks.avroSink.channel = memChannel
__ 5. agent86 is to get its events from agent99. So there must be an avro source to receive the
data and the data is to be passed to an hdfs sink.
Define the source, sink, and channel to be used by agent86 as well as the bindings.
__ a. The source name is avroSource. Its type is avro.
The bind parameter is localhost
The port is 10099
__ b. The sink name is hdfsSink. Its type is hdfs.
A portion of the hdfs path is created by extracting date and time information from the
header of each event. hdfs://rvm.svl.ibm.com:8020/user/virtuser/flume/%y-%m-
%d/%H%M
The filePrefix is log.
The writeFormat is Text
Page 16
IBM Software
agent86.sources.avroSource.type = avro
agent86.sources.avroSource.bind = localhost
agent86.sources.avroSource.port = 10099
agent86.sinks.hdfsSink.type = hdfs
agent86.sinks.hdfsSink.hdfs.path =
hdfs://rvm.svl.ibm.com:8020/user/virtuser/flume/%y-%m-%d/%H%M
agent86.sinks.hdfsSink.hdfs.filePrefix = Log
agent86.sinks.hdfsSink.hdfs.writeFormat = Text
agent86.sinks.hdfsSink.hdfs.fileType = DataStream
agent86.channels.memChannel.type = memory
agent86.channels.memChannel.capacity = 100
agent86.sources.avroSource.channels = memChannel
agent86.sinks.hdfsSink.channel = memChannel
__ 6. Save your work into File System->home->virtuser and call the file
flume_agents.properties.
Hands-on-Lab Page 17
IBM Software
agent13
__ 1. Open a PuTTY window, connect and change to the flume directory.
cd /usr/iop/4.1.0.0/flume/
__ 2. When you start agent13, even though you coded your configuration statements correctly,
you will see what looks like a Java exception when you start the agent. That is because
the avro sink is not able to connect to the source yet. Once agent99 starts and the avro
source does its bind, you should see a statement something like the following:
INFO sink.AvroSink: Avro sink avroSink: Building RpcClient with hostname:
rvm, port: 10013
agent99
__ 3. Open PuTTY window, connect and change to the flume directory.
cd /usr/iop/4.1.0.0/flume/
__ 4. When you start agent99, even though you coded your configuration statements correctly,
you will see what looks like a Java exception when you start the agent. That is because
the avro sink is not able to connect to the source yet. Once agent86 starts and the avro
source does its bind, you should see a statement something like the following:
Rpc sink avroSink: Building RpcClient with hostname: localhost, port:
10099
Execute the following:
bin/flume-ng agent -n agent99 --conf conf -f
/home/virtuser/flume_agents.properties -Dflume.root.logger=INFO,console
agent86
__ 5. Open a PuTTY window, connect and change to the flume directory.
Page 18
IBM Software
cd /usr/iop/4.1.0.0/flume/
__ 6. Start agent86. You should see a statement as follows:
INFO source.AvroSource: Avro source avroSource started.
Execute the following:
bin/flume-ng agent -n agent86 --conf conf -f
/home/virtuser/flume_agents.properties -Dflume.root.logger=INFO,console
You should have 3 windows that look similar to this after they are all running:
agent13:
agent99:
Hands-on-Lab Page 19
IBM Software
agent86:
Page 20
IBM Software
__ 6. Next check to see that the data was moved to HDFS. Return to the terminal window
where you started agent86. You should see some statements indicating that a file was
created as a temporary file and then has been renamed. Notice that the directory
structure has been made up of some data and time information.
__ 7. View the contents of the newly created file. Return to the terminal window where you
created the text.txt file. Execute the following replacing the file name with your file name.
(You can do a copy and paste of the file name from the terminal window for agent86.)
Note: You must be in hdfs to use hadoop unless you’ve changed permissions.
su hdfs
hadoop fs -cat /user/virtuser/flume/16-01-27/1342/Log.1453930980072
__ 8. Execute ctrl-z in each of the three windows where the agents are running in order to
terminate them.
__ 9. You can close your open terminal windows.
End of exercise
Hands-on-Lab Page 21
NOTES
NOTES
© Copyright IBM Corporation 2013.