Getting Started With AWS: Analyzing Big Data
Getting Started With AWS: Analyzing Big Data
Getting Started With AWS: Analyzing Big Data
Table of Contents
Analyzing Big Data ........................................................................................................................ 1
Key AWS Services for Big Data ................................................................................................ 1
Setting Up .................................................................................................................................... 3
Sign Up for AWS ................................................................................................................... 3
Create a Key Pair .................................................................................................................. 3
Tutorial: Sentiment Analysis ............................................................................................................. 5
Step 1: Create a Twitter Developer Account ................................................................................ 6
Step 2: Create an Amazon S3 Bucket ....................................................................................... 6
Step 3: Collect and Store the Sentiment Data ............................................................................. 7
Launch an Instance Using AWS CloudFormation ................................................................ 7
Collect Tweets Using the Instance .................................................................................... 8
Store the Tweets in Amazon S3 ...................................................................................... 10
Step 4: Customize the Mapper ............................................................................................... 10
Step 5: Create an Amazon EMR Cluster .................................................................................. 11
Step 6: Examine the Sentiment Analysis Output ........................................................................ 14
Step 7: Clean Up ................................................................................................................. 15
Tutorial: Web Server Log Analysis ................................................................................................... 16
Step 1: Create an Amazon EMR Cluster .................................................................................. 16
Step 2: Connect to the Master Node ........................................................................................ 18
Step 3: Start and Configure Hive ............................................................................................ 19
Step 4: Create the Hive Table and Load Data into HDFS ............................................................. 19
Step 5: Query Hive ............................................................................................................... 20
Step 6: Clean Up ................................................................................................................. 21
More Big Data Options ................................................................................................................. 23
Related Resources ...................................................................................................................... 25
iii
Solution
Data sets can be very large. Storage can become Amazon S3 can store large amounts of data, and
expensive, and data corruption and loss can have its capacity can grow to meet your needs. It is
far-reaching implications.
highly redundant and secure, protecting against
data loss and unauthorized use. Amazon S3 also
has an intentionally small feature set to keep its
costs low.
Maintaining a cluster of physical servers to process When you run an application on a virtual Amazon
data is expensive and time-consuming.
EC2 server, you pay for the server only while the
application is running, and you can increase the
number of servers within minutes, not hours or
days to meet the processing needs of your application.
Challenge
Solution
Hadoop and other open-source big-data tools can Amazon EMR handles cluster configuration, monitbe challenging to configure, monitor, and operate. oring, and management. Amazon EMR also integrates open-source tools with other AWS services
to simplify large-scale data processing, so you can
focus on data analysis and extracting value.
Setting Up
Before you use AWS for the first time, complete the following tasks. (These steps prepare you for both
of the tutorials in this guide.)
Tasks
Sign Up for AWS (p. 3)
Create a Key Pair (p. 3)
2.
3.
From the navigation bar, in the region selector, click US West (Oregon).
In the navigation pane, click Key Pairs.
4.
5.
6.
Important
This is the only chance for you to save the private key file. You'll need to provide the name
of your key pair when you launch an instance and the corresponding private key each time
you connect to the instance.
7.
Prepare the private key file. This process depends on the operating system of the computer that
you're using.
If your computer runs Mac OS X or Linux, use the following command to set the permissions of
your private key file so that only you can read it.
$ chmod 400 my-key-pair.pem
If your computer runs Windows, use the following steps to convert your .pem file to a .ppk file
for use with PuTTY.
a.
b.
c.
d.
e.
f.
g.
Go to https://dev.twitter.com/user/login and log in with your Twitter user name and password. If you
do not yet have a Twitter account, click the Sign up link that appears under the Username field.
2.
If you have not yet used the Twitter developer site, you'll be prompted to authorize the site to use
your account. Click Authorize app to continue.
Go to the Twitter applications page at https://dev.twitter.com/apps and click Create a new application.
Follow the on-screen instructions. For the application Name, Description, and Website, you can
specify any text you're simply generating credentials to use with this tutorial, rather than creating
a real application.
3.
4.
5.
6.
Twitter displays the details page for your new application. Click the <guilabel>Key and Access
Tokens</guilabel> tab collect your Twitter developer credentials. You'll see a Consumer key and
Consumer secret. Make a note of these values; you'll need them later in this tutorial. You may want
to store your credentials in a text file.
At the bottom of the page click Create my access token. Make a note of the Access token and
Access token secret values that appear, or add them to the text file you created in the preceding
step.
If you need to retrieve your Twitter developer credentials at any point, go to https://dev.twitter.com/apps
and select the application that you created for the purposes of this tutorial.
2.
3.
Specify a name for your bucket, such as mysentimentjob. To meet Hadoop requirements,
your bucket name is restricted to lowercase letters, numbers, periods (.), and hyphens (-).
For the region, select US Standard.
c.
Click Create.
4.
Select your new bucket from the All Buckets list and click Create Folder. In the text box, specify
input as the folder name, and then press Enter or click the check mark.
5.
Repeat the previous step to create a folder named mapper at the same level as the input folder.
6.
For the purposes of this tutorial (to ensure that all services can use the folders), make the folders
public as follows:
a.
b.
c.
Under Stack, in the Name box, specify a name that is easy for you to remember. For example,
my-sentiment-stack.
b.
c.
5.
In the KeyPair box, specify the name of the key pair that you created in Create a Key Pair (p. 3).
Note that this key pair must be in the US East (N. Virginia) region.
In the TwitterConsumerKey, TwitterConsumerSecret, TwitterToken, and TwitterTokenSecret
boxes, specify your Twitter credentials. For best results, copy and paste the Twitter credentials
from the Twitter developer site or the text file you saved them in.
Note
The order of the Twitter credential boxes on the Specify Parameters page may not
match the display order on the Twitter developer site. Verify that you're pasting the
correct value in each box.
c.
6.
7.
8.
Click Next.
Note
Stacks take several minutes to launch. To see whether this process is complete, click
Refresh. When your stack status is CREATE_COMPLETE, it's ready to use.
Note
You may encounter an issue installing Tweepy. If you do encounter this issue, edit setup.py
as follows.
Remove the following:
Select the Outputs tab. Copy the DNS name of the Amazon EC2 instance that AWS CloudFormation
created from the EC2DNS key.
2.
Connect to the instance using SSH. Specify the name of your key pair and the user name ec2-user.
For more information, see Connect to Your Linux Instance.
In the terminal window, run the following command:
3.
$ cd sentiment
4.
To collect tweets, run the following command, where term1 is your search term. Note that the collector
script is not case sensitive. To use a multi-word term, enclose it in quotation marks.
$ python collector.py term1
Examples:
$ python collector.py kindle
$ python collector.py "kindle fire"
5.
Press Enter to run the collector script. Your terminal window displays the following message:
Collecting tweets. Please wait.
Note
If your SSH connection is interrupted while the script is still running, reconnect to the instance
and run the script with nohup (for example, nohup python collector.py > /dev/null
&).
The script collects 500 tweets, which could take several minutes. If you're searching for a subject that is
not currently popular on Twitter (or if you edited the script to collect more than 500 tweets), the script will
take longer to run. You can interrupt it at any time by pressing Ctrl+C.
When the script has finished running, your terminal window displays the following message:
Finished collecting tweets.
In your SSH window, run the following command. (The current directory should still be sentiment.
If it's not, use cd to navigate to the sentiment directory.)
$ ls
2.
You should see a file named tweets.date-time.txt, where date and time reflect when the
script was run. This file contains the ID numbers and full text of the tweets that matched your search
terms.
To copy the Twitter data to Amazon S3, run the following command, where tweet-file is the file
you identified in the previous step and your-bucket is the name of your bucket.
Important
Be sure to include the trailing slash, to indicate that input is a folder. Otherwise, Amazon
S3 will create an object called input in your base S3 bucket.
$ s3cmd put tweet-file s3://your-bucket/input/
For example:
$ s3cmd put tweets.Nov12-1227.txt s3://mysentimentjob/input/
3.
To verify that the file was uploaded to Amazon S3, run the following command:
$ s3cmd ls s3://your-bucket/input/
You can also use the Amazon S3 console to view the contents of your bucket and folders.
On your computer, open a text editor. Copy and paste the following script into a new file, and save
the file as sentiment.py.
#!/usr/bin/python
import cPickle as pickle
import nltk.classify.util
10
def word_feats(words):
return dict([(word, True) for word in words])
def subj(subjLine):
subjgen = subjLine.lower()
# Replace term1 with your subject term
subj1 = "term1"
if subjgen.find(subj1) != -1:
subject = subj1
return subject
else:
subject = "No match"
return subject
def main(argv):
classifier = pickle.load(open("classifier.p", "rb"))
for line in sys.stdin:
tolk_posset = word_tokenize(line.rstrip())
d = word_feats(tolk_posset)
subjectFull = subj(line)
if subjectFull == "No match":
print "LongValueSum:" + " " + subjectFull + ": " + "\t" + "1"
else:
print "LongValueSum:" + " " + subjectFull + ": " + classifi
er.classify(d) + "\t" + "1"
if __name__ == "__main__":
main(sys.argv)
2.
Replace term1 with the search term that you used in Step 3: Collect and Store the Sentiment
Data (p. 7) and save the file.
Important
Do not change the spacing in the file. Incorrect indentation causes the Hadoop streaming
program to fail.
3.
4.
Open the Amazon S3 console and locate the mapper folder that you created in Step 2: Create an
Amazon S3 Bucket (p. 6).
Click Upload and follow the on-screen instructions to upload your customized sentiment.py file.
5.
Select the file, click Actions, and then click Make Public.
11
program in Amazon EMR, you specify a mapper and a reducer, as well as any supporting files. The
following list provides a summary of the files you'll use for this tutorial.
For the mapper, you'll use the file you customized in the preceding step.
For the reducer method, you'll use the predefined Hadoop package aggregate. For more information
about the aggregate package, see the Hadoop documentation.
Sentiment analysis usually involves some form of natural language processing. For this tutorial, you'll
use the Natural Language Toolkit (NLTK), a popular Python platform. You'll use an Amazon EMR
bootstrap action to install the NLTK Python module. Bootstrap actions load custom software onto the
instances that Amazon EMR provisions and configures. For more information, see Create Bootstrap
Actions in the Amazon Elastic MapReduce Developer Guide.
Along with the NLTK module, you'll use a natural language classifier file that we've provided in an
Amazon S3 bucket.
For the job's input data and output files, you'll use the Amazon S3 bucket you created (which now
contains the tweets you collected).
Note that the files used in this tutorial are for illustration purposes only. When you perform your own
sentiment analysis, you'll need to write your own mapper and build a sentiment model that meets your
needs. For more information about building a sentiment model, see Learning to Classify Text in Natural
Language Processing with Python, which is provided for free on the NLTK site.
To create a cluster
1.
2.
3.
4.
Note
In a production environment, logging and debugging can be useful tools for analyzing errors
or inefficiencies in Amazon EMR steps or programs. For more information, see
Troubleshooting in the Amazon Elastic MapReduce Developer Guide.
5.
6.
7.
Under Software Configuration, leave the default Hadoop distribution setting: Amazon and select
the latest 3.x AMI version. Under Applications to be installed, click each X to remove Hive, Pig,
and Hue from the list.
Under File System Configuration and Hardware Configuration, leave the default settings.
Under Security and Access, select your key pair from EC2 key pair. Leave the default IAM user
access. If this is your first time using Amazon EMR, click Create Default Role to create the default
EMR role and then click Create Default Role to create the default EC2 instance profile.
12
8.
In the Add bootstrap action list, select Custom action. You'll add a custom action that installs
and configures the Natural Language Toolkit on the cluster.
Click Configure and add.
In the Add Bootstrap Action dialog box, in the Amazon S3 location box, copy and paste the
following path:
s3://awsdocs-tutorials/sentiment-analysis/config-nltk.sh
Alternatively, copy and paste the following script into a new file, upload that file to an Amazon
S3 bucket, and use that location instead:
#!/bin/bash
sudo yum -y install git gcc python-dev python-devel
sudo pip install -U numpy
sudo pip install pyyaml nltk
sudo pip install -e git://github.com/mdp-toolkit/mdp-toolkit#egg=MDP
sudo python -m nltk.downloader -d /usr/share/nltk_data all
d.
9.
13
c.
In the Add Step dialog box, configure the job as follows, replacing your-bucket with the name
of the Amazon S3 bucket you created earlier, and then click Add:
Set Name to Sentiment analysis
Set Mapper to s3://your-bucket/mapper/sentiment.py
Set Reducer to aggregate
Set Input S3 location to s3://your-bucket/input/
Set Output S3 location to s3://your-bucket/output/ (note that this folder must not exist)
Set Arguments to -cacheFile s3://awsdocs-tutorials/sentiment-analysis/classifier.p#classifier.p
Set Action on failure to Continue
2.
3.
4.
Open the Amazon S3 console and locate the bucket you created in Step 2: Create an Amazon S3
Bucket (p. 6). You should see a new output folder in your bucket. You might need to click the
refresh arrow in the top right corner to see the new bucket.
The job output is split into several files: an empty status file named _SUCCESS and several
part-xxxxx files.The part-xxxxx files contain sentiment measurements generated by the Hadoop
streaming program.
Select an output file, click Actions, and then click Download. Right-click the link in the pop-up window
and click Save Link As to download the file.
Repeat this step for each output file.
Open the files in a text editor. You'll see the total number of positive and negative tweets for your
search term, as well as the total number of tweets that did not match any of the positive or negative
14
terms in the classifier (usually because the subject term was in a different field, rather than in the
actual text of the tweet).
For example:
kindle: negative
kindle: positive
No match:
13
479
8
In this example, the sentiment is overwhelmingly positive. In most cases, the positive and negative
totals are closer together. For your own sentiment analysis work, you'll want to collect and compare
data over several time periods, possibly using several different search terms, to get as accurate a
measurement as possible.
Step 7: Clean Up
To prevent your account from accruing additional charges, you should clean up the AWS resources that
you created for this tutorial.
Because you ran a Hadoop streaming program and set it to auto-terminate after running the steps in the
program, the cluster should have been automatically terminated when processing was complete.
15
16
To create a cluster
1.
2.
3.
4.
Specify a Cluster name or leave the default value of My cluster. Set Termination protection to
No and clear the Logging Enabled check box.
Note
In a production environment, logging and debugging can be useful tools for analyzing errors
or inefficiencies in Amazon EMR steps or programs. For more information, see
Troubleshooting in the Amazon Elastic MapReduce Developer Guide.
5.
Under Software Configuration, leave the default Hadoop distribution setting, Amazon, and the
latest AMI version. Under Applications to be installed, leave the default Hive settings. Click the
X to remove Pig from the list.
6.
Note
When you analyze data in a real application, you might want to increase the size or number
of these nodes to improve processing power and reduce computational time. You may also
17
want to use spot instances to further reduce your costs. For more information, see Lowering
Costs with Spot Instances in the Amazon Elastic MapReduce Developer Guide.
7.
Under Security and Access, select your key pair from EC2 key pair. Leave the default IAM user
access. If this is your first time using Amazon EMR, click Create Default Role to create the default
EMR role and then click Create Default Role to create the default EC2 instance profile.
8.
Leave the default Bootstrap Actions and Steps settings. Bootstrap actions and steps allow you to
customize and configure your application. For this tutorial, we are using Hive, which is already included
in our configuration.
At the bottom of the page, click Create cluster.
9.
When the summary of your new cluster appears, the status is Starting. It takes a few minutes for
Amazon EMR to provision the Amazon EC2 instances for your cluster and change the status to
Waiting.
When you've successfully connected to the master node, you'll see a welcome message and prompt
similar to the following:
----------------------------------------------------------------------------Welcome to Amazon Elastic MapReduce running Hadoop and Amazon Linux.
Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.
The Hadoop UI can be accessed via the following commands:
18
ResourceManager
NameNode
lynx http://ip-172-16-43-158:9026/
lynx http://ip-172-16-43-158:9101/
----------------------------------------------------------------------------[hadoop@ip-172-16-43-158 ~]$
In the terminal window for the master node, at the hadoop prompt, run the following command.
[hadoop@ip-172-16-43-158 ~]$ hive
Tip
If the hive command is not found, be sure that you specified hadoop as the user name when
connecting to the master node, not ec2-user. Otherwise, close this connection and connect
to the master node again.
2.
Tip
If /home/hadoop/hive/hive_contrib.jar is not found, it's possible that there's a
problem with the AMI that you selected. Follow the directions in Clean Up (p. 21) and then
start this tutorial again, using a different AMI version.
When the command completes, you'll see a confirmation message similar to the following:
Added /home/hadoop/hive/lib/hive_contrib.jar to class path
Added resource: /home/hadoop/hive/lib/hive_contrib.jar
19
translation using a serializer/deserializer (SerDe). SerDes exist for a variety of data formats. For information
about how to write a custom SerDe, see the Apache Hive Developer Guide.
The SerDe we'll use in this example uses regular expressions to parse the log file data. It comes from
the Hive open-source community. Using this SerDe, we can define the log files as a table, which we'll
query using SQL-like statements later in this tutorial. After Hive has loaded the data, the data persists in
HDFS storage as long as the Amazon EMR cluster is running, even if you shut down your Hive session
and close the SSH connection.
2.
At the hive command prompt, paste the command (use Ctrl+Shift+V in a terminal window or right-click
in a PuTTY window), and then press Enter.
When the command finishes, you'll see a message like this one:
OK
Time taken: 12.56 seconds
Note that the LOCATION parameter specifies the location of a set of sample Apache log files in Amazon
S3. To analyze your own Apache web server log files, replace the URL in the example command with
the location of your own log files in Amazon S3.
20
Example 1: Count the number of rows in the Apache webserver log files
select count(1) from serde_regex;
After the query finishes, you'll see output similar to the following:
Total MapReduce CPU Time Spent: 13 seconds 860 msec
OK
239344
Time taken: 86.92 seconds, Fetched: 1 row(s)
Example 2: Return all fields from one row of log file data
select * from serde_regex limit 1;
After the query finishes, you'll see output similar to the following:
OK
66.249.67.3
20/Jul/2009:20:12:22 -0700]
"GET /gal
lery/main.php?g2_controller=exif.SwitchDetailMode
&g2_mode=detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15741&g2_return
Name=photo HTTP/1.1"
302
5
"-"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Time taken: 4.444 seconds, Fetched: 1 row(s)
Example 3: Count the number of requests from the host with an IP address of 192.168.1.198
select count(1) from serde_regex where host="192.168.1.198";
After the query finishes, you'll see output similar to the following:
Total MapReduce CPU Time Spent: 13 seconds 870 msec
OK
46
Time taken: 73.077 seconds, Fetched: 1 row(s)
Step 6: Clean Up
To prevent your account from accruing additional charges, clean up the following AWS resources that
you created for this tutorial.
3.
21
2.
3.
22
23
If you're not able to find an open-source tool that meets your needs, you can write a custom Hadoop
map-reduce application and run it on Amazon EMR. For more information, see Run a Hadoop Application
to Process Data in the Amazon Elastic MapReduce Developer Guide.
Alternatively, you can create a Hadoop streaming job that reads data from standard input. For an
example, see Tutorial: Sentiment Analysis (p. 5) in this guide. For more details, see Launch a Streaming
Cluster in the Amazon Elastic MapReduce Developer Guide.
24
Related Resources
The following table lists some of the AWS resources that you'll find useful as you work with AWS.
Resource
Description
AWS Documentation
Official documentation for each AWS product, including service introductions, service features, and API reference.
Contact Us
The hub for creating and managing your AWS Support cases.
Also includes links to other helpful resources, such as forums,
technical FAQs, service health status, and AWS Trusted Advisor.
AWS Support
The home page for AWS Support, a one-on-one, fast-response support channel to help you build and run applications
in the cloud.
Provides access to information, tools, and resources to compare the costs of Amazon Web Services with IT infrastructure
alternatives.
Provides technical whitepapers that cover topics such as architecture, security, and economics. These whitepapers have
been written by the Amazon team, customers, and solution
providers.
25
Resource
Description
AWS Blogs
AWS Podcast
26