Cloudera Data Analyst Training Exercise Manual
Cloudera Data Analyst Training Exercise Manual
General Notes
Cloudera provides a hands-on environment to accompany this training course. This
consists of a virtual machine (VM) running Linux, with a recent version of CDH
already installed and configured for you. This VM runs in pseudo-distributed mode, a
configuration that enables a Hadoop cluster to run on a single machine.
$ $ADIR/scripts/restart_services.sh
2. Exercises often contain steps with commands that look like this:
The dollar sign ($) represents the command prompt. Do not include this character
when copying and pasting commands into your terminal window. Also, the
backslash (\) signifies that the command continues on the next line. You may
either enter the code as shown (on two lines), or omit the backslash and type the
command on a single line.
HDFS Warnings
Due to a bug in HDFS, you may see
java.lang.InterruptedException warnings when running
3. Although many students are comfortable using UNIX text editors like vi or emacs,
some might prefer a graphical text editor. To invoke the graphical editor from
the command line, type gedit followed by the path of the file you wish to edit.
Appending & to the command allows you to type additional commands while the
editor is still open. Here is an example of how to edit a file named myfile.txt:
Catch-Up Script
If you are unable to complete an exercise, we have provided a script to catch you up
automatically. Each exercise has instructions for running the catch-up script.
Bonus Exercises
Many of the exercises contain one or more optional “bonus” sections. We encourage you
to work through these if time remains after you finish the main exercise and you would
like an additional challenge to practice what you have learned.
In this exercise, you will practice using the Hadoop command line utility to
interact with Hadoop’s Distributed Filesystem (HDFS) and use Sqoop to import
tables from a relational database to HDFS.
2. In Firefox, click the Hue bookmark in the bookmark toolbar (or type
http://localhost:8888/hue/home into the address bar and press the
Enter key.)
3. After a few seconds, you should see Hue’s home screen. The first time you log in,
you will be prompted to create a new username and password. Enter training in
both the username and password fields, and then click the Create Account button.
Note: You can close the tour that appears. These exercises will present the features
you need for this course.
4. Click the menu icon ( ) to the left of the Hue icon, then click Browsers > Files. The
File Browser displays your HDFS home directory (since your user ID on the cluster
is training, your home directory in HDFS is /user/training). This directory
does not yet contain any files or directories.
In Case of Error
If Hue displays the error message Cannot access: /user/
training/, this indicates that one or more of the required
services has failed and needs to be restarted. Refer to the above
section “Restarting Required Services” for instructions about how
to restart the required services.
5. Create a temporary sub-directory: select the +New menu and click Directory.
6. Enter directory name test and click the Create button. Your home directory now
contains a directory called test.
7. Click test to view the contents of that directory; currently it contains no files or
subdirectories.
10. Choose any of the data files in that directory and click the Open button.
In Case of Error
If Hue displays the error message Error: IOException:
Failed to find datanode, this indicates that one or more of
the required services has failed and needs to be restarted. Refer to
the above section “Restarting Required Services” for instructions
about how to restart the required services.
11. The file you selected will be loaded into the current HDFS directory. Click the
filename to see the file’s contents. Because HDFS is designed to store very large
files, Hue will not display the entire file, just the first page of data. You can click the
arrow buttons or use the scrollbar to see more of the data.
12. Return to the test directory by clicking View file location in the left hand panel.
13. Above the list of files in your current directory is the full path of the directory you
are currently displaying. You can click on any directory in the path, or on the first
slash (/) to go to the top level (root) directory. Click training to return to your
home directory.
14. Delete the temporary test directory you created, including the file in it, by
selecting the checkbox next to the directory name then clicking the Move to trash
button. Confirm that you want to delete by clicking Yes.
$ hdfs dfs
This displays a help message describing all subcommands associated with hdfs
dfs.
This lists the contents of the HDFS root directory. One of the directories listed
is /user. Each user on the cluster has a “home” directory below /user
corresponding to his or her user ID.
18. If you do not specify a path, hdfs dfs assumes you are referring to your home
directory:
19. Note the /dualcore directory. Most of your work in this course will be in that
directory. Try creating a temporary subdirectory in /dualcore:
20. Next, add a web server log file to this new directory in HDFS:
To remove a file:
$ hdfs dfs -rm /dualcore/example.txt
21. Verify the last step by listing the contents of the /dualcore/test1 directory.
You should observe that the access.log file is present and occupies 106,339,468
bytes of space in HDFS:
23. In a terminal window, log in to MySQL and select the dualcore database:
24. Next, list the available tables in the dualcore database (mysql> represents the
MySQL client prompt and is not part of the command):
25. Review the structure of the employees table and examine a few of its records:
26. Exit MySQL by typing quit, and press the Enter key:
mysql> quit
27. Run the following command, which imports the employees table into the
/dualcore directory in HDFS, using tab characters to separate each field:
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /dualcore \
--table employees
It is normal for Sqoop to produce a lot of log messages, which are shown in your
terminal window, during the import process.
Hiding Passwords
Typing the database password on the command line is a potential
security risk since others may see it. An alternative to using the
--password argument is to use -P and let Sqoop prompt you for
the password, which is then not visible when you type it.
28. Revise the previous command and import the customers table into HDFS.
29. Revise the previous command and import the products table into HDFS.
30. Revise the previous command and import the orders table into HDFS.
31. Next, import the order_details table into HDFS. The command is slightly
different because this table only holds references to records in the orders and
products table, and it lacks a primary key of its own. Consequently, you will need
to specify the --split-by option and instruct Sqoop to divide the import work
among tasks based on values in the order_id field. An alternative is to use the
-m 1 option to force Sqoop to import all the data with a single task, but this would
significantly reduce performance.
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /dualcore \
--table order_details \
--split-by order_id
In this exercise, you will practice using Pig to explore, correct, and reorder data in
files from two different ad networks. You will first experiment with small samples
of this data using Pig in local mode, and once you are confident that your ETL
scripts work as you expect, you will use them to process the complete data sets in
HDFS by using Pig in MapReduce mode.
IMPORTANT: This exercise builds on the previous one. If you were unable to complete
the previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
Background Information
Dualcore has recently started using online advertisements to attract new customers to
our e-commerce site. Each of the two ad networks we use provides data about the ads
that have been placed. This includes the site where the ad was placed, the date when it
was placed, what keywords triggered its display, whether the user clicked the ad, and
the per-click cost.
Unfortunately, the data from each network is in a different format. Each file also
contains some invalid records. Before we can analyze the data, we must first correct
these problems by using Pig to do the following:
$ cd $ADIR/exercises/pig_etl
2. Copy a small number of records from the input file to another file on the local file
system. When you start Pig, you will run in local mode. For testing, you can work
faster with small local files than large files in HDFS.
It is not essential to choose a random sample here—just a handful of records in the
correct format will suffice. Use the command below to capture the first 25 records
so you have enough to test your script:
3. Start the Grunt shell in local mode so that you can work with the local
sample1.txt file.
$ pig -x local
grunt>
4. Load the data in the sample1.txt file into Pig and dump it:
You should see the 25 records that comprise the sample data file.
5. Load the first two columns’ data from the sample file as character data, and then
dump that data:
7. See what happens if you run the DESCRIBE command on data. Recall that when
you loaded data, you did not define a schema.
grunt> QUIT;
9. Edit the first_etl.pig file to complete the LOAD statement and read the data
from the sample1.txt file you created earlier. The following table shows the
format of the data in the file. For simplicity, you should leave the date and time
fields separate, so each will be of type chararray, rather than converting them to
a single field of type datetime.
Index Field Data Type Description Example
0 keyword chararray Keyword that triggered ad tablet
1 campaign_id chararray Uniquely identifies our ad A3
2 date chararray Date of ad display 05/29/2013
3 time chararray Time of ad display 15:49:21
4 display_site chararray Domain where ad shown www.example.com
5 was_clicked int Whether ad was clicked 1
6 cpc int Cost per click, in cents 106
7 country chararray Name of country in which ad ran USA
8 placement chararray Where on page was ad displayed TOP
10. Once you have edited the LOAD statement, try it by running your script in local
mode:
Ensure that the output shows all fields in the expected order and the values appear
similar in format to that shown in the table above before proceeding to the next
step. You may find it helpful to also use a DESCRIBE statement at the end of the
script to display the structure of the data.
11. Make each of the following changes, running your script in local mode after each
one to verify that your change is correct:
a. Filter out all records except those where the country field equals USA.
c. Update your script to convert the keyword field to uppercase and to remove
any leading or trailing whitespace. (Hint: You can nest calls to the two built-
in functions inside the FOREACH ... GENERATE statement from the last
statement.)
13. Edit first_etl.pig and change the path in the LOAD statement to match the
path of the file you just added to HDFS (/dualcore/ad_data1.txt).
14. Next, replace DUMP with a STORE statement that will write the output of your
processing as tab-delimited records to the /dualcore/ad_data1 directory.
15. Run this script in Pig’s cluster mode to analyze the entire file in HDFS:
$ pig first_etl.pig
If your script fails, check your code carefully, fix the error, and then try running
it again. Don’t forget that you must remove output in HDFS from a previous run
before you execute the script again.
16. Check the first 20 output records that your script wrote to HDFS and ensure they
look correct:
You can ignore the message that says cat is unable to write to the output stream;
this simply happens because you are writing more data with the hdfs dfs -cat
command than you are reading with the head command.
17. Create a small sample of the data from the second ad network that you can test
locally while you develop your script:
18. Edit the second_etl.pig file to complete the LOAD statement and read the data
from the sample you just created. (Hint: The fields are comma-delimited.) The
following table shows the order of fields in this file, again with each row of the table
representing the contexts of one column of the data:
Index Field Data Type Description Example
0 campaign_id chararray Uniquely identifies our ad A3
1 date chararray Date of ad display 05/29/2013
2 time chararray Time of ad display 15:49:21
3 display_site chararray Domain where ad shown www.example.com
4 placement chararray Ad display location on page TOP
5 was_clicked int Whether ad was clicked Y
6 cpc int Cost per click, in cents 106
19. Once you have edited the LOAD statement, use the DESCRIBE keyword and then
run your script in local mode to check that the schema matches the table above:
20. Replace DESCRIBE with a DUMP statement and then make each of the following
changes to second_etl.pig, running this script in local mode after each change
to verify what you’ve done before you continue with the next step:
a. This ad network sometimes logs a given record twice. Add a statement to the
second_etl.pig file so that you remove any duplicate records. If you have
done this correctly, you should only see one record where the display_site
field has a value of siliconwire.example.com.
b. As before, you need to store the fields in a different order than you received
them. Use a FOREACH ... GENERATE statement to create a new relation
containing the fields in the same order you used to write the output from first
ad network (shown again in the table below). Also, convert the keyword field
to uppercase and remove any leading or trailing whitespace, as you did with the
data from the first ad network:
Index Field Description
0 campaign_id Uniquely identifies our ad
1 date Date of ad display
2 time Time of ad display
3 keyword Keyword that triggered ad
4 display_site Domain where ad shown
5 placement Where on page was ad displayed
6 was_clicked Whether ad was clicked
7 cpc Cost per click, in cents
c. The date field in this dataset is in the format MM-DD-YYYY, while the data
you previously wrote is in the format MM/DD/YYYY. Edit the FOREACH ...
GENERATE statement to call the REPLACE(date, '-', '/') function to
correct this.
21. Once you are sure the script works locally, add the full dataset to HDFS:
22. Edit the script to have it LOAD the file you just added to HDFS, and then replace
the DUMP statement with a STORE statement to write your output as tab-delimited
records to the /dualcore/ad_data2 directory.
23. Run your script against the data you added to HDFS:
$ pig second_etl.pig
24. Check the first 15 output records written in HDFS by your script:
In this exercise, you will write Pig scripts that analyze data to optimize our
advertising with two online ad networks, helping Dualcore to save money and
attract new customers.
IMPORTANT: This exercise builds on the previous one. If you were unable to complete
the previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
$ cd $ADIR/exercises/analyze_ads
2. Obtain a local subset of the input data by running the following command:
You can ignore the message that says cat is unable to write to the output stream;
this simply happens because you are writing more data with the hdfs dfs -cat
command than you are reading with the head command.
Note: As mentioned in the previous exercise, it is faster to test Pig scripts by using
a local subset of the input data. You can use local subsets of data when testing
Pig scripts throughout this course. Although explicit steps are not provided for
creating local data subsets in upcoming exercises, doing so will help you perform
the exercises more quickly.
3. Open the low_cost_sites.pig file in your editor, and then make the following
changes:
b. Add a line that creates a new relation to include only records where
was_clicked has a value of 1.
d. Create a new relation that includes two fields: the display_site and the
total cost of all clicks on that site.
4. Once you have made these changes, try running your script against the sample data:
5. In the LOAD statement, replace the test_ad_data.txt file with a file glob
(pattern) that will load both the /dualcore/ad_data1 and /dualcore/
ad_data2 directories (and does not load any other data, such as the text files from
the previous exercise).
6. Once you have made these changes, try running your script against the data in
HDFS:
$ pig low_cost_sites.pig
7. Since this will be a slight variation on the code you have just written, copy that file
as high_cost_keywords.pig:
$ cp low_cost_sites.pig high_cost_keywords.pig
8. Edit the high_cost_keywords.pig file and make the following three changes:
c. Display the top five results to the screen instead of the top three as before.
9. Once you have made these changes, try running your script against the data in
HDFS:
$ pig high_cost_keywords.pig
$ cd bonus_01
a. Group the records (filtered by was_clicked == 1) so that you can call the
aggregate function in the next step.
b. Invoke the COUNT function to calculate the total of clicked ads. (Hint: Because
we shouldn’t have any null records, you can use the COUNT function instead of
COUNT_STAR, and the choice of field you supply to the function is arbitrary.)
3. Once you have made these changes, try running your script against the data in
HDFS:
$ pig total_click_count.pig
$ cd ../bonus_02
2. Because this code will be similar to the code you wrote in the previous step, start by
copying that file as project_next_campaign_cost.pig:
$ cp ../bonus_01/total_click_count.pig \
project_next_campaign_cost.pig
a. Since you are trying to determine the highest possible cost, you should not
limit your calculation to the cost for ads actually clicked. Remove the FILTER
statement so that you consider the possibility that any ad might be clicked.
b. Change the aggregate function to the one that returns the maximum value in the
cpc field. (Hint: Don’t forget to change the name of the relation in the preceding
GROUP statement, and the name of the relation the cpc field comes from, to
account for the removal of the FILTER statement in the previous instruction
step.)
4. Once you have made these changes, try running your script against the data in
HDFS:
$ pig project_next_campaign_cost.pig
Question: What is the maximum you expect this campaign might cost? You can
compare your solution to the one in the bonus_02/solution/ subdirectory.
$ cd ../bonus_03
a. Within the nested FOREACH, filter the records to include only records for which
the ad was clicked.
b. Create a new relation on the line that follows the FILTER statement which
counts the number of records within the current group.
c. Add another line below that to calculate the click-through rate in a new field
named ctr.
d. After the nested FOREACH, sort the records in ascending order of click-through
rate and display the first three to the screen.
3. Once you have made these changes, try running your script against the data in
HDFS:
$ pig lowest_ctr_by_site.pig
If you still have time remaining, modify your script to display the three keywords
with the highest click-through rate. You can compare your solution to the
highest_ctr_by_keyword.pig file in the solution directory.
In this exercise, you will practice combining, joining, and analyzing the product
sales data previously exported from Dualcore’s MySQL database so you can
observe the effects that our recent advertising campaign has had on sales.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
$ cd $ADIR/exercises/disparate_datasets
a. Following the FILTER statement, create a new relation with just one field: the
order’s year and month. (Hint: Use the SUBSTRING built-in function to extract
the first part of the order_dtm field, which contains the month and year.)
b. Count the number of orders in each of the months you extracted in the previous
step.
3. Once you have made these changes, try running your script against the data in
HDFS:
$ pig count_orders_by_period.pig
Question: Does the data suggest that the advertising campaign we started in May
led to a substantial increase in orders?
a. Join the two relations on the order_id field they have in common.
b. Create a new relation from the joined data that contains a single field:
the order’s year and month, similar to what you did previously in the
count_orders_by_period.pig file.
c. Group the records by month and then count the records in each group.
5. Once you have made these changes, try running your script against the data in
HDFS:
$ pig count_tablet_orders_by_period.pig
Question: Does the data show an increase in sales of the advertised product
corresponding to the month in which Dualcore’s campaign was active?
$ cd bonus_01
a. Filter the orders by date (using a regular expression) to include only those
placed during the campaign period (May 1, 2013 through May 31, 2013).
b. Filter the orders to include only those that contain the advertised product
(product ID 1274348). (Hint: Apply DISTINCT to remove duplicate records
representing orders that contain two or more of the advertised product.)
c. Create a new relation containing the order_id and product_id fields for
these orders.
3. Once you have made these changes, try running your script against the data in
HDFS:
$ pig average_order_size.pig
Question: Does the data show that the average order contained at least two items
in addition to the tablet we advertised?
Since we are considering the total sales price of orders in addition to the number of
orders a customer has placed, not every customer with at least five orders during
2012 will qualify. In fact, only about one percent of our customers will be eligible for
membership in one of these three groups.
During this exercise, you will write the code needed to filter the list of orders based on
date, group them by customer ID, count the number of orders per customer, and then
filter this to exclude any customer who did not have at least five orders. You will then
join this information with the order details and products datasets in order to calculate
the total sales of those orders for each customer, split them into the groups based on
the criteria described above, and then write the data for each group (customer ID and
total sales) into a separate directory in HDFS.
$ cd ../bonus_02
2. Edit the loyalty_program.pig file and implement the steps described above.
The code to load the three datasets you will need is already provided for you.
3. After you have written the code, run it against the data in HDFS:
$ pig loyalty_program.pig
4. If your script completed successfully, use the hdfs dfs -getmerge command
to create a local text file for each group so you can check your work (note that the
name of the directory shown here may not be the same as the one you chose):
5. Use the UNIX head and tail commands to check a few records and ensure that
the total sales prices fall into the correct ranges:
$ head platinum.txt
$ tail gold.txt
$ head silver.txt
$ wc -l platinum.txt
$ wc -l gold.txt
$ wc -l silver.txt
In this exercise, you will practice using the Hue query editor and the Impala and
Beeline shells to execute simple queries. These exercises use the tables that
have been populated with data you imported to HDFS using Sqoop in an earlier
exercise.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
1. Start the Firefox web browser if it isn’t running, then click on the Hue bookmark in
the Firefox bookmark toolbar (or type http://localhost:8888/home into the
address bar and press the Enter key).
2. After a few seconds, you should see Hue’s home screen. If you don’t currently have
an active session, you will first be prompted to log in. Enter training in both the
username and password fields, and then click the Sign In button.
3. Click the Query button. Note that there are query editors for both Hive and Impala
(as well as other tools such as Pig), accessible using the drop-down menu attached
to this button. The interface is very similar for both Hive and Impala. For these
exercises, you will use the Impala query editor.
4. Make sure the default database is selected in the database list on the left side of
the page.
5. Below the selected database is a list of the tables in that database. Click the
customers table to view the columns in the table. It may take a few seconds to
display the columns.
6. Hover over the customers row and click the information i that appears next to the
table name. Then click the Sample tab to see sample data from the table.
8. All you know about the winner is that her name is Bridget and she lives in Kansas
City. In the Impala Query Editor, enter a query in the text area to find the winning
customer. Use the LIKE operator to do a wildcard search for names such as Bridget,
Bridgette, or Bridgitte. Remember to filter on the customer’s city.
9. After entering the query, click the Execute button ( ) to the left of the text area.
While the query is executing, the Query History tab displays the status of the
query. When the query is complete, the Results tab opens, displaying the results of
the query.
Question: Which customer did your query identify as the winner of the $5,000
prize?
10. Start a terminal window if you don’t currently have one running.
11. On the Linux command line in the terminal window, start the Impala shell:
$ impala-shell
Impala displays the URL of the Impala server in the shell command prompt, for
example:
[localhost.localdomain:21000] >
12. At the prompt, review the schema of the products table by entering
Remember that SQL commands in the shell must be terminated by a semicolon (;),
unlike in the Hue query editor.
14. Execute a query that displays the three most expensive products. (Hint: Use ORDER
BY.)
> quit;
$ cd $ADIR/exercises/queries
$ cat verify_tablet_order.sql
$ impala-shell -f verify_tablet_order.sql
$ beeline -u jdbc:hive2://localhost:10000
Beeline displays the URL of the Hive server in the shell command prompt, such as:
0: jdbc:hive2://localhost:10000>
20. Execute a query to find all the Gigabux brand products whose price is less than
1000 (less than $10).
> !exit
In this exercise, you will practice using several common techniques for creating
and populating tables.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
3. Click the link for the customers table in the main panel to display the table
overview and review the list of columns.
4. Click the Sample tab to view the first hundred rows of data.
5. Before creating the table, review the files containing the product ratings data. The
files are in /home/training/training_materials/analyst/data. You
can use the head command in a terminal window to see the first few lines:
$ head $ADIR/data/ratings_2012.txt
$ head $ADIR/data/ratings_2013.txt
6. Copy the data files to the /dualcore directory in HDFS. You may use either the
Hue File Browser, or the hdfs command in the terminal window:
7. Return to the Table Browser in Hue. Click on the Create a new table icon ( ) in the
upper right to start the table definition wizard.
8. The first wizard step is to pick whether to add the data at creation using a file, or to
create an empty table so you can add the data later.
a. With Type showing file, click the field next to Path. Navigate up the directory
hierarchy to find the /dualcore directory and choose ratings_2012.txt.
(You will load the 2013 data later.)
b. Check the information that has been added to the main panel, to verify that Hue
is interpreting the data correctly. You should see Format options, including
Field Separator (set to ^Tab()), Record Separator (set to New line), and
Quote Character (set to Double Quote). The Has Header box should not be
checked. If any of these are set incorrectly, correct them. You should also see a
Preview that shows the data separated into fields (labeled field_1, field_2, and
so on).
a. Under DESTINATION, click in the Name field and change the supplied
table name (based on the file name) to ratings. (You may also call it
default.ratings, but as long as you are in the default database,
specifying the database is not necessary.)
b. Under PROPERTIES, choose the file format. File formats will be covered later in
the course. For now, Format should show Text. Correct it if needed.
c. Click Extras to see the options provided there. The settings there are correct
for this table, but note that this allows you to change your mind on some of the
settings from the previous step. It also allows you to add a description for your
table, and to set delimiters for Array, Map, and Struct fields. For this simple
table, only the field terminator is relevant; collection and map delimiters are
used for complex data in Hive and will be covered later in the course.
d. Scroll down to the Fields section. Notice that the field types are selected,
however, you should check that these are correct. (For example, Hue typically
chooses bigint for all integer fields, but perhaps you know that int or even
tinyint is more appropriate.) Use the following descriptions of the fields to
add the field names and correct the types as needed:
Field Name Field Type
posted timestamp
e. When you have added all the columns, click Submit. This will start a task to
define the table in the metastore and create the warehouse directory in HDFS to
store the data.
f. When the task is complete, a Task History pop-up window will appear. Check
that no error message is given, then click the X to close the pop-up window.
10. The new table ratings will appear in the Table Browser. Scroll down to confirm
that the fields are correct and the data has been added to the table.
11. Optional: Use the Hue File Browser or the hdfs command to view the
/user/hive/warehouse directory to confirm creation of the ratings
subdirectory.
12. Try querying the data in the table. In Hue, click the Query button to switch to the
Impala Query Editor. The ratings table should appear in the table list on the left.
Try counting the number of ratings:
13. You can also load data to an existing table. One way to do this is in Hue; another is
to use the LOAD DATA INPATH command. Try doing it using Hue:
a. Return to the Hue Table Browser if necessary, and click the link for the new
ratings table in the main panel.
b. Then click the Import Data icon ( ) in the upper right corner.
c. In the Import Data dialog, enter or browse to the HDFS location of the 2013
product ratings data file: /dualcore/ratings_2013.txt. Be sure that the
Overwrite existing data box is not checked, and then click Submit.
d. The Task History pop-up window should appear again. Check that the data
loaded without error. It should look similar to this:
14. The LOAD DATA INPATH command and the Hue table creation and data import
tools move the file to the table’s directory. Using the Hue File Browser or hdfs
dfs commands, verify that the files are no longer present in the original directory:
15. Optional: Verify that the 2013 data is shown alongside the 2012 data in the table’s
warehouse directory.
16. With the additional data, there are now 21,997 records. Try counting the records in
the ratings table again.
a. Use the command below, and note that the count is not 21,997.
b. After loading the new data, you need to invalidate the metadata, so the
additional data can be accessed. (Invalidating is done automatically by Hue
when you create a table, but not when you add data.) Do either of the following:
INVALIDATE METADATA;
• Click the Refresh icon ( ) in the panel on the left, then select Invalidate all
metadata and rebuild index and click the Refresh button.
c. Execute the count command again, and verify that 21,997 are included in the
table.
17. Write and execute a CREATE TABLE statement to create an external table for the
tab-delimited records in HDFS at /dualcore/employees. The format is shown
below:
Field Name Field Type
emp_id STRING
fname STRING
lname STRING
address STRING
city STRING
state STRING
zipcode STRING
job_title STRING
email STRING
active STRING
salary INT
18. Run the following query to verify that you have created the table correctly.
It should show that Sales Associate, Cashier, and Assistant Manager are the three
most common job titles at Dualcore.
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--table suppliers \
--hive-import
2. It is always a good idea to verify that data has been added as intended. Execute the
following query to count the number of suppliers in Texas. You may use either the
Impala shell or the Hue Impala Query Editor. Remember to invalidate the metadata
cache so that Impala can find the new table.
INVALIDATE METADATA;
then
In this exercise, you will modify the suppliers table you imported using Sqoop in the
previous exercise. You may complete these exercises using either the Impala shell or
the Impala query editor in Hue.
2. Use the DESCRIBE command on the suppliers table to verify the change.
In this exercise, you will create a table for ad click data that is partitioned by the
network on which the ad was displayed.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
1. Optional: View the first few lines in the data files for both networks.
3. Alter the ads table to add two partitions, one for network 1 and one for network 2.
6. Verify that the data for both ad networks were correctly loaded by counting the
records for each:
Note: If you are using Hue and put multiple commands in the Query Editor at the
same time, Hue runs them one at a time. After it runs a command, Hue will report
the results, then wait for another click of the Execute button before continuing.
Network 1 should have 438,389 records and Network 2 should have 350,563
records.
1. The ETL data output directory is provided as $ADIR/data/latlon. Copy the data
directory to the /dualcore directory in HDFS. You can use the Hue File Browser,
or the following hdfs command:
• Hint: The actual data in the data directory is in a MapReduce output file called
part-m-00000.parquet. Use the LIKE PARQUET command to use the
existing column definitions in the data file.
3. Review the table in Impala to confirm it was correctly created with columns zip,
latitude and longitude.
4. Perform a query or preview the data in the Impala Query Editor to confirm that the
data in the data file is being accessed correctly.
In this exercise, you will write queries to analyze data in tables that have been
populated with data you imported to HDFS using Sqoop in a previous exercise.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
Several analysis questions are described below, and you will need to run queries
to answer them. You can use whichever tool you prefer—Hive or Impala—using
whichever method you like best, including shell, script, or the Hue Query Editor.
◦ Hint: The order_date column in the orders table is of type TIMESTAMP. Use
the function to_date to get just the date portion of the value.
There are several ways you could write these queries. One possible solution for each is
in the solution directory.
• Run a query to show how each day’s profit ranks compared to other days within the
same year and month.
◦ Hint: Use the previous exercise’s solution as a sub-query; find the ROW_NUMBER of
the results within each year and month.
There are several ways you could write this query. One possible solution is in the
bonus_01/solution directory.
In this exercise, you will use Hive and Impala to work with complex data from a
customer loyalty program.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
3. Load the data file by placing it into the HDFS warehouse directory for the new table.
You can use either the Hue File Browser, or the hdfs command:
4. Using either the Beeline shell or Hue’s Hive Query Editor, run a query to select the
HOME phone number for customer ID 1200866. (Hint: MAP keys are case-sensitive.)
You should see 408-555-4914 as the result.
5. Select the third element from the order_idsARRAY for customer ID 1200866.
(Hint: elements are indexed from zero.) The query should return 5278505.
9. Run a query to return the distinct key values in the phone column. (Hint: You can
query a MAP column in Impala as if it were a separate table with columns key and
value.)
10. These distinct key values represent types of phone numbers. Now run a query to
count how many of each type of phone number there are in the phone column.
11. Run a query to count the total number of order IDs listed in the order_ids
column. (Hint: You can query an ARRAY column in Impala as if it were a separate
table with columns item and pos.)
12. Select the first element in the order_idsARRAY for each customer. (Hint: The pos
column represents the position of each element in an ARRAY, indexed from zero.)
13. Run a query to count how many customers have 30 or more orders.
1. Run a Hive query to return the order IDs for customer ID 1200866, one per row.
(Hint: Use the explode function.)
2. Run a Hive query to return the customer IDs and order IDs for all customers who
have an average order total of greater than $900 (90,000 cents). The query should
return one order ID per row. (Hint: Use LATERAL VIEW.)
1. Run an Impala query to return the same result set as the previous Hive query: the
customer IDs and order IDs for all customers who have an average order total of
greater than $900. (Hint: Use join notation, and remember to query the Parquet-
based table.)
2. Now run an Impala query that returns the customer IDs, first names, last names,
phone numbers, and phone number types for these customers who have an average
order of greater than $900.
In this exercise, you will use a Regex SerDe to load web server log data into
a table. Afterwards, you will use Hive’s text processing features to analyze
customers’ comments and product ratings, uncover problems, and propose
potential solutions.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
$ beeline -u jdbc:hive2://localhost:10000 \
-f $ADIR/exercises/text_analysis/create_web_logs.hql
2. Populate the table by adding the log file to the table’s directory in HDFS:
3. Verify that the data is loaded correctly by running this query to show the top three
items users searched for on our website:
This query may take several minutes to run. You should see that it returns tablet
(303), ram (153), and wifi (148).
Note: The REGEXP operator, which is available in some SQL dialects, is similar
to LIKE, but uses regular expressions for more powerful pattern matching. The
REGEXP operator is synonymous with the RLIKE operator.
1. Review the ratings table structure using the Hive Query Editor or using the
DESCRIBE command in the Beeline shell.
2. We want to find the product that customers like most, but we must guard against
being misled by products that have few ratings assigned. Run the following query to
find the product with the highest average among all those with at least 50 ratings:
3. Rewrite, and then execute, the query above to find the product with the lowest
average among products with at least 50 ratings. You should see that the result is
product ID 1274673 with an average rating of 1.10.
1. The following query normalizes all comments on that product to lowercase, breaks
them into individual words using the sentences function, and passes those to the
ngrams function to find the five most common bigrams (two-word combinations).
Run the query:
2. Most of these words are too common to provide much insight, though the word
“expensive” does stand out in the list. Modify the previous query to find the five
most common trigrams (three-word combinations), and then run that query in
Hive.
3. Among the patterns you see in the result is the phrase “ten times more.” This might
be related to the complaints that the product is too expensive. Now that you’ve
identified a specific phrase, look at a few comments that contain it by running this
query:
SELECT message
FROM ratings
WHERE prod_id = 1274673
AND lower(message) LIKE '%ten times more%'
LIMIT 3;
You should see comments that say, “Why does the red one cost ten times more than
the others?” and “Red is ten times more expensive than the others!”
4. We can infer that customers are complaining about the price of the red-color
version of this item. Write and execute a query that will find all distinct comments
containing the word “red” that are associated with product ID 1274673.
• “Why does the red one cost ten times more than the others?”
The first and third comment imply that this product is overpriced relative to similar
products. Write and run a query that will display the record for product ID 1274673
in the products table.
6. Your query should have shown that the product was a “16GB USB Flash Drive
(Red)” from the “Orion” brand. Next, run this query to identify similar products:
SELECT *
FROM products
WHERE name LIKE '%16 GB USB Flash Drive%'
AND brand='Orion';
The query results show that we have three products that are almost identical,
differing only in color. However, the product with the negative reviews (the red
one) is priced about ten times higher than the others, just as some of the comments
said.
The costs for these differently colored products are the same, but the price of the
red one is 42999 cents ($429.99) whereas the prices of the other two are both 4299
cents ($42.99). It appears that using text processing on the product reviews may
have helped us to uncover a pricing error.
In this exercise, you will practice techniques to improve Hive query performance.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
IMPORTANT: Use a single session of Beeline or Hue’s Hive Query Editor to complete all
the steps of this exercise. Except for this one session, close all other sessions of Beeline
and close all other web browser windows containing Hue’s Hive Query Editor.
1. The following Hive query returns the top 10 brands as measured by the number
of products for sale. Paste this HiveQL into Beeline or Hue’s Hive Query Editor and
execute the query:
3. View Hive’s execution plan by prefixing your query with EXPLAIN or by using the
Explain button ( ) in Hue. Observe that the Hive query executes as a MapReduce
job that includes both a map phase and reduce phase.
4. Because this query processes a very small amount of data (the products table
contains 1,114 rows and its data is about 60kb in size), a large proportion of the
query’s running time is consumed by the overhead of the MapReduce job. Because
the data is small, we can reduce this overhead by using Hadoop standalone mode
to run the query in a single Java Virtual Machine (JVM) on the Hive Server. Enable
standalone mode by issuing the Hive command:
SET mapreduce.framework.name=local;
5. Now execute the same Hive query that was executed in the first step of this exercise
again, and observe the amount of time the query takes to complete.
6. In preparation for running other queries on larger datasets, switch back to using
distributed mode by issuing the Hive command:
SET mapreduce.framework.name=yarn;
7. The following Hive query returns the top 10 brands based on total sales for
customers in New York. This requires joining four different tables, some of which
contain millions of rows, as well as performing aggregation and ordering the result:
View Hive’s execution plan by prefixing the query with EXPLAIN or by using the
Explain button ( ) in Hue. Observe that the Hive query executes as a MapReduce
job with numerous map and reduce phases.
8. You may execute this query if you wish. You should expect it to take about five
minutes to complete.
9. Because this query runs slowly using Hive on MapReduce, we could instead run it
using Hive on Spark and compare the query performance using these two engines.
To use Spark as Hive’s execution engine, issue the Hive command:
SET hive.execution.engine=spark;
10. Recall that Spark must initialize when you submit the first Hive on Spark query.
Before issuing the complex query again to compare performance, first issue a
simpler query to cause Spark to initialize. You should expect a long delay before the
query completes:
11. Once the query completes and returns a result (1,114), issue the same simple query
a second time and observe that it returns very quickly.
12. Now issue the complex query that returns the top 10 brands based on total sales
for customers in New York, and observe the amount of time the query takes to
complete.
13. View the Hive on Spark execution plan by prefixing the query with EXPLAIN or by
using the Explain button in Hue. Observe that the number of stages is much smaller
and the execution plan is much simpler than with Hive on MapReduce.
14. End the Spark session and return to using Hive on MapReduce by issuing the Hive
command:
SET hive.execution.engine=mr;
This command will shut down Spark and free up memory on the hands-on
environment, preventing out-of-memory errors or performance problems when
completing later exercises.
In this exercise, you will explore the query execution plan for various types of
queries in Impala.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
2. Note that the query explanation includes a warning that table and column stats are
not available for the products table. Compute the stats by executing the following
command:
3. Now view the query plan again, this time without the warning.
4. The previous query was very simple, and included only a single table in the FROM
clause. Try reviewing the query plan of a more complex query. The following query
returns the top three products sold. Before using EXPLAIN, compute stats on the
tables to be queried.
Questions: How many stages are there in this query? What are the estimated per-
host memory requirements for this query? What is the total size of all partitions to
be scanned?
5. The tables in the queries above each have only a single partition. Try reviewing the
query plan for a partitioned table. Recall that in an earlier exercise, you created an
ads table partitioned on the network column. First, compute stats for this table:
6. Now compare the query plans for the following two queries. The first calculates the
total cost of clicked ads each ad campaign; the second does the same, but for all ads
on one of ad networks.
Questions: What is the estimate per-host memory requirements for the two
queries? What explains the difference?
1. Try executing one of the queries you examined above; for example:
SUMMARY;
3. Questions: Which stage took the longest average time to complete? Which took the
most memory?
In this exercise, you will explore the data from Dualcore’s web server that
you loaded in an earlier exercise. Queries on that data will reveal that many
customers abandon their shopping carts before completing the checkout process.
You will create several additional tables, using data from a TRANSFORM script and
a supplied UDF, which you will use later to analyze how Dualcore could turn this
problem into an opportunity.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
1. Run the following query in Hive to show the number of requests for each step of the
checkout process:
The results of this query highlight a problem: about one out of every three
customers abandons their cart after the second step. This might mean millions of
dollars in lost revenue, so let’s see if we can determine the cause.
2. The log file’s cookie field stores a value that uniquely identifies each user session.
Since not all sessions involve checkouts at all, create a new table containing the
session ID and number of checkout steps completed for just those sessions that do:
3. Run this query to show the number of people who completed only one checkout
step (view cart), only two checkout steps (see shipping cost), or all four checkout
steps:
You should see from the differences in these numbers that many customers
abandon their order after the second step, which is when they first learn how much
it will cost to ship their order.
4. Optional: Because the new checkout_sessions table does not use a SerDe, it can
be queried in Impala. Try running the same query as in the previous step in Impala.
What happens?
c. The new table contains the ZIP code instead of an IP address, plus the other two
fields from the original table.
b. The script splits them into individual fields using a tab delimiter.
d. The three fields in each output record are delimited with tabs and printed to
standard output.
7. Copy the Python file to HDFS so that the Hive Server can access it. You may use the
Hue File Browser or the hdfs dfs command:
8. Run the script to create the cart_zipcodes table. You can either paste the code
into the Hive Query Editor, or use Beeline in a terminal window:
$ beeline -u jdbc:hive2://localhost:10000 \
-f $ADIR/exercises/transform/create_cart_zipcodes.hql
9. Write a HiveQL statement to create and load a table called cart_items with two
fields: cookie and prod_id based on data selected the web_logs table. Keep the
following in mind when writing your statement:
a. The prod_id field should contain only the seven-digit product ID. (Hint: Use
the regexp_extract function.)
b. Use a WHERE clause with REGEXP using the same regular expression as above,
so that you only include records where customers are adding items to the cart.
10. Verify the contents of the new table by running this query:
11. Run the following HiveQL to create a table called cart_orders with this
information:
12. Before you can use a UDF, you must make it available to Hive. First, copy the file to
HDFS so that the Hive Server can access it. You may use the Hue File Browser or the
hdfs dfs command:
13. Next, register the function with Hive and provide the name of the UDF class as
well as the alias you want to use for the function. Run the Hive command below to
associate the UDF with the alias calc_shipping_cost:
14. Now create a new table called cart_shipping that will contain the session
ID, number of steps completed, total retail price, total wholesale cost, and the
estimated shipping cost for each order based on data from the cart_orders
table:
15. Finally, verify your table by running the following query to check a record:
This should show that session as having two completed steps, a total retail price of
$263.77, a total wholesale cost of $236.98, and a shipping cost of $9.09.
Note: The total_price, total_cost, and shipping_cost columns in the
cart_shipping table contain the number of cents as integers. Be sure to divide
results containing monetary amounts by 100 to get dollars and cents.
In this exercise, you will analyze the abandoned cart data you extracted in an
earlier exercise.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
For this exercise, you can use whichever tool you prefer—Hive or Impala—using
whichever method you like best, including shell, script, or the Hue Query Editor.
If you plan to use Impala for these exercises, you will first need to invalidate Impala’s
metadata cache in order to access those tables.
You should see that abandoned carts mean that Dualcore is potentially losing out on
more than $2 million in revenue. Clearly it’s worth the effort to do further analysis.
Note: The total_price, total_cost, and shipping_cost columns in the
cart_shipping table contain the number of cents as integers. Be sure to divide
results containing monetary amounts by 100 to get dollars and cents.
2. The number returned by the previous query is revenue, but what counts is profit.
Gross profit for an item is the price minus the cost. Write and execute a query
similar to the one above, but which reports the total lost profit from abandoned
carts. If you need a hint on how to write this query, you can check the solution/
abandoned_checkout_profit.sql file.
• After running your query, you should see that Dualcore is potentially losing
$111,058.90 in profit due to customers not completing the checkout process.
3. How does this compare to the amount of profit Dualcore receives from customers
who do complete the checkout process? Modify your previous query to consider
only those records where steps_completed = 4, and then re-execute it. Check
solution/completed_checkout_profit.sql if you need a hint.
• The result should show that we earn a total of $177,932.93 on completed orders,
so abandoned carts represent a substantial proportion of additional profits.
4. The previous two queries returned the total profit for abandoned and completed
orders, but these aren’t directly comparable because there were different
numbers of each. It might be the case that one is much more profitable than
the other on a per-order basis. Write and execute a query that will calculate
the average profit based on the number of steps completed during the
checkout process. If you need help writing this query, check the solution/
checkout_profit_by_step.sql file.
• You should observe that carts abandoned after step 2 represent an even higher
average profit per order than completed orders.
5. Run the following query to compare the average shipping cost for orders
abandoned after the second step versus completed orders:
• You will see that the shipping cost for abandoned orders was almost 10% higher
than for completed purchases. Offering free shipping, at least for some orders,
might actually bring in more money than passing on the cost to customers and
risking abandoned orders.
6. Run the following query to determine the average profit per order over the entire
month for the data you are analyzing in the log file. This will help you to determine
whether we could absorb the cost of offering free shipping:
• You should see that the average profit for all orders during May was $7.80.
An earlier query you ran showed that the average shipping cost was $8.83
for completed orders and $9.66 for abandoned orders, so clearly we would
lose money by offering free shipping on all orders. However, it might still be
worthwhile to offer free shipping on orders over a certain amount.
7. Run the following query, which is a slightly revised version of the previous one, to
determine whether offering free shipping only on orders of $10 or more would be a
good idea:
• You should see that our average profit on orders of $10 or more was $9.09, so
absorbing the cost of shipping would leave very little profit.
8. Repeat the previous query, modifying it slightly each time to find the average profit
on orders of at least $50, $100, and $500.
• You should see that there is a huge spike in the amount of profit for orders of
$500 or more ($111.05 on average for these orders).
9. How much does shipping cost on average for orders totaling $500 or more? Write
and run a query to find out (solution/avg_shipping_cost_50000.sql
contains the solution, in case you need a hint).
• You should see that the average shipping cost is $12.28, which happens to be
about 11% of the profit on those orders.
10. Since there’s no way to know in advance who will abandon their cart, Dualcore
would have to absorb the $12.28 average cost on all orders of at least $500. Would
the extra money from abandoned carts offset the added cost of free shipping for
customers who would have completed their purchases anyway? Run the following
query to see the total profit on completed purchases:
• After running this query, you should see that the total profit for completed orders
is $107,582.97.
11. Now, run the following query to find the potential profit, after subtracting shipping
costs, if all customers completed the checkout process:
Since the result of $120,355.26 is greater than the current profit of $107,582.97
from completed orders, it appears that Dualcore could earn nearly $13,000 more by
offering free shipping for all orders of at least $500.
Congratulations! Your hard work analyzing a variety of data with Hadoop’s tools
has helped make Dualcore more profitable than ever.
In this exercise, you will use STREAM in Pig to analyze metadata from Dualcore’s
customer service call recordings to identify the cause of a sudden increase
in complaints. You will then use this data in conjunction with a user-defined
function to propose a solution for resolving the problem.
IMPORTANT: This exercise builds on previous ones. If you were unable to complete
any previous exercise or think you may have made a mistake, run the following
command to prepare for this exercise before continuing:
$ $ADIR/scripts/catchup.sh
When prompted, enter the number for the previous exercise, “Analyzing Abandoned
Carts.”
Background Information
Dualcore outsources its call center operations, and our costs have recently risen due
to an increase in the volume of calls handled by these agents. Unfortunately, we do not
have access to the call center’s database, but they provide us with recordings of these
calls stored in MP3 format. By using Pig’s STREAM to invoke a provided Python script,
you can extract the category and timestamp from the files, and then analyze that data to
learn what is causing the recent increase in calls.
$ cd $ADIR/exercises/extending_pig
path, the call category, the agent ID, the customer ID, and the timestamp for when
the agent answered the call.
Your first step is to create a text file containing the paths of the files to analyze,
with one line for each file. You can easily create the data in the required format by
capturing the output of the UNIX find command:
c. Display the top three categories (based on number of calls) to the screen.
4. Once you have made these changes, run your script to check the top three
categories in the month before Dualcore started the online advertising campaign:
5. Now run the script again, this time specifying the parameter for May:
The output should confirm that not only is call volume substantially higher in May,
the SHIPPING_DELAY category has more than twice as many calls as the other two
combined.
To solve this problem, Dualcore will open a new distribution center to improve
shipping times.
The ZIP codes for the three proposed sites are 02118, 63139, and 78237. You will look
up the latitude and longitude of these ZIP codes, as well as the ZIP codes of customers
who have recently ordered, using a supplied dataset. Once you have the coordinates,
you will use the HaversineDistInMiles UDF distributed with DataFu to determine
how far each customer is from the three distribution centers. You will then calculate
the average distance for all customers to each of these distribution centers in order to
propose the one that will benefit the most customers.
6. The latlon.tsv file on your local drive is a tab-delimited file that maps ZIP codes
to latitude/longitude points. Add it to HDFS:
shipped by airplane). The Pig Latin code joins these customers’ ZIP codes with the
latitude/longitude dataset uploaded in the previous step, then writes those three
columns (ZIP code, latitude, and longitude) as the result. Examine the script to see
how it works, and then run it to create the customer location data in HDFS:
$ pig create_cust_location_data.pig
$ egrep '^02118|^63139|^78237' \
$ADIR/data/latlon.tsv > warehouses.tsv
b. Use the UDF to calculate the distance from the customer to the warehouse
11. After you have finished implementing the Pig Latin code described above, run the
script:
$ pig calc_average_distances.pig
Question: Which of these three proposed ZIP codes has the smallest average
number of miles to our customers?
Hive/Impala Tables
The following is a record count for tables that are created or queried during the hands-
on exercises. Use the DESCRIBE tablename command to see the table structure.
Table Name Record Count
ads 788,952
cart_items 33,812
cart_orders 12,955
cart_shipping 12,955
cart_zipcodes 12,955
checkout_sessions 12,955
customers 201,375
employees 61,712
latlon 42968
loyalty_program 311
loyalty_program_parquet 311
order_details 3,333,244
orders 1,662,951
products 1,114
ratings 21,997
suppliers (renamed vendors) 66
web_logs 412,860
Character Classes
[057] Matches any single digit that is either 0, 5, or 7
[0-9] Matches any single digit between 0 and 9
[3-6] Matches any single digit between 3 and 6
[a-z] Matches any single lowercase letter
[C-F] Matches any single uppercase letter between C and F
For example, the pattern [C-F][3-6] would match the string D3 or F5 but would fail
to match G3 or C7.
There are also some built-in character classes that are shortcuts for common sets of
characters.
Matching Quantifiers
{5} Preceding character may occur exactly five times
{0,6} Preceding character may occur between zero and six times
? Preceding character is optional (may occur zero or one times)
+ Preceding character may occur one or more times
* Preceding character may occur zero or more times
By default, quantifiers try to match as many characters as possible. If you used the
pattern ore.+a on the string Dualcore has a store in Florida, you might
be surprised to learn that it matches ore has a store in Florida rather than
ore ha or ore in Florida as you might have expected. This is because matches
are “greedy” by default. Adding a question mark makes the quantifier match as few
characters as possible instead, so the pattern ore.+?a on this string would match
ore ha.
Finally, there are two special metacharacters that match zero characters. They are used
to ensure that a string matches a pattern only when it occurs at the beginning or end of
a string.