Getting and Cleaning Data Course Notes: Xing Su
Getting and Cleaning Data Course Notes: Xing Su
Xing Su
Contents
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Raw and processed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Tidy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Download files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Reading Excel files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Reading XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Reading JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
data.table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Reading from MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
HDF5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Web Scraping (tutorial) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Working with API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Reading from Other Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
tidyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
lubridate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Subsetting and Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Summarizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Creating New Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Reshaping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Merging Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Editing Text Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Working with Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1
Overview
Tidy Data
1. Raw Data
no software processing has been done
did not manipulate, remove, or summarize in anyway
2. Tidy data set
end goal of cleaning data process
each variable should be in one column
each observation of that variable should be in a different row
one table for each kind of variable
if there are multiple tables, there should be a column to link them
include a row at the top of each file with variable names (variable names should make sense)
in general data should be save in one file per table
3. Code book describing each variable and its values in the tidy data set
information about the variables (w/ units) in dataset NOT contained in tidy data
information about the summary choice that were made (median/mean)
information about experimental study design (data collection methods)
common format for this document = markdown/Word/text
study design section = thorough description of how data was collected
code book section = describes each variable and units
4. Explicit steps and exact recipe to get through 1 - 3 (instruction list)
ideally a computer script (no parameters)
output = processed tidy data
in addition to script, possibly may need steps to run files, how script is run, and explicit instructions
2
Download files
3
Reading XML
4
Reading JSON
5
data.table
inherits from data.frame (external package) all functions that accept data.frame work on
data.table
can be much faster (written in C), much much faster at subsetting/grouping/updating
syntax: dt <- data.table(x = rnorm(9), y = rep(c(a, b, c), each = 3), z = rnorm(9)
tables() = returns all data tables in memory
shows name, nrow, MB, cols, key
some subset works like before = dt[2, ], dt[dt$y=="a",]
dt[c(2, 3)] = subset by rows, rows 2 and 3 in this case
column subsetting (modified for data.table)
argument after comma is called an expression (collection of statements enclosed in {})
dt[, list(means(x), sum(z)] = returns mean of x column and sum of z column (no "" needed
to specify column names, x and z in example)
dt[, table(y)] = get table of y value (perform any functions)
add new columns
dt[, w:=z2]
when this is performed, a new data.table is created and data copied over (not good for large
datasets)
dt2 <- dt; dt[, y:= 2]
when changes are made to dt, changes get translated to dt2
Note: if copy must be made, use the copy() function instead
multiple operations
dt[, m:= {temp <- (x+z); log2(temp +5)}] adds a column that equals log2(x+z + 5)
plyr like operations
dt[,a:=x>0] = creates a new column a that returns TRUE if x > 0, and FALSE other wise
dt[,b:=mean(x+w), by=a] = creates a new column b that calculates the aggregated mean for x
+ w for when a = TRUE/FALSE, meaning every b value is gonna be the same for TRUE, and
others are for FALSE
special variables
.N = returns integer, length 1, containing the number (essentially count)
dt <- data.table (x=sample(letters[1:3], 1E5, TRUE)) = generates data table
dt[, .N by =x] = creates a table to count observations by the value of x
keys (quickly filter/subset)
example: dt <- data.table(x = rep(c("a", "b", "c"), each 100), y = rnorm(300)) =
generates data table
setkey(dt, x) = set the key to the x column
dt['a'] = returns a data frame, where x = a (effectively filter)
joins (merging tables)
example: dt1 <- data.table(x = c('a', 'b', ...), y = 1:4) = generates data table
dt2 <- data.table(x= c('a', 'd', ...), z = 5:7) = generates data table
setkey(dt1, x); setkey(dt2, x) = sets the keys for both data tables to be column x
6
merge(dt1, dt2) = returns a table, combine the two tables using column x, filtering to only
the values that match up between common elements the two x columns (i.e. a) and the data
is merged together
fast reading of files
example: big_df <- data.frame(norm(1e6), norm(1e6)) = generates data table
file <- tempfile() = generates empty temp file
write.table(big.df, file=file, row.names=FALSE, col.names = TRUE, sep =
"\t". quote = FALSE) = writes the generated data from big.df to the empty temp
file
fread(file) = read file and load data = much faster than read.table()
7
Reading from MySQL
8
HDF5
9
Web Scraping (tutorial)
10
Working with API
11
Reading from Other Sources
12
dplyr
13
capable of renaming multiple columns at the same time, no quotes needed
mutate()
create a new variable based on the value of one or more existing variables in the dataset
capable of modifying existing columns/variables as well
example: mutate(dataFrameTable, newColumn = size / 220) = create a new column
with specified name and the method of calculating
multiple columns can be created at the same time by using , as separator, new variables can
even reference each other in terms of calculation
summarize()
collapses the dataset into a single row
example: summarize(dataFrameTable, avg = mean(size)) = returns the mean from the
column in a single variable with the specified name
summarize(can return the requested value for each group in the dataset
group_by()
example: by_package <- group_by(cran, package) = creates a grouped data frame table
by specified variable
summarize(by_package, mean(size)) = returns the mean size of each group (instead of 1
value from the summarize() example above)
Note: n() = counts number of observation in the current group
Note: n_distinct() = efficiently count the number of unique values in a vector
Note: quantile(variable, probs = 0.99) = returns the 99% percentile from the data
Note: by default, dplyr prints the first 10 rows of data if there are more than 100 rows; if
there are not, it will print everything
rbind_list()
bind multiple data frames by row and column
example: rbind_list(passed, failed)
Chaining/Piping
allows stringing together multiple function calls in a way that is compact and readable, while still
accomplishing the desired result
Note: all variable calls refer to the tbl_df specified at the same level of the call
%>% = chaining operator
Note: ?chain brings up relevant documentation for the chaining operator
Code on the right of the operator operates on the result from the code on the left
exp1 %>% exp2 %>% exp3 ...
exp1 is calculated first
exp2 is then applied on exp1 to achieve a result
exp3 is then applied to the result of that operation, etc.
Note: the chaining aspect is done with the data frame table that is being passed from one call
to the next
Note: if the last call has no additional arguments, print() for example, then it is possible to
leave() off
14
tidyr
gather()
gather columns into key value pairs
example: gather(students, sex, count, -grade) = gather each key (in this case named sex),
value (in this case count) pair into one row
effectively translates to (columnName, value) with new names imposed on both = all combi-
nations of column name and value
-grade = signifies that the column does not need to be remapped, so that column is preserved
class1:class5 = can be used instead to specify where to gather the key values
separate()
separate one column into multiple column
example: separate(data = res, col = sex_class, into = c("sex", "class") = split the
specified column in the data frame into two columns
Note: the new columns are created in place, and the other columns are pushed to the right
Note: separate() is able to automatically split non-alphanumeric values by finding the logical
separator; it is also possible to specify the separator by using the sep argument
spread()
spread key-value pairs across multiple columns = turn values of a column into column head-
ers/variables/new columns
example: spread(students3, test, grade) = splits test column into variables by using it as
a key, and grade as values
Note: no need to specify what the columns are going to be called, since they are going to be
generated using the values in the specified column
Note: the value will be matched and split up according their alignment with the key (test)
= midterm, A
extract_numeric()
extract numeric component of variable
example: extract_numeric("class5") = returns 5
example: mutate(class = extract_numeric(class)) = changes the class name to numbers only
unique() = general R function, not specific to tidyr
returns a vector with the duplicates removed
Note: when there are redundant information, its better to split up the info into multiple tables; however,
each table should also contain primary keys, which identify observations and link data from one table to
the next
15
lubridate
16
Subsetting and Sorting
subsetting
x <- data.frame("var1" = sample(1:5), "var2" = sample(6:10), "var3" = (11:15)) =
initiates a data frame with three names columns
x <- x[sample(1:5) = this scrambles the rows
x$var2[c(2,3)] = NA = setting the 2nd and 3rd element of the second column to NA
x[1:2, "var2"] = subsetting the first two row of the the second column
x[(x$var1 <= 3 | x$var3 > 15), ] = return all rows of x where the first column is less than
or equal to three or where the third column is bigger than 15
x[which(x$var2 >8), ] = returns the rows where the second column value is larger than 8
Note: which(condition) = useful in dealing with NA values as it returns the indices of the
values where the condition holds true (returns FALSE for NA)
sorting/ordering
sort(x$var1) = sort the vector in increasing/alphabetical order
decreasing = TRUE = use decreasing argument to sort vector in decreasing order
na.last = TRUE = use na.last argument to sort the vector such that all the NA values will
be listed last
x[order(x$var1, x$var2), ] = order the x data frame according to var1 first and var2 second
plyr package: arrange(data.farme, var1, desc(var2) = see dplyr sections
adding row/columns
x$var4 <-rnorm(5) = adds a new column to the end called var4
cbind(X, rnorm(5)) = combines data frame with vector (as a column on the right)
rbind() = combines two objects by putting them on top of each other (as a row on the
bottom)
Note: order specified in the argument is the order in which the operation is performed
17
Summarizing Data
Admitted Rejected
Male 1198 1493
Female 557 1278
18
object.size(obj) = returns size of object in bytes
print(object.size(obj), units = "Mb" = prints size of object in Mb
19
Creating New Variables
sequences
s <- seq(1, 10, by = 2) = creates a sequence from 1 to 10 by intervals of 2
length = 3 = use the length argument to specify how many numbers to generate
seq(along = x) = create as many elements as vector x
subsetting variables
restData$nearMe = restData$neighborhood %in% c("Roland", "Homeland") = creates a
new variable nearMe that returns TRUE if the neighborhood value is Roland or Homeland, and
FALSE otherwise
binary variables
restData$zipWrong = ifelse(restData$zipCode<0, TRUE, FALSE) = creates a new variable
zipWrong that returns TRUE if the zipcode is less than 0, and FALSE otherwise
ifelse(condition, result1, result2) = this function is the same as a if-else statement
categorical variables
restData$zipGroups = cut(restData$zipCode, breaks = quantile(restData$zipCode) =
creates new variable zipGroups that specify ranges for the zip code data such that the observations
are divided into groups created by the quantile function
cut(variable, breaks) = cuts a variable/vector into groups at the specified breaks
Note: class of resultant variable = factor
quantile(variable) = returns 0, .25, .5, .75, 1 by default and thus provides for ranges/groups
for the data to be divided in
using Hmisc package
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode, g = 4)
cut2(variable, g=4) = automatically divides the variable values into 4 groups according
the quantiles
Note: class of resultant variable = factor
factor variables
restData$zcf <- factor(restData$zipCode) = coverts an existing vector to factor variable
levels = c("yes", "no") = use the levels argument to specify the order of the different
factors
Note: by default, converting variables to the factor class, the levels will be structured alpha-
betically unless otherwise specified
as.numeric(factorVariable) = converts factor variable values into numeric by assigning the
lowest (first) level 1, the second lowest level 2, . . . , etc.
category + factor split
using plyr and Hmisc packages
library(plyr); library(Hmisc)
readData2 <- mutate(restData, zipGroups = cut2(zipCode, g = 4)
this creates zipGroups and splits the data from zipCode all at the same time
common transforms
abs(x) = absolute value
sqrt(x) = square root
ceiling(x), floor() = round up/down to integer
20
round(x, digits = n) = round to the number of digits after the decimal point
signif(x, digits = n) = round to the number of significant digits
cos(x), sin(x), tan(x) . . . etc = trigonometric functions
log(x), log2(x), log10(x) = natural log, log 2, log 10
exp(x) = exponential of x
21
Reshaping Data
22
Merging Data
23
Editing Text Variables
24
Regular Expressions
RegEx = combination of literals and metacharacters
used with grep/grepl/sub/gsub functions or any other that involve searching for strings in character
objects/variables
= start of the line (metacharacter)
example: text matches lines such as text . . .
$ = end of the line (metacharacter)
example: text$ matches lines such as . . . text
[] = set of characters that will be accepted in the match (character class)
example: [Ii] matches lines such as I . . . or i . . .
[0-9] = searches for a a range of characters (character class)
example: [a-zA-Z] will match any letter in upper or lower case
[?.] = when used at beginning of character class, means not (metacharacter)
example: [?.]$ matches any line that does end in . or ?
. = any character (metacharacter)
example: 9.11 matches 9/11, 9911, 9-11, etc
| = or, used to combine subexpressions called alternatives (metacharacter)
example: ([Gg}ood | [Bb]ad) matches any lines that start with lower/upper Good. . . and
Bad . . .
Note: () limits the scope of alternatives divided by | here
? = expression is optional = 0/1 of some character/expression (metacharacter)
example: [Gg]eorge( [Ww]\.)? [Bb]ush matches george bush, George W. Bush
Note: " was added before." because . is a metacharacter, . called escape dot, tells the
expression to read it as an actual period instead of an operator
* = any number of repetition, including none = 0 or more of some character/expression (metacharacter)
example: .* matches anything combination of characters
Note: * is greedy = always matches the longest possible string that satisfies the regular expression
greediness of * can be turned off with the ?
example: s.*?s matches the shortest s. . . s text
+ = 1 or more repetitions = 1 or more of some character/expression (metacharacter)
example: [0-9]+ matches matches many at least digit 1 numbers such as 0, 90, or 021442132
{m, n} = interval quantifier, allows specifying the minimum and maximum number of matches (metachar-
acter)
m = at least, n = not more than
{m} = exactly m matches
{m, } = at least m matches
example: Bush( +[ ]+ +){1, 5} debates matches Bush + (at least one space + any word that
doesnt contain space + at least one space) this pattern repeated between 1 and 5 times + debates
() = define group as the the text in parentheses, groups will be remembered and can be referred to by
\1, \2, etc.
example: ([a-zA-Z]+) +\1 + matches any word + at least one space + the same word repeated
+ at least one space = night night, so so, etc.
25
Working with Dates
26
Data Sources
quantmod package = get historical stock prices for publicly traded companies on NASDAQ or NYSE
27