Data - Table Tutorial (With 50 Examples) PDF
Data - Table Tutorial (With 50 Examples) PDF
https://www.listendata.com/2016/10/r-data-table.html
The data.table R package is considered as the fastest package for data manipulation. This tutorial
includes various examples and practice questions to make you familiar with the package. Analysts
generally call R programming not compatible with big datasets ( > 10 GB) as it is not memory
efficient and loads everything into RAM. To change their perception, 'data.table' package comes
into play. This package was designed to be concise and painless. There are many benchmarks done
in the past to compare dplyr vs data.table. In every benchmark, data.table wins. The efficiency of
this package was also compared with python' package (panda). And data.table wins. In CRAN,
there are more than 200 packages that are dependent on data.table which makes it listed in the top 5
R's package.
data.table Syntax
The syntax of data.table is shown in the image below :
data.table Syntax
DT[ i , j , by]
1. The first parameter of data.table i refers to rows. It implies subsetting rows. It is equivalent
to WHERE clause in SQL
2. The second parameter of data.table j refers to columns. It implies subsetting columns
(dropping / keeping). It is equivalent to SELECT clause in SQL.
3. The third parameter of data.table by refers to adding a group so that all calculations would
be done within a group. Equivalent to SQL's GROUP BY clause.
The data.table syntax is NOT RESTRICTED to only 3 parameters. There are other arguments
that can be added to data.table syntax. The list is as follows -
1. with, which
2. allow.cartesian
1
3. roll, rollends
4. .SD, .SDcols
5. on, mult, nomatch
The above arguments would be explained in the latter part of the post.
Read Data
In data.table package, fread() function is available to read or get data from your computer or from
a web page. It is equivalent to read.csv() function of base R.
mydata = fread("https://github.com/arunsrinivasan/satrdays-
workshop/raw/master/flights_2014.csv")
Describe Data
This dataset contains 253K observations and 17 columns. It constitutes information about flights'
arrival or departure time, delays, flight cancellation and destination in year 2014.
nrow(mydata)
[1] 253316
ncol(mydata)
[1] 17
names(mydata)
[1] "year" "month" "day" "dep_time" "dep_delay" "arr_time" "arr_delay"
[8] "cancelled" "carrier" "tailnum" "flight" "origin" "dest" "air_time"
[15] "distance" "hour" "min"
head(mydata)
year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight
1: 2014 1 1 914 14 1238 13 0 AA N338AA 1
2: 2014 1 1 1157 -3 1523 13 0 AA N335AA 3
3: 2014 1 1 1902 2 2224 9 0 AA N327AA 21
4: 2014 1 1 722 -8 1014 -26 0 AA N3EHAA 29
5: 2014 1 1 1347 2 1706 1 0 AA N319AA 117
6: 2014 1 1 1824 4 2145 0 0 AA N3DEAA 119
origin dest air_time distance hour min
1: JFK LAX 359 2475 9 14
2: JFK LAX 363 2475 11 57
3: JFK LAX 351 2475 19 2
4: LGA PBI 157 1035 7 22
5: JFK LAX 350 2475 13 47
6: EWR LAX 339 2454 18 24
2
Selecting or Keeping Columns
Suppose you need to select only 'origin' column. You can use the code below -
dat1 = mydata[ , origin] # returns a vector
Dropping a Column
Suppose you want to include all the variables except one column, say. 'origin'. It can be easily done
by adding ! sign (implies negation in R)
dat5 = mydata[, !c("origin"), with=FALSE]
3
Keeping variables that contain 'dep'
You can use %like% operator to find pattern. It is same as base R's grepl() function, SQL's
LIKE operator and SAS's CONTAINS function.
dat7 = mydata[,names(mydata) %like% "dep", with=FALSE]
Rename Variables
You can rename variables with setnames() function. In the following code, we are renaming a
variable 'dest' to 'destination'.
setnames(mydata, c("dest"), c("Destination"))
To rename multiple variables, you can simply add variables in both the sides.
setnames(mydata, c("dest","origin"), c("Destination",
"origin.of.flight"))
4
Faster Data Manipulation with Indexing
data.table uses binary search algorithm that makes data manipulation faster.
You are searching the value 20 in the above list. See how binary search algorithm works -
1. First, we sort the values
2. We would calculate the middle value i.e. 10.
3. We would check whether 20 = 10? No. 20 < 10.
4. Since 20 is greater than 10, it should be somewhere after 10. So we can ignore all the values
that are lower than or equal to 10.
5. We are left with 13, 20, 26. The middle value is 20.
6. We would again check whether 20=20. Yes. the match found.
If we do not use this algorithm, we would have to search 5 in the whole list of seven values.
It is important to set key in your dataset which tells system that data is sorted by the key column.
For example, you have employee’s name, address, salary, designation, department, employee ID.
We can use 'employee ID' as a key to search a particular employee.
Set Key
In this case, we are setting 'origin' as a key in the dataset mydata.
# Indexing (Set Keys)
setkey(mydata, origin)
5
Performance Comparison
You can compare performance of the filtering process (With or Without KEY).
system.time(mydata[origin %in% c("JFK", "LGA")])
system.time(mydata[c("JFK", "LGA")])
Result : It returns origin and dest as these are columns that are set keys.
Sorting Data
We can sort data using setorder() function, By default, it sorts data on ascending order.
mydata01 = setorder(mydata, origin)
6
Sorting Data on descending order
In this case, we are sorting data by 'origin' variable on descending order.
mydata02 = setorder(mydata, -origin)
IF THEN ELSE
The 'IF THEN ELSE' conditions are very popular for recoding values. In data.table package, it can
be done with the following methods :
It means to set flag= 1 if min is less than 50. Otherwise, set flag =0.
First, we are computing scheduled departure time and then selecting only relevant columns.
7
Summarize or Aggregate Columns
Like SAS PROC MEANS procedure, we can generate summary statistics of specific variables. In
this case, we are calculating mean, median, minimum and maximum value of variable arr_delay.
mydata[, .(mean = mean(arr_delay, na.rm = TRUE),
median = median(arr_delay, na.rm = TRUE),
min = min(arr_delay, na.rm = TRUE),
max = max(arr_delay, na.rm = TRUE))]
If you need to calculate summary statistics for a larger list of variables, you can use .SD and
.SDcols operators. The .SD operator implies 'Subset of Data'.
mydata[, lapply(.SD, mean), .SDcols = c("arr_delay", "dep_delay")]
In this case, we are calculating mean of two variables - arr_delay and dep_delay.
8
Summary by group
Remove Duplicates
You can remove non-unique / duplicate cases with unique() function. Suppose you want to
eliminate duplicates based on a variable, say. carrier.
setkey(mydata, "carrier")
unique(mydata)
Suppose you want to remove duplicated based on all the variables. You can use the command below
setkey(mydata, NULL)
unique(mydata)
9
SQL's RANK OVER PARTITION
In SQL, Window functions are very useful for solving complex data problems. RANK OVER
PARTITION is the most popular window function. It can be easily translated in data.table with the
help of frank() function. frank() is similar to base R's rank() function but much faster. See the code
below.
dt = mydata[, rank:=frank(-distance,ties.method = "min"), by=carrier]
In this case, we are calculating rank of variable 'distance' by 'carrier'. We are assigning rank 1 to the
highest value of 'distance' within unique values of 'carrier'.
10
The %like% is mainly used to find all the values that matches a pattern.
DT = data.table(Name=c("dep_time","dep_delay","arrival"), ID=c(2,3,4))
DT[Name %like% "dep"]
Merging / Joins
The merging in data.table is very similar to base R merge() function. The only difference is
data.table by default takes common key variable as a primary key to merge two datasets. Whereas,
data.frame takes common variable name as a primary key to merge the datasets.
Sample Data
(dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))
(dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))
Inner Join
It returns all the matching observations in both the datasets.
merge(dt1, dt2, by="A")
Left Join
It returns all observations from the left dataset and the matched observations from the right dataset.
merge(dt1, dt2, by="A", all.x = TRUE)
Right Join
It returns all observations from the right dataset and the matched observations from the left dataset.
merge(dt1, dt2, by="A", all.y = TRUE)
Full Join
It return all rows when there is a match in one of the datasets.
merge(dt1, dt2, all=TRUE)
11
Similarly, you can use setDT() function to convert data frame to data table.
set.seed(123)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE)
Rolling Joins
It supports rolling joins. They are commonly used for analyzing time
series data. A very R packages supports these kind of joins.
Q3. Find origin of flights having average total delay is greater than
20 minutes
mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay",
"dep_delay"), by = origin][(arr_delay + dep_delay) > 20]
12
Q4. Extract average of arrival and departure delays for carrier ==
'DL' by 'origin' and 'dest' variables
mydata[carrier == "DL",
lapply(.SD, mean, na.rm = TRUE),
by = .(origin, dest),
.SDcols = c("arr_delay", "dep_delay")]
Q5. Pull first value of 'air_time' by 'origin' and then sum the
returned values when it is greater than 300
mydata[, .SD[1], .SDcols="air_time", by=origin][air_time > 300,
sum(air_time)]
Endnotes
This package provides a one-stop solution for data wrangling in R. It offers two main benefits - less
coding and lower computing time. However, it's not a first choice of some of R programmers. Some
prefer dplyr package for its simplicity. I would recommend learn both the packages. Check out
dplyr tutorial. If you are working on data having size less than 1 GB, you can use dplyr package. It
offers decent speed but slower than data.table package.
13