0% found this document useful (0 votes)

216 views

Data - Table Tutorial (With 50 Examples) PDF

The document discusses the data.table package in R, which is designed for fast manipulation of large datasets. It provides 50 examples of using data.table syntax and functions to perform common operations like subsetting rows and columns, filtering, sorting, and adding new columns with calculations. Key features of data.table that make it faster than other packages include binary search indexing and storing data in memory for quick access. The document aims to help analysts familiarize themselves with data.table's functionality and efficient handling of big data.

Uploaded by

Rizqoh Fatichah

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

216 views

Data - Table Tutorial (With 50 Examples) PDF

Uploaded by

Rizqoh Fatichah

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

R : Data.

Table Tutorial (with 50 Examples)

Deepanshu Bhalla 4 Comments data.table, R Tutorial

https://www.listendata.com/2016/10/r-data-table.html

The data.table R package is considered as the fastest package for data manipulation. This tutorial
includes various examples and practice questions to make you familiar with the package. Analysts
generally call R programming not compatible with big datasets ( > 10 GB) as it is not memory
efficient and loads everything into RAM. To change their perception, 'data.table' package comes
into play. This package was designed to be concise and painless. There are many benchmarks done
in the past to compare dplyr vs data.table. In every benchmark, data.table wins. The efficiency of
this package was also compared with python' package (panda). And data.table wins. In CRAN,
there are more than 200 packages that are dependent on data.table which makes it listed in the top 5
R's package.

data.table Syntax
The syntax of data.table is shown in the image below :

data.table Syntax

DT[ i , j , by]
1. The first parameter of data.table i refers to rows. It implies subsetting rows. It is equivalent
to WHERE clause in SQL
2. The second parameter of data.table j refers to columns. It implies subsetting columns
(dropping / keeping). It is equivalent to SELECT clause in SQL.
3. The third parameter of data.table by refers to adding a group so that all calculations would
be done within a group. Equivalent to SQL's GROUP BY clause.

The data.table syntax is NOT RESTRICTED to only 3 parameters. There are other arguments
that can be added to data.table syntax. The list is as follows -
1. with, which
2. allow.cartesian

1
3. roll, rollends
4. .SD, .SDcols
5. on, mult, nomatch
The above arguments would be explained in the latter part of the post.

How to Install and load data.table Package

install.packages("data.table")

#load required library

library(data.table)

Read Data
In data.table package, fread() function is available to read or get data from your computer or from
a web page. It is equivalent to read.csv() function of base R.
mydata = fread("https://github.com/arunsrinivasan/satrdays-
workshop/raw/master/flights_2014.csv")

Describe Data
This dataset contains 253K observations and 17 columns. It constitutes information about flights'
arrival or departure time, delays, flight cancellation and destination in year 2014.
nrow(mydata)
[1] 253316

ncol(mydata)
[1] 17

names(mydata)
[1] "year" "month" "day" "dep_time" "dep_delay" "arr_time" "arr_delay"
[8] "cancelled" "carrier" "tailnum" "flight" "origin" "dest" "air_time"
[15] "distance" "hour" "min"

head(mydata)
year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight
1: 2014 1 1 914 14 1238 13 0 AA N338AA 1
2: 2014 1 1 1157 -3 1523 13 0 AA N335AA 3
3: 2014 1 1 1902 2 2224 9 0 AA N327AA 21
4: 2014 1 1 722 -8 1014 -26 0 AA N3EHAA 29
5: 2014 1 1 1347 2 1706 1 0 AA N319AA 117
6: 2014 1 1 1824 4 2145 0 0 AA N3DEAA 119
origin dest air_time distance hour min
1: JFK LAX 359 2475 9 14
2: JFK LAX 363 2475 11 57
3: JFK LAX 351 2475 19 2
4: LGA PBI 157 1035 7 22
5: JFK LAX 350 2475 13 47
6: EWR LAX 339 2454 18 24

2
Selecting or Keeping Columns
Suppose you need to select only 'origin' column. You can use the code below -
dat1 = mydata[ , origin] # returns a vector

The above line of code returns a vector not data.table.

To get result in data.table format, run the code below :

dat1 = mydata[ , .(origin)] # returns a data.table

It can also be written like data.frame way

dat1 = mydata[, c("origin"), with=FALSE]

Keeping a column based on column position

dat2 =mydata[, 2, with=FALSE]

In this code, we are selecting second column from mydata.

Keeping Multiple Columns

The following code tells R to select 'origin', 'year', 'month', 'hour' columns.
dat3 = mydata[, .(origin, year, month, hour)]

Keeping multiple columns based on column position

You can keep second through fourth columns using the code below -
dat4 = mydata[, c(2:4), with=FALSE]

Dropping a Column
Suppose you want to include all the variables except one column, say. 'origin'. It can be easily done
by adding ! sign (implies negation in R)
dat5 = mydata[, !c("origin"), with=FALSE]

Dropping Multiple Columns

dat6 = mydata[, !c("origin", "year", "month"), with=FALSE]

3
Keeping variables that contain 'dep'
You can use %like% operator to find pattern. It is same as base R's grepl() function, SQL's
LIKE operator and SAS's CONTAINS function.
dat7 = mydata[,names(mydata) %like% "dep", with=FALSE]

Rename Variables
You can rename variables with setnames() function. In the following code, we are renaming a
variable 'dest' to 'destination'.
setnames(mydata, c("dest"), c("Destination"))

To rename multiple variables, you can simply add variables in both the sides.
setnames(mydata, c("dest","origin"), c("Destination",
"origin.of.flight"))

Subsetting Rows / Filtering

Suppose you are asked to find all the flights whose origin is
'JFK'.
# Filter based on one variable
dat8 = mydata[origin == "JFK"]

Select Multiple Values

Filter all the flights whose origin is either 'JFK' or 'LGA'
dat9 = mydata[origin %in% c("JFK", "LGA")]

Apply Logical Operator : NOT

The following program selects all the flights whose origin is not equal to 'JFK' and 'LGA'
# Exclude Values
dat10 = mydata[!origin %in% c("JFK", "LGA")]

Filter based on Multiple variables

If you need to select all the flights whose origin is equal to 'JFK' and carrier = 'AA'
dat11 = mydata[origin == "JFK" & carrier == "AA"]

4
Faster Data Manipulation with Indexing
data.table uses binary search algorithm that makes data manipulation faster.

Binary Search Algorithm

Binary search is an efficient algorithm for finding a value from a
sorted list of values. It involves repeatedly splitting in half the
portion of the list that contains values, until you found the value
that you were searching for.

Suppose you have the following values in a variable :

5, 10, 7, 20, 3, 13, 26

You are searching the value 20 in the above list. See how binary search algorithm works -
1. First, we sort the values
2. We would calculate the middle value i.e. 10.
3. We would check whether 20 = 10? No. 20 < 10.
4. Since 20 is greater than 10, it should be somewhere after 10. So we can ignore all the values
that are lower than or equal to 10.
5. We are left with 13, 20, 26. The middle value is 20.
6. We would again check whether 20=20. Yes. the match found.
If we do not use this algorithm, we would have to search 5 in the whole list of seven values.

It is important to set key in your dataset which tells system that data is sorted by the key column.
For example, you have employee’s name, address, salary, designation, department, employee ID.
We can use 'employee ID' as a key to search a particular employee.

Set Key
In this case, we are setting 'origin' as a key in the dataset mydata.
# Indexing (Set Keys)
setkey(mydata, origin)

Note : It makes the data table sorted by the column 'origin'.

How to filter when key is turned on.

You don't need to refer the key column when you apply filter.
data12 = mydata[c("JFK", "LGA")]

5
Performance Comparison
You can compare performance of the filtering process (With or Without KEY).
system.time(mydata[origin %in% c("JFK", "LGA")])
system.time(mydata[c("JFK", "LGA")])

Performance - With or without KEY

If you look at the real time in the image above, setting key makes filtering twice as faster than
without using keys.

Indexing Multiple Columns

We can also set keys to multiple columns like we did below to columns 'origin' and 'dest'. See the
example below.
setkey(mydata, origin, dest)

Filtering while setting keys on Multiple Columns

# First key column 'origin' matches “JFK” and second key column 'dest'
matches “MIA”
mydata[.("JFK", "MIA")]

It is equivalent to the following code :

mydata[origin == "JFK" & dest == "MIA"]

To identify the column(s) indexed by

key(mydata)

Result : It returns origin and dest as these are columns that are set keys.

Sorting Data
We can sort data using setorder() function, By default, it sorts data on ascending order.
mydata01 = setorder(mydata, origin)

6
Sorting Data on descending order
In this case, we are sorting data by 'origin' variable on descending order.
mydata02 = setorder(mydata, -origin)

Sorting Data based on multiple variables

In this example, we tells R to reorder data first by origin on ascending order and then variable
'carrier'on descending order.
mydata03 = setorder(mydata, origin, -carrier)

Adding Columns (Calculation on rows)

You can do any operation on rows by adding := operator. In this example, we are subtracting
'dep_delay' variable from 'dep_time' variable to compute scheduled departure time.
mydata[, dep_sch:=dep_time - dep_delay]

Adding Multiple Columns

mydata002 = mydata[, c("dep_sch","arr_sch"):=list(dep_time -
dep_delay, arr_time - arr_delay)]

IF THEN ELSE
The 'IF THEN ELSE' conditions are very popular for recoding values. In data.table package, it can
be done with the following methods :

Method I : mydata[, flag:= 1*(min < 50)]

Method II : mydata[, flag:= ifelse(min < 50, 1,0)]

It means to set flag= 1 if min is less than 50. Otherwise, set flag =0.

How to write Sub Queries (like SQL)

We can use this format - DT[ ] [ ] [ ] to build a chain in data.table. It is like sub-queries like SQL.
mydata[, dep_sch:=dep_time - dep_delay][,.
(dep_time,dep_delay,dep_sch)]

First, we are computing scheduled departure time and then selecting only relevant columns.

7
Summarize or Aggregate Columns
Like SAS PROC MEANS procedure, we can generate summary statistics of specific variables. In
this case, we are calculating mean, median, minimum and maximum value of variable arr_delay.
mydata[, .(mean = mean(arr_delay, na.rm = TRUE),
median = median(arr_delay, na.rm = TRUE),
min = min(arr_delay, na.rm = TRUE),
max = max(arr_delay, na.rm = TRUE))]

Summarize with data.table package

Summarize Multiple Columns
To summarize multiple variables, we can simply write all the summary statistics function in a
bracket. See the command below-
mydata[, .(mean(arr_delay), mean(dep_delay))]

If you need to calculate summary statistics for a larger list of variables, you can use .SD and
.SDcols operators. The .SD operator implies 'Subset of Data'.
mydata[, lapply(.SD, mean), .SDcols = c("arr_delay", "dep_delay")]

In this case, we are calculating mean of two variables - arr_delay and dep_delay.

Summarize all numeric Columns

By default, .SD takes all continuous variables (excluding grouping variables)
mydata[, lapply(.SD, mean)]

Summarize with multiple statistics

mydata[, sapply(.SD, function(x) c(mean=mean(x), median=median(x)))]

GROUP BY (Within Group Calculation)

Summarize by group 'origin
mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by =
origin]

8
Summary by group

Use key column in a by operation

Instead of 'by', you can use keyby= operator.
mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), keyby =
origin]

Summarize multiple variables by group 'origin'

mydata[, .(mean(arr_delay, na.rm = TRUE), mean(dep_delay, na.rm =
TRUE)), by = origin]

Or it can be written like below -

mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay",
"dep_delay"), by = origin]

Remove Duplicates
You can remove non-unique / duplicate cases with unique() function. Suppose you want to
eliminate duplicates based on a variable, say. carrier.
setkey(mydata, "carrier")
unique(mydata)

Suppose you want to remove duplicated based on all the variables. You can use the command below
setkey(mydata, NULL)
unique(mydata)

Note : Setting key to NULL is not required if no key is already set.

Extract values within a group

The following command selects first and second values from a categorical variable carrier.
mydata[, .SD[1:2], by=carrier]

Select LAST value from a group

mydata[, .SD[.N], by=carrier]

9
SQL's RANK OVER PARTITION
In SQL, Window functions are very useful for solving complex data problems. RANK OVER
PARTITION is the most popular window function. It can be easily translated in data.table with the
help of frank() function. frank() is similar to base R's rank() function but much faster. See the code
below.
dt = mydata[, rank:=frank(-distance,ties.method = "min"), by=carrier]

In this case, we are calculating rank of variable 'distance' by 'carrier'. We are assigning rank 1 to the
highest value of 'distance' within unique values of 'carrier'.

Cumulative SUM by GROUP

We can calculate cumulative sum by using cumsum() function.
dat = mydata[, cum:=cumsum(distance), by=carrier]

Lag and Lead

The lag and lead of a variable can be calculated with shift() function. The syntax of shift() function
is as follows - shift(variable_name, number_of_lags, type=c("lag", "lead"))
DT <- data.table(A=1:5)
DT[ , X := shift(A, 1, type="lag")]
DT[ , Y := shift(A, 1, type="lead")]

Lag and Lead Function

Between and LIKE Operator

We can use %between% operator to define a range. It is inclusive of the values of both the ends.
DT = data.table(x=6:10)
DT[x %between% c(7,9)]

10
The %like% is mainly used to find all the values that matches a pattern.
DT = data.table(Name=c("dep_time","dep_delay","arrival"), ID=c(2,3,4))
DT[Name %like% "dep"]

Merging / Joins
The merging in data.table is very similar to base R merge() function. The only difference is
data.table by default takes common key variable as a primary key to merge two datasets. Whereas,
data.frame takes common variable name as a primary key to merge the datasets.

Sample Data
(dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))
(dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))

Inner Join
It returns all the matching observations in both the datasets.
merge(dt1, dt2, by="A")

Left Join
It returns all observations from the left dataset and the matched observations from the right dataset.
merge(dt1, dt2, by="A", all.x = TRUE)

Right Join
It returns all observations from the right dataset and the matched observations from the left dataset.
merge(dt1, dt2, by="A", all.y = TRUE)

Full Join
It return all rows when there is a match in one of the datasets.
merge(dt1, dt2, all=TRUE)

Convert a data.table to data.frame

You can use setDF() function to accomplish this task.
setDF(mydata)

11
Similarly, you can use setDT() function to convert data frame to data table.
set.seed(123)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE)

setDT(X, key = "A")

Other Useful Functions

Reshape Data
It includes several useful functions which makes data cleaning easy
and smooth. To reshape or transpose data, you can use
dcast.data.table() and melt.data.table() functions. These functions
are sourced from reshape2 package and make them efficient. It also add
some new features in these functions.

Rolling Joins
It supports rolling joins. They are commonly used for analyzing time
series data. A very R packages supports these kind of joins.

Examples for Practise

Q1. Calculate total number of rows by month and then sort on
descending order
mydata[, .N, by = month] [order(-N)]

The .N operator is used to find count.

Q2. Find top 3 months with high mean arrival delay

mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by =
month][order(-mean_arr_delay)][1:3]

Q3. Find origin of flights having average total delay is greater than
20 minutes
mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay",
"dep_delay"), by = origin][(arr_delay + dep_delay) > 20]

12
Q4. Extract average of arrival and departure delays for carrier ==
'DL' by 'origin' and 'dest' variables
mydata[carrier == "DL",
lapply(.SD, mean, na.rm = TRUE),
by = .(origin, dest),
.SDcols = c("arr_delay", "dep_delay")]

Q5. Pull first value of 'air_time' by 'origin' and then sum the
returned values when it is greater than 300
mydata[, .SD[1], .SDcols="air_time", by=origin][air_time > 300,
sum(air_time)]

Endnotes
This package provides a one-stop solution for data wrangling in R. It offers two main benefits - less
coding and lower computing time. However, it's not a first choice of some of R programmers. Some
prefer dplyr package for its simplicity. I would recommend learn both the packages. Check out
dplyr tutorial. If you are working on data having size less than 1 GB, you can use dplyr package. It
offers decent speed but slower than data.table package.

Alteryx Tools Sheet v11.3
100% (3)
Alteryx Tools Sheet v11.3
24 pages
Subsetting Data in R
No ratings yet
Subsetting Data in R
44 pages
R Programming Cheatsheet
100% (1)
R Programming Cheatsheet
6 pages
Practice 1 To 15 and Answers
No ratings yet
Practice 1 To 15 and Answers
47 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
9 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
datatable
No ratings yet
datatable
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
mda_practical2_eda
No ratings yet
mda_practical2_eda
50 pages
R Programming Cheat Sheet: Ata Tructures
No ratings yet
R Programming Cheat Sheet: Ata Tructures
2 pages
Introduction To The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
No ratings yet
Introduction To The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
8 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
Introduction To The Data - Table Package in R: Revised: September 18, 2015 (A Later Revision May Be Available On The)
No ratings yet
Introduction To The Data - Table Package in R: Revised: September 18, 2015 (A Later Revision May Be Available On The)
8 pages
Faqs About The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
No ratings yet
Faqs About The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
21 pages
14 Work With Big Data
No ratings yet
14 Work With Big Data
74 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
Presentation 1
No ratings yet
Presentation 1
34 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
14 pages
Enhanced Data
No ratings yet
Enhanced Data
12 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
BMR Assignment: Tidyr
No ratings yet
BMR Assignment: Tidyr
3 pages
Module III
No ratings yet
Module III
53 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
What Is Dplyr
No ratings yet
What Is Dplyr
23 pages
R Module 8 - Data Cleaning
No ratings yet
R Module 8 - Data Cleaning
48 pages
Data Transformation 1 Reviewed
No ratings yet
Data Transformation 1 Reviewed
43 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
R Sharing
No ratings yet
R Sharing
16 pages
Base-R
No ratings yet
Base-R
9 pages
R-Basics.knit (1)
No ratings yet
R-Basics.knit (1)
13 pages
Unit2
No ratings yet
Unit2
76 pages
DSF 11-12
No ratings yet
DSF 11-12
21 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
Dplyr Tutorial
100% (1)
Dplyr Tutorial
22 pages
MBA Sem 1 Unit 3 Fundamentals of R (1)
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R (1)
41 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
Data Tidying With Tidyr::: Cheat Sheet
No ratings yet
Data Tidying With Tidyr::: Cheat Sheet
2 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
Coursera Notes
No ratings yet
Coursera Notes
4 pages
LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS
No ratings yet
LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS
8 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
58.tidy Data in R For Linguists
No ratings yet
58.tidy Data in R For Linguists
14 pages
DV Lab
No ratings yet
DV Lab
52 pages
All Codes
No ratings yet
All Codes
10 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
R Prog
No ratings yet
R Prog
27 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Week 1-B. Data in R
No ratings yet
Week 1-B. Data in R
5 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
MySQL Crash Course: A Hands-on Introduction to Database Development
From Everand
MySQL Crash Course: A Hands-on Introduction to Database Development
Rick Silva
No ratings yet
Android SQLite Database Tutorial
No ratings yet
Android SQLite Database Tutorial
40 pages
Viva Ques 1
No ratings yet
Viva Ques 1
24 pages
Data Models Unit 1
No ratings yet
Data Models Unit 1
72 pages
SQL Server Documentation
No ratings yet
SQL Server Documentation
135 pages
Blanket Po Header
No ratings yet
Blanket Po Header
4 pages
1 RG SQLNotes
No ratings yet
1 RG SQLNotes
216 pages
Candidate Management System
0% (1)
Candidate Management System
7 pages
Informatica Questions Answers
No ratings yet
Informatica Questions Answers
85 pages
Database Management System (DBMS - 204) : Single Row Functions
No ratings yet
Database Management System (DBMS - 204) : Single Row Functions
11 pages
DBMS EXP 6
No ratings yet
DBMS EXP 6
6 pages
Full Text Indexes in Postgresql
No ratings yet
Full Text Indexes in Postgresql
37 pages
Section 6: Correct
No ratings yet
Section 6: Correct
12 pages
Current Log
No ratings yet
Current Log
33 pages
Advanced SQL Queries, Examples of Queries in SQL List of TOP-70 Items in 2022 - ByteScout
No ratings yet
Advanced SQL Queries, Examples of Queries in SQL List of TOP-70 Items in 2022 - ByteScout
19 pages
12IP and CS BOTH - 100 - VIVA Qs - CS 12 by Lovejeet Arora
No ratings yet
12IP and CS BOTH - 100 - VIVA Qs - CS 12 by Lovejeet Arora
8 pages
ThemePark Database
No ratings yet
ThemePark Database
5 pages
?? ????? - 1696879727
No ratings yet
?? ????? - 1696879727
77 pages
05 Data Loading, Storage and Wrangling-1
No ratings yet
05 Data Loading, Storage and Wrangling-1
22 pages
Oracle Query
No ratings yet
Oracle Query
606 pages
Mysql Queries
100% (1)
Mysql Queries
15 pages
Login and Signup Form Using PHP and MySQL With Validation
No ratings yet
Login and Signup Form Using PHP and MySQL With Validation
13 pages
SQL Interview Questions CHEAT SHEET (2024) - InterviewBit
No ratings yet
SQL Interview Questions CHEAT SHEET (2024) - InterviewBit
49 pages
SQL Lab Manual
100% (3)
SQL Lab Manual
13 pages
Localhost - 127.0.0.1 - Dbintellectlink - PhpMyAdmin 5.2.1
No ratings yet
Localhost - 127.0.0.1 - Dbintellectlink - PhpMyAdmin 5.2.1
3 pages
BAZEPOD Sluzbeni Podsjetnik Kolegija Baze Podataka
No ratings yet
BAZEPOD Sluzbeni Podsjetnik Kolegija Baze Podataka
5 pages
Chapter 8 The Entity Relationship Data Model
No ratings yet
Chapter 8 The Entity Relationship Data Model
17 pages
Chapter 4: Intermediate SQL: Database System Concepts, 6 Ed
No ratings yet
Chapter 4: Intermediate SQL: Database System Concepts, 6 Ed
52 pages
Practical List Xii 2023-24
No ratings yet
Practical List Xii 2023-24
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data - Table Tutorial (With 50 Examples) PDF

Uploaded by

Data - Table Tutorial (With 50 Examples) PDF

Uploaded by

R : Data.

Table Tutorial (with 50 Examples)

How to Install and load data.table Package

#load required library

The above line of code returns a vector not data.table.

To get result in data.table format, run the code below :

It can also be written like data.frame way

Keeping a column based on column position

In this code, we are selecting second column from mydata.

Keeping Multiple Columns

Keeping multiple columns based on column position

Dropping Multiple Columns

Subsetting Rows / Filtering

Select Multiple Values

Apply Logical Operator : NOT

Filter based on Multiple variables

Binary Search Algorithm

Suppose you have the following values in a variable :

Note : It makes the data table sorted by the column 'origin'.

How to filter when key is turned on.

Performance - With or without KEY

Indexing Multiple Columns

Filtering while setting keys on Multiple Columns

It is equivalent to the following code :

To identify the column(s) indexed by

Sorting Data based on multiple variables

Adding Columns (Calculation on rows)

Adding Multiple Columns

Method I : mydata[, flag:= 1*(min < 50)]

How to write Sub Queries (like SQL)

Summarize with data.table package

Summarize all numeric Columns

Summarize with multiple statistics

GROUP BY (Within Group Calculation)

Use key column in a by operation

Summarize multiple variables by group 'origin'

Or it can be written like below -

Note : Setting key to NULL is not required if no key is already set.

Extract values within a group

Select LAST value from a group

Cumulative SUM by GROUP

Lag and Lead

Lag and Lead Function

Between and LIKE Operator

Convert a data.table to data.frame

setDT(X, key = "A")

Other Useful Functions

Examples for Practise

The .N operator is used to find count.

Q2. Find top 3 months with high mean arrival delay

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.