0% found this document useful (0 votes)

14 views

MIT 302 - Statistical Computing II - Tutorial 02

This document provides an overview of advanced data manipulation techniques in R using packages such as dplyr, tidyr, data.table, and stringr. These techniques allow efficient transformation and manipulation of data to extract meaningful insights. Specifically, it discusses filtering rows and selecting columns using dplyr, reshaping data between wide and long formats using tidyr, aggregating data using dplyr, and merging datasets using inner joins in dplyr. Mastering these advanced techniques enables effective cleaning, transformation, reshaping and analysis of data.

Uploaded by

evansojoshuz

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

MIT 302 - Statistical Computing II - Tutorial 02

Uploaded by

evansojoshuz

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

MIT 302 – STATISTICAL

COMPUTING 11
TUTORIAL 2: Advanced Data Manipulation in R

1 Overview
Advanced data manipulation techniques in R allow you to efficiently transform and manipulate
your data to extract meaningful insights. These techniques involve packages such as dplyr,
tidyr, data.table, and stringr, which provide a wide range of functions for complex data
operations.
The dplyr package is a powerful tool for data manipulation in R. It offers functions like filter,
select, arrange, mutate, and summarize, which allow you to perform tasks such as filtering rows
based on conditions, selecting specific columns, sorting rows, creating new columns, and
summarizing data. The syntax of dplyr is concise and easy to understand, making it a popular
choice for data manipulation tasks.
The tidyr package complements dplyr by providing functions for reshaping and tidying data.
With tidyr, you can convert data between wide and long formats using gather and spread
functions. These functions are useful when you need to transform your data to fit a specific
analysis or visualization requirement. tidyr also offers functions like separate and unite for
splitting and combining columns, as well as fill for handling missing values.
The data.table package is known for its efficiency in handling large datasets. It provides
optimized functions for joins, aggregation, and reshaping operations. With data.table, you can
perform fast and memory-efficient operations on large datasets, making it a valuable tool for
data manipulation tasks that involve substantial amounts of data.
Another package, stringr, focuses on advanced string manipulation. It provides functions for
pattern matching, string extraction, replacement, and splitting. With stringr, you can efficiently
work with text data, extract relevant information, perform string replacements, and split strings
into substrings based on patterns.
By mastering these advanced data manipulation techniques in R, you can effectively clean,
transform, and reshape your data to gain insights and make informed decisions. These
techniques enable you to handle complex data scenarios, efficiently manipulate large datasets,
and extract valuable information from strings. With the help of these packages and their
functions, you can streamline your data manipulation workflow and enhance your data analysis
capabilities.

2 Advanced techniques using packages such as dplyr and tidyr

2.1 Filtering Rows with Multiple Conditions
The dplyr package provides the filter() function, which allows you to filter rows based on
one or more conditions. You can combine multiple conditions using logical operators such
as & (AND) and | (OR).
Example:
library(dplyr)

# Create a dataset
df <- data.frame(
id = 1:10,
category = c("A", "B", "A", "C", "B", "C", "A", "B", "C", "A"),
value = c(10, 20, 15, 5, 25, 30, 12, 18, 8, 22)
)

# Filter rows based on multiple conditions

filtered_data <- filter(df, category == "A" & value > 10)
In this example, we have a dataset with three columns: id, category, and value. We use
the filter() function to select rows where the category is "A" and the value is greater than
10. This will return a new dataset filtered_data containing the filtered rows.
2.2 Selecting Columns with Conditions
The select() function in dplyr allows you to choose specific columns based on conditions.
You can use helper functions like starts_with(), ends_with(), contains(), and matches() to
match column names.
Example:
library(dplyr)

# Select columns based on conditions

selected_columns <- select(df, starts_with("c"))

# Select columns using regex pattern

selected_columns <- select(df, matches("^c.*$"))
In this example, we use the select() function to choose columns that start with "c". The
first select() call selects columns that start with "c" using the starts_with() helper function.
The second select() call uses a regular expression pattern (^c.*$) to match column names
starting with "c" and selects those columns.
2.3 Arranging Rows
The arrange() function in dplyr allows you to sort rows based on one or more columns. By
default, rows are sorted in ascending order, but you can use the desc() function to sort in
descending order.
Example:
library(dplyr)

# Arrange rows based on a column

arranged_data <- arrange(df, value)

# Arrange rows based on multiple columns

arranged_data <- arrange(df, category, desc(value))
In this example, we first use the arrange() function to sort the rows in ascending order based
on the value column. The second arrange() call sorts the rows first by category in ascending
order and then by value in descending order.
2.4 Creating New Variables
The mutate() function in dplyr allows you to create new variables based on existing variables.
You can perform calculations or transformations on the existing columns and store the results
in new columns.
Example:
library(dplyr)

# Create a new variable based on existing variables

mutated_data <- mutate(df, value_squared = value^2)
In this example, we use the mutate() function to create a new variable called value_squared.
We calculate the square of the value column and store the result in the new column.
2.5 Summarizing Data
The summarize() function in dplyr allows you to compute summary statistics or aggregate
values based on groups defined by one or more variables. You can use functions
like mean(), sum(), min(), max(), and n() (count) to summarize the data.
Example:
library(dplyr)

# Summarize data based on a group variable

summary <- df %>%
group_by(category) %>%
summarize(avg_value = mean(value), total_count = n())
In this example, we first use the %>% operator (pipe) to chain operations together. We group the
data by the category variable using group_by(). Then, we use summarize() to calculate the
average value (avg_value) and the total count of rows (total_count) within each category.
These advanced techniques using packages such as dplyr and tidyr allow you to perform
complex data manipulation tasks in R efficiently. By filtering rows, selecting columns,
arranging rows, creating new variables, and summarizing data, you can transform and analyze
your data effectively.

3 Reshaping, aggregating, and merging datasets

3.1 Reshaping Data
Reshaping data involves transforming the layout of your dataset from one form to another.
The tidyr package provides functions like gather() and spread() for reshaping data between
wide and long formats.
Example:
library(tidyr)

# Create a wide-format dataset

wide_data <- data.frame(
id = 1:3,
category = c("A", "B", "C"),
value_1 = c(10, 20, 30),
value_2 = c(15, 25, 35)
)
# Reshape from wide to long format
long_data <- gather(wide_data, key = "variable", value = "value", -id, -category)
In this example, we start with a wide-format dataset where each variable has its own column.
We use the gather() function to reshape the data into a long format. The key argument
specifies the name of the column that will store the variable names, and the value argument
specifies the name of the column that will store the corresponding values. We exclude
the id and category columns from the reshaping.
3.2 Aggregating Data
Aggregating data involves summarizing or collapsing your dataset based on one or more
variables. The dplyr package provides the group_by() and summarize() functions for
aggregating data based on groups.
Example:
library(dplyr)

# Create a dataset
df <- data.frame(
category = c("A", "A", "B", "B", "B", "C"),
value = c(10, 20, 15, 25, 30, 12)
)

# Aggregate data by category

aggregated_data <- df %>%
group_by(category) %>%
summarize(avg_value = mean(value), total_count = n())
In this example, we have a dataset with two columns: category and value. We use
the %>% operator to chain operations together. First, we group the data by the category variable
using group_by(). Then, we use summarize() to calculate the average value (avg_value) and
the total count of rows (total_count) within each category.
3.3 Merging Datasets
Merging datasets involves combining data from multiple datasets based on common columns
or keys. The dplyr package provides the inner_join() function for performing inner joins
between datasets.
Example:
library(dplyr)

# Create two datasets

df1 <- data.frame(
id = c(1, 2, 3),
value1 = c(10, 20, 30)
)

df2 <- data.frame(

id = c(2, 3, 4),
value2 = c(15, 25, 35)
)

# Merge datasets based on id

merged_data <- inner_join(df1, df2, by = "id")
In this example, we have two datasets, df1 and df2, each with an id column. We use
the inner_join() function to merge the datasets based on the common id column. The
resulting merged_data dataset contains only the rows where the id values are present in both
datasets.
These techniques of reshaping, aggregating, and merging datasets in R are essential for data
manipulation and analysis. By reshaping your data, you can transform its structure to fit
specific analysis or visualization needs. Aggregating data allows you to summarize and gain
insights from your dataset. Merging datasets enables you to combine information from multiple
sources to create a comprehensive dataset for analysis.

Nutrition and Diet Therapy, 6th Ed (2015)
100% (4)
Nutrition and Diet Therapy, 6th Ed (2015)
697 pages
Subsetting Data in R
No ratings yet
Subsetting Data in R
44 pages
Modern Statistics With R
100% (2)
Modern Statistics With R
580 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
32 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
BMR Assignment: Tidyr
No ratings yet
BMR Assignment: Tidyr
3 pages
Data manipulation in R
No ratings yet
Data manipulation in R
5 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
fancyDPLYR Funcs
No ratings yet
fancyDPLYR Funcs
31 pages
Group Manipulation and Data Reshaping in R
No ratings yet
Group Manipulation and Data Reshaping in R
10 pages
Unit2
No ratings yet
Unit2
76 pages
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
DV Lab
No ratings yet
DV Lab
52 pages
Module IV
No ratings yet
Module IV
43 pages
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
Data Analytics-34-41
No ratings yet
Data Analytics-34-41
8 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
Lab11
No ratings yet
Lab11
2 pages
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
All Codes
No ratings yet
All Codes
10 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
Phan Project2 Report
No ratings yet
Phan Project2 Report
10 pages
DSCI 100 Cheat Sheet
No ratings yet
DSCI 100 Cheat Sheet
3 pages
R Programming
No ratings yet
R Programming
11 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
Plyr Package in R Programming
No ratings yet
Plyr Package in R Programming
9 pages
DataCamp Week 5
No ratings yet
DataCamp Week 5
7 pages
Learn R_ Learn R_ Data Cleaning Cheatsheet _ Codecademy
No ratings yet
Learn R_ Learn R_ Data Cleaning Cheatsheet _ Codecademy
4 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
R Dplyr Tutorial - Merge, Join, Spread PDF
No ratings yet
R Dplyr Tutorial - Merge, Join, Spread PDF
17 pages
Cleaning Data in R
No ratings yet
Cleaning Data in R
9 pages
R Advbeginner v5
No ratings yet
R Advbeginner v5
73 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
CH 3
No ratings yet
CH 3
33 pages
ProgrammingForDS16_Rdatamanipulation
No ratings yet
ProgrammingForDS16_Rdatamanipulation
20 pages
Tidyverse - Tidyr and Dplyr
No ratings yet
Tidyverse - Tidyr and Dplyr
33 pages
Cleaning Data
No ratings yet
Cleaning Data
17 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Data Manipulation R
No ratings yet
Data Manipulation R
13 pages
lab week2-3
No ratings yet
lab week2-3
26 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
R study material I
No ratings yet
R study material I
8 pages
Module II (1)
No ratings yet
Module II (1)
40 pages
Module II
No ratings yet
Module II
40 pages
Lecture 9: Data Wrangling With Dplyr: Kevin Lee
No ratings yet
Lecture 9: Data Wrangling With Dplyr: Kevin Lee
12 pages
PushpendraLabFile
No ratings yet
PushpendraLabFile
51 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
How People Learn 2003
100% (1)
How People Learn 2003
12 pages
EVIDA Law
No ratings yet
EVIDA Law
14 pages
2. RECENT ACTUAL TEST 2_READING
No ratings yet
2. RECENT ACTUAL TEST 2_READING
12 pages
Ultimate Halloween Word Search
No ratings yet
Ultimate Halloween Word Search
6 pages
Products Range by Vair Systems Pvt. Ltd.
No ratings yet
Products Range by Vair Systems Pvt. Ltd.
8 pages
Base Isolation
0% (1)
Base Isolation
27 pages
Kudesia - 2013 - Mindfulness at Work
No ratings yet
Kudesia - 2013 - Mindfulness at Work
1 page
The Fox and The Stork - K I D S I N CO - Com - Free Playscripts For Kids!
No ratings yet
The Fox and The Stork - K I D S I N CO - Com - Free Playscripts For Kids!
2 pages
Environment Ecology MCQs PDF
No ratings yet
Environment Ecology MCQs PDF
11 pages
Lesson Plan in Music I
No ratings yet
Lesson Plan in Music I
3 pages
Freedom of The Press in Zambia - Wikipedia
No ratings yet
Freedom of The Press in Zambia - Wikipedia
10 pages
Tutorial Free Range VHDL
No ratings yet
Tutorial Free Range VHDL
192 pages
TH Koln
No ratings yet
TH Koln
3 pages
Control of Voltage and Reactive Power
100% (3)
Control of Voltage and Reactive Power
47 pages
Pixel Nerf
No ratings yet
Pixel Nerf
10 pages
Transformations That Work
No ratings yet
Transformations That Work
16 pages
Powerpoint Presentation On Journals
No ratings yet
Powerpoint Presentation On Journals
11 pages
Structural Analysis of Anal Um in in Um Multi Hull
No ratings yet
Structural Analysis of Anal Um in in Um Multi Hull
14 pages
Lesson 4 - Festivals and Culture
No ratings yet
Lesson 4 - Festivals and Culture
30 pages
Understanding Inclusive Education A Theoretical Contribution From System Theory and The Constructionist Perspective
No ratings yet
Understanding Inclusive Education A Theoretical Contribution From System Theory and The Constructionist Perspective
18 pages
Ed Big Plans For Little Kids LTD Financial Report
No ratings yet
Ed Big Plans For Little Kids LTD Financial Report
32 pages
Duncan, 2005
No ratings yet
Duncan, 2005
123 pages
Dde Mba Course Structure
No ratings yet
Dde Mba Course Structure
68 pages
Official Resume Advi Bishnoi July 2017
No ratings yet
Official Resume Advi Bishnoi July 2017
1 page
Solubility Product Constants
No ratings yet
Solubility Product Constants
6 pages
Essentials of English
No ratings yet
Essentials of English
15 pages
School Grade Level Teacher Learning Area Teaching Dates and Time Quarter
No ratings yet
School Grade Level Teacher Learning Area Teaching Dates and Time Quarter
4 pages
11th
No ratings yet
11th
11 pages
Gmail - Booking Confirmation On IRCTC, Train - 16229, 23-Dec-2021, 2S, MYS - SBC
No ratings yet
Gmail - Booking Confirmation On IRCTC, Train - 16229, 23-Dec-2021, 2S, MYS - SBC
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

MIT 302 - Statistical Computing II - Tutorial 02

Uploaded by

MIT 302 - Statistical Computing II - Tutorial 02

Uploaded by

MIT 302 – STATISTICAL

2 Advanced techniques using packages such as dplyr and tidyr

# Filter rows based on multiple conditions

# Select columns based on conditions

# Select columns using regex pattern

# Arrange rows based on a column

# Arrange rows based on multiple columns

# Create a new variable based on existing variables

# Summarize data based on a group variable

3 Reshaping, aggregating, and merging datasets

# Create a wide-format dataset

# Aggregate data by category

# Create two datasets

df2 <- data.frame(

# Merge datasets based on id

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.