0% found this document useful (0 votes)
14 views

MIT 302 - Statistical Computing II - Tutorial 02

This document provides an overview of advanced data manipulation techniques in R using packages such as dplyr, tidyr, data.table, and stringr. These techniques allow efficient transformation and manipulation of data to extract meaningful insights. Specifically, it discusses filtering rows and selecting columns using dplyr, reshaping data between wide and long formats using tidyr, aggregating data using dplyr, and merging datasets using inner joins in dplyr. Mastering these advanced techniques enables effective cleaning, transformation, reshaping and analysis of data.

Uploaded by

evansojoshuz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

MIT 302 - Statistical Computing II - Tutorial 02

This document provides an overview of advanced data manipulation techniques in R using packages such as dplyr, tidyr, data.table, and stringr. These techniques allow efficient transformation and manipulation of data to extract meaningful insights. Specifically, it discusses filtering rows and selecting columns using dplyr, reshaping data between wide and long formats using tidyr, aggregating data using dplyr, and merging datasets using inner joins in dplyr. Mastering these advanced techniques enables effective cleaning, transformation, reshaping and analysis of data.

Uploaded by

evansojoshuz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MIT 302 – STATISTICAL

COMPUTING 11
TUTORIAL 2: Advanced Data Manipulation in R

1 Overview
Advanced data manipulation techniques in R allow you to efficiently transform and manipulate
your data to extract meaningful insights. These techniques involve packages such as dplyr,
tidyr, data.table, and stringr, which provide a wide range of functions for complex data
operations.
The dplyr package is a powerful tool for data manipulation in R. It offers functions like filter,
select, arrange, mutate, and summarize, which allow you to perform tasks such as filtering rows
based on conditions, selecting specific columns, sorting rows, creating new columns, and
summarizing data. The syntax of dplyr is concise and easy to understand, making it a popular
choice for data manipulation tasks.
The tidyr package complements dplyr by providing functions for reshaping and tidying data.
With tidyr, you can convert data between wide and long formats using gather and spread
functions. These functions are useful when you need to transform your data to fit a specific
analysis or visualization requirement. tidyr also offers functions like separate and unite for
splitting and combining columns, as well as fill for handling missing values.
The data.table package is known for its efficiency in handling large datasets. It provides
optimized functions for joins, aggregation, and reshaping operations. With data.table, you can
perform fast and memory-efficient operations on large datasets, making it a valuable tool for
data manipulation tasks that involve substantial amounts of data.
Another package, stringr, focuses on advanced string manipulation. It provides functions for
pattern matching, string extraction, replacement, and splitting. With stringr, you can efficiently
work with text data, extract relevant information, perform string replacements, and split strings
into substrings based on patterns.
By mastering these advanced data manipulation techniques in R, you can effectively clean,
transform, and reshape your data to gain insights and make informed decisions. These
techniques enable you to handle complex data scenarios, efficiently manipulate large datasets,
and extract valuable information from strings. With the help of these packages and their
functions, you can streamline your data manipulation workflow and enhance your data analysis
capabilities.

2 Advanced techniques using packages such as dplyr and tidyr


2.1 Filtering Rows with Multiple Conditions
The dplyr package provides the filter() function, which allows you to filter rows based on
one or more conditions. You can combine multiple conditions using logical operators such
as & (AND) and | (OR).
Example:
library(dplyr)

# Create a dataset
df <- data.frame(
id = 1:10,
category = c("A", "B", "A", "C", "B", "C", "A", "B", "C", "A"),
value = c(10, 20, 15, 5, 25, 30, 12, 18, 8, 22)
)

# Filter rows based on multiple conditions


filtered_data <- filter(df, category == "A" & value > 10)
In this example, we have a dataset with three columns: id, category, and value. We use
the filter() function to select rows where the category is "A" and the value is greater than
10. This will return a new dataset filtered_data containing the filtered rows.
2.2 Selecting Columns with Conditions
The select() function in dplyr allows you to choose specific columns based on conditions.
You can use helper functions like starts_with(), ends_with(), contains(), and matches() to
match column names.
Example:
library(dplyr)

# Select columns based on conditions


selected_columns <- select(df, starts_with("c"))

# Select columns using regex pattern


selected_columns <- select(df, matches("^c.*$"))
In this example, we use the select() function to choose columns that start with "c". The
first select() call selects columns that start with "c" using the starts_with() helper function.
The second select() call uses a regular expression pattern (^c.*$) to match column names
starting with "c" and selects those columns.
2.3 Arranging Rows
The arrange() function in dplyr allows you to sort rows based on one or more columns. By
default, rows are sorted in ascending order, but you can use the desc() function to sort in
descending order.
Example:
library(dplyr)

# Arrange rows based on a column


arranged_data <- arrange(df, value)

# Arrange rows based on multiple columns


arranged_data <- arrange(df, category, desc(value))
In this example, we first use the arrange() function to sort the rows in ascending order based
on the value column. The second arrange() call sorts the rows first by category in ascending
order and then by value in descending order.
2.4 Creating New Variables
The mutate() function in dplyr allows you to create new variables based on existing variables.
You can perform calculations or transformations on the existing columns and store the results
in new columns.
Example:
library(dplyr)

# Create a new variable based on existing variables


mutated_data <- mutate(df, value_squared = value^2)
In this example, we use the mutate() function to create a new variable called value_squared.
We calculate the square of the value column and store the result in the new column.
2.5 Summarizing Data
The summarize() function in dplyr allows you to compute summary statistics or aggregate
values based on groups defined by one or more variables. You can use functions
like mean(), sum(), min(), max(), and n() (count) to summarize the data.
Example:
library(dplyr)

# Summarize data based on a group variable


summary <- df %>%
group_by(category) %>%
summarize(avg_value = mean(value), total_count = n())
In this example, we first use the %>% operator (pipe) to chain operations together. We group the
data by the category variable using group_by(). Then, we use summarize() to calculate the
average value (avg_value) and the total count of rows (total_count) within each category.
These advanced techniques using packages such as dplyr and tidyr allow you to perform
complex data manipulation tasks in R efficiently. By filtering rows, selecting columns,
arranging rows, creating new variables, and summarizing data, you can transform and analyze
your data effectively.

3 Reshaping, aggregating, and merging datasets


3.1 Reshaping Data
Reshaping data involves transforming the layout of your dataset from one form to another.
The tidyr package provides functions like gather() and spread() for reshaping data between
wide and long formats.
Example:
library(tidyr)

# Create a wide-format dataset


wide_data <- data.frame(
id = 1:3,
category = c("A", "B", "C"),
value_1 = c(10, 20, 30),
value_2 = c(15, 25, 35)
)
# Reshape from wide to long format
long_data <- gather(wide_data, key = "variable", value = "value", -id, -category)
In this example, we start with a wide-format dataset where each variable has its own column.
We use the gather() function to reshape the data into a long format. The key argument
specifies the name of the column that will store the variable names, and the value argument
specifies the name of the column that will store the corresponding values. We exclude
the id and category columns from the reshaping.
3.2 Aggregating Data
Aggregating data involves summarizing or collapsing your dataset based on one or more
variables. The dplyr package provides the group_by() and summarize() functions for
aggregating data based on groups.
Example:
library(dplyr)

# Create a dataset
df <- data.frame(
category = c("A", "A", "B", "B", "B", "C"),
value = c(10, 20, 15, 25, 30, 12)
)

# Aggregate data by category


aggregated_data <- df %>%
group_by(category) %>%
summarize(avg_value = mean(value), total_count = n())
In this example, we have a dataset with two columns: category and value. We use
the %>% operator to chain operations together. First, we group the data by the category variable
using group_by(). Then, we use summarize() to calculate the average value (avg_value) and
the total count of rows (total_count) within each category.
3.3 Merging Datasets
Merging datasets involves combining data from multiple datasets based on common columns
or keys. The dplyr package provides the inner_join() function for performing inner joins
between datasets.
Example:
library(dplyr)

# Create two datasets


df1 <- data.frame(
id = c(1, 2, 3),
value1 = c(10, 20, 30)
)

df2 <- data.frame(


id = c(2, 3, 4),
value2 = c(15, 25, 35)
)

# Merge datasets based on id


merged_data <- inner_join(df1, df2, by = "id")
In this example, we have two datasets, df1 and df2, each with an id column. We use
the inner_join() function to merge the datasets based on the common id column. The
resulting merged_data dataset contains only the rows where the id values are present in both
datasets.
These techniques of reshaping, aggregating, and merging datasets in R are essential for data
manipulation and analysis. By reshaping your data, you can transform its structure to fit
specific analysis or visualization needs. Aggregating data allows you to summarize and gain
insights from your dataset. Merging datasets enables you to combine information from multiple
sources to create a comprehensive dataset for analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy