0% found this document useful (0 votes)
2 views

Unit2

This document provides a comprehensive guide on data handling in R, covering importing datasets from various sources, exploring and inspecting data, data wrangling techniques, and methods for handling missing values. It discusses different types of joins for merging datasets, feature encoding, scaling, and outlier detection methods. Additionally, it includes code examples for practical implementation of these concepts.

Uploaded by

sdhdmh5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit2

This document provides a comprehensive guide on data handling in R, covering importing datasets from various sources, exploring and inspecting data, data wrangling techniques, and methods for handling missing values. It discusses different types of joins for merging datasets, feature encoding, scaling, and outlier detection methods. Additionally, it includes code examples for practical implementation of these concepts.

Uploaded by

sdhdmh5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

UNIT 2

Load the dataset


Data can be imported from various sources such as CSV, Excel, databases,
or JSON files.

Example: Importing CSV

data <- read.csv("data.csv /Path to your dataset")


or

data <- read.csv("C:\\Users\\ADMIN\\Desktop\\DAVR\\claimants.csv")


Data can be imported from various sources such as CSV, Excel, databases,
or JSON files.

Example: Excel File (Requires readxl package)

library(readxl)

data <- read_excel("data.xlsx")


Data can be imported from various sources such as CSV, Excel, databases,
or JSON files.

Example: Database SQL Connection

library(DBI)

conn <- dbConnect(RSQLite::SQLite(), "database.sqlite")

data <- dbGetQuery(conn, "SELECT * FROM table_name")


2. Exploring Data
Once imported, the dataset needs to be inspected to understand its
structure.

View the first few rows:

head(data)

View the first 10 rows:

head(data,10)
View the last few rows:

tail(data)

View the last 10 rows:

tail(data,10)
Check structure and types of variables:

str(data)

Summary of dataset:

summary(data)
change the contents of the data frame as follows:

data$Gender[data$Gender==1] <- "Female“

data$Gender[data$Gender==0] <- “Male“

The $ operator is how you tell R that you are looking for a specific value
/variable within the dataframe.
Finding Column Names:

colnames(data)

Renaming Column Names :

[colnames(data)=="V1"]<-“Gender”
Or
colnames(data) <- c("NewName1", "NewName2", "NewName3")
Breakdown of the data using Table command :

table(data$Gender)

Now , You Try It :


Change the name of the second Column
in ‘my_data_frame’ to ‘Income’
Data Wrangling
As the upper right hand pane of RStudio now shows, these data have
number of observations and n variables.
By default, R has assumed that the first line of these data are the
variable names.
R treats strings as factors during data import unless you tell it not to.
 This can become a problem if you try to perform operations on string
variables that are actually factor variables.
In order to find the “class” of a variable or whether it is numeric,
character, or factor variable we can use the class() command .
If we want to prevent R from defaulting to this behaviour, we can add an
option to our read.csv command.

Options for most commands are specified by a comma after the names
of the object you want to apply the command to.
To Find an Options for an R Command:
We can Run
Convert a column to a factor:

data$category <- as.factor(data$category)

Create a new column:

data$new_column <- data$old_column * 2


Installing Packages
Packages are user-contributed bundles of code

They are much richer because R has greater functionality.


Installing a Package (only once in R’s Memory) :

install.packages(“haven”)

Haven - use for reading data in stata/ SPSS/SAS

Each and every time we want to use a package we need to “load it” into
R's memory as follows:
library(haven)

We can also write:


require(“haven”)
Subsetting the Data
Cleaning/Reshaping/“Wrangling” data
is a core task of computational
sociology and “data science” more
broadly.

A recent New York Times


Article suggests 80% of data scientists'
time is spent cleaning data, while only
20% of their time is spent analyzing it.
Filtering & Selecting Data
To extract meaningful information, we filter and select relevant columns.

Select specific columns:

selected_data <- data[, c("column1", "column2")]

Filter rows based on conditions:

filtered_data <- subset(data, column1 > 50 & column2 == "Yes")


Using dplyr for filtering & Selecting :

library(dplyr)

filtered_data <- data %>%


filter(column1 > 50, column2 == "Yes") %>%
select(column1, column2, column3)
Aggregation and Summarization
Summarizing data helps in generating meaningful insights.

Group & Summarize using dplyr:


summary_data <- data %>%
group_by(category) %>%
summarise(avg_value = mean(numeric_column, na.rm = TRUE),
count = n())

Find the total Sum of a column :


sum(data$numeric_column, na.rm = TRUE)
Merging and Joining Data
Data from multiple sources can be combined using joins.

Merging datasets (inner_join from dplyr) :

merged_data <- inner_join(data1, data2, by = "common_column“)

Inner join : It returns only the rows where there is a match in both
datasets. Rows that do not have matching values in both tables are
removed.
Other types of joins:
Left join : It returns all rows from data1 and only matching rows from data2. If
there is no match in data2, missing values will be inserted.

left_join(data1, data2, by = "common_column") # Keeps all rows from data1

Right Join : It returns all rows from data2 and only matching rows from data1. If
there is no match in data1, missing values will be inserted.

right_join(data1, data2, by = "common_column") # Keeps all rows from data2


Full join : It returns all the rows from both datasets . If there is no match ,
Na is used.
full_join(data1, data2, by = "common_column") # Keeps all rows from
both

Keeps Only Matching


Join Type Rows from data1 Rows from data2 NA where no match?
Rows?

Inner Join ❌ Only Matching ❌ Only Matching ✅ Yes No

Left Join ✅ All ❌ Only Matching ❌ No Yes (for data2 values)

Right Join ❌ Only Matching ✅ All ❌ No Yes (for data1 values)

Full Join ✅ All ✅ All ❌ No Yes (for both)


There are many different commands for merging dataframes in R (e.g.
the merge command in base R)

Here , we will use the plyr package because it is more powerful, faster
and easier to use

install.packages(“plyr”)

Library(plyr)
The command for merging datasets in plyr is called join:

merged_data<-join(data1, data2)

This particular command from plyr package automatically searches for


column names that are shared by both files.
The %in% operator identifies common elements in two vectors.

data1$race %in% data2$race

Output- Binary (True , False)

We can also combine %in% with ! :

!(data1$race %in% data2$race)

Output- Reversed Binary (False , True)


Reshaping Data
Reshaping data helps in rearranging it for better analysis.
Wide to long format : (tidyr Package)

library(tidyr)

long_data <- pivot_longer(data, cols = c("Jan", "Feb", "Mar"),


names_to = "Month", values_to = "Value")
Long to wide format :

wide_data <- pivot_wider(long_data, names_from = "Month",


values_from = "Value")
Handling Missing Values
Missing values are data points that are absent for a specific variable in a
dataset.

They can be represented in various ways, such as blank cells, null values,
or special symbols like “NA” or “unknown.”

These missing data points pose a significant challenge in data analysis


and can lead to inaccurate or biased results.
Missing values can pose a significant challenge in data analysis, as they
can:

Reduce the sample size: This can decrease the accuracy and reliability of
your analysis.
Introduce bias: If the missing data is not handled properly, it can bias the
results of your analysis.
Make it difficult to perform certain analyses: Some statistical techniques
require complete data for all variables, making them inapplicable when
missing values are present
Why Is Data Missing From the Dataset?

Data can be missing for many reasons like technical issues, human errors,
privacy concerns, data processing issues, or the nature of the variable
itself.
Understanding the cause of missing data helps choose appropriate
handling strategies and ensure the quality of your analysis.

It’s important to understand the reasons behind missing data:


Identifying the type of missing data: Is it Missing Completely at Random
(MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR)?
Evaluating the impact of missing data: Is the missingness causing bias or
affecting the analysis?
Choosing appropriate handling strategies: Different techniques are
suitable for different types of missing data.
Types of Missing Values :
There are three main types of missing values:
Missing Completely at Random (MCAR): MCAR is a specific type of missing data in which
the probability of a data point being missing is entirely random and independent of any
other variable in the dataset. In simpler terms, whether a value is missing or not has
nothing to do with the values of other variables or the characteristics of the data point
itself.
Missing at Random (MAR): MAR is a type of missing data where the probability of a data
point missing depends on the values of other variables in the dataset, but not on the
missing variable itself. This means that the missingness mechanism is not entirely random,
but it can be predicted based on the available information.
Missing Not at Random (MNAR): MNAR is the most challenging type of missing data to deal
with. It occurs when the probability of a data point being missing is related to the missing
value itself. This means that the reason for the missing data is informative and directly
associated with the variable that is missing.
Methods of Handling Missing Values :

Check for missing values:


sum(is.na(data))

Columnwise summation of missing Values


colSums(is.na(data)

Remove missing values:


data <- na.omit(data)
Missing Values Imputation – filling missing values with mean, median
and mode

Mean Imputation:
Replaces missing values with the average of the non-missing values in that
column.
Median Imputation:
Replaces missing values with the middle value when the non-missing values
are sorted in ascending order.
Mode Imputation:
Replaces missing values with the most frequent value in that column, suitable
for categorical data.
Code Examples

# Sample data with missing values

data <- data.frame(


numeric_col = c(1, 2, NA, 4, 5),
categorical_col = c("A", "B", NA, "C", "B")
)
# --- Mean Imputation ---
# Calculate the mean of numeric_col

mean_val <- mean(data$numeric_col, na.rm = TRUE)


# Replace NA values with the mean

data$numeric_col[is.na(data$numeric_col)] <- mean_val


# --- Median Imputation ---

# Calculate the median of numeric_col

median_val <- median(data$numeric_col, na.rm = TRUE)

# Replace NA values with the median

data$numeric_col[is.na(data$numeric_col)] <- median_val


# --- Mode Imputation ---
# Calculate the mode of categorical_col

mode_val <- function(x) {


ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

mode_val <- mode_val(data$categorical_col)

# Replace NA values with the mode

data$categorical_col[is.na(data$categorical_col)] <- mode_val

# Display the imputed data


print(data)
Removing Duplicates
1.R base functions
•duplicated(): for identifying
duplicated elements and
•unique(): for extracting
unique elements,

2.distinct() [dplyr package]


to remove duplicate rows in a data
frame.
Given the following vector:
x <- c(1, 1, 4, 5, 4, 6)

To find the position of duplicate elements in x, use this:


duplicated(x)

O/P : [1] FALSE TRUE FALSE FALSE TRUE FALSE

Extract duplicate elements:


x[duplicated(x)]
O/P : [1] 1 4
Feature Encoding
Feature encoding is a technique for converting categorical or non-
numeric data into a numeric representation .

# Finding the Unique Values in the column

Unique_values <- unique(data$column_name)

Output : group_A , group B, group C


One Hot Encoding
One Hot encoding is a way to represent categorical variables as numeric
values.

Each unique value in the categorical column is transformed into a


separate column with binary values (0s and 1s) indicating the presence or
absence of that value.
The original data frame “data” is combined with the encoded columns
using the “cbind” function, resulting in the new data frame.
One-Hot Encoding Using model.matrix()
# 1. Assume 'original_data' is your data frame
# 2. 'CategoryColumn' is the categorical variable you want to encode

# Create a new data frame with one-hot encoded variables


encoded_cols <- as.data.frame(
model.matrix(~ CategoryColumn - 1, data = original_data)
)

# Add the encoded columns back to the original data


encoded_data <- cbind(original_data, encoded_cols)
Ordinal encoding
Ordinal encoding is used to represent categorical variables with ordered
or hierarchical relationships using numerical values.
In the case of the “PracticeSport” column, we assigned numerical
values to the categories “never”, “sometimes”, and “regularly” based on
their order. “never” was assigned 0, “sometimes” was assigned 1, and
“regularly” was assigned 2.
# Create a copy of the original data frame
encoded_data <- data

# Define the mapping of categories to numerical values


mapping <- c("never" = 0, "sometimes" = 1,
"regularly" = 2)

# Apply the ordinal encoding


encoded_data$PracticeSport <-
mapping[as.character(encoded_data$PracticeSport)]

# Print the first few rows of the encoded data frame


head(encoded_data)
Feature Scaling
Many times when we have different features in the dataset on different
scales.

while using the gradient descent algorithm for training of the Machine
Learning Algorithms it is advised to use features which are on the same
scale to have stable and fast training of our algorithm.

There are different methods of feature scaling


like standardization and normalization which is also known as Min-Max
Scaling.
Standardization
Standardizing is like giving your stats a change! Imagine you have a group of friends,
all of whom are of different heights and weights. Some are long, some are heavy,
difficult to compare directly.

standardized_data <- data.frame(data)

# Create a new variable for standardized MathScore


standardized_data <- scale(standardized_data$column_name)

# Print the first few values of the standardized data


head(standardized_data)
Z-Score Standardization
(Standardization)
Z-score standardization (also called standard scaling) transforms the data so that it has
a mean (μ) of 0 and a standard deviation (σ) of 1.

Formula:

where:𝑋X = Original value


μ = Mean of the dataset
σ = Standard deviation of the dataset
Normalization
normalization adjusts the values so all of them match properly inside a
selected range(0-1).
 It’s like bringing harmony to your statistics, making it less complicated
to examine and analyze.

# Min-Max Normalization
starwars_normalized <- starwars_transformed %>%
mutate(across(where(is.numeric), ~ (.-min(.)) / (max(.) - min(.))))
Min-Max Normalization (Scaling
between 0 and 1)
Min-Max normalization scales values within a fixed range, typically [0,1].

Formula:

where:
X = Original value
Xmin = Minimum value in the dataset
Xmax = Maximum value in the dataset
Handling outliers
Outliers are extreme values that significantly deviate from the majority of data points.
They can skew analysis results and need proper handling.

1. Detecting Outliers
Before handling outliers, we must detect them using various methods.
A. Using Summary Statistics (IQR Method)
The Interquartile Range (IQR) method helps detect outliers.
Formula:

Outliers are values outside these bounds.


Using Boxplots

A boxplot visually displays outliers.

boxplot(data$column_name, main="Boxplot of column_name", col="lightblue“)


# Compute Q1, Q3, and IQR
Q1 <- quantile(data$column_name, 0.25)
Q3 <- quantile(data$column_name, 0.75)
IQR_value <- IQR(data$column_name)

# Calculate lower and upper bounds


lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Find outliers
outliers <- data$mpg[data$column_name < lower_bound |
data$column_name > upper_bound]
print(outliers)
2. Using Z-Score (Standardization)

Z-score measures how far a data point is from the mean in terms of standard
deviations.
Formula:

Outliers have Z-score > 3 or < -3.


# Compute Z-score
data$z_score <- (data$column_name - mean(data$column_name)) /
sd(data$column_name)

# Find outliers (Z-score > 3 or < -3)


outliers <- data[data$z_score > 3 | data$z_score < -3, ]
print(outliers)
Summary Statistics
Summary statistics are numerical
measures that describe key
characteristics of a dataset.
The main summary statistics
include:
Mean, median, Minimum Value,
Maximum value, 1st Quartile
calue & 3rd Quartile Value.
1.1 Central Tendency
Central tendency refers to the measure that identifies the central or
typical value in a dataset. The three primary measures are:
Mean (Average)
The mean is the sum of all values divided by the total number of
observations.
Formula:
Median (Middle Value)
The median is the middle number when the data is arranged in
ascending order.
If the dataset has an odd number of observations, the median is the
middle value.
If it has an even number of observations, the median is the average of
the two middle values.
Example: For (50,000, 60,000, 70,000, 80,000, 90,000), the median is
70,000 (middle value).
For (50,000, 60,000, 70,000, 80,000), the median is:
Mode (Most Frequent Value)
The mode is the most frequently occurring value in the dataset.
Example: If exam scores are (85, 90, 88, 90, 92, 85, 90), then 90 is the
mode (occurs the most times).
A dataset can have one mode (unimodal), two modes (bimodal), or
multiple modes (multimodal).
2. Variability/Dispersion (Spread of Data)
Variability measures how spread out the data is. The key measures of variability
include:
Range
The simplest measure of variability.
Formula:

Example: If the salaries of employees are (50,000, 60,000, 70,000, 80,000, 90,000):
Range=90,000−50,000=40,000
Limitation: The range considers only the extreme values and ignores the distribution
of data in between.
 Variance (σ²)
Variance measures the average squared deviation from the mean.
Formula for a population:

Formula for a sample:


Example Calculation:
If the dataset is (10, 20, 30, 40, 50):
Mean = (10 + 20 + 30 + 40 + 50) / 5 = 30
Differences from mean: (-20, -10, 0, 10, 20)
Squared Differences: (400, 100, 0, 100, 400)
Variance = (400 + 100 + 0 + 100 + 400) / 5 = 200
Standard Deviation (σ or s)
Standard deviation is the square root of variance.
Formula:

 It measures how much the data deviates from the mean.


A low standard deviation means the values are close to the mean, while
a high standard deviation indicates they are spread out.
Quartiles (Q1, Q2, Q3)
Quartiles divide the dataset into four equal parts:
Q1 (First Quartile - 25th percentile): The middle value between the
minimum and median.
Q2 (Second Quartile - 50th percentile): The median
Q3 (Third Quartile - 75th percentile): The middle value between the
median and the maximum.
Interquartile Range (IQR)
Interquartile Range (IQR) is:

Example: If the dataset is (10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
then:Q1 = 30, Q2 = 50,
Q3 = 70IQR = 70 - 30 = 40
IQR is useful for detecting outliers (values beyond Q1 - 1.5 × IQR or Q3 + 1.5 × IQR).
Data Distributions and Histograms
Data Distributions
A distribution shows how data values are spread.
Common distributions include:
Normal Distribution (Bell Curve): Most values cluster around
the mean.
Skewed Distribution: Data is asymmetric (left-skewed or right-
skewed).
Uniform Distribution: All values occur with equal probability.
Histogram (Graphical Representation of
Data Distribution)
A histogram divides data into bins (intervals) and counts occurrences.
Example: Consider the following exam scores:

The histogram will show how frequently scores fall into different ranges.
Data Distributions
Skewnesss
Kurtosis
Identifying Patterns and Anomalies
Patterns in Data
Trends: Long-term movement in data (e.g., increasing sales
over time).
Seasonality: Repeated patterns at regular intervals (e.g., high
sales during festivals).
Cyclic Patterns: Fluctuations without a fixed interval (e.g.,
economic cycles).
Detecting Anomalies (Outliers)
Outliers are data points that deviate significantly from the rest
of the data.
Methods to detect outliers:
IQR Method: Data points beyond Q1 - 1.5 × IQR or Q3 + 1.5 ×
IQR.
Z-Score Method: Data points with Z-score > 3 or < -3.
Box Plots: Visual representation of outliers.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy