Unit2
Unit2
library(readxl)
library(DBI)
head(data)
head(data,10)
View the last few rows:
tail(data)
tail(data,10)
Check structure and types of variables:
str(data)
Summary of dataset:
summary(data)
change the contents of the data frame as follows:
The $ operator is how you tell R that you are looking for a specific value
/variable within the dataframe.
Finding Column Names:
colnames(data)
[colnames(data)=="V1"]<-“Gender”
Or
colnames(data) <- c("NewName1", "NewName2", "NewName3")
Breakdown of the data using Table command :
table(data$Gender)
Options for most commands are specified by a comma after the names
of the object you want to apply the command to.
To Find an Options for an R Command:
We can Run
Convert a column to a factor:
install.packages(“haven”)
Each and every time we want to use a package we need to “load it” into
R's memory as follows:
library(haven)
library(dplyr)
Inner join : It returns only the rows where there is a match in both
datasets. Rows that do not have matching values in both tables are
removed.
Other types of joins:
Left join : It returns all rows from data1 and only matching rows from data2. If
there is no match in data2, missing values will be inserted.
Right Join : It returns all rows from data2 and only matching rows from data1. If
there is no match in data1, missing values will be inserted.
Here , we will use the plyr package because it is more powerful, faster
and easier to use
install.packages(“plyr”)
Library(plyr)
The command for merging datasets in plyr is called join:
merged_data<-join(data1, data2)
library(tidyr)
They can be represented in various ways, such as blank cells, null values,
or special symbols like “NA” or “unknown.”
Reduce the sample size: This can decrease the accuracy and reliability of
your analysis.
Introduce bias: If the missing data is not handled properly, it can bias the
results of your analysis.
Make it difficult to perform certain analyses: Some statistical techniques
require complete data for all variables, making them inapplicable when
missing values are present
Why Is Data Missing From the Dataset?
Data can be missing for many reasons like technical issues, human errors,
privacy concerns, data processing issues, or the nature of the variable
itself.
Understanding the cause of missing data helps choose appropriate
handling strategies and ensure the quality of your analysis.
Mean Imputation:
Replaces missing values with the average of the non-missing values in that
column.
Median Imputation:
Replaces missing values with the middle value when the non-missing values
are sorted in ascending order.
Mode Imputation:
Replaces missing values with the most frequent value in that column, suitable
for categorical data.
Code Examples
while using the gradient descent algorithm for training of the Machine
Learning Algorithms it is advised to use features which are on the same
scale to have stable and fast training of our algorithm.
Formula:
# Min-Max Normalization
starwars_normalized <- starwars_transformed %>%
mutate(across(where(is.numeric), ~ (.-min(.)) / (max(.) - min(.))))
Min-Max Normalization (Scaling
between 0 and 1)
Min-Max normalization scales values within a fixed range, typically [0,1].
Formula:
where:
X = Original value
Xmin = Minimum value in the dataset
Xmax = Maximum value in the dataset
Handling outliers
Outliers are extreme values that significantly deviate from the majority of data points.
They can skew analysis results and need proper handling.
1. Detecting Outliers
Before handling outliers, we must detect them using various methods.
A. Using Summary Statistics (IQR Method)
The Interquartile Range (IQR) method helps detect outliers.
Formula:
# Find outliers
outliers <- data$mpg[data$column_name < lower_bound |
data$column_name > upper_bound]
print(outliers)
2. Using Z-Score (Standardization)
Z-score measures how far a data point is from the mean in terms of standard
deviations.
Formula:
Example: If the salaries of employees are (50,000, 60,000, 70,000, 80,000, 90,000):
Range=90,000−50,000=40,000
Limitation: The range considers only the extreme values and ignores the distribution
of data in between.
Variance (σ²)
Variance measures the average squared deviation from the mean.
Formula for a population:
Example: If the dataset is (10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
then:Q1 = 30, Q2 = 50,
Q3 = 70IQR = 70 - 30 = 40
IQR is useful for detecting outliers (values beyond Q1 - 1.5 × IQR or Q3 + 1.5 × IQR).
Data Distributions and Histograms
Data Distributions
A distribution shows how data values are spread.
Common distributions include:
Normal Distribution (Bell Curve): Most values cluster around
the mean.
Skewed Distribution: Data is asymmetric (left-skewed or right-
skewed).
Uniform Distribution: All values occur with equal probability.
Histogram (Graphical Representation of
Data Distribution)
A histogram divides data into bins (intervals) and counts occurrences.
Example: Consider the following exam scores:
The histogram will show how frequently scores fall into different ranges.
Data Distributions
Skewnesss
Kurtosis
Identifying Patterns and Anomalies
Patterns in Data
Trends: Long-term movement in data (e.g., increasing sales
over time).
Seasonality: Repeated patterns at regular intervals (e.g., high
sales during festivals).
Cyclic Patterns: Fluctuations without a fixed interval (e.g.,
economic cycles).
Detecting Anomalies (Outliers)
Outliers are data points that deviate significantly from the rest
of the data.
Methods to detect outliers:
IQR Method: Data points beyond Q1 - 1.5 × IQR or Q3 + 1.5 ×
IQR.
Z-Score Method: Data points with Z-score > 3 or < -3.
Box Plots: Visual representation of outliers.