0% found this document useful (0 votes)
18 views13 pages

Experiment 5

The document demonstrates various techniques for loading, cleaning, and summarizing data in R. It shows how to import data from CSV and Excel files, remove rows with missing values, replace missing values with means, remove duplicate rows, and drop specific rows. Descriptive statistics like mean, median, mode, variance, and quantiles are calculated on the loaded data.

Uploaded by

anjanisvecw2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Experiment 5

The document demonstrates various techniques for loading, cleaning, and summarizing data in R. It shows how to import data from CSV and Excel files, remove rows with missing values, replace missing values with means, remove duplicate rows, and drop specific rows. Descriptive statistics like mean, median, mode, variance, and quantiles are calculated on the loaded data.

Uploaded by

anjanisvecw2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

EXPERIMENT 5

#LOAD DATA FROM DIFFERENT FILES LIKE CSV AND EXCEL R

getwd()

setwd("C:/Users/HP/Desktop/R LAB PROGRAMS")

OUTPUT:

> getwd()

[1] "C:/Users/HP/Documents"

> setwd("C:/Users/HP/Desktop/R LAB PROGRAMS")

#IMPORTING DATA USING READ.CSV

f1<- read.csv("marks.csv")

f1

#length of data frame

length(f1)

#summary of the data frame

summary(f1)

OUTPUT:

Student.ID Name JAVA.MARKS DM.MARKS OS.MARKS CGPA

1 101 Alice 85 90 88 3.8

2 102 Bob 78 85 80 3.4

3 103 Charlie 92 88 95 4.0

4 104 David 75 82 77 3.2

5 105 Emily 88 95 89 3.9

6 106 Frank 80 78 82 3.1

7 107 Grace 95 92 97 4.2

8 108 Henry 70 75 67 2.9

9 109 Irene 89 87 91 3.7

10 110 Jack 84 91 86 3.6

> length(f1)

[1] 6
> summary(f1)

Student.ID Name JAVA.MARKS DM.MARKS


Min. :101.0 Length:10 Min. :70.00 Min. :75.00
1st Qu.:103.2 Class :character 1st Qu.:78.50 1st Qu.:82.75
Median :105.5 Mode :character Median :84.50 Median :87.50
Mean :105.5 Mean :83.60 Mean :86.30
3rd Qu.:107.8 3rd Qu.:88.75 3rd Qu.:90.75
Max. :110.0 Max. :95.00 Max. :95.00
OS.MARKS CGPA
Min. :67.0 Min. :2.900
1st Qu.:80.5 1st Qu.:3.250
Median :87.0 Median :3.650
Mean :85.2 Mean :3.580
3rd Qu.:90.5 3rd Qu.:3.875
Max. :97.0 Max. :4.200

#DISCRIPTIVE STATISTICS

#DATA DESCRIPTION

#COMPUTING MEAN VALUE

mean = mean(f1$CGPA)

mean

#computing median

median = median(f1$CGPA)

median

#computing mode

install.packages("modeest")

library(modeest)

mode = mfv(f1$CGPA)

print(mode)

cat("mean, median & mode of the java discription are: ")

print(mean)

print(median)

print(mode)

OUTPUT:

> mean

[1] 3.58

> median

[1] 3.65

package ‘modeest’ successfully unpacked and MD5 sums checked


> print(mode)

[1] 2.9 3.1 3.2 3.4 3.6 3.7 3.8 3.9 4.0 4.2

mean, median & mode of the java discription are:

[1] 3.58

[1] 3.65

[1] 2.9 3.1 3.2 3.4 3.6 3.7 3.8 3.9 4.0 4.2

##MEASURES OF VARIABILITY

max = max(f1$JAVA.MARKS)

cat("maximum value: \n",max)

min = min(f1$JAVA.MARKS)

cat("minimum value: \n",min)

OUTPUT:

maximum value:

95

minimum value:

70

#calculate range

range(f1$JAVA.MARKS)

#calculate range, difference between max and min

ranged = max - min

cat("Difference between max and min, Range is : ")

ranged

range(f1$JAVA.MARKS)

max(f1$JAVA.MARKS)

OUTPUT:

> range(f1$JAVA.MARKS)

[1] 70 95

Difference between max and min, Range is : > ranged

[1] 25

> range(f1$JAVA.MARKS)

[1] 70 95
> max(f1$JAVA.MARKS)

[1] 95

#Data variability

#Variance and standard variation

cat("Variance of the java marks: ")

variance = var(f1$JAVA.MARKS)

variance

stdevq = sd(f1$JAVA.MARKS)

meanvq = mean(f1$JAVA.MARKS)

cv = (stdevq/meanvq)* 100

cv

cat("Variance of the java marks: ",variance)

cat("Standard deviation of the java marks: ", stdevq)

varianced = var(f1$DM.MARKS)

stdeva = sd(f1$DM.MARKS)

meannd = mean(f1$DM.MARKS)

meannd

stdeva

cv1 = (stdeva/meannd)*100

cv1

cat("Variance of the dm marks: ",varianced)

cat("Standard deviation of dm marks: ",stdeva)

cat("S.D in java and dm marks: ",stdevq,stdeva)

print("Data variability in java and dm marks")

if(cv > cv1){

print("Java marks have more variability than dm")

}else{

print("Dm marks have more variability than java")

OUTPUT:
Variance of the java marks:

[1] 61.6

> cv

[1] 9.388238

Variance of the java marks: 61.6

Standard deviation of the java marks: 7.848567

> meannd

[1] 86.3

> stdeva

[1] 6.360468

> cv1

[1] 7.370183

Variance of the dm marks: 40.45556

Standard deviation of dm marks: 6.360468

S.D in java and dm marks: 7.848567 6.360468

[1] "Data variability in java and dm marks"

[1] "Java marks have more variability than dm"

##QUANTILES OF THE DATA

quartiles = quantile(f1$JAVA.MARKS)

quartiles

probs = seq(0,1,0.25)

quantile(f1$JAVA.MARKS,probs)

OUTPUT:

> quartiles

0% 25% 50% 75% 100%

70.00 78.50 84.50 88.75 95.00

> quantile(f1$JAVA.MARKS,probs)

0% 25% 50% 75% 100%

70.00 78.50 84.50 88.75 95.00

#DECILES
probs = seq(0,1,0.1)

quantile(f1$JAVA.MARKS,probs)

OUTPUT:

> quantile(f1$JAVA.MARKS,probs)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

70.0 74.5 77.4 79.4 82.4 84.5 86.2 88.3 89.6 92.3 95.0

#PERCENTILES

h = c(1:2000)

probs1 = seq(0,1,0.1)

quantile(h,probs)

OUTPUT:

> quantile(h,probs)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1.0 200.9 400.8 600.7 800.6 1000.5 1200.4 1400.3 1600.2 1800.1 2000.0

getwd()

setwd("C:/Users/HP/Desktop/R LAB PROGRAMS")

path = "C:/Users/HP/Desktop/R LAB PROGRAMS/student-marks.xlsx"

library("readxl")

ps = read_excel(path , skip=4)

ps

getwd()

setwd("C:/Users/HP/Desktop/R LAB PROGRAMS")

library("readxl")

OUTPUT:

> ps

# A tibble: 6 × 6

`104` David `75` `82` `77` `3.2`

<dbl> <chr> <dbl> <dbl> <dbl> <dbl>

1 105 Emily 88 95 89 3.9


2 106 Frank 80 78 82 3.1

3 107 Grace 95 92 97 4.2

4 108 Henry 70 75 67 2.9

5 109 Irene 89 87 91 3.7

6 110 Jack 84 91 86 3.6

#DATA CLEANING IN R

#Data cleaning refers to the process of transforming raw data into data that is suitable to perform
operations

#Method 1: Remove Rows With Missing Values

library(dplyr)

#remove rows with any missing values

df <- data.frame(

PatientID = c(1, 2, 3, 4, 5),

PatientName = c("John Doe", "Jane Smith", "Bob Johnson", NA, "Charlie Brown"),

Age = c(35, 28, NA, 60, 42),

AdmissionDate = as.Date(c("2024-01-01", "2024-02-15", "2024-03-10", "2024-04-05", "2024-05-


20")),

DischargeDate = as.Date(c("2024-01-10", "2024-03-05", NA, "2024-05-15", "2024-06-10")),

Diagnosis = c("Flu", "Fractured Leg", "Pneumonia", "Hypertension", "Appendicitis"),

RoomNumber = c(101, NA, 310, 415, 520),

DoctorName = c("Dr. Smith", "Dr. Johnson", "Dr. Davis", "Dr. Wilson", NA)

print(df)

df %>% na.omit()

OUTPUT:

> print(df)

PatientID PatientName Age AdmissionDate DischargeDate Diagnosis RoomNumber

1 1 John Doe 35 2024-01-01 2024-01-10 Flu 101

2 2 Jane Smith 28 2024-02-15 2024-03-05 Fractured Leg NA

3 3 Bob Johnson NA 2024-03-10 <NA> Pneumonia 310

4 4 <NA> 60 2024-04-05 2024-05-15 Hypertension 415

5 5 Charlie Brown 42 2024-05-20 2024-06-10 Appendicitis 520


DoctorName

1 Dr. Smith

2 Dr. Johnson

3 Dr. Davis

4 Dr. Wilson

5 <NA>

> df %>% na.omit()

PatientID PatientName Age AdmissionDate DischargeDate Diagnosis RoomNumber

1 1 John Doe 35 2024-01-01 2024-01-10 Flu 101

DoctorName

1 Dr. Smith

#replace missing values

df1 <- df %>% mutate(across(where(is.numeric), ~ifelse(is.na(.), mean(., nTRUE), .)))


print(df1)

Output

PatientID PatientName Age AdmissionDate DischargeDate


Diagnosis RoomNumber DoctorName
1 1 John Doe 35.00 2024-01-01 2024-01-10
Flu 101.0 Dr. Smith
2 2 Jane Smith 28.00 2024-02-15 2024-03-05
Fractured Leg 336.5 Dr. Johnson
3 3 Bob Johnson 41.25 2024-03-10 <NA>
Pneumonia 310.0 Dr. Davis
4 4 <NA> 60.00 2024-04-05 2024-05-15
Hypertension 415.0 Dr. Wilson
5 5 Charlie Brown 42.00 2024-05-20 2024-06-10
Appendicitis 520.0 <NA>

#remove

zz=c(23,29,NA,9,21.19)
zz
length(zz)
mean(zz)
M=mean(zz,na.rm=T)
print(M)
OUTPUT:
#remove>

zz=c(23,29,NA,9,21.19)

>

zz[1] 23.00 29.00 NA 9.00 21.19

> length(zz)

[1] 5

> mean(zz)

[1] NA

> M=mean(zz,na.rm=T)

> print(M)

[1] 20.5475

#Remove using data frame

df <- data.frame(

Name = c("Alice", "Jack", "Bob", "Anny", "Christie"),

Age = c(24, 35, 23, 32, 40),

City = c("New York", "Los Angeles", NA, "San Francisco", "Seattle"),

Grade = c("A", "B", "C", "A", "B")

print(df)

new_df<- df %>% na.omit()

print(new_df)

OUTPUT:

> print(df)

Name Age City Grade

1 Alice 24 New York A

2 Jack 35 Los Angeles B

3 Bob 23 <NA> C

4 Anny 32 San Francisco A

5 Christie 40 Seattle B

> new_df<- df %>% na.omit()


> print(new_df)

Name Age City Grade

1 Alice 24 New York A

2 Jack 35 Los Angeles B

4 Anny 32 San Francisco A

5 Christie 40 Seattle B

#Remove duplicate rows using dataFrame

df_with_duplicates <- data.frame(

Name = c("Jack", "Jane", "Jack", "Jane", "Bob", "Alice", "Christie"),

Age = c(25, 30, 25, 30, 22, 28, 35),

City = c("New York", "Los Angeles", "New York", "Los Angeles", "Chicago", "San Francisco",
"Seattle"),

Grade = c("A", "B", "A", "B", "C", "A", "B")

print(df_with_duplicates)

new_df<- df_with_duplicates %>% distinct(.keep_all = TRUE)

print(new_df)

OUTPUT:

> print(df_with_duplicates)

Name Age City Grade

1 Jack 25 New York A

2 Jane 30 Los Angeles B

3 Jack 25 New York A

4 Jane 30 Los Angeles B

5 Bob 22 Chicago C

6 Alice 28 San Francisco A

7 Christie 35 Seattle B

> new_df<- df_with_duplicates %>% distinct(.keep_all = TRUE)


> print(new_df)

Name Age City Grade

1 Jack 25 New York A

2 Jane 30 Los Angeles B

3 Bob 22 Chicago C

4 Alice 28 San Francisco A

5 Christie 35 Seattle B

#Drop rows
df <- data.frame(
Name = c("John", "Jane", "Bob", "Alice", "Charlie"),
Age = c(25, 30, 22, 28, 35),
City = c("New York", "Los Angeles", NA, "San Francisco", "Seattle"),
Grade = c("A", "B", "C", "A", "B")
)
df %>% drop_na()

OUTPUT:

Name Age City Grade

1 John 25 New York A

2 Jane 30 Los Angeles B

3 Alice 28 San Francisco A

4 Charlie 35 Seattle B

#drop rows in specific columns

df %>% drop_na(City)

OUTPUT:

Name Age City Grade

1 John 25 New York A

2 Jane 30 Los Angeles B

3 Alice 28 San Francisco A

4 Charlie 35 Seattle B
# to get column headings
glimpse(df)

OUTPUT:

Rows: 5

Columns: 4

$ Name <chr> "John", "Jane", "Bob", "Alice", "Charlie"

$ Age <dbl> 25, 30, 22, 28, 35

$ City <chr> "New York", "Los Angeles", NA, "San Francisco", "Seattle"

$ Grade <chr> "A", "B", "C", "A", "B"

#bind rows

df1 <- data.frame(


Name = c("Aarav", "Aditi", "Arjun", "Ananya", "Ayush"),
Age = c(25, 30, 22, 28, 35),
City = c("Mumbai", "Delhi", "Bangalore", "Kolkata", "Chennai"),
Grade = c("A", "B", "C", "A", "B")
)
df2 <- data.frame(
Name = c("Bhavya", "Chirag", "Deepika", "Dhruv", "Esha"),
Age = c(27, 32, 24, 30, 28),
City = c("Jaipur", "Ahmedabad", "Lucknow", "Hyderabad", "Pune"),
Grade = c("B", "C", "A", "B", "A")
)
df1
df2
com<- bind_rows(df1,df2)
print(com)

OUTPUT:

Name Age City Grade

1 Aarav 25 Mumbai A
2 Aditi 30 Delhi B

3 Arjun 22 Bangalore C

4 Ananya 28 Kolkata A

5 Ayush 35 Chennai B

Name Age City Grade

1 Bhavya 27 Jaipur B

2 Chirag 32 Ahmedabad C

3 Deepika 24 Lucknow A

4 Dhruv 30 Hyderabad B

5 Esha 28 Pune A

Name Age City Grade

1 Aarav 25 Mumbai A

2 Aditi 30 Delhi B

3 Arjun 22 Bangalore C

4 Ananya 28 Kolkata A

5 Ayush 35 Chennai B

6 Bhavya 27 Jaipur B

7 Chirag 32 Ahmedabad C

8 Deepika 24 Lucknow A

9 Dhruv 30 Hyderabad B

10 Esha 28 Pune A

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy