Daur Unit 2
Daur Unit 2
INTRODUCTION:
DATA FRAMES
Imagine a data frame as something akin to a database table or an Excel spreadsheet. It has a
specific number of columns, each of which is expected to contain values of a particular data type.
It also has an indeterminate number of rows, i.e. sets of related values for each column.
Assume, we have been asked to store data of our employees (such as employee ID, name and the
project that they are working on). We have been given three independent vectors, viz., namely,
“EmpNo”, “EmpName” and “ProjName” that holds details such as employee ids, employee
names and project names, respectively.
However, we need a data structure similar to a database table or an Excel spreadsheet that can
bind all these details together. We create a data frame by the name, “Employee” to store all the
three vectors together.
We have just created a data frame, “Employee” with data neatly organised into rows and the
variable names serving as column names across the top.
Use the syntax as shown next to display the content of the data frame, “Employee” in descending
order of “EmpExpYears”.
We will explore the data held in the data frame with the help of the following R functions:
➢ dim()
• nrow() ➢ head()
• ncol() ➢ tail()
➢ str() ➢ edit()
➢ summary()
➢ names()
dim() Function
The dim()function is used to obtain the dimensions of a data frame. The output of this function
returns the number of rows and columns.
> dim(Employee)
[1] 5 4
The data frame, “Employee” has 5 rows and 4 columns.
➢ nrow() Function
The nrow() function returns the number of rows in a data frame.
> nrow(Employee)
[1] 5
The data frame, “Employee” has 5 rows.
➢ ncol() Function
The ncol() function returns the number of columns in a data frame.
> ncol(Employee)
str() Function
The str() function compactly displays the internal structure of R objects. We will use it to display
the internal structure of the dataset, “Employee”.
> str (Employee)
‘data.frame’ : 5 obs. of 4 variables:
$ EmpNo : num 1000 1001 1002 1003 1004
$ EmpName : Factor w/ 5 levels “Dave”, “Jack”, ..: 2 3 5 4 1
$ ProjName : Factor w/ 5 levels “P01”, “P02”, “P03”, ..: 1 2 3 4 5
$ EmpExpYears : num 5 9 6 12 7
summary() Function
We will use the summary() function to return result summaries for each column of the dataset.
> summary (Employee)
EmpNo EmpName ProjName EmpExpYears
Min. : 1000 Dave : 1 P01:1 Min. : 5.0
1st Qu. : 1001 Jack : 1 P02:1 1st Qu. : 6.0
Median : 1002 Jane : 1 P03:1 Median : 7.0
Mean : 1002 Joe : 1 P04:1 Mean : 7.8
3rd Qu. : 1003 Margaritta : 1 P05:1 3rd Qu. : 9.0
Max. : 1004 Max. : 12.0
names() Function
The names()function returns the names of the objects. We will use the names() function to return
the column headers for the dataset, “Employee”.
> names (Employee)
[1] “EmpNo” “EmpName” “ProjName” “EmpExpYears”
In the example, names(Employee) returns the column headers of the dataset “Employee”.
The str() function helps in returning the basic structure of the dataset. This function provides an
overall view of the dataset.
head() Function
The head()function is used to obtain the first n observations where n is set as 6 by default.
Examples
1. In this example, the value of n is set as 3 and hence, the resulting output would contain
the first 3 observations of the dataset.
> head(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
tail() Function
The tail()function is used to obtain the last n observations where n is set as 6 by default.
> tail(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
Example
Consider the example, where the value of n is negative, and the output is returned by a
simple sum up value of x+n. Here x = 5 and n =-2. When a negative input is given in the
case of the tail()function, it returns the last x+n observations.
The example given as follows returns the last 3 records from the dataset, “Employee”.
> tail(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
edit() Function
The edit() function will invoke the text editor on the R object. We will use the edit() function to
open the dataset , “Employee” in the text editor.
Let us look at how R can load data into data frames from external files.
We have created a .csv file by the name, “item.csv” in the D:\ drive. It has the following
content:
EXPLORING DATA
Data in R is a set of organized information. Statistical data type is more common in R, which is a
set of observations where values for the variables are passed. These input variables are used in
measuring, controlling or manipulating the results of a program. Each variable differs in size and
type.
R supports the following basic data types to explore:
➢ Integer
➢ Numeric
➢ Logical
➢ Character/string
➢ Factor
➢ Complex
Based on the specific data characteristics in R, data can be explored in different ways.
DATA SUMMARY
Data summary in R can be obtained by using various R functions.
We will execute few R functions on the data set, “Employee”. Let us begin by displaying the
contents of the dataset, “Employee”.
> Employee
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
The summary(Employee[4]) function works on the fourth column, “EmpExpYears” and
computes the minimum, 1st quartile, median, mean, 3rd quartile and maximum for its value.
> summary(Employee[4])
EmpExpYears Mean : 7.8
Min. : 5.0 3rd Qu. : 9.0
1st Qu. : 6.0 Max. : 12.0
Median : 7.0
The min(Employee[4]) function works on the fourth column, “EmpExpYears” and determines
the minimum value for this column.
Similarly, other functions (such as mean, min, max, range and quantile) can be used with the
sapply()function to obtain the desired output.Consider the same data frame, Employee.
> which.min(Employee$EmpExpYears)
[1] 1
At position 1, is the employee with the minimum years of experience, “5”.
> which.max(Employee$EmpExpYears)
[1] 4
At position 4, is the employee with the maximum years of experience, “12”.
For summarizing data, there are three other ways to group the data based on some specific
conditions or variables and subsequent to this, the summary() function can be applied. These are
explained below.
• ddply() requires the “plyr” package
• summariseBy() requires doBy package
• aggregate() is included in base package of R.
A simple code to explain the ddply() function is:
data <- read.table(header= TRUE, text= ‘no sex before after change
1 M 54.2 5.2 -9.2
2 F 63.2 61.0 1.1
3 F 52 24.5 3.5
4 F 25 55 2.5
.
.
.
.
.
.
.
54 M 54 45 1.2’
)
KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 12
When applying the ddply()function to the above input,
library(plyr)
#set the functions to run the length, mean, sd on the value based on the “change” input for each
group of inputs.
#Break down with the values of “no”
cdata <- ddply(data,c(“no”),summarise,
N=length(change),
sd=sd(change),
mean=mean(change)
)
cdata
Output of the ddply()function is:
> no N sd mean
1 5 4.02 2.3
2 14 5.5 2.1
3 4 2.1 1.0
.
.
54 9 2.0 0.9
2. na.exclude(c) excludes the missing values and returns the object. A slight difference can
be found in some residual and prediction functions.
> na.exclude(c)
V1 V2
1 1 4
2 2 5
4. na.fail(c)returns an error when a missing value is found. It returns an object only when
there is no missing value.
> na.fail(c)
Error in na.fail.default(c) : missing value in object
In R, special checks are conducted for handling invalid values. An invalid value can be NA,
NaN, Inf or -Inf. Functions for these invalid values include anyNA(x) anyInvalid(x) and
is.invalid(x), where the value of x can be a vector, matrix or array. Here, anyNA function returns
a TRUE value if the input has any Na or NaN values. Else, it returns a FALSE value. This
function is equivalent to any(is.na(x)).
anyInvalid function returns a TRUE value, if the input has any invalid values. Else,it returns a
FALSE value. This function is equivalent to any(is.valid(x)).
Unlike the other two functions, is.invalid returns an object corresponding to each input value. If
the input is invalid, it returns TRUE, else it returns FALSE. This function is also equivalent to
(is.na(x) | is.infinite(x)).
Few examples with the above functions are:
KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 14
> anyNA(c(-9,NaN,9)) > is.finite(c(-9, Inf,9)) > is.infinite(c(-9, Inf, 9))
[1] TRUE [1] TRUE FALSE TRUE [1] FALSE TRUE FALSE
The basic idea of invalid values and outliers can be explained with a simple example.
Obtain the min, max, median mean, 1st quantile, 3rd quantile values using the summary()
Function.
Below figure is explained as follows:
>summary(custdata$income)
#returns the minimum, maximum, mean, median, and quantile values of the ‘income’ from the
‘custdata’ input values.
m
-8700 14600 35000 53500 67000 615000
>summary(custdata$age)
#returns the minimum, maximum, mean, median, and quantile values of the ‘age’ from the
‘custdata’ input values.
Minimum 1st Quantile Median Mean 3rd Quantile Maximum
0.0 38.0 50.0 51.7 64.0 146.7
The above two scenarios clearly explain the invalid and outlier values. In the first output, one of
the values of ‘income’ is negative (-8700). Practically, a person cannot have negative income.
Negative income is an indicator of debt. Hence, the income is given in negative values.
However, such negative values are required to be treated effectively.
A check is required on how to handle these types of inputs, i.e. either to drop the negative values
for the income or to convert the negative income into zero.
In the second case, one of the values of ‘age’ is zero and the other value is greater than 120,
which is considered as an outlier. Here, the values fall out of the data range of the expected
values. Outliers are considered to be incorrect or errors in input data. In such cases, an age ‘0’
could refer to unknown data or may be the customer never disclosed the age, and in case of more
than 120 years of age, the customer must have lived long.
A negative value in the age field could be a sentinel value and an outlier could be an error data,
unusual data or sentinel value. In case of missing a proper input to the field, an action is required
DESCRIPTIVE STATISTICS
1. Data Range
Data range in R helps in identifying the extent of difference in input data. The data range of the
observation variable is the difference between the largest and the smallest data value in a dataset.
The value of a data range can be calculated by subtracting the smallest value from the largest
value, i.e. Range = Largest value – Smallest value.
For example, the range or the duration of rainfall can be computed as
# Calculates the duration.
>duration = time$rainfall
#Apply max and min function to return the range
>max(duration) - min(duration)
This sample code returns the range or duration by taking the minimum and maximum values. In
the example above, time duration of rainfall is helpful in predicting the probability of duration of
rainfall. Hence, there should be enough variation in the amount of rainfall and the duration of the
rainfall.
2. Frequencies and Mode
Frequency
Frequency is a summary of data occurrence in a collection of non-overlapping types. In R, freq
function can be used to find the frequency distribution of vector inputs. In the example given,
consider sellers as the dataset and the frequency distribution of the shop variable is the summary
of the number of sellers in each shop.
> head(subset(mtcars, select = ‘gear’))
gear
Mazda RX4 4
Mazda RX4 Wag 4
Datsun 710 4
Hornet 4 Drive 3
Hornet Sportabout 3
Valiant 3
> factor(mtcars$gear)
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5
> w = table(mtcars$gear)
>w
3 4 5
15 12 5
> t = as.data.frame(w)
>t
Var1 Freq
1 3 15
2 4 12
3 5 5
> names(t) [1] = ‘gear’
>t
gear Freq
1 3 15
2 4 12
3 5 5
The cbind()function can be used to display the result in column format.
Mode
Mode is similar to frequency, except that the value of mode returns the highest number of
occurrences in a dataset. Mode can take both numeric and character as input data.
Mode does not have any standard inbuilt function to calculate mode of the given inputs.Hence, a
user-defined function is required to calculate mode in R. Here, the input is a vector value and the
output is the mode value.
A sample code to return the mode value is
#Create the function
getmode <- function(y){
uniqy <- unique(y)
uniqy[which.max(tabulate(match(y,uniqy)))]
}
# Define the input vector values
v <- c(5,6,4,8,5,7,4,6,5,8,3,2,1)
#Calculate the mode with user-defined functions
resultmode<- getmode(v)
print(resultmode)
#Define characters as input vector values
charv <-c(“as”,“is”,“is”,“it”,“in”)
#Calculate mode using user-defined function
resultmode <- getmode(charv)
print(resultmode)
Executing the above code will give the result as:
[1] 5
[1] “is”
> #Create the function
> getmode <- function(y) {
+ uniqy <- unique (y)
+ uniqy[which.max(tabulate(match(y, uniqy)))]
+}
>
> v <- c(5,6,4,8,5,7,4,6,5,8,3,2,1)
> resultmode<- getmode(v)
> print(resultmode)
[1] 5
> charv <- c(“as”,“is”,“is”,“it”,“in”)
> resultmode <- getmode (charv)
> print (resultmode)
[1] “is”
Statistical data in R is analyzed using inbuilt functions. These inbuilt functions are found in the
base R package. The functions take vector values as input with arguments and produce the
output.
Mean
Mean is the sum of input values divided by the sum of the number of inputs. It is also called the
average of the input values. In R, mean is calculated by inbuilt functions. The function
mean()gives the output of the mean value in R.
Basic syntax for the mean() function in R is:
mean(x, trim=0, na.rm = FALSE,...)
where,x is the input vector, trim specifies some drop in observations from both the sorted ends of
the input vector and na.rm removes the missing values in the input vector.
Example 1
A sample code to calculate the mean in R is
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
# Find the mean of the vector inputs
result.mean <- mean(x)
print(result.mean)
Output
[1]12.65.
> x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
> result.mean <- mean(x)
> print (result.mean)
[1] 12.65
When the trim parameter is selected, it sorts the vector values first and drops the input values for
calculating the mean based on the trim value from both the ends. Say trim = 0.4,4 values from
both the ends of sorted vector values are dropped. With the above sample, vector values
(15,54,6,5,9.2,36,5.3,8,-7,-5) are sorted to (-7,-5,5,5.3,6,8,9.2,15,36,54) and 4 values are
removed from both the ends, i.e. (-7,-5,5,5.3) from the left and (9.2,15,36,54) from the right.
Example 2
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
# Find the mean of the vector inputs
result.mean <- mean(x, trim =0.3)
print(result.mean)
Output
[1] 7.125
> x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
> result.means <- mean(x, trim =0.3)
> print(result.mean)
[1] 7.125
Example 3
In case of any missing value, the mean() function would return NA. In order to overcome such
cases, na.rm = TRUE is used to remove the NA values from the list for calculating the mean in
R.
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5,NA)
# Find the mean of the vector inputs
result.mean <- mean(x)
print(result.mean)
Step 4: Use the abline function to have a straight line (horizontal line) run through the bar
plot at the mean value.
The abline function can take an h parameter with a value to draw a horizontal line or a v
parameter for a vertical line. When it’s called, it updates the previous plot. Draw a horizontal line
across the plot at the mean.
> barplot (numbers)
> abline (h = mean (numbers))
Median
Median is the middle value of the given inputs. In R, the median can be found using the median()
function. Basic syntax for calculating the median in R is
median(x, na.rm=FALSE)
where,x is the input vector value and na.rm removes the missing values in the input vector.
Example 1
A sample to find out the median value of the input vector in R is
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
# Find the median value
median.result <-median(x)
print(median.result)
[1]7.
> x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
> median.result <-median (x)
> print (median.result)
[1] 7
Example 2
Objective: To determine the median of a set of numbers. To plot the numbers in a bar plot
and have a straight line run through the plot at the median.
Step 1: To create a vector, “numbers”.
> numbers <- c(1, 3, 5, 2, 8, 7, 9, 10)
Step 2: To compute the median value of the set of numbers contained in the
vector,“numbers”.
> median(numbers)
[1] 6
Step 3: To plot a bar plot using the vector, “numbers”. Use the abline function to have a
straight line (horizontal line) run through the bar plot at the median.
> barplot (numbers)
> abline (h = median (numbers))
Outcome: A straight line at the computed median value (6.0) runs through the bar plot computed
on the vector, “numbers”.
4. Standard Deviation
Objective: To determine the standard deviation.
To plot the numbers in a bar plot and have a straight line run through the plot at the mean and
another straight line run through the plot at mean + standard deviation.
Step 1: To create a vector, “numbers”.
> numbers <- c(1,3,5,2,8,7,9,10)
Step 2: To compute the mean value of the set of numbers contained in the vector,
“numbers”.
5. Mode
Objective: To determine the mode of a set of numbers.
R does not have a standard inbuilt function to determine the mode. We will write out own,
“Mode” function. This function will take the vector as the input and return the mode as the
output value.
Step 1: Create a user-defined function, “Mode”.
> Mode <-function(v) {
+ UniqValue <- unique(v)
+ UniqValue[which.max(tabulate(match(v, UniqValue)))]
+}
While writing the above function, “Mode”, we have used three other functions provided by R,
viz. “unique”, “tabulate” and “match”.
unique function: The “unique” function will take the vector as the input and returns the vector
with the duplicates removed.
>v
[1] 2 1 2 3 1 2 3 4 1 5 5 3 2 3
> unique(v)
[1] 2 1 3 4 5
match function: Takes a vector as the input and returns the vector that has the position of (first)
match of its first arguments in its second.
>v
[1] 2 1 2 3 1 2 3 4 1 5 5 3 2 3
> UniqValue <- unique(v)
> UniqValue
[1] 2 1 3 4 5
> match(v,UniqValue)
For a better understanding of input data, pictures or charts are preferred over text. Visualization
engages the audience well and numerical values, on comparison, cannot represent a big dataset in
an engaging manner.
The use of graphical representation to examine the given set of data is called visualization. With
this visualization, it is easier to calculate the following:
• To determine the peak value of the age of the customers (maximum value)
• To estimate the existence of the sub-population
• To determine the outlier values.
The graphical representation displays the maximum available information from the lowest to the
highest value. It also presents users with greater data clarity. For better usage of visualization, the
right aspect ratio and scaling of data is needed.
The additional black curve in Figure 4.2 refers to a bimodal distribution. Usually, if a distribution
contains more than two peaks, then it is considered a multimodal. The second black curve has the
same mean age as that of the grey curve. However, here the curve concentrates on two sets of
populations with younger ages between 20 and 30 and the older ages above 70.
These two sets of populations have different patterns in their behaviour and the probability of
customers who have health insurance also differs. In such a case, using a logistic regression or
linear regression fails to represent the current scenario. Hence, in order to overcome the
difficulties faced for such representations, a density plot or histogram can examine the
distribution of the input values. Moving forward, the histogram makes the representation simpler
as compared to density plots and is the preferred method for presenting findings from
quantitative analysis.
1) Histograms
It fills the bar with the ‘col’ colour parameter. And border to the bar can be done by passing
values to the ‘border’ parameter.
> H <- c (8,13,30,5,28)
> hist(H, xlab=“Categories”, col=“red”)
Example 3
The parameters xlim and ylim are used to denote the range of values that are used in the X
and Y axes. And breaks are used to specify the width of each bar.
#Create data for the histogram
H <- c (8,13,30,5,28)
# Give a file name for the histogram
png(file = “samplelimhistogram.png”)
#Create a samplelimhistogram.png
hist(H, xlab =“Values”, ylab= “Colours”, col=“green”, xlim=c(0,30),
ylim=c(0,5), breaks= 5)
#Save the samplelimhistogram.png file
dev.off()
> H <- c (8,13,30,5,28)
> hist(H, xlab =“Values”, ylab = “Colours”, col= “green”,xlim=c(0,30), ylim=c(0,5), breaks=5)
A density plot is referred to as a ‘continuous histogram’ of the given variable. However, the area
of the curve under the density plot is equal to 1. Therefore, the point on the density plot diagram
matches the fraction of the data (or the percentage of the data which is divided by 100 that takes
a particular value). The resulting value of the fraction is very small.
A density plot is an effective way to assess the distribution of a variable. It provides a better
reference in finding a parametric distribution. The basic syntax to create a plot is
plot(density(x)), where x is a numeric vector value.
Example 1
A simple density plot can be created by just passing the values and using the plot()function
(Figure 4.5).
# Create data for the density plot
h <- density (c(0.0, 38.0, 50.0, 51.7, 64.0, 146.0))
#Create density plot for h
plot(h)
> h <- density (c(0.0, 38.0, 50.0, 51.7, 64.0, 146.0))
> plot(h, xlab=“Values”, ylab=“Density”)
Example 3
To enable the dollar symbol in the input data labels=dollar parameter is passed. Hence, the
amount is displayed with the dollar $ symbol (Figure 4.7).
library(scales)
barplot(custdata) + geom_density(aes(x=income)) + scale_x_
continuous(labels=dollar)
3) Bar Charts
A bar chart is a pictorial representation of statistical data. Both vertical and horizontal bars can
be drawn using R. It also provides an option to color the bars in different colors. The length of
the bar is directly proportional to the values of the axes. R uses the barplot() function to create a
bar chart. The basic syntax for creating a bar chart using R is
barplot(H, xlab, ylab, main, names.arg, col)
where,H is a matrix or a vector that contains the numeric values used in bar chart, xlab is the
label of the X-axis, ylab is the label of the Y-axis, main is the main title of the bar chart,
Here when the “horiz” parameter is set to TRUE, it displays the bar chart in a horizontal position
else it will be displayed as a default vertical bar chart.
A group data in R is used to handle multiple inputs and takes the value of the matrix. This
group bar chart is created using the barplot() function and accepts the matrix inputs.
Example
> colors <- c(“green”,“orange”,“brown”)
> months <- c(“Mar”,“Apr”,“May”,“Jun”,“Jul”)
> regions <- c(“East”,“West”,“North”)
> Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11),nrow=3,ncol= 5,byrow = TRUE)
> Values
Here the legend column is included on the top right side of the bar chart.
Stacked bar chart is similar to group bar chart where multiple inputs can take different graphical
representations. Except by grouping the values, the stacked bar chart stacks each bar one after
the other based on the input values.
Example
> days <- c(“Mon”,“Tues”,“Wed”)
> months <- c(“Jan”,“Feb”,“Mar”,“Apr”,“May”)
> colours <- c(“red”,“blue”,“green”)
> val <- matrix(c(2,5,8,6,9,4,6,4,7,10,12,5,6,11,13), nrow =3, ncol=5, byrow =TRUE)
> barplot(val,main=“Total”,names.arg=months,xlab=“Months”,ylab=“Days”,col=colours)
> legend(“topleft”, days, cex=1.3,fill=colours)