0% found this document useful (0 votes)
11 views28 pages

Daur Unit 2

UNIT-4

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views28 pages

Daur Unit 2

UNIT-4

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Exploring Data in R

INTRODUCTION:

• R provides interactive data visualizations to support analyses of statistical data.


• In R, data is usually stored in data frames owing to its ability to hold data of varied data
types. These data frames are unlike the matrices, which can store data of only one type.

DATA FRAMES

Imagine a data frame as something akin to a database table or an Excel spreadsheet. It has a
specific number of columns, each of which is expected to contain values of a particular data type.
It also has an indeterminate number of rows, i.e. sets of related values for each column.

Assume, we have been asked to store data of our employees (such as employee ID, name and the
project that they are working on). We have been given three independent vectors, viz., namely,
“EmpNo”, “EmpName” and “ProjName” that holds details such as employee ids, employee
names and project names, respectively.

>EmpNo <- c(1000, 1001, 1002, 1003, 1004)


>EmpName <- c(“Jack”, “Jane”, “Margaritta”, “Joe”, “Dave”)
>ProjName <- c(“PO1”, “PO2”, “PO3”, “PO4”, “PO5”)

However, we need a data structure similar to a database table or an Excel spreadsheet that can
bind all these details together. We create a data frame by the name, “Employee” to store all the
three vectors together.

>Employee <- data.frame(EmpNo, EmpName, ProjName)

Let us print the content of the date frame, “Employee”.

We have just created a data frame, “Employee” with data neatly organised into rows and the
variable names serving as column names across the top.

➢ Data Frame Access


There are two ways to access the content of data frames:
i. By providing the index number in square brackets
ii. By providing the column name as a string in double brackets.

➢ By Providing the Index Number in Square Brackets


Example 1
To access the second column, “EmpName”, we type the following command at the R prompt.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 1


KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 2
➢ By Providing the Column Name as a String in Double Brackets

➢ Ordering the Data Frames


Let us display the content of the data frame, “Employee” in ascending order of “EmpExpYears”.

Use the syntax as shown next to display the content of the data frame, “Employee” in descending
order of “EmpExpYears”.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 3


R FUNCTIONS FOR UNDERSTANDING DATA IN DATA FRAMES

We will explore the data held in the data frame with the help of the following R functions:
➢ dim()
• nrow() ➢ head()
• ncol() ➢ tail()
➢ str() ➢ edit()
➢ summary()
➢ names()

dim() Function

The dim()function is used to obtain the dimensions of a data frame. The output of this function
returns the number of rows and columns.
> dim(Employee)
[1] 5 4
The data frame, “Employee” has 5 rows and 4 columns.
➢ nrow() Function
The nrow() function returns the number of rows in a data frame.
> nrow(Employee)
[1] 5
The data frame, “Employee” has 5 rows.
➢ ncol() Function
The ncol() function returns the number of columns in a data frame.
> ncol(Employee)

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 4


[1] 4
The data frame, “Employee” has 4 columns.

str() Function

The str() function compactly displays the internal structure of R objects. We will use it to display
the internal structure of the dataset, “Employee”.
> str (Employee)
‘data.frame’ : 5 obs. of 4 variables:
$ EmpNo : num 1000 1001 1002 1003 1004
$ EmpName : Factor w/ 5 levels “Dave”, “Jack”, ..: 2 3 5 4 1
$ ProjName : Factor w/ 5 levels “P01”, “P02”, “P03”, ..: 1 2 3 4 5
$ EmpExpYears : num 5 9 6 12 7

summary() Function

We will use the summary() function to return result summaries for each column of the dataset.
> summary (Employee)
EmpNo EmpName ProjName EmpExpYears
Min. : 1000 Dave : 1 P01:1 Min. : 5.0
1st Qu. : 1001 Jack : 1 P02:1 1st Qu. : 6.0
Median : 1002 Jane : 1 P03:1 Median : 7.0
Mean : 1002 Joe : 1 P04:1 Mean : 7.8
3rd Qu. : 1003 Margaritta : 1 P05:1 3rd Qu. : 9.0
Max. : 1004 Max. : 12.0

names() Function

The names()function returns the names of the objects. We will use the names() function to return
the column headers for the dataset, “Employee”.
> names (Employee)
[1] “EmpNo” “EmpName” “ProjName” “EmpExpYears”
In the example, names(Employee) returns the column headers of the dataset “Employee”.
The str() function helps in returning the basic structure of the dataset. This function provides an
overall view of the dataset.

head() Function

The head()function is used to obtain the first n observations where n is set as 6 by default.
Examples
1. In this example, the value of n is set as 3 and hence, the resulting output would contain
the first 3 observations of the dataset.
> head(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 5


2. Consider x as the total number of observations. In case of any negative values as input
for n in the head() function, the output obtained is first x+n observations. In this example,
x=5 and n= -2, then the number of observations returned will be x + n =5 + (-2)= 3
> head(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6

tail() Function

The tail()function is used to obtain the last n observations where n is set as 6 by default.
> tail(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
Example
Consider the example, where the value of n is negative, and the output is returned by a
simple sum up value of x+n. Here x = 5 and n =-2. When a negative input is given in the
case of the tail()function, it returns the last x+n observations.
The example given as follows returns the last 3 records from the dataset, “Employee”.
> tail(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7

edit() Function

The edit() function will invoke the text editor on the R object. We will use the edit() function to
open the dataset , “Employee” in the text editor.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 6


LOAD DATA FRAMES

Let us look at how R can load data into data frames from external files.

➢ Reading from a .csv (comma separated values file)

We have created a .csv file by the name, “item.csv” in the D:\ drive. It has the following
content:

➢ Subsetting Data Frame


To subset the data frame and display the details of only those items whose price is greater than or
equal to 350.
> subset(ItemDataFrame, ItemPrice >=350)
Itemcode ItemCategory ItemPrice
1 I1001 Electronics 700
3 I1003 Office supplies 350
To subset the data frame and display only the category to which the items belong (items whose
price is greater than or equal to 350).
> subset(ItemDataFrame,ItemPrice >=350, select = c(ItemCategory))
ItemCategory
1 Electronics
KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 7
3 Office supplies
To subset the data frame and display only the items where the category is either “Office
supplies” or “Desktop supplies”.
> subset(ItemDataFrame, ItemCategory == “Office supplies” | ItemCategory== “Desktop
supplies”)
Itemcode ItemCategory ItemPrice
2 I1002 Desktop supplies 300
3 I1003 Office supplies 350

➢ Reading from a Tab Separated Value File


For any file that uses a delimiter other than a comma, one can use the read.table command.
Example
We have created a tab separated file by the name, “item-tab-sep.txt” in the D:\ drive. It has the
following content.
Itemcode ItemQtyOnHand ItemReorderLvl
I1001 75 25
I1002 30 25
I1003 35 25
Let us load this file using the read.table function. We will read the content from the file but will
not store its content to a data frame.
> read.table(“d:/item-tab-sep.txt”,sep=“\t”)
V1 V2 V3
1 Itemcode ItemQtyOnHand ItemReorderLvl
2 I1001 70 25
3 I1002 30 25
4 I1003 35 25
Notice the use of V1, V2 and V3 as column headings. It means that our specified column names,
“Itemcode”, ItemCategory” and “ItemPrice” are not considered. In other words,the first line is
not automatically treated as a column header.
Let us modify the syntax, so that the first line is treated as a column header.
> read.table(“d:/item-tab-sep.txt”,sep=“\t”, header=TRUE)
Itemcode ItemQtyOnHand ItemReorderLv1
1 I1001 70 25
2 I1002 30 25
3 I1003 35 25
Now let us read the content of the specified file into the data frame, “ItemDataFrame”.
> ItemDataFrame <- read.table(“D:/item-tab-sep.txt”,sep=“\t”,header=TRUE)
> ItemDataFrame
Itemcode ItemQtyOnHand ItemReorderLvl
1 I1001 70 25
2 I1002 30 25
3 I1003 35 25
➢ Reading from a Table
A data table can reside in a text file. The cells inside the table are separated by blank characters.
An example of a table with 4 rows and 3 columns is given as follows:

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 8


Copy and paste the table in a file named “d:/mydata.txt” with a text editor and then load the data
into the workspace with the function read.table.
> mydata = read.table(“d:/mydata.txt”)
> mydata
V1 V2 V3
1 1001 Physics 85
2 2001 Chemistry 87
3 3001 Mathematics 93
4 4001 English 84

➢ Merging Data Frames


Let us now attempt to merge two data frames using the merge function. The merge function
takes an x frame (item.csv) and a y frame (item-tab-sep.txt) as arguments. By default, it joins the
two frames on columns with the same name (the two “Itemcode” columns).
> csvitem <- read.csv(“d:/item.csv”)
> tabitem <- read.table(“d:/item-tab-sep.txt”,sep=“\t”,header=TRUE)
> merge (x=csvitem, y=tabitem)
Itemcode ItemCategory ItemPrice ItemQtyOnHand ItemReorderLvl
1 I1001 Electronics 700 70 25
2 I1002 Desktop supplies 300 30 25
3 I1003 Office supplies 350 35 25

EXPLORING DATA

Data in R is a set of organized information. Statistical data type is more common in R, which is a
set of observations where values for the variables are passed. These input variables are used in
measuring, controlling or manipulating the results of a program. Each variable differs in size and
type.
R supports the following basic data types to explore:
➢ Integer
➢ Numeric
➢ Logical
➢ Character/string
➢ Factor
➢ Complex
Based on the specific data characteristics in R, data can be explored in different ways.

❖ Exploratory Data Analysis


Exploratory data analysis (EDA) involves dataset analysis to summarize the main characteristics
in the form of visual representations. Exploratory data analysis using R is an approach used to
summarize and visualize the main characteristics of a dataset, which differs from initial data
analysis.
The main aim of EDA is to summarize and visualize the main characteristics of a dataset. It
focuses on:
• Exploring data by understanding its structure and variables
• Developing an intuition about the dataset
• Considering how the dataset came into existence
• Deciding how to investigate by providing a formal statistical method

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 9


• Extending better insights about the dataset
• Formulating a hypothesis that leads to new data collection
• Handling any missing values
• Investigating with more formal statistical methods.

Some of the graphical techniques used by EDA are:


• Box plot
• Histogram
• Scatter plot
• Run chart
• Bar chart
• Density plots
• Pareto chart

DATA SUMMARY
Data summary in R can be obtained by using various R functions.

We will execute few R functions on the data set, “Employee”. Let us begin by displaying the
contents of the dataset, “Employee”.
> Employee
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
The summary(Employee[4]) function works on the fourth column, “EmpExpYears” and
computes the minimum, 1st quartile, median, mean, 3rd quartile and maximum for its value.
> summary(Employee[4])
EmpExpYears Mean : 7.8
Min. : 5.0 3rd Qu. : 9.0
1st Qu. : 6.0 Max. : 12.0
Median : 7.0
The min(Employee[4]) function works on the fourth column, “EmpExpYears” and determines
the minimum value for this column.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 10


> min(Employee[4])
[1] 5
The max(Employee[4]) function works on the fourth column, “EmpExpYears” and determines
the maximum value for this column.
> max(Employee[4])
[1] 12
The range(Employee[4]) function works on the fourth column, “EmpExpYears” and determines
the range of values for this column.
> range(Employee[4])
[1] 5 12
The Employee[,4] command at the R prompt displays the value of the column,“EmpExpYears” .
> Employee[,4]
[1] 5 9 6 12 7
The mean(Employee[,4]) function works on the fourth column, “EmpExpYears” and determines
the mean value for this column.
> mean (Employee[,4])
[1] 7.8
The median(Employee[,4]) function works on the fourth column, “EmpExpYears” and
determines the median value for this column.
> median(Employee[,4])
[1] 7
The mad(Employee[,4]) function returns the median absolute deviation value.
> mad (Employee[,4])
[1] 2.9652
The IQR(Employee[,4]) function returns the interquartile range.
> IQR (Employee[,4])
[1] 3
The quantile(Employee[,4]) function returns the quantile values for the column,“EmpExpYears”.
> quantile(Employee[,4])
0% 25% 50% 75% 100%
5 6 7 9 12
The sapply()function is used to obtain the descriptive statistics with the specified input. With the
use of this function, mean, var, min, max, sd, quantile and range can be determined.
The mean of the input data is found using:
sapply (sampledata, mean, na.rm=TRUE)

Similarly, other functions (such as mean, min, max, range and quantile) can be used with the
sapply()function to obtain the desired output.Consider the same data frame, Employee.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 11


> sapply(Employee[4],max) > sapply(Employee[4],mean) > sapply(Employee[4],range)
EmpExpYears EmpExpYears EmpExpYears
12 7.8 [1,] 5
[2,] 12

> sapply(Employee[4],min) > sapply(Employee[4],quantile)


EmpExpYears EmpExpYears
5 0% 5
25% 6
50% 7
75% 9
100% 12

> which.min(Employee$EmpExpYears)
[1] 1
At position 1, is the employee with the minimum years of experience, “5”.
> which.max(Employee$EmpExpYears)
[1] 4
At position 4, is the employee with the maximum years of experience, “12”.

For summarizing data, there are three other ways to group the data based on some specific
conditions or variables and subsequent to this, the summary() function can be applied. These are
explained below.
• ddply() requires the “plyr” package
• summariseBy() requires doBy package
• aggregate() is included in base package of R.
A simple code to explain the ddply() function is:
data <- read.table(header= TRUE, text= ‘no sex before after change
1 M 54.2 5.2 -9.2
2 F 63.2 61.0 1.1
3 F 52 24.5 3.5
4 F 25 55 2.5
.
.
.
.
.
.
.
54 M 54 45 1.2’
)
KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 12
When applying the ddply()function to the above input,
library(plyr)
#set the functions to run the length, mean, sd on the value based on the “change” input for each
group of inputs.
#Break down with the values of “no”
cdata <- ddply(data,c(“no”),summarise,
N=length(change),
sd=sd(change),
mean=mean(change)
)
cdata
Output of the ddply()function is:
> no N sd mean
1 5 4.02 2.3
2 14 5.5 2.1
3 4 2.1 1.0
.
.
54 9 2.0 0.9

FINDING THE MISSING VALUES

In R, missing data is indicated as NA in the dataset, where NA refers to “Not Available”.


It is neither a string nor a numeric value, but it is used to specify the missing data. Input
vectors can be created with the missing values as follows:
x <- c(2,5,86,9,NA,45,3)
y <- c(“red”,NA,“NA”)
In this case, x contains numeric values as the input. Here, NA can be used to avoid any errors or
other numeric exceptions like infinity. In the second example, y contains the string values as the
input. Here, the third value is a string ‘NA’ and the second value NA is a missing value. The
function is.na()is used in R to identify the missing values.
This function returns a Boolean value as either TRUE or FALSE.
> is.na(x)
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
> is.na(y)
[1] FALSE TRUE FALSE
The is.na function is used to find and create missing values. na.action provides options for
treating the missing data.
Possible na.action settings include:
• na.omit, na.exclude: This function returns the object by removing the missing values’
observation.
• na.pass: This function returns object unchanged even with missing objects.
• na.fail: This function returns object if it has no missing values.
To enable the na.action in options, use getOption(“na.action”).
Examples
1. Populating the matrix with sample input values as follows:
> c <- as.data.frame (matrix(c(1:5,NA),ncol=2))
>c
V1 V2
1 1 4
2 2 5
3 3 NA

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 13


na.omit(c) omits the NA missing values’ row and returns the other object.
> na.omit(c)
V1 V2
1 1 4
2 2 5

2. na.exclude(c) excludes the missing values and returns the object. A slight difference can
be found in some residual and prediction functions.
> na.exclude(c)
V1 V2
1 1 4
2 2 5

3. na.pass(c)returns the object unchanged along with the missing values.


> na.pass(c)
V1 V2
1 1 4
2 2 5
3 3 NA

4. na.fail(c)returns an error when a missing value is found. It returns an object only when
there is no missing value.
> na.fail(c)
Error in na.fail.default(c) : missing value in object

INVALID VALUES AND OUTLIERS

In R, special checks are conducted for handling invalid values. An invalid value can be NA,
NaN, Inf or -Inf. Functions for these invalid values include anyNA(x) anyInvalid(x) and
is.invalid(x), where the value of x can be a vector, matrix or array. Here, anyNA function returns
a TRUE value if the input has any Na or NaN values. Else, it returns a FALSE value. This
function is equivalent to any(is.na(x)).
anyInvalid function returns a TRUE value, if the input has any invalid values. Else,it returns a
FALSE value. This function is equivalent to any(is.valid(x)).
Unlike the other two functions, is.invalid returns an object corresponding to each input value. If
the input is invalid, it returns TRUE, else it returns FALSE. This function is also equivalent to
(is.na(x) | is.infinite(x)).
Few examples with the above functions are:
KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 14
> anyNA(c(-9,NaN,9)) > is.finite(c(-9, Inf,9)) > is.infinite(c(-9, Inf, 9))
[1] TRUE [1] TRUE FALSE TRUE [1] FALSE TRUE FALSE

> is.nan(c(-9, Inf, 9)) > is.nan(c(-9, Inf, NaN))


[1] FALSE FALSE FALSE [1] FALSE FALSE TRUE

The basic idea of invalid values and outliers can be explained with a simple example.
Obtain the min, max, median mean, 1st quantile, 3rd quantile values using the summary()
Function.
Below figure is explained as follows:
>summary(custdata$income)
#returns the minimum, maximum, mean, median, and quantile values of the ‘income’ from the
‘custdata’ input values.
m
-8700 14600 35000 53500 67000 615000
>summary(custdata$age)
#returns the minimum, maximum, mean, median, and quantile values of the ‘age’ from the
‘custdata’ input values.
Minimum 1st Quantile Median Mean 3rd Quantile Maximum
0.0 38.0 50.0 51.7 64.0 146.7

The above two scenarios clearly explain the invalid and outlier values. In the first output, one of
the values of ‘income’ is negative (-8700). Practically, a person cannot have negative income.
Negative income is an indicator of debt. Hence, the income is given in negative values.
However, such negative values are required to be treated effectively.
A check is required on how to handle these types of inputs, i.e. either to drop the negative values
for the income or to convert the negative income into zero.
In the second case, one of the values of ‘age’ is zero and the other value is greater than 120,
which is considered as an outlier. Here, the values fall out of the data range of the expected
values. Outliers are considered to be incorrect or errors in input data. In such cases, an age ‘0’
could refer to unknown data or may be the customer never disclosed the age, and in case of more
than 120 years of age, the customer must have lived long.
A negative value in the age field could be a sentinel value and an outlier could be an error data,
unusual data or sentinel value. In case of missing a proper input to the field, an action is required

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 15


to handle the scenario, i.e. whether to drop the field, drop the data input or convert the improper
data.

DESCRIPTIVE STATISTICS
1. Data Range
Data range in R helps in identifying the extent of difference in input data. The data range of the
observation variable is the difference between the largest and the smallest data value in a dataset.
The value of a data range can be calculated by subtracting the smallest value from the largest
value, i.e. Range = Largest value – Smallest value.
For example, the range or the duration of rainfall can be computed as
# Calculates the duration.
>duration = time$rainfall
#Apply max and min function to return the range
>max(duration) - min(duration)
This sample code returns the range or duration by taking the minimum and maximum values. In
the example above, time duration of rainfall is helpful in predicting the probability of duration of
rainfall. Hence, there should be enough variation in the amount of rainfall and the duration of the
rainfall.
2. Frequencies and Mode
Frequency
Frequency is a summary of data occurrence in a collection of non-overlapping types. In R, freq
function can be used to find the frequency distribution of vector inputs. In the example given,
consider sellers as the dataset and the frequency distribution of the shop variable is the summary
of the number of sellers in each shop.
> head(subset(mtcars, select = ‘gear’))
gear
Mazda RX4 4
Mazda RX4 Wag 4
Datsun 710 4
Hornet 4 Drive 3
Hornet Sportabout 3
Valiant 3
> factor(mtcars$gear)
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5
> w = table(mtcars$gear)
>w
3 4 5
15 12 5
> t = as.data.frame(w)
>t
Var1 Freq
1 3 15
2 4 12
3 5 5
> names(t) [1] = ‘gear’
>t
gear Freq
1 3 15
2 4 12
3 5 5
The cbind()function can be used to display the result in column format.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 16


>w
3 4 5
15 12 5
> cbind(w)
w
3 15
4 12
55

Mode

Mode is similar to frequency, except that the value of mode returns the highest number of
occurrences in a dataset. Mode can take both numeric and character as input data.
Mode does not have any standard inbuilt function to calculate mode of the given inputs.Hence, a
user-defined function is required to calculate mode in R. Here, the input is a vector value and the
output is the mode value.
A sample code to return the mode value is
#Create the function
getmode <- function(y){
uniqy <- unique(y)
uniqy[which.max(tabulate(match(y,uniqy)))]
}
# Define the input vector values
v <- c(5,6,4,8,5,7,4,6,5,8,3,2,1)
#Calculate the mode with user-defined functions
resultmode<- getmode(v)
print(resultmode)
#Define characters as input vector values
charv <-c(“as”,“is”,“is”,“it”,“in”)
#Calculate mode using user-defined function
resultmode <- getmode(charv)
print(resultmode)
Executing the above code will give the result as:
[1] 5
[1] “is”
> #Create the function
> getmode <- function(y) {
+ uniqy <- unique (y)
+ uniqy[which.max(tabulate(match(y, uniqy)))]
+}
>
> v <- c(5,6,4,8,5,7,4,6,5,8,3,2,1)
> resultmode<- getmode(v)
> print(resultmode)
[1] 5
> charv <- c(“as”,“is”,“is”,“it”,“in”)
> resultmode <- getmode (charv)
> print (resultmode)
[1] “is”

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 17


3. Mean and Median

Statistical data in R is analyzed using inbuilt functions. These inbuilt functions are found in the
base R package. The functions take vector values as input with arguments and produce the
output.
Mean
Mean is the sum of input values divided by the sum of the number of inputs. It is also called the
average of the input values. In R, mean is calculated by inbuilt functions. The function
mean()gives the output of the mean value in R.
Basic syntax for the mean() function in R is:
mean(x, trim=0, na.rm = FALSE,...)
where,x is the input vector, trim specifies some drop in observations from both the sorted ends of
the input vector and na.rm removes the missing values in the input vector.
Example 1
A sample code to calculate the mean in R is
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
# Find the mean of the vector inputs
result.mean <- mean(x)
print(result.mean)
Output
[1]12.65.
> x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
> result.mean <- mean(x)
> print (result.mean)
[1] 12.65
When the trim parameter is selected, it sorts the vector values first and drops the input values for
calculating the mean based on the trim value from both the ends. Say trim = 0.4,4 values from
both the ends of sorted vector values are dropped. With the above sample, vector values
(15,54,6,5,9.2,36,5.3,8,-7,-5) are sorted to (-7,-5,5,5.3,6,8,9.2,15,36,54) and 4 values are
removed from both the ends, i.e. (-7,-5,5,5.3) from the left and (9.2,15,36,54) from the right.
Example 2
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
# Find the mean of the vector inputs
result.mean <- mean(x, trim =0.3)
print(result.mean)
Output
[1] 7.125
> x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
> result.means <- mean(x, trim =0.3)
> print(result.mean)
[1] 7.125
Example 3
In case of any missing value, the mean() function would return NA. In order to overcome such
cases, na.rm = TRUE is used to remove the NA values from the list for calculating the mean in
R.
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5,NA)
# Find the mean of the vector inputs
result.mean <- mean(x)
print(result.mean)

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 18


#Dropping NA values from finding the mean
result.mean <- mean(x, na.rm=TRUE)
print(result.mean)
Output
[1]NA
[1]12.65
> x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5,NA)
> result.means <- mean (x)
> print (result.mean)
[1] NA
> result.mean <- mean (x, na.rm=TRUE)
> print (result.mean)
[1] 12.65
Example 4
Objective: To determine the mean of a set of numbers. To plot the numbers in a barplot and have
a straight line run through the plot at the mean.
Step 1: To create a vector, “numbers”.
> numbers <-c(1, 3, 5, 2, 8, 7, 9, 10)
Step 2: To compute the mean value of the set of numbers contained in the
vector,“numbers”.
> mean (numbers)
[1] 5.625
Outcome: The mean value for the vector, “numbers” is computed as 5.625.
Step 3: To plot a bar plot using the vector, “numbers”.__

Step 4: Use the abline function to have a straight line (horizontal line) run through the bar
plot at the mean value.
The abline function can take an h parameter with a value to draw a horizontal line or a v
parameter for a vertical line. When it’s called, it updates the previous plot. Draw a horizontal line
across the plot at the mean.
> barplot (numbers)
> abline (h = mean (numbers))

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 19


Outcome: A straight line at the computed mean value (5.625) runs through the bar plot
computed on the vector, “numbers”.

Median
Median is the middle value of the given inputs. In R, the median can be found using the median()
function. Basic syntax for calculating the median in R is
median(x, na.rm=FALSE)
where,x is the input vector value and na.rm removes the missing values in the input vector.
Example 1
A sample to find out the median value of the input vector in R is
#Define a vector
x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
# Find the median value
median.result <-median(x)
print(median.result)
[1]7.
> x<- c(15,54,6,5,9.2,36,5.3,8,-7,-5)
> median.result <-median (x)
> print (median.result)
[1] 7
Example 2
Objective: To determine the median of a set of numbers. To plot the numbers in a bar plot
and have a straight line run through the plot at the median.
Step 1: To create a vector, “numbers”.
> numbers <- c(1, 3, 5, 2, 8, 7, 9, 10)
Step 2: To compute the median value of the set of numbers contained in the
vector,“numbers”.
> median(numbers)
[1] 6
Step 3: To plot a bar plot using the vector, “numbers”. Use the abline function to have a
straight line (horizontal line) run through the bar plot at the median.
> barplot (numbers)
> abline (h = median (numbers))

Outcome: A straight line at the computed median value (6.0) runs through the bar plot computed
on the vector, “numbers”.
4. Standard Deviation
Objective: To determine the standard deviation.
To plot the numbers in a bar plot and have a straight line run through the plot at the mean and
another straight line run through the plot at mean + standard deviation.
Step 1: To create a vector, “numbers”.
> numbers <- c(1,3,5,2,8,7,9,10)
Step 2: To compute the mean value of the set of numbers contained in the vector,
“numbers”.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 20


> mean(numbers)
[1] 5.625
Step 3: To determine the standard deviation of the set of numbers held in the vector,
“numbers”.
> deviation <- sd(numbers)
> deviation
[1] 3.377975
Step 4: To plot a bar plot using the vector, “numbers”.
> barplot (numbers)
Step 5: Use the abline function to have a straight line (horizontal line) run through the bar
plot at the mean value (5.625) and another straight line run through the bar plot at mean
value + standard deviation (5.625 + 3.377975)
> barplot (numbers)
> abline (h=sd(numbers))
> abline (h=sd(numbers) + mean(numbers))

5. Mode
Objective: To determine the mode of a set of numbers.
R does not have a standard inbuilt function to determine the mode. We will write out own,
“Mode” function. This function will take the vector as the input and return the mode as the
output value.
Step 1: Create a user-defined function, “Mode”.
> Mode <-function(v) {
+ UniqValue <- unique(v)
+ UniqValue[which.max(tabulate(match(v, UniqValue)))]
+}
While writing the above function, “Mode”, we have used three other functions provided by R,
viz. “unique”, “tabulate” and “match”.
unique function: The “unique” function will take the vector as the input and returns the vector
with the duplicates removed.
>v
[1] 2 1 2 3 1 2 3 4 1 5 5 3 2 3
> unique(v)
[1] 2 1 3 4 5
match function: Takes a vector as the input and returns the vector that has the position of (first)
match of its first arguments in its second.
>v
[1] 2 1 2 3 1 2 3 4 1 5 5 3 2 3
> UniqValue <- unique(v)
> UniqValue
[1] 2 1 3 4 5
> match(v,UniqValue)

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 21


[1] 1 2 1 3 2 1 3 4 2 5 5 3 1 3
tabulate function: Takes an integer valued vector as the input and counts the number of times
each integer occurs in it.
> tabulate(match(v,UniqValue))
[1] 4 3 4 1 2
Going by our example, “2” occurs four times, “1” occurs three times, “3” occurs four times, “4”
occurs one time and “5” occurs two times.
Step 2: Create a vector, “v”.
> v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
Step 3: Call the function, “Mode” and pass the vector, “v” to it.
> Output <- Mode(v)
Step 4: Print out the mode value of the vector, “v”.
> print(Output)
[1] 2
Let us pass a character vector, “charv” to the “Mode” function.
Step 1: Create a character vector, “charv”.
> charv <- c(“o”,“it”,“the”,“it”,“it”)
Step 2: Call the function, “Mode” and pass the character vector, “charv” to it.
> Output <- Mode(charv)
Step 3: Print out the mode value of the vector, “v”.
> print(Output)
[1] “it”

SPOTTING PROBLEMS IN DATA WITH VISUALISATION

For a better understanding of input data, pictures or charts are preferred over text. Visualization
engages the audience well and numerical values, on comparison, cannot represent a big dataset in
an engaging manner.
The use of graphical representation to examine the given set of data is called visualization. With
this visualization, it is easier to calculate the following:
• To determine the peak value of the age of the customers (maximum value)
• To estimate the existence of the sub-population
• To determine the outlier values.
The graphical representation displays the maximum available information from the lowest to the
highest value. It also presents users with greater data clarity. For better usage of visualization, the
right aspect ratio and scaling of data is needed.

1) Visually Checking Distributions for a Single Variable

With R visualization, one can answer the following questions:


• What is the peak value of the standard distribution?
• How many peaks are there in a distribution? (Basically bimodality vs unimodality)
• Is it normal data or lognormal data?
• How does the given data vary?
• Is the given data concerned in a certain interval or category?
Generally, visual representation of data is helpful to grasp the shape of data distribution.
The summary statistics assumes that the data is more or less close to normal distribution.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 22


Figure 4.2 represents a unimodal diagram with only one peak in the normal distribution diagram.
It also represents the values in a more visually understandable way. It returns the mean customer
age of about 51.7, which is nearly equal to 52.50% of the customers who fall in the age group of
38 to 64 years. With this statistical output, it can be concluded that the customer is a middle-aged
person in the age range of 38–64 years.

The additional black curve in Figure 4.2 refers to a bimodal distribution. Usually, if a distribution
contains more than two peaks, then it is considered a multimodal. The second black curve has the
same mean age as that of the grey curve. However, here the curve concentrates on two sets of
populations with younger ages between 20 and 30 and the older ages above 70.
These two sets of populations have different patterns in their behaviour and the probability of
customers who have health insurance also differs. In such a case, using a logistic regression or
linear regression fails to represent the current scenario. Hence, in order to overcome the
difficulties faced for such representations, a density plot or histogram can examine the
distribution of the input values. Moving forward, the histogram makes the representation simpler
as compared to density plots and is the preferred method for presenting findings from
quantitative analysis.

1) Histograms

A histogram is a graphical illustration of the distribution of numerical data in successive


numerical intervals of equal sizes. It looks similar to a bar graph. However, values are grouped
into continuous ranges in a histogram. The height of a histogram bar represents the number of
values occurring in a particular range.
R uses hist(x) function to create simple histograms, where x is a numeric value to be plotted.
Basic syntax to create a histogram using R is:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
where,v takes a vector that contains numeric values, ‘main’ is the main title of the bar chart, xlab
is the label of the X-axis, xlim specifies the range of values on the X-axis, ylim specifies the
range of values on the Y-axis, ‘breaks’ control the number of bins or mentions the width of the
bar, ‘col’ sets the colour of the bars and ‘border’ sets the border colour of the bars.
Example 1
A simple histogram can be created by just providing the input vector where other
parameters are optional.
# Create data for the histogram

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 23


h<- c (8,13,30,5,28)
#Create histogram for H
hist(h)
Example 2
A histogram simple can be created by providing the input vector “v”, file name, label for
X-axis “xlab”, colour “col” and colour “border” as shown:
# Create data for the histogram
H <- c (8,13,30,5,28)
# Give a file name for the histogram
png(file = “samplehistogram.png”)
#Create a sample histogram
hist(H, xlab=“Categories”, col=“red”)
#Save the sample histogram file
dev.off()
Executing the above code fetches the output as shown in Figure 4.3.

It fills the bar with the ‘col’ colour parameter. And border to the bar can be done by passing
values to the ‘border’ parameter.
> H <- c (8,13,30,5,28)
> hist(H, xlab=“Categories”, col=“red”)
Example 3
The parameters xlim and ylim are used to denote the range of values that are used in the X
and Y axes. And breaks are used to specify the width of each bar.
#Create data for the histogram
H <- c (8,13,30,5,28)
# Give a file name for the histogram
png(file = “samplelimhistogram.png”)
#Create a samplelimhistogram.png
hist(H, xlab =“Values”, ylab= “Colours”, col=“green”, xlim=c(0,30),
ylim=c(0,5), breaks= 5)
#Save the samplelimhistogram.png file
dev.off()
> H <- c (8,13,30,5,28)
> hist(H, xlab =“Values”, ylab = “Colours”, col= “green”,xlim=c(0,30), ylim=c(0,5), breaks=5)

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 24


2) Density Plots

A density plot is referred to as a ‘continuous histogram’ of the given variable. However, the area
of the curve under the density plot is equal to 1. Therefore, the point on the density plot diagram
matches the fraction of the data (or the percentage of the data which is divided by 100 that takes
a particular value). The resulting value of the fraction is very small.
A density plot is an effective way to assess the distribution of a variable. It provides a better
reference in finding a parametric distribution. The basic syntax to create a plot is
plot(density(x)), where x is a numeric vector value.

Example 1
A simple density plot can be created by just passing the values and using the plot()function
(Figure 4.5).
# Create data for the density plot
h <- density (c(0.0, 38.0, 50.0, 51.7, 64.0, 146.0))
#Create density plot for h
plot(h)
> h <- density (c(0.0, 38.0, 50.0, 51.7, 64.0, 146.0))
> plot(h, xlab=“Values”, ylab=“Density”)

In case of widespread data range, the distribution of


data is concentrated to one side of the curve. Here it is
very complex to determine the exact value in the peak.
Example 2
In case of non-negative data, another way to plot the curve is using the distribution
diagram on a logarithmic scale, which is equivalent to the plot the density plot of log10
(input value).

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 25


For Figure 4.5 it is very hard to find out the peak value of the mass distribution. Hence, in order
to simplify the visual representation log10 scale is used. In
Figure 4.6 the peak value of the income distribution is clearly pictured as ~$40,000. In case of
wide spread data this logarithmic approach can give a perfect result.

Example 3
To enable the dollar symbol in the input data labels=dollar parameter is passed. Hence, the
amount is displayed with the dollar $ symbol (Figure 4.7).
library(scales)
barplot(custdata) + geom_density(aes(x=income)) + scale_x_
continuous(labels=dollar)

3) Bar Charts

A bar chart is a pictorial representation of statistical data. Both vertical and horizontal bars can
be drawn using R. It also provides an option to color the bars in different colors. The length of
the bar is directly proportional to the values of the axes. R uses the barplot() function to create a
bar chart. The basic syntax for creating a bar chart using R is
barplot(H, xlab, ylab, main, names.arg, col)

where,H is a matrix or a vector that contains the numeric values used in bar chart, xlab is the
label of the X-axis, ylab is the label of the Y-axis, main is the main title of the bar chart,

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 26


names.arg is the collection of names to appear under each bar and col is used to give colours to
the bars.
Some basic bar charts commonly used in R are:
• Simple bar chart
• Grouped bar chart
• Stacked bar chart
1. Simple Bar Chart
A simple bar chart is created by just providing the input values and a name to the bar chart. The
following code creates and saves a bar chart using the barplot() function in R.
Example 1
# Create data for the bar chart
H <- c (8,13,30,5,28)
#Give a name for the bar chart
png(file = “samplebarchart.png”)
#Plot bar chart using barplot() function
barplot(H)
#Save the file
dev.off()
> H <- c (8,13,30,5,28)
> barplot (H, xlab = “Categories”,
ylab=“Values”, col=“blue”)
It can be drawn both vertically and horizontally. Labels for both the X and Y axes can be given
with xlab and ylab parameters. The colour parameter is passed to fill the colour in the bar.
Example 2
The bar chart is drawn horizontally by passing the “horiz” parameter TRUE. This can be
shown with a sample program as follows:
# Create data for the bar chart
H <- c (8,13,30,5,28)
#Give a name for the bar chart
png(file = “samplebarchart.png”)
#Plot bar chart using barplot() function
barplot(H, horiz=TRUE))
#Save the file
dev.off()
> barplot(H, xlab = “Values”, ylab=“Categories”, col=“blue”,horiz=TRUE)

Here when the “horiz” parameter is set to TRUE, it displays the bar chart in a horizontal position
else it will be displayed as a default vertical bar chart.

2. Group Bar Chart

A group data in R is used to handle multiple inputs and takes the value of the matrix. This
group bar chart is created using the barplot() function and accepts the matrix inputs.
Example
> colors <- c(“green”,“orange”,“brown”)
> months <- c(“Mar”,“Apr”,“May”,“Jun”,“Jul”)
> regions <- c(“East”,“West”,“North”)
> Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11),nrow=3,ncol= 5,byrow = TRUE)
> Values

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 27


[,1] [,2] [,3] [,4] [,5]
[1,] 2 9 3 11 9
[2,] 4 8 7 3 12
[3,] 5 2 8 10 11
> rownames(Values) <- regions
> rownames(Values)
[1] “East” “West” “North”
> Values
[,1] [,2] [,3] [,4] [,5]
East 2 9 3 11 9
West 4 8 7 3 12
North 5 2 8 10 11
> colnames(Values) <- months
> Values
Mar Apr May Jun Jul
East 2 9 3 11 9
West 4 8 7 3 12
North 5 2 8 10 11
> barplot(Values, col=colors, width=2, beside=TRUE, names.arg=months, main=“Total
Revenue 2015 by month”)
> legend(“topleft”, regions, cex=0.6, bty= “n”, fill=colors);

Here the legend column is included on the top right side of the bar chart.

3. Stacked Bar Chart

Stacked bar chart is similar to group bar chart where multiple inputs can take different graphical
representations. Except by grouping the values, the stacked bar chart stacks each bar one after
the other based on the input values.
Example
> days <- c(“Mon”,“Tues”,“Wed”)
> months <- c(“Jan”,“Feb”,“Mar”,“Apr”,“May”)
> colours <- c(“red”,“blue”,“green”)
> val <- matrix(c(2,5,8,6,9,4,6,4,7,10,12,5,6,11,13), nrow =3, ncol=5, byrow =TRUE)
> barplot(val,main=“Total”,names.arg=months,xlab=“Months”,ylab=“Days”,col=colours)
> legend(“topleft”, days, cex=1.3,fill=colours)

In Figure, the ‘Total’ is set as the main title of


the stack bar chart with ‘Months’ as the X-axis label
and ‘Season’ as the Y-axis label.
The code legend (“topleft”, days,cex=1.3,fill=colours)
specifies the legend to be displayed at the top right
of the bar chart with colours filled accordingly.

KAMALA CHALLA,ASST.PROF,IT,VNR VJIET. Page 28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy