Basic R
Basic R
1
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Outline
• Introduction to a data analytic tool
• Data import and export
• Attribute and data types
• Descriptive statistics
• Exploratory Data Analysis
2
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Introduction to R
• Software for Statistical Data Analysis and Graphics
• Programming Environment
• Interpreted Language
• Data Storage, Analysis, Graphing
• Free and Open Source Software (under GNU General
Public License)
3
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Intro to a data analytic tool)
Introduction to R
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. It includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy,
and
• a well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output
facilities.
4
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Why R?
• It’s free. Active development.
• Reproducible analysis.
• R can handle really large datasets.
–Excel is limited by 1,048,576 rows and 16,384 columns.
• A better way to explore, present and interpret your data.
5
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
6
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Obtaining R
• https://www.r-project.org
• https://cran.r-project.org
• CRAN is a network of ftp and web servers around
the world that store identical, up-to-date, versions of
code and documentation for R.
7
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
workspace
scripts
Plot area
consol
e
9
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
R Studio
The four highlighted window panels follow.
10
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
R Studio
- The console pane can be used to obtain help information on R by entering ?lm at the
console prompt. Alternatively,help (lm) could have been entered at the console
prompt.
-Functions such as edit() and fix() allow the user to update the contents of an R
variable. Alternatively, such changes can be implemented with RStudio by selecting the
appropriate variable from the workspace panel.
- R allows one to save the workspace environment, including variables and loaded
libraries, into an .Rdata file using the save.image() function. An existing .Rdata file can
be loaded using the load . image () function.
- Tools such as RStudio prompt the user for whether the developer wants to save the
workspace connects prior to exiting the GUI.
11
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
12
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
# import a csv file of the total annual sales for each customer
sales <- read.csv(“c:/Chapter 3/yearly_sales.csv")
13
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
14
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
summary()
• The summary () function provides some descriptive statistics, such as the mean and median, for each data
column.
• Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are provided.
• Because the gender column contains two possible characters, an "F" (female) or "M" (male), the
summary () function provides the count of each character's occurrence.
15
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Plotting
• Plotting
columns.
a dataset's contents can provide information about the relationships between the various
• In this example, the plot() function generates a scatterplot of the number of orders
(sales$num_of_orders) against the annual sales (sales$sales_total).
• The $ is used to reference a specific column in the dataset sales.
plot(sales$num_of_orders,sales$sales_total, main="Number of Orders vs. Sales")
16
Review of Basic Data Analytic Methods and Exploratory Data Analysis
Plotting
• Each point corresponds to the number of orders and the total sales for each customer.
• The plot indicates that the annual sales are proportional to the number of orders placed.
17
Review of Basic Data Analytic Methods and Exploratory Data Analysis
Statistical Analysis
• lm() is a linear model function, such like linear regression analysis.
• The sample code below demonstrates how to create a linear model and save it into a variable
results.
• In this particular case, we are using the number of orders to predict the total sales.
18
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Output:
• The intercept is -154.1 and the coefficient for the num. of orders is 166.2.
• Therefore, the complete regression equation is
Total Sales= -154.1 + 166.2 * num. of orders
• This equation tells us that the predicted total sales will increase by 166.2 for every unit increase in
the num. of orders.
• Suppose that our research question asks what the expected total sales, given the number of
orders is 7. As follows, we can use the regression equation to calculate the answer to this
question.
predicted total Sales= -154.1 + 166.2 * 7
19
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
summary(results)
Output:
• The summary() function has provided us with a wealth of information, including t-test, F-test, R-
squared, residual, and significance values. All of this data can be used to answer important
research questions related to our linear mode
20
Review of Basic Data Analytic Methods and Exploratory Data Analysis
Data Structure
• Supports virtually any type of data
• Numbers, characters, logicals (TRUE/ FALSE)
• Arrays of virtually unlimited sizes
• Simplest: Vectors and Matrices
• Lists: Can Contain mixed type variables
• Data Frame: Rectangular Data Set
21
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Data Structure
Linear Rectangular
22
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
Running R
• Directly in the Windowing System (Console)
• Using Editors
• Notepad, WinEdt, Tinn-R: Windows
• Xemacs, ESS (Emacs speaks Statistics)
• On the Editor:
• source(“filename.R”)
• Outputs can be diverted by using
• sink(“filename.Rout”)
23
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
R Session
24
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)
25
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)
26
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)
27
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)
• R packages such as DBI and RODBC are available for this purpose.
• These packages provide database interfaces for communication between R and DBMSs such as
MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum. The following Rcode demonstrates
how to install the RODBC package with the install.packages() function.
28
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)
29
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)
30
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
31
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Attribute and Data Types)
32
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Interval aVariable
properties: can take difference of two values.
may not be able to take ratios of two values.
Ratio Variable
Properties: you can take a ratio of two values.
has a clear definition of 0.0
34
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Attribute and Data Types)
NOT RATIO!!
- Temperature, expressed in F or C, is not a ratio variable.
A temperature of 0.0 on either of those scales does not mean 'no heat'.
A temperature of 100 degrees C is not twice as hot as 50 degrees C.
- pH=0 just means 1 molar of H+. and the definition of molar is fairly arbitrary.
A pH of 0.0 does not mean 'no acidity' (quite the opposite!).
A pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.
35
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
36
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Data Conversion
• Data of one attribute type may be converted to another.
• For example, the quality of diamonds {Fair, Good, Very Good, Premium, Ideal} is
considered ordinal but can be converted to nominal {Good, Excellent} with a defined
mapping.
• Similarly, a ratio attribute like Age can be converted into an ordinal attribute such as
{Infant, Adolescent, Adult, Senior}.
• Understanding the attribute types in a given dataset is important to ensure that the
appropriate descriptive statistics and analytic methods are applied and properly
interpreted. For example, the mean and standard deviation of U.S. postal ZIP codes
are not very meaningful or appropriate.
37
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
• R provides, such as class () and typeof (),to examine the characteristics of a given variable.
• The class() function represents the abstract class of an object.
• The typeof () function determines the way an object is stored in memory. Although i
appears to be an integer, i is internally stored using double precision.
38
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
• The application of the length() function reveals that the created variables each have a
length of 1.
39
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Vectors
• Vectors are a basic building block for data in R.
• Simple R variables are actually vectors. A vector can only consist of values in the same
class. The tests for vectors can be conducted using the is. vector () function..
40
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
41
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
• Sometimes it is necessary to initialize a vector of a specific length and then populate the
content of the vector later.
• The vector () function, by default, creates a logical vector.
• A vector of a different type can be specified by using the mode parameter. The vector c,
an integer vector of length 0, may be useful when the number of elements is not initially
known and the new elements will later be added to the end of the vector as the values
become available.
42
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
43
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
44
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
45
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Data Frames
• Data frames provide a structure for storing and accessing several variables of possibly different
data types. In fact, as the is.data.frame() function indicates, a data frame was created by the
read.csv() function.
46
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Data Frames
• Because of their flexibility to handle many data types, data frames are the preferred input
format for many of the modeling functions available in R.
• The use of the str() function provides the structure of the sales data frame. This function
identifies the integer and numeric (double) data types, the factor variables and levels, as well
as the first few values for each variable.
47
Review of Basic Data Analytic Methods and Exploratory Data Analysis
List
A list is a collection o f objects that can be of various types, including other
lists.
v <- 1:5
M<-matrix(c(1,3,3,5,0,4,3,3,3),nrow=3,ncol=3)
M
[,1] [,2] [,3]
[1,] 1 5 3
[2,] 3 0 3
[3,] 3 4 3 49
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
List
• The use of the single set of brackets, [], only accesses an item in the list, not its content.
• The double brackets, [ [] ] , display the content.
50
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
List
51
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
List
• The str () function offers details about the structure of a list. The double brackets, [ [] ] , display
the content.
52
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Factors
• A factor denotes a categorical variable, typically with a few finite levels such as ‘F’ and ‘M’ in
case of gender.
53
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Factors
• Included with the ggplot2 package, the diamonds data frame contains three ordered factors.
• Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very Good,
Premium, and Ideal.
• diamonds$cut contains ordinal data.
54
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Factors
• To categorize sales$sales_totals into three groups-small,medium, and big-according to the amount of the
sales.
55
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Factors
• Create and add the factor to a data frame.
• The cbind() function is used to combine variables column-wise.
• The rbind() function is used to combine datasets row-wise.
• The use of factors is important in several R statistical modeling functions, such as analysis of
variance, aov ( ) and the use of contingency tables .
56
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Factors
vectors
an R factor might be viewed simply as a vector with a bit of extra information that consists of a record
of the distinct values in that vector called levels.
The core of xf here is not (5, 7, 9, 7) but rather (1, 2, 3, 2) i.e. (level-1, level-2, level-
3, level-2). So, the data has been recoded by level.
57
Review of Basic Data Analytic Methods and Exploratory Data Analysis
58
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Attribute and Data Types)
59
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Contigency table
• Table refers to a class of objects used to store the observed counts across the factors for a given dataset.
• Contingency
table.
table - basis for performing a statistical test on the independence of the factors used to build the
• The code below builds a contingency table based on the sales$gender and sales$ spender factors.
• Based on the observed counts in the table, the summary() function performs a chi-squared test on the
independence of the two factors. Because the reported p-value is greater than 0.05, the assumed
independence of the two factors is not rejected.
60
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)
Descriptive Statistics
• Itsuch
has already been shown that the summary () function provides several descriptive statistics,
as the mean and median, about a variable such as the sales data frame.
• The results now include the counts for the three levels of the spender variable based on the
earlier examples involving factors.
61
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)
62
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)
function apply()
• The function apply() is useful when the same function is to be applied to several
variables in a data frame.
• For example, the following R code calculates the standard deviation for the first three
variables in sales.
• In the code,
setting MARGIN=2 specifies that the function is applied over the columns.
setting MARGIN=1 specifies that the function is applied over the rows.
• Other functions, such as lapply () and sapply (), apply a function to a list or vector.
63
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)
User-defined functions
64