0% found this document useful (0 votes)
1 views64 pages

Basic R

The document provides an overview of basic data analytic methods and exploratory data analysis using R, a free and open-source software for statistical data analysis and graphics. It covers data import/export, attribute and data types, descriptive statistics, and the use of R for graphical representation and statistical modeling. Key features of R, including its ability to handle large datasets and support for various data structures, are also discussed.

Uploaded by

s211021185
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views64 pages

Basic R

The document provides an overview of basic data analytic methods and exploratory data analysis using R, a free and open-source software for statistical data analysis and graphics. It covers data import/export, attribute and data types, descriptive statistics, and the use of R for graphical representation and statistical modeling. Key features of R, including its ability to handle large datasets and support for various data structures, are also discussed.

Uploaded by

s211021185
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Review of Basic Data Analytic

Methods and Exploratory Data


Analysis
Part 1

1
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Outline
• Introduction to a data analytic tool
• Data import and export
• Attribute and data types
• Descriptive statistics
• Exploratory Data Analysis

2
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Introduction to R
• Software for Statistical Data Analysis and Graphics
• Programming Environment
• Interpreted Language
• Data Storage, Analysis, Graphing
• Free and Open Source Software (under GNU General
Public License)

3
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Intro to a data analytic tool)

Introduction to R
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. It includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy,
and
• a well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output
facilities.

4
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Why R?
• It’s free. Active development.
• Reproducible analysis.
• R can handle really large datasets.
–Excel is limited by 1,048,576 rows and 16,384 columns.
• A better way to explore, present and interpret your data.

5
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Graph and Plotting Using R

6
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Obtaining R
• https://www.r-project.org
• https://cran.r-project.org
• CRAN is a network of ftp and web servers around
the world that store identical, up-to-date, versions of
code and documentation for R.

7
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

R Graphical User Interface (GUI) & Editor


• Rgui.exe
• R commander
• Rattle
• Rstudio
• Jupyter Notebook
• Replit.com
8
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
R Studio

workspace
scripts

Plot area

consol
e

9
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

R Studio
The four highlighted window panels follow.

• Scripts: Serves as an area to write and save R code

• Workspace: Lists the data sets and variables in the R environment

• Plots: Displays the plots generated by the R code and provides a


straight forward mechanism to export the plots

• Console: Provides a history of the executed R code and the output.

10
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)
R Studio
- The console pane can be used to obtain help information on R by entering ?lm at the
console prompt. Alternatively,help (lm) could have been entered at the console
prompt.

-Functions such as edit() and fix() allow the user to update the contents of an R
variable. Alternatively, such changes can be implemented with RStudio by selecting the
appropriate variable from the workspace panel.

- R allows one to save the workspace environment, including variables and loaded
libraries, into an .Rdata file using the save.image() function. An existing .Rdata file can
be loaded using the load . image () function.

- Tools such as RStudio prompt the user for whether the developer wants to save the
workspace connects prior to exiting the GUI.

11
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

• Annual Sales Example (yearly_sales.csv)


• 10,000 retail customers

12
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

# import a csv file of the total annual sales for each customer
sales <- read.csv(“c:/Chapter 3/yearly_sales.csv")

# examine the imported dataset


head(sales)
summary(sales)

13
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

# plot num_of_orders vs. sales


plot(sales$num_of_orders,sales$sales_total,
main="Number of Orders vs. Sales")

# perform a statistical analysis (fit a linear regression model)


results <- lm(sales$sales_total ~ sales$num_of_orders)
results
summary(results)

# perform some diagnostics on the fitted model


# plot histogram of the residuals
hist(results$residuals, breaks = 800)

14
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

summary()
• The summary () function provides some descriptive statistics, such as the mean and median, for each data
column.
• Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are provided.
• Because the gender column contains two possible characters, an "F" (female) or "M" (male), the
summary () function provides the count of each character's occurrence.

15
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Plotting

• Plotting
columns.
a dataset's contents can provide information about the relationships between the various

• In this example, the plot() function generates a scatterplot of the number of orders
(sales$num_of_orders) against the annual sales (sales$sales_total).
• The $ is used to reference a specific column in the dataset sales.
plot(sales$num_of_orders,sales$sales_total, main="Number of Orders vs. Sales")

16
Review of Basic Data Analytic Methods and Exploratory Data Analysis

Plotting

• Each point corresponds to the number of orders and the total sales for each customer.
• The plot indicates that the annual sales are proportional to the number of orders placed.

17
Review of Basic Data Analytic Methods and Exploratory Data Analysis

Statistical Analysis
• lm() is a linear model function, such like linear regression analysis.
• The sample code below demonstrates how to create a linear model and save it into a variable
results.
• In this particular case, we are using the number of orders to predict the total sales.

# perform a statistical analysis (fit a linear regression model)


results <- lm(sales$sales_total ~ sales$num_of_orders)
results
Output:

18
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Output:

• The intercept is -154.1 and the coefficient for the num. of orders is 166.2.
• Therefore, the complete regression equation is
Total Sales= -154.1 + 166.2 * num. of orders
• This equation tells us that the predicted total sales will increase by 166.2 for every unit increase in
the num. of orders.
• Suppose that our research question asks what the expected total sales, given the number of
orders is 7. As follows, we can use the regression equation to calculate the answer to this
question.
predicted total Sales= -154.1 + 166.2 * 7

19
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

summary(results)

Output:

• The summary() function has provided us with a wealth of information, including t-test, F-test, R-
squared, residual, and significance values. All of this data can be used to answer important
research questions related to our linear mode
20
Review of Basic Data Analytic Methods and Exploratory Data Analysis

Data Structure
• Supports virtually any type of data
• Numbers, characters, logicals (TRUE/ FALSE)
• Arrays of virtually unlimited sizes
• Simplest: Vectors and Matrices
• Lists: Can Contain mixed type variables
• Data Frame: Rectangular Data Set

21
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Data Structure
Linear Rectangular

All Same Type VECTORS MATRIX*

Mixed LIST DATA FRAME

22
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

Running R
• Directly in the Windowing System (Console)
• Using Editors
• Notepad, WinEdt, Tinn-R: Windows
• Xemacs, ESS (Emacs speaks Statistics)
• On the Editor:
• source(“filename.R”)
• Outputs can be diverted by using
• sink(“filename.Rout”)

23
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Intro to a data analytic tool)

R Session

• First, read data from other sources


• Use packages, libraries, and functions
• Write functions wherever necessary
• Conduct Statistical Data Analysis
• Save outputs to files, write tables
• Save R workspace if necessary (exit prompt)

24
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)

Data Import and Export


• Data import into R using the read.csv() function
sales <- read.csv("c:/data/yearly_sales.csv")

R uses a forward slash / as the separator character in the directory and


file paths. tousingabackslash(\) as a separator.
• To simplify the import of multiple files with long path names,the
setwd() function can be used to set the working directory for the
subsequent import and export operations.
setwd ("c: / data/ ")
sales <- read.csv("yearly_sales.csv")

25
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)

Data Import and Export


Function Headers Separator Decimal Point
read.table() FALSE “” “.”
read.csv() TRUE “,” “.”
read.csv2() TRUE “;” “,”
read.delim() TRUE “\t” “.”
read.delim2() TRUE “\t” “,”

26
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)

Data Import and Export


• Export R datasets to an external file write.table(), write.csv() and write.csv2()
• Add additional column to the sales dataset and exports the modified dataset to an
external files

# add a column for the average sales per order


sales$per_order <- sales$sales_total/sales$num_of_orders

# export data without the row names


write.table (sales, “sales_modified.txt”, row.names= FALSE)

27
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)

Data Import and Export


• Sometimes it is necessary to read data from a database management system (DBMS).

• R packages such as DBI and RODBC are available for this purpose.
• These packages provide database interfaces for communication between R and DBMSs such as
MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum. The following Rcode demonstrates
how to install the RODBC package with the install.packages() function.

• The library() function loads the package into the R workspace.


• Finally,a connector(conn) is initialized for connecting to a PivotalGreenplum database training2 via
open database connectivity(ODBC) with user user. The training2 database must be defined either in
the Ietc/ODBC.ini configuration file or using the Administrative Tools under the Windows Control

28
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)

Data Import and Export


• The connector needs to be present to submit a SQL query to an ODBC database by using the
sqlQuery() function from the RODBC package.
• The following R code retrieves specific columns from the housing table in which household income
(hinc) is greater than $ 1,000,000.

housing_data <- sqlQuery (conn, “select serialno, state, persons, rooms


from housing
where hinc > 1000000”)
head (housing_data)

29
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Data Import and Export)

Data Import and Export


• Although plots can be saved using the RStudio GUI, plots can also be saved using Rcode by
specifying the appropriate graphic devices.
• Using the jpeg () function, the following R code creates a new JPEG file, adds a histogram
plot to the file, and then closes the file. Such techniques are useful when automating
standard reports. Other functions,such as png (),bmp(),pdf (),and postscript () ,are
available in R to save plots in the desired format.

30
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Attribute and Data Types

31
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Attribute and Data Types)

32
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Interval aVariable
properties: can take difference of two values.
may not be able to take ratios of two values.

Example: temperature in Celsius and in Fahrenheit.

can take difference of two values.


- You can say that if temperature in Perlis is 40 deg Celsius and that in Shah Alam is 20 deg
Celsius, then Perlis is 20 deg Celsius hotter than Shah Alam (taking difference).
-when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance
from 70-80.

may not be able to take ratios of two values.


- But you cannot say Perlis is twice as hot as Shah Alam (not allowed to take ratio).
- ratios don't make any sense - 80 degrees is not twice as hot as 40 degrees (although the
attribute value is twice as large).
33
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Attribute and Data Types)

Ratio Variable
Properties: you can take a ratio of two values.
has a clear definition of 0.0

you can take a ratio of two values.


Example 40 kg is twice as heavy as 20 kg (taking ratios).
A weight of 4 grams is twice a weight of 2 grams.
has a clear definition of 0.0.
- When the variable equals 0.0, there is none of that variable. Variables like height,
weight, enzyme activity are ratio variables.
- Temperature in Kelvin is a ratio variable, as 0.0 Kelvin really does mean 'no heat'.

34
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Attribute and Data Types)

NOT RATIO!!
- Temperature, expressed in F or C, is not a ratio variable.
A temperature of 0.0 on either of those scales does not mean 'no heat'.
A temperature of 100 degrees C is not twice as hot as 50 degrees C.

- pH=0 just means 1 molar of H+. and the definition of molar is fairly arbitrary.
A pH of 0.0 does not mean 'no acidity' (quite the opposite!).
A pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.

35
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

36
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Data Conversion
• Data of one attribute type may be converted to another.
• For example, the quality of diamonds {Fair, Good, Very Good, Premium, Ideal} is
considered ordinal but can be converted to nominal {Good, Excellent} with a defined
mapping.
• Similarly, a ratio attribute like Age can be converted into an ordinal attribute such as
{Infant, Adolescent, Adult, Senior}.
• Understanding the attribute types in a given dataset is important to ensure that the
appropriate descriptive statistics and analytic methods are applied and properly
interpreted. For example, the mean and standard deviation of U.S. postal ZIP codes
are not very meaningful or appropriate.

37
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Numeric, Character, and Logical Data Types

• R provides, such as class () and typeof (),to examine the characteristics of a given variable.
• The class() function represents the abstract class of an object.
• The typeof () function determines the way an object is stored in memory. Although i
appears to be an integer, i is internally stored using double precision.

38
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Numeric, Character, and Logical Data Types


• To test if i is an integer using the is.integer(} function.
• To coerce i into a new integer variable, j, using the as.integer() function.
• Similar functions can be applied for double, character, and logical types.

• The application of the length() function reveals that the created variables each have a
length of 1.

39
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Vectors
• Vectors are a basic building block for data in R.
• Simple R variables are actually vectors. A vector can only consist of values in the same
class. The tests for vectors can be conducted using the is. vector () function..

40
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Vectors: Creation and Manipulation

41
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

• Sometimes it is necessary to initialize a vector of a specific length and then populate the
content of the vector later.
• The vector () function, by default, creates a logical vector.
• A vector of a different type can be specified by using the mode parameter. The vector c,
an integer vector of length 0, may be useful when the number of elements is not initially
known and the new elements will later be added to the end of the vector as the values
become available.

42
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Arrays and Matrices


• The array () function can be used to restructure a vector as an array.
• For example, the following R code builds a three-dimensional array to hold the quarterly
sales for three regions over a two-year period and then assign the sales amount of
$158,000 to the second region for the first quarter of the first year.

43
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Arrays and Matrices


• A two-dimensional array is known as a matrix. The following code initializes a matrix to
hold the quarterly sales for the three regions.
• The parameters nrow and ncol define the number of rows and columns, respectively, for
the sales_matrix.

44
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Arrays and Matrices


• R provides the standard matrix operations such as addition, subtraction, and
multiplication, as well as the transpose function t() and the inverse matrix function
matrix.inverse() included in the matrixcalc package.

45
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Data Frames
• Data frames provide a structure for storing and accessing several variables of possibly different
data types. In fact, as the is.data.frame() function indicates, a data frame was created by the
read.csv() function.

• The variables stored can be easily accessed using the $


• A factor denotes a categorical variable, typically with a few finite levels such as ‘F’ and ‘M’ in
case of gender.

46
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)
Data Frames
• Because of their flexibility to handle many data types, data frames are the preferred input
format for many of the modeling functions available in R.

• The use of the str() function provides the structure of the sales data frame. This function
identifies the integer and numeric (double) data types, the factor variables and levels, as well
as the first few values for each variable.

47
Review of Basic Data Analytic Methods and Exploratory Data Analysis

Data Frames: Subsetting Operators


• aR'ssuccinct
subsetting operators are powerful in that they allow one to express complex operations in
fashion and easily retrieve a subset of the dataset.

Class of the sales


variable is data frame,
type is list.
48
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

List
A list is a collection o f objects that can be of various types, including other
lists.

v <- 1:5

M<-matrix(c(1,3,3,5,0,4,3,3,3),nrow=3,ncol=3)
M
[,1] [,2] [,3]
[1,] 1 5 3
[2,] 3 0 3
[3,] 3 4 3 49
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

List
• The use of the single set of brackets, [], only accesses an item in the list, not its content.
• The double brackets, [ [] ] , display the content.

50
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

List

51
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

List
• The str () function offers details about the structure of a list. The double brackets, [ [] ] , display
the content.

52
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Factors
• A factor denotes a categorical variable, typically with a few finite levels such as ‘F’ and ‘M’ in
case of gender.

• Factor can be ordered or not ordered.


• sales$gender contains nominal data.

53
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Factors
• Included with the ggplot2 package, the diamonds data frame contains three ordered factors.
• Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very Good,
Premium, and Ideal.
• diamonds$cut contains ordinal data.

54
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Factors
• To categorize sales$sales_totals into three groups-small,medium, and big-according to the amount of the
sales.

55
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Factors
• Create and add the factor to a data frame.
• The cbind() function is used to combine variables column-wise.
• The rbind() function is used to combine datasets row-wise.
• The use of factors is important in several R statistical modeling functions, such as analysis of
variance, aov ( ) and the use of contingency tables .

56
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Factors

vectors

an R factor might be viewed simply as a vector with a bit of extra information that consists of a record
of the distinct values in that vector called levels.

The core of xf here is not (5, 7, 9, 7) but rather (1, 2, 3, 2) i.e. (level-1, level-2, level-
3, level-2). So, the data has been recoded by level.
57
Review of Basic Data Analytic Methods and Exploratory Data Analysis

cbind() and rbind()

58
Review of Basic Data Analytic Methods and Exploratory Data
Analysis(Attribute and Data Types)

cbind() and rbind()


• Assign a new column names to vector.

59
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Attribute and Data Types)

Contigency table
• Table refers to a class of objects used to store the observed counts across the factors for a given dataset.
• Contingency
table.
table - basis for performing a statistical test on the independence of the factors used to build the

• The code below builds a contingency table based on the sales$gender and sales$ spender factors.
• Based on the observed counts in the table, the summary() function performs a chi-squared test on the
independence of the two factors. Because the reported p-value is greater than 0.05, the assumed
independence of the two factors is not rejected.

60
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)

Descriptive Statistics
• Itsuch
has already been shown that the summary () function provides several descriptive statistics,
as the mean and median, about a variable such as the sales data frame.
• The results now include the counts for the three levels of the spender variable based on the
earlier examples involving factors.

61
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)

Functions for Descriptive Statistics


• The IQR() function provides the difference between the third and the first quartiles

62
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)

function apply()

• The function apply() is useful when the same function is to be applied to several
variables in a data frame.
• For example, the following R code calculates the standard deviation for the first three
variables in sales.
• In the code,
setting MARGIN=2 specifies that the function is applied over the columns.
setting MARGIN=1 specifies that the function is applied over the rows.

• Other functions, such as lapply () and sapply (), apply a function to a list or vector.

63
Review of Basic Data Analytic Methods and Exploratory Data Analysis
(Descriptive Statistics)

User-defined functions

• Additional descriptive statistics can be applied with user-defined


functions.
• The following R code defines a function, my_range(), to compute the
difference between the maximum and minimum values returned by
the range () function.
• In general, user-defined functions are useful for any task or operation
that needs to be frequently repeated.
• Use help(“function”) for more information.

64

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy