Da Session 4
Da Session 4
Da Session 4
http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
Download R: http://cran.r-project.org/bin/
help.start()
help(topic)
?topic
??topic
R command in integrated environment
How to use R for simple Mathematics
> 3+5
> 12 + 3 / 4 – 5 + 3*8
> (12 + 3 / 4 – 5) + 3*8
> pi * 2^3 – sqrt(4)
Note
>factorial(4) R ignores spaces
>log(2,10)
>log(2, base=10)
>log10(2)
>log(2)
How to store results of calculations for future use
> x = 3+5
>x
> y = 12 + 3 / 4 – 5 + 3*8
>y
> z = (12 + 3 / 4 – 5) + 3*8
>z
> A <- 6 + 8 ## no space should be between < & -
>a ## Note: R is case sensitive
>A
Using C command
> data1 = c(3, 6, 9, 12, 78, 34, 5, 7, 7) ## numerical data
> data1.text = c(‘Mon’, ‘Tue’, “Wed”) ## Text data
## Single or double quote both ok
##copy/paste into R console may not work
> data1.text = c(data1.text, ‘Thu’, ‘Fri’)
Scan command for making data
> d3 = scan(what = ‘character’) > d3[6]='sat'
1: mon
2: tue
3: wed thu
> d3
5: [1] "mon" "mon" "wed" "thu" NA
"sat"
> d3
[1] "mon" "tue" "wed" "thu" > d3[2]='tue'
> d3[2]
[1] "tue"
> d3[5] = 'fri'
> d3[2]='mon'
> d3
> d3 [1] "mon" "tue" "wed" "thu" "fri"
[1] "mon" "mon" "wed" "thu" "sat"
Concept of working directory
>getwd()
[1] "C:\Users\DA\R\Database"
Matrix
Data Frame
List
Vectors in R
>x=c(1,2,3,4,56)
>x
> x[2]
> x = c(3, 4, NA, 5)
>mean(x)
[1] NA
>mean(x, rm.NA=T)
[1] 4
> x = c(3, 4, NULL, 5)
>mean(x)
[1] 4
More on Vectors in R
>y = c(x,c(-1,5),x)
>length(x)
>length(y)
There are useful methods to create long vectors whose elements are in
arithmetic progression:
> x=1:20
>x
If the common difference is not 1 or -1 then we can use the seq function
> y=seq(2,5,0.3)
>y
[1] 2.0 2.3 2.6 2.9 3.2 3.5 3.8 4.1 4.4 4.7 5.0
> length(y)
[1] 11
More on Vectors in R
> x=1:5
It is very easy to
> mean(x) add/subtract/multiply/divide two
[1] 3 vectors entry by entry.
>x > y=c(0,3,4,0)
[1] 1 2 3 4 5 > x+y
> x^2 [1] 1 5 7 4 5
[1] 1 4 9 16 25 > y=c(0,3,4,0,9)
> x+y
> x+1
[1] 1 5 7 4 14
[1] 2 3 4 5 6 Warning message:
> 2*x In x + y : longer object length is not a
[1] 2 4 6 8 10 multiple of shorter object length
> exp(sqrt(x)) > x=1:6
[1] 2.718282 4.113250 5.652234 > y=c(9,8)
7.389056 9.356469 > x+y
[1] 10 10 12 12 14 14
Matrices in R
Same data type/mode – number , character, logical
a.matrix <- matrix(vector, nrow = r, ncol = c, byrow = FALSE,
dimnames = list(char-vector-rownames, char-vector-col-names))
## dimnames is optional argument, provides labels for rows & columns.
> y <- matrix(1:20, nrow = 4, ncol = 5)
>A = matrix(c(1,2,3,4),nrow=2,byrow=T)
>A
>A = matrix(c(1,2,3,4),ncol=2)
>B = matrix(2:7,nrow=2)
>C = matrix(5:2,ncol=2)
>mr <- matrix(1:20, nrow = 5, ncol = 4, byrow = T)
>mc <- matrix(1:20, nrow = 5, ncol = 4)
>mr
>mc
More on matrices in R
>dim(B) #Dimension
>nrow(B)
>ncol(B)
>A+C
>A-C
>A%*%C #Matrix multiplication. Where will be the result?
>A*C #Entry-wise multiplication
>t(A) #Transpose
>A[1,2]
>A[1,]
>B[1,c(2,3)]
>B[,-1]
Lists in R
Vectors and matrices in R are two ways to work with a
collection of objects.
>names(x)
>x$name
>x$hei #abbreviations are OK
>x$marks
>x$m[2]
Data frame in R
A data frame is more general than a matrix, in that different columns
can have different modes (numeric, character, factor, etc.).
>d <- c(1,2,3,4)
>e <- c("red", "white", "red", NA)
>f <- c(TRUE,TRUE,TRUE,FALSE)
>myframe <- data.frame(d,e,f)
>names(myframe) <- c("ID","Color","Passed") # Variable names
>myframe
>myframe[1:3,] # Rows 1 , 2, 3 of data frame
>myframe[,1:2] # Col 1, 2 of data frame
>myframe[c("ID","Color")] #Columns ID and color from data frame
>myframe$ID # Variable ID in the data frame
Factors in R
In R we can make a variable is nominal by making it a factor.
Functional Help
?rnorm()
Package Installation
install.packages("ggplot")
Library Call (for use)
library(ggplot)
DESCRIPTIVE STATISTICS
The monthly credit card expenses of an individual in 1000 rupees is given below.
Kindly summarize the data
Format Code
Excel library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", “Sheet1”)
Function Description
Open a connection to an ODBC database
odbcConnect(dsn, uid="", pwd="")
Read a table from an ODBC database into a data frame
sqlFetch(channel, sqtable)
Operators - Arithmetic
Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2
DESCRIPTIVE STATISTICS
Operators - Logical
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&y x AND y
isTRUE(x) test if X is TRUE
DESCRIPTIVE STATISTICS
Descriptive Statistics
Computation of descriptive statistics for variable CC
Descriptive Statistics
Function Code
Quantile > quantile(CC)
Output
Quantile 0% 25% 50% 75% 100%
Value 53 57 59 61 65
Function Code
Summary >summary(CC)
Output
Minimum Q1 Median Mean Q3 Maximum
53 57 59 59.2 61 65
DESCRIPTIVE STATISTICS
Descriptive Statistics
Function Code Output
Statistics Values
describe > library(psych)
> describe(CC) N 20
mean 59.2
sd 3.11
median 59
Trimmed 59.25
mad 2.97
min 53
Max 65
Range 12
Skew -0.08
Kurtosis -0.85
se 0.69
DESCRIPTIVE STATISTICS
Graphs
Graph Code
Histogram > hist(CC)
Histogram colour (“Blue”) > hist(CC,col="blue")
Dot plot > dotchart(CC)
Box plot > boxplot(CC)
Box plot colour > boxplot(CC, col="dark green")
DESCRIPTIVE STATISTICS
Histogram : Variable - CC
DESCRIPTIVE STATISTICS
Read data and simple scatter plot using function ggplot() with geom_point().
> train <- read.csv("C:/Users/Data/Big_Mart_Dataset.csv")
> view(train)
> library(ggplot2)
> ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point() +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
DATA VISUALIZATION
DATA VISUALIZATION
1. Scatter Plot: Now, we can view a third variable also in same chart, say a
categorical variable (Item_Type) which will give the characteristic (item_type)
of each data set. Different categories are depicted by way of different color for
item_type in below chart.
> library(ggplot2)
> ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color =
Item_Type)) + scale_x_continuous("Item Visibility", breaks =
seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by =
30))+ theme_bw() + labs(title="Scatterplot")
DATA VISUALIZATION
DATA VISUALIZATION
1. Scatter Plot: We can even make it more visually clear by creating separate
scatter plots for each separate Item_Type as shown below.
library(ggplot2)
2. Histogram: It is used to plot continuous variable. It breaks the data into bins
and shows frequency distribution of these bins. We can always change the bin
size and see the effect it has on visualization.
3. Stack Bar Chart: It is an advanced version of bar chart, used for visualizing
a combination of categorical variables.
R Code:
> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill =
"red")+scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000,
by=500))+labs(title = "Box Plot", x = "Outlet Identifier")
DATA VISUALIZATION
The black points are outliers. Outlier detection and removal is an essential step of
successful data exploration.
DATA VISUALIZATION
For Big_Mart_Dataset, when we want to analyse the trend of item outlet sales,
area chart can be plotted as shown below. It shows count of outlets on basis of
sales.
R Code:
> ggplot(train, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill
= "steelblue") + scale_x_continuous(breaks = seq(0,11000,1000))+ labs(title =
"Area Chart", x = "Item Outlet Sales", y = "Count")
DATA VISUALIZATION
Area chart shows continuity of Item Outlet Sales using function ggplot() with
geom_area.
DATA VISUALIZATION
R Code:
> install.packages("corrgram")
> library(corrgram)
> corrgram(train, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")
DATA VISUALIZATION
• Darker the colour, higher the co-relation between variables. Positive co-
relations are displayed in blue and negative correlations in red colour. Colour
intensity is proportional to the co-relation value.
• We can see that Item cost & Outlet sales are positively correlated while Item
weight & its visibility are negatively correlated.
DATA PRE-PROCESSING
DATA PREPROCESSING
Option 2: Replace the missing values with variable mean, median, etc
Option 2: Replace the missing values with variable mean, median, etc
Replacing the missing values with mean
Option 2: Replace the missing values with variable mean, median, etc
Replacing the missing values with men
SL No cmusage l3musage avrecharge Proj Growth Circle
1 5.1 3.5 99.4 11 1
2 4.9 3 98.6 11 1
3 5.975 3.2 96.14117647 11 1
4 4.6 3.1 98.5 1 1
5 5 3.105882353 98.4 11 1
6 5.4 3.9 98.3 12 1
7 7 3.2 95.3 6 2
8 6.4 3.2 95.5 7 2
9 6.9 3.1 95.1 7 2
10 5.975 2.3 96 5 2
11 6.5 2.8 95.4 7 2
12 5.7 3.105882353 95.5 5 2
13 6.3 3.3 96.14117647 8 2
14 6.7 3.3 94.3 3 3
15 6.7 3 94.8 2 3
16 6.3 2.5 95 10 3
17 5.975 3 94.8 4 3
18 6.2 3.4 94.6 2 3
19 5.9 3 94.9 9 3
TRANSFORMATION / NORMALIZATION
z transform:
Transformed data = (Data – Mean) / SD
Example: Take a sample of size 60 (10%) randomly from the data given in the
file bank-data.csv and save it as a new csv file?
>mydata = bank-data
> mysample = mydata[sample(1:nrow(mydata), 60, replace = FALSE),]
>write.csv(mysample,"E:/IIFT/mysample.csv")
Example: Split randomly the data given in the file bank-data.csv into sets namely
training (75%) and test (25%) ?
>mydata = bank-data
>sample = sample(2, nrow(mydata), replace = TRUE, prob = c(0.75, 0.25))
> sample1 = mydata[sample ==1, ]
> sample2 = mydata[sample ==2,]
Any question?
You may also send your question(s) at tanujitisi@gmail.com