Module 1: Unit - 1.1: Introduction To Analytics or R Programming
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
1
Introduction to Analytics or R programming
Topic Session Goals
Introduction to Analytics or R By the end of this session, you will be able to:
programming 1. Understand R
2. Use functions of R
Session Plan:
Activity Location
Knowing language R Classroom
Summary Classroom
Student Handbook – SSC/ Q2101 – Associate Analytics
Look at R!
R-Studio Interface
R-Commander Interface
Using R as calculator
R can be used as a calculator. For example, if we have to know what is 2+2 then-
> 2+2
Press enter and we get the answer
as [1] 4
Similarly we can calculate anything as if done on calculator.
Student Handbook – SSC/ Q2101 – Associate Analytics
1. Log of 2
3 2
2. 2 X 3
3
3. e
Understanding components of R
1. Data Type:
There are two types of data classified on very broad level. They are Numeric and Character data.
Numeric Data: - It includes 0~9, “.” and “- ve” sign.
Character Data: - Everything except Numeric data type is Character. For Example, Names,
Gender etc.
For Example, “1,2,3…” are Quantitative Data while “Good”, “Bad” etc. are Quantitative Data.
Although we can convert Qualitative Data into Quantitative Data using Ordinal Values.
For Example, “Good” can be rated as 9 while “Average” can be rated as 5 and “Bad” can be rated as 0.
2. Data Frame:
A data frame is used for storing data tables. It is a list of vectors of equal length.
The top line of the table, called the header, contains the column names. Each horizontal line afterward
denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data
member of a row is called a cell. To retrieve data in a cell, we would enter its row and column coordinates in
the single square bracket "[]" operator. The two coordinates are separated by a comma. In other words, the
coordinates begins with row position, then followed by a comma, and ends with the column position. The
order is important.
For Example,
Here is the cell value from the first row, second column of mtcars.
Student Handbook – SSC/ Q2101 – Associate Analytics
We have two different options for constructing matrices or arrays. Either we use the
creator functions matrix () and
Array (), or you simply change the dimensions using the dim () function.
For example, you make an array with four columns, three rows, and two “tables” like this:
In the above example, “my.array” is the name of the array we have given. And “←” is the
assignment operator.
There are 24 units in this array mentioned as “1:24” and are divided in three dimensions “(3, 4, 2)”.
Note: - Although the rows are given as the first dimension, the tables are filled column-wise. So, for arrays,
R fills the columns, then the rows, and then the rest.
Alternatively, you could just add the dimensions using the dim ( ) function. This is a little hack that goes a
bit faster than using the array ( ) function; it’s especially useful if you have your data already in a vector.
(This little trick also works for creating matrices, by the way, because a matrix is nothing more than an
array with only two dimensions.)
Say you already have a vector with the numbers 1 through 24, like this:
>my.vector<- 1:24
You can easily convert that vector to an array exactly like my.array simply by assigning the dimensions,
like this:
You can check whether two objects are identical by using theidentical () function.
Student Handbook – SSC/ Q2101 – Associate Analytics
We can import Datasets from various sources having various files types for example,
.csv format
Big data tool – Impala
CSV File
The sample data can also be in comma separated values (CSV) format. Each cell inside such data file is
separated by a special character, which usually is a comma, although other characters can be used as well.
The first row of the data file should contain the column names instead of the actual data. Here is a sample
of the expected format.
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3
After we copy and paste the data above in a file named "mydata.csv" with a text editor, we can read the
data with the function read.csv.
In various European locales, as the comma character serves as the decimal point, the
functionread.csv2 should be used instead. For further detail of the read.csv and read.csv2 functions, please
consult the R documentation.
> help(read.csv)
Big data tool – Impala
Cloudera 'Impala', which is a massively parallel processing (MPP) SQL query engine runs natively
in Apache Hadoop.
RImpala enables querying the data residing in HDFS and Apache HBase from R, which can be further
processed as an R object using R functions. RImpala is now available for download from the
Comprehensive R Archive Network (CRAN) under GNU General Public License (GPL3).
To install RImpala :
We use following code to install RImpala package.
>install. packages("RImpala")
“Getwd” means get the working directory (wd) and “setwd” is used to set the working directory.
Shaan, 21, M
Ritu, 24, F
Raj, 31, M
Working on Variables
Before learning about creating and modifying variables in R we will know the various operators in R.
Inputting missing data using standard methods and algorithmic approaches (mice package R):
In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).
Unlike SAS, R uses the same symbol for character and numeric data.
For Example,
We have defined “y” and then checked if there is any missing value. T or True means that there is a
missing value.
y <- c(1,2,3,NA)
is.na(y)
# returns a vector (F FF T)
Student Handbook – SSC/ Q2101 – Associate Analytics
For Example,
We can create new dataset without missing data as below: -
newdata<- na.omit(mydata)
Or, we can also use “na.rm=TRUE” in argument of the operator. From above example we use na.rm and
get desired result.
x <- c(1,2,NA,3)
mean(x, na.rm=TRUE)
# returns 2
Outliers:
Outlier is a point or an observation that deviates significantly from the other observations.
●Due to experimental errors or “special circumstances”
●Outlier detection tests to check for outliers
●Outlier treatment –
Retention
Exclusion
We have “OUTLIER” package in R to detect and treat outliers in Data.
Normally we use BOX Plot and Scatter plot to find outliers from graphical representation.
Student Handbook – SSC/ Q2101 – Associate Analytics
Scatter Plot
Box Plot
To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join
two data frames by one or more common key variables (i.e., an inner join).
For example,
To merge two data frames by ID:
total <- merge(data frameA,dataframeB,by="ID")
To merge on more than one criteria we pass the argument as follows
For example,
total <- rbind(data frameA, data frameB)
Note:-
If data frameA has variables that data frameB does not, then either:
1. Delete the extra variables in data frameA or
2. Create the additional variables in data frameB and set them to NA
(missing) before joining them with rbind( ).
We use cbind() function to combine data by column the syntax is same as rbind().
We use rbind.fill() in plyr package in R. It binds or combines a list of data frames filling missing columns
with NA.
For example,
rbind.fill(mtcars[c("mpg", "wt")], mtcars[c("wt", "cyl")])
“FOR” Loop:-
To repeat an action for every value in a vector by using a “for” loop.
We construct a “for” loop in R as follows:
for(i in values){
... do something ...
}
“IFELSE” Function:-
When using R, sometimes we need our function to do something if a condition is true and something else if
it is not.
You could do this with two if statements, but there’s an easier way in R: an if...else statement.
An if…else statement contains the same elements as an if statement), and then some extra:
The keyword else, placed after the first code block
A second block of code, contained within braces, that has to be carried out if and only if the result of the
condition in the if() statement is FALSE
For example,
if(hours > 100) net.price<- net.price * 0.9
if(public) {tot.price<- net.price * 1.06 } else
{tot.price<- net.price * 1.12} round(tot.price)}
2. Which of the following is the function to display the first 5 rows of data?
Case Study:
Segregate Sepal length based on the “Sepal Length” and “Sepal Width”. Use the dataset named “IRIS”.
Data Set
Summary
Summarizing Data, and By the end of this session, you will be able
to:
Revisiting Probability
1. Summarize Data
2. Work on Probability.
Session Plan:
Activity Location
Summary statistics- summarizing data with R Classroom
Probability Classroom
Summary Classroom
Student Handbook – SSC/ Q2101 – Associate Analytics
•summary(data_frame)
•summary(iris)
•Output : Mean, Median, Minimum, Maximum, 1st and 3rd quartile
>summary(dataset)
For example ,
Probability
A probability distribution describes how the values of a random variable are distributed.
For example, the collection of all possible outcomes of a sequence of coin tossing is known to follow
the binomial distribution. Whereas the means of sufficiently large samples of a data population are
known to resemble the normal distribution. Since the characteristics of these theoretical distributions are
well understood, they can be used to make
Student Handbook – SSC/ Q2101 – Associate Analytics
Now, “At Random” means that there is no biased treatment with any card and the result will be totally
at random.
So, No. of Ace of Diamond in a pack = S = 1
Total no of possible outcomes = Total no. of cards in pack = 52
Probability of positive outcome = S/P = 1/52
That is we have 1.92% chance that we will get positive outcome.
Expected value
The expected value of a random variable is intuitively the long-run average value of repetitions of the
experiment it represents.
For example, the expected value of a dice roll is 3.5 because, roughly speaking, the average of an
extremely large number of dice rolls is practically always nearly equal to 3.5.
Less roughly, the law of large numbers guarantees that the arithmetic mean of the values
almostsurely converges to the expected value as the number of repetitions goes to infinity.
The expected value is also known as the expectation, mathematical expectation, EV, mean, or first
moment.
More practically, the expected value of a discrete random variable is the probability-weighted average of
all possible values. In other words, each possible value the random variable can assume is multiplied by its
probability of occurring, and the resulting products are summed to produce the expected value. The same
works for continuous random variables, except the sum is replaced by an integral and the probabilities
by probability densities. The formal definition subsumes both of these and also works for distributions
which are neither discrete nor continuous: the expected value of a random variable is the integral of the
random variable with respect to its probability measure.Theexpected value is a key aspect of how one
characterizes a probability distribution; it is a location parameter.
Student Handbook – SSC/ Q2101 – Associate Analytics
Random Variable:
A random variable, aleatory variable or stochastic variable is a variable whose value is subject to
variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take
on a set of possible different values (similarly to other mathematical variables), each with an
associated probability, in contrast to other mathematical variables.
Random variables can be discrete, that is, taking any of a specified finite or countable list of values,
endowed with a probability mass function, characteristic of a probability distribution; or continuous, taking
any numerical value in an interval or collection of intervals, via a probability density function that is
characteristic of a probability distribution; or a mixture of both types. The realizations of a random
variable, that is, the results of randomly choosing values according to the variable's probability distribution
function, are called random variates.
For Example,
If we toss a coin for 10 times and we get heads 8 times then we cannot say that the 11th time if coin is
tossed then we get a head or a tail. But we are sure that we will either get a head or a tail.
Probability Distribution
Probability Distribution Function or PDF is the function that defines probability of outcomes based on
certain conditions.
Based on Conditions, there are majorly 5 types PDFs.
Normal Distribution
We come now to the most important continuous probability density function and perhaps the most
important probability distribution of any sort, the normal distribution.
On several occasions, we have observed its occurrence in graphs from, apparently, widely differing
sources: the sums when three or more dice are thrown; the binomial distribution for large values of n; and
in the hyper geometric distribution.
There are many other examples as well and several reasons, which will appear here, to call this
distribution “normal.”
If
We say that X has a normal probability distribution. A graph of a normal distribution, where we have
chosen a = 0 and b = 1, appears in figure below:
The normal distribution f(x), with any mean μ and any positive deviation σ, has the
following properties:
It is symmetric around the point x = μ, which is at the same time the mode, the median and the mean of the
distribution.
It is unimodal: its first derivative is positive for x < μ, negative for x > μ, and zero only at x = μ.
Its density has two inflection points (where the second derivative of is zero and changes sign), located one
standard deviation away from the mean, namely at x = μ − σ and x = μ + σ.
Its density is log-concave.
Its density is infinitely differentiable, indeed super smooth of order 2.
Its second derivative f′′(x) is equal to its derivative with respect to its variance σ2.
is normally distributed.
• In Bayesian statistics, one does not "test normality" per se, but rather computes the likelihood that
the data come from a normal distribution with given parameters μ,σ (for all μ,σ), and compares that with the
likelihood that the data come from other distributions under consideration, most simply using a Bayes factor
(giving the relative likelihood of seeing the data given different models), or more finely taking a prior
distribution on possible models and parameters and computing a posterior distribution given the computed
likelihoods.
1. Graphical methods:
An informal approach to testing normality is to compare a histogram of the sample data to a normal
probability curve. The empirical distribution of the data (the histogram) should be bell-shaped and
resemble the normal distribution. This might be difficult to see if the sample is small. In this case one might
proceed by regressing the data against the quartiles of a normal distribution with the same mean and
variance as the sample. Lack of fit to the regression line suggests a departure from normality.(see Anderson
Darling coefficient and Minitab)
A graphical tool for assessing normality is the normal probability plot, a quantile-quantile plot (QQ plot) of
the standardized data against the standard normal distribution. Here the correlation between the sample data
and normal quartiles (a measure of the goodness of fit) measures how well the data are modeled by a
normal distribution. For normal data the points plotted in the QQ plot should fall approximately on a
straight line, indicating high positive correlation. These plots are easy to interpret and also have the benefit
that outliers are easily identified.
Back-of-the-envelope test
Simple back-of-the-envelope test takes the sample maximum and minimum and computes their z-score, or
more properly t-statistic (number of sample standard deviations that a sample is above or below the sample
mean), and compares it to the 68–95–99.7 rule: if one has a 3σ event (properly, a 3s event) and substantially
fewer than 300 samples, or a 4s event and substantially fewer than 15,000 samples, then a normal
distribution will understate the maximum magnitude of deviations in the sample data.
This test is useful in cases where one faces kurtosis risk – where large deviations matter – and has the
benefits that it is very easy to compute and to communicate: non-statisticians can easily grasp that "6σ
events are very rare in normal distributions".
2. Frequentist tests:
Tests of univariate normality include D'Agostino's K-squared test, the Jarque–Bera test, the Anderson–
Darling test, the Cramér–von Mises criterion, the Lilliefors test for normality (itself an adaptation of the
Kolmogorov–Smirnov test), the Shapiro–Wilk test, the Pearson's chi-squared test, and the Shapiro–Francia
test. A 2011 paper from The Journal of Statistical Modeling and Analytics concludes that Shapiro-Wilk has
the best power for a given significance, followed closely by Anderson-Darling when comparing the
Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors, and Anderson-Darling tests.
Some published works recommend the Jarque–Bera test. But it is not without weakness. It has low power for
distributions with short tails, especially for bimodal distributions. Other authors have declined to include
Student Handbook – SSC/ Q2101 – Associate Analytics
Historically, the third and fourth standardized moments (skewness and kurtosis) were some of the earliest
tests for normality. The Jarque–Bera test is itself derived fromskewness and kurtosis estimates. Mardia's
multivariate skewness and kurtosis tests generalize the moment tests to the multivariate case. Other early
test statistics include the ratio of the mean absolute deviation to the standard deviation and of the range to
the standard deviation.
More recent tests of normality include the energy test (Székely and Rizzo) and the tests based on the
empirical characteristic function (ecf) (e.g. Epps and Pulley, Henze–Zirkler, BHEP test). The energy and
the ecf tests are powerful tests that apply for testing univariate or multivariate normality and are statistically
consistent against general alternatives.
The normal distribution has the highest entropy of any distribution for a given standard deviation. There
are a number of normality tests based on this property, the first attributable to Vasicek.
3. Bayesian tests:
Kullback–Leibler divergences between the whole posterior distributions of the slope and variance do not
indicate non-normality. However, the ratio of expectations of these posteriors and the expectation of the
ratios give similar results to the Shapiro–Wilk statistic except for very small samples, when non-
informative priors are used.
Spiegelhalter suggests using a Bayes factor to compare normality with a different class of
distributional alternatives. This approach has been extended by Farrell and Rogers-Stewart.
Student Handbook – SSC/ Q2101 – Associate Analytics
The central limit theorem states that under certain (fairly common) conditions, the sum of many random
variables will have an approximately normal distribution.
More specifically, where X1, …, Xn are independent and identically distributed random variables with the
same arbitrary distribution, zero mean, and variance σ2; and Z is their mean scaled by
Then, as n increases, the probability distribution of Z will tend to the normal distribution with zero
mean and variance (σ2).
The central limit theorem also implies that certain distributions can be approximated by the
normal distribution, for example:
• The binomial distribution B(n, p) is approximately normal with mean np and variance np(1−p) for
large n and for p not too close to zero or one.
• The Poisson distribution with parameter λ is approximately normal with mean λ and variance λ,
for large values of λ.
• The chi-squared distribution χ2(k) is approximately normal with mean k and variance 2k, for large
k.
• The Student's t-distribution t(ν) is approximately normal with mean 0 and variance 1 when ν is
large.
Random walk
A random walk is a mathematical formalization of a path that consists of a succession of random steps.
For example, the path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging
animal, the price of a fluctuating stock and the financial status of a gambler can all be modeled as random
walks, although they may not be truly random in reality. The term random walk was first introduced by
Karl Pearson in 1905.
Random walks have been used in many fields: ecology, economics, psychology, computer science,
physics, chemistry, and biology.
Student Handbook – SSC/ Q2101 – Associate Analytics
Summary
Normality tests are used to determine if a data set is well-modeled by a normal distribution and to
compute how likely it is for a random variable underlying the data set to be normally distributed.
A random walk is a mathematical formalization of a path that consists of a succession of random steps.
Probability Distribution Function or PDF is the function that defines probability of outcomes based
on certain conditions.
A random variable, aleatory variable or stochastic variable is a variable whose value is subject
to variations due to chance
Sachin buys a chocolate bar every day during a promotion that says one out of six chocolate bars
has a gift coupon within. Answer the following questions:
•What is the distribution of the number of chocolates with gift coupons in seven days?
•What is the probability that Amir gets no chocolates with gift coupons in seven days?
•Amir gets no gift coupons for the first six days of the week. What is the chance that he will get
a one on the seventh day?
•Amir buys a bar every day for six weeks. What is the probability that he gets at least three
gift coupons?
Student Handbook – SSC/ Q2101 – Associate Analytics
•How many days of purchase are required so that Amir’s chance of getting at least one
gift coupon is 0.95 or greater?
Solution:
Hints:
n r n-r
Formula = Crp q
Where
n is the no. of trials
r is the number of successful outcomes
p is the probability of success, and
q is the probability of failure.