1st Unit Notes (1)

UNIT-I:
DESCRIPTIVE STATISTICS: Probability Distributions, Inferential Statistics, Inferential

Statistics through hypothesis tests Regression & ANOVA, Regression ANOVA (Analysis of
Variance)
Probability Distributions:-
Suppose you are a teacher at a university. After checking assignments for a week, you graded all
the students. You gave these graded papers to a data entry guy in the university and tell him to
create a spreadsheet containing the grades of all the students. But the guy only stores the grades
and not the corresponding students.
He made another blunder, he missed a couple of entries in a hurry and we have no idea whose
grades are missing. Let’s find a way to solve this.
One way is that you visualize the grades and see if you can find a trend in the data.
The graph that you have plot is called the frequency distribution of the data. You see that there is
a smooth curve like structure that defines our data, but do you notice an anomaly? We have an
abnormally low frequency at a particular score range. So the best guess would be to have missing
values that remove the dent in the distribution.
This is how you would try to solve a real-life problem using data analysis. For any Data
Scientist, a student or a practitioner, distribution is a must know concept. It provides the basis for
analytics and inferential statistics.
While the concept of probability gives us the mathematical calculations, distributions help us
actually visualize what’s happening underneath.
Table of Contents:-
1. Common Data Types

2. Types of Distributions
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution
Common Data Types

Before we jump on to the explanation of distributions, let’s see what kind of data we can
encounter. The data can be discrete or continuous.
Discrete Data as the name suggests, can take only specified values. For example, when you roll
a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.
Continuous Data can take any value within a given range. The range may be finite or infinite.
For example, A girl’s weight or height, the length of the road. The weight of a girl can be any
value from 54 kgs, or 54.5 kgs, or 54.5436kgs.
Now let us start with the types of distributions.
Types of Distributions
Bernoulli Distribution
Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to
understand than it sounds!
All you cricket junkies out there! At the beginning of any cricket match, how do you decide who
is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right? Let’s say
if the toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure),
and a single trial. So the random variable X which has a Bernoulli distribution can take value 1
with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible
outcomes.
The probability mass function is given by: px(1-p)1-x where x € (0, 1).
It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a fight
between me and Undertaker. He is pretty much certain to win. So in this case probability of my
success is 0.15 while my failure is 0.85
Here, the probability of success(p) is not same as the probability of failure. So, the chart below
shows the Bernoulli Distribution of our fight.
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is
exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected
value of any distribution is the mean of the distribution. The expected value of a random variable
X from a Bernoulli distribution is found as follows:
E(X) = 1*p + 0*(1-p) = p
The variance of a random variable from a bernoulli distribution is:
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow
or not where rain denotes success and no rain denotes failure and Winning (success) or losing
(failure) the game.
Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are
equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the
n number of possible outcomes of a uniform distribution are equally likely.
A variable X is said to be uniformly distributed if the density function is:
The graph of a uniform distribution curve looks like
You can see that the shape of the Uniform distribution curve is rectangular, the reason why
Uniform distribution is called rectangular distribution.
For a Uniform Distribution, a and b are the parameters.
The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of
40 and a minimum of 10.
Let’s try calculating the probability that the daily sales will fall between 15 and 30.
The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5
Similarly, the probability that daily sales are greater than 20 is = 0.667
The mean and variance of X following a uniform distribution is:
Mean -> E(X) = (a+b)/2
Variance -> V(X) = (b-a)²/12
The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform
density is given by:
Binomial Distribution
Let’s get back to cricket. Suppose that you won the toss today and this indicates a successful
event. You toss again but you lost this time. If you win a toss today, this does not necessitate that
you will win the toss tomorrow. Let’s assign a random variable, say X, to the number of times
you won the toss. What can be the possible value of X? It can be any number depending on the
number of times you tossed a coin.
There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win
or lose and where the probability of success and failure is same for all the trials is called a
Binomial Distribution.
The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of
failure can be easily computed as q = 1 – 0.2 = 0.8.
Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number
of times is called binomial. The parameters of a binomial distribution are n and p where n is the
total number of trials and p is the probability of success in each trial.
On the basis of the above explanation, the properties of a Binomial Distribution are
1. Each trial is independent.
2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are identical.)
The mathematical representation of binomial distribution is given by:
A binomial distribution graph where the probability of success does not equal the probability of
failure looks like
Now, when probability of success = probability of failure, in such a situation

the graph of binomial distribution looks like
The mean and variance of a binomial distribution are given by:
Mean -> µ = n*p
Variance -> Var(X) = n*p*q
Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is
why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables
often turns out to be normally distributed, contributing to its widespread application. Any
distribution is known as Normal distribution if it has the following characteristics:
1. The mean, median and mode of the distribution coincide.

2. The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3. The total area under the curve is 1.
4. Exactly half of the values are to the left of the center and the other half to the right.
A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.
The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally
distributed is given by:
Mean -> E(X) = µ
Variance -> Var(X) = σ^2
Here, µ (mean) and σ (standard deviation) are the parameters.

The graph of a random variable X ~ N (µ, σ) is shown below.
A standard normal distribution is defined as the distribution with mean 0 and

standard deviation 1. For such a case, the PDF becomes:
Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get
in a day? It can be any number. Now, the entire number of calls at a call
center in a day is modeled by Poisson distribution. Some more examples are
1. The number of emergency calls recorded at a hospital in a day.

2. The number of thefts reported in an area on a day.
3. The number of customers arriving at a salon in an hour.
4. The number of suicides reported in a particular city.
5. The number of printing errors at each page of the book.
You can now think of many examples following the same course. Poisson
Distribution is applicable in situations where events occur at random points of
time and space wherein our interest lies only in the number of occurrences of
the event.
A distribution is called Poisson distribution when the following assumptions

are valid:
1. Any successful event should not influence the outcome of another
successful event.
2. The probability of success over a short interval must equal the probability of
success over a longer interval.
3. The probability of success in an interval approaches zero as the interval
becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson

distribution. Some notations used in Poisson distribution are:
 λ is the rate at which an event occurs,

 t is the length of a time interval,
 And X is the number of events in that time interval.
Here, X is called a Poisson Random Variable and the probability distribution of

X is called Poisson distribution.
Let µ denote the mean number of events in an interval of length t. Then, µ =

λ*t.
The PMF of X following a Poisson distribution is given by:
The mean µ is the parameter of this distribution. µ is also defined as the λ

times length of that interval. The graph of a Poisson distribution is shown
below:
The graph shown below illustrates the shift in the curve due to increase in
mean.
It is perceptible that as the mean increases, the curve shifts to the right.
The mean and variance of X following a Poisson distribution:
Mean -> E(X) = µ

Variance -> Var(X) = µ
Exponential Distribution
Let’s consider the call center example one more time. What about the interval
of time between the calls ? Here, exponential distribution comes to our
rescue. Exponential distribution models the interval of time between the calls.
Other examples are:
1. Length of time beteeen metro arrivals,

2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner
Exponential distribution is widely used for survival analysis. From the

expected life of a machine to the expected life of a human, exponential
distribution successfully delivers the result.
A random variable X is said to have an exponential distribution with PDF:
f(x) = { λe-λx, x ≥ 0
and parameter λ>0 which is also called the rate.
For survival analysis, λ is called the failure rate of a device at any time t, given
that it has survived up to t.
Mean and Variance of a random variable X following an exponential

distribution:
Mean -> E(X) = 1/λ
Variance -> Var(X) = (1/λ)²

Also, the greater the rate, the faster the curve drops and the lower the rate,
flatter the curve. This is explained better with the graph shown below.
To ease the computation, there are some formulas given below.

P{X≤x} = 1 – e-λx, corresponds to the area under the density curve to the left of
x.
P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.
P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve
between x1 and x2.
Inferential Statistics: - Inferential statistics is a branch of statistics that makes the use of various
analytical tools to draw inferences about the population data from sample data. Inferential
statistics help to draw conclusions about the population while descriptive statistics summarizes
the features of the data set.
 Inferential statistics enables one to make descriptions of data and draw inferences and
conclusions from the respective data.
 Inferential statistics uses sample data because it is more cost-effective and less tedious than
collecting data from an entire population.
 It allows one to come to reasonable assumptions about the larger population based on a sample’s
characteristics.
There are two main types of inferential statistics: –
1. Hypothesis testing.
2. Regression analysis.
The samples chosen in inferential statistics need to be representative of the entire population.
Inferential Statistics:-
Inferential Statistics
Hypothesis testing Regression Analysis
Anova Linear Regression
NON Linear Regression
Logistic Regression.
Hypothesis Testing:
Hypothesis testing is a type of inferential statistics that is used to test assumptions and draw
conclusions about the population from the available sample data. It involves setting up a null
hypothesis and an alternative hypothesis followed by conducting a statistical test of significance.
A conclusion is drawn based on the value of the test statistic, the critical value, and
the confidence intervals. A hypothesis test can be left-tailed, right-tailed, and two-tailed. Given
below are certain important hypothesis tests that are used in inferential statistics.
In today’s data-driven world, decisions are based on data all the time. Hypothesis plays a crucial
role in that process, whether it may be making business decisions, in the health sector, academia,
or in quality improvement. Without hypothesis & hypothesis tests, you risk drawing the wrong
conclusions and making bad decisions. In this tutorial, you will look at Hypothesis Testing
in Statistics.
What Is Hypothesis Testing in Statistics?
Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a
population parameter to the test. It is used to estimate the relationship between 2 statistical
variables.
Examples of statistical hypothesis from real-life: -

 A teacher assumes that 60% of his college's students come from lower-middle-class families.
 A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.
Now that you know about hypothesis testing, look at the two types of hypothesis testing in
statistics.
Null Hypothesis
Alternate Hypothesis
The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no
bearing on the study's outcome unless it is rejected.
H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the
alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.
Example:
A sanitizer manufacturer claims that its product kills 95 percent of germs on average.
To put this company's claim to the test, create a null and alternate hypothesis.
H0 (Null Hypothesis): Average = 95%.
Alternative Hypothesis (H1): The average is less than 95%.
Another straightforward example to understand this concept is determining whether or not a coin
is fair and balanced. The null hypothesis states that the probability of a show of heads is equal to
the likelihood of a show of tails. In contrast, the alternate theory states that the probability of a
show of heads and tails would be very different.
Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical hypothesis into two
types.
Simple Hypothesis: A simple hypothesis specifies an exact value for the parameter.
Composite Hypothesis: A composite hypothesis specifies a range of values.
Example:
A company is claiming that their average sales for this quarter are 1000 units. This is an example
of a simple hypothesis.
Suppose the company claims that the sales are in the range of 900 to 1000 units. Then this is a
case of a composite hypothesis.
Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.
Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis despite
being true.
Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it is false,
unlike a Type-I error.
Example:
Suppose a teacher evaluates the examination paper to decide whether a student passes or fails.
H0: Student has passed

H1: Student has failed
Type I error will be the teacher failing the student [rejects H0] although the student scored the
passing marks [H0 was true].
Type II error will be the case where the teacher passes the student [do not reject H0] although the
student did not score the passing marks [H1 is true].
Level of Significance:
The alpha value is a criterion for determining whether a test statistic is statistically significant. In
a statistical test, Alpha represents an acceptable probability of a Type I error. Because alpha is a
probability, it can be anywhere between 0 and 1. In practice, the most commonly used alpha
values are 0.01, 0.05, and 0.1, which represent a 1%, 5%, and 10% chance of a Type I error,
respectively (i.e. rejecting the null hypothesis when it is in fact correct).
P-Value
A p-value is a metric that expresses the likelihood that an observed difference could have
occurred by chance. As the p-value decreases the statistical significance of the observed
difference increases. If the p-value is too low, you reject the null hypothesis.
Here you have taken an example in which you are trying to test whether the new advertising
campaign has increased the product's sales. The p-value is the likelihood that the null hypothesis,
which states that there is no change in the sales due to the new advertising campaign, is true. If
the p-value is .30, then there is a 30% chance that there is no increase or decrease in the product's
sales. If the p-value is 0.03, then there is a 3% probability that there is no increase or decrease in
the sales value due to the new advertising campaign. As you can see, the lower the p-value, the
chances of the alternate hypothesis being true increases, which means that the new advertising
campaign causes an increase or decrease in sales.
Conclusion:
You would have a much better understanding of hypothesis testing, one of the most important
concepts in the field of Data Science. The majority of hypotheses are based on speculation about
observed behavior, natural phenomena, or established theories.
How do we obtain a random sample?
1. Defining a population
2. Deciding your sample size
3. Randomly select a sample
4. Analyze the data sample
Random sampling can be a complex process and often depends on the particular characteristics
of a population. However, the fundamental principles involve:
1. Defining a population
This simply means determining the pool from which you will draw your sample, a population
can be anything—it isn’t limited to people. So it could be a population of objects, cities, cats,
pugs, or anything else from which we can derive measurements.
2. Deciding your sample size
The bigger your sample size, the more representative it will be of the overall population.
Drawing large samples can be time-consuming, difficult, and expensive. Indeed, this is why we
draw samples in the first place—it is rarely feasible to draw data from an entire population. Your
sample size should therefore be large enough to give you confidence in your results but not so
small that the data risk being unrepresentative. This is where using descriptive statistics can help,
as they allow us to strike a balance between size and accuracy.
3. Randomly select a sample
Once you’ve determined the sample size, you can draw a random selection. You might do this
using a random number generator, assigning each value a number and selecting the numbers at
random. Or you could do it using a range of similar techniques or algorithms (we won’t go into
detail here, as this is a topic in its own right, but you get the idea).
4. Analyze the data sample
Once you have a random sample, you can use it to infer information about the larger population.
It’s important to note that while a random sample is representative of a population, it will never
be 100% accurate. For instance, the mean (or average) of a sample will rarely match the mean of
the full population, but it will give you a good idea of it. For this reason, it’s important to
incorporate your error margin in any analysis. This is why, any result from inferential techniques
is in the form of a probability.
1. What is statistics?
It might seem silly to define a concept as ‘basic’ as statistics on a data analytics blog. However,
when we use these terms often, it’s easy to take them for granted.
Put simply, statistics is the area of applied math that deals with the collection, organization,
analysis, interpretation, and presentation of data. Sound familiar it should. These are all vital
steps in the data analytics process. In fact, in many ways, data analytics is statistics. When we
use the term ‘data analytics’ what we really mean is ‘the statistical analysis of a given dataset or
data sets’. But that’s a bit of a mouthful, so we tend to shorten it.
Since they are so fundamental to data analytics, statistics are also vitally important to any field
that data analysts work in. From science and psychology to marketing and medicine, the wide
range of statistical techniques out there can be broadly divided into two categories: descriptive
statistics and inferential statistics. But what’s the difference between them?
In a nutshell, descriptive statistics focus on describing the visible characteristics of a dataset (a
population or sample). Meanwhile, inferential statistics focus on making predictions or
generalizations about a larger dataset, based on a sample of those data. Before we go explore
these two categories of statistics further, it helps to understand what population and sample
mean. Let’s find out.
2. What are population and sample in statistics?
Two basic but vital concepts in statistics are those of population and sample. We can define them
as follows.
 Population is the entire group that you wish to draw data from. While in day-to-day life,
the word is often used to describe groups of people (such as the population of a country)
in statistics, it can apply to any group from which you will collect information. This is
often people, but it could also be cities of the world, animals, objects, plants, colors, and
so on.
 A sample is a representative group of a larger population. Random sampling from
representative groups allows us to draw broad conclusions about an overall population.
This approach is commonly used in polling. Pollsters ask a small group of people about
their views on certain topics. They can then use this information to make informed
judgments about what the larger population thinks. This saves time, hassle, and the
expense of extracting data from an entire population.
The image illustrates the concept of population and sample. Using random sample measurements
from a representative group, we can estimate, predict, or infer characteristics about the larger
population. While there are many technical variations on this technique, they all follow the same
underlying principles.
3. What is descriptive statistics?
Descriptive statistics are used to describe the characteristics or features of a dataset. The term
‘descriptive statistics’ can be used to describe both individual quantitative observations as well
as the overall process of obtaining insights from these data. We can use descriptive statistics to
describe both an entire population or an individual sample. Because they are merely explanatory,
descriptive statistics are not heavily concerned with the differences between the two types of
data.
So what measures do descriptive statistics. While there are many, important ones include:
 Distribution
 Central tendency
 Variability
What is distribution?
Distribution shows us the frequency of different outcomes (or data points) in a population or
sample. We can show it as numbers in a list or table, or we can represent it graphically. As a
basic example, the following list shows the number of those with different hair colors in a dataset
of 286 people.
 Brown hair: 130

 Black hair: 39
 Blond hair: 91
 Auburn hair: 13
 Gray hair: 13
We can also represent this information visually, for instance in a pie chart.
Generally, using visualizations is common practice in descriptive statistics. It helps us more
readily spot patterns or trends in a dataset.
What is central tendency?
Central tendency is the name for measurements that look at the typical central values within a
dataset. This does not just refer to the central value within an entire dataset, which is called the
median. Rather, it is a general term used to describe a variety of central measurements. For
instance, it might include central measurements from different quartiles of a larger dataset.
Common measures of central tendency include:
 The mean: The average value of all the data points.

 The median: The central or middle value in the dataset.
 The mode: The value that appears most often in the dataset.
Once again, using hair color example, we can determine that the mean measurement is 57.2 (the
total value of all the measurements, divided by the number of values), the median is 39 (the
central value) and the mode is 13 (because it appears twice, which is more than any of the other
data points).
What is variability?
The variability, or dispersion, of a dataset, describes how values are distributed or spread out.
Identifying variability relies on understanding the central tendency measurements of a dataset.
However, like central tendency, variability is not just one measure. It is a term used to describe a
range of measurements.
 Standard deviation: This shows us the amount of variation or dispersion. Low standard
deviation implies that most values are close to the mean. High standard deviation
suggests that the values are more broadly spread out.
 Minimum and maximum values: These are the highest and lowest values in a dataset or
quartile.
 Range: This measures the size of the distribution of values. This can be easily
determined by subtracting the smallest value from the largest.
 Kurtosis: This measures whether or not the tails of a given distribution contain extreme
values (also known as outliers). If a tail lacks outliers, we can say that it has low kurtosis.
If a dataset has a lot of outliers, we can say it has high kurtosis.
 Skewness: This is a measure of a dataset’s symmetry. If you were to plot a bell-curve
and the right-hand tail was longer and fatter, we would call this positive skewness. If the
left-hand tail is longer and fatter, we call this negative skewness. This is visible in the
following image.
Used together, distribution, central tendency, and variability can tell us a surprising amount of
detailed information about a dataset. Within data analytics, they are very common measures,
especially in the area of exploratory data analysis. Once you’ve summarized the main features of
a population or sample, you’re in a much better position to know how to proceed with it. And
this is where inferential statistics come in.
Regression:-
Regression is defined as a statistical method that helps us to analyze and understand the
relationship between two or more variables of interest. The process that is adapted to perform
regression analysis helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other.
In regression, we normally have one dependent variable and one or more independent
variables. Here we try to “find” the value of the dependent variable “Y” with the help of the
independent variables. In other words, we are trying to understand, how the value of ‘Y’
changes with respect to change in ‘X’.
For the regression analysis is be a successful method, we understand the

following terms:
 Dependent Variable: This is the variable that we are trying to

understand or forecast.
 Independent Variable: These are factors that influence the analysis or
target variable and provide us with information regarding the
relationship of the variables with the target variable.
What is Regression Analysis?
Regression analysis is used for prediction and forecasting. This has

substantial overlap with the field of machine learning. This statistical
method is used across different industries such as,
 Financial Industry- Understand the trend in the stock prices, forecast
the prices, and evaluate risks in the insurance domain
 Marketing- Understand the effectiveness of market campaigns, and
forecast pricing and sales of the product.
 Manufacturing- Evaluate the relationship of variables that determine
to define a better engine to provide better performance
 Medicine- Forecast the different combinations of medicines to prepare
generic medicines for diseases.
 Regression Meaning In Simple terms:-

 Let’s understand the concept of regression with this example.
 You are conducting a case study on a set of college students to understand if students
with high CGPA also get a high GRE score.
 Your first task would be to collect the details of all the students.
 We go ahead and collect the GRE scores and CGPAs of the students of this college.
All the GRE scores are listed in one column and the CGPAs are listed in another
column.
 Now, if we are supposed to understand the relationship between these two variables,
we can draw a scatter plot.
 Here, we see that there’s a linear relationship between CGPA and GRE score which
means that as the CGPA increases, the GRE score also increases. This would also
mean that a student, who has a high CGPA, would also have a higher probability of
getting a high GRE score.
 But what if I ask, “The CGPA of the student is 8.32, what will be the GRE score of
the student?“
 This is where Regression comes in. If we are supposed to find the relationship
between two variables, we can apply regression analysis.
Terminologies used in Regression Analysis:-
Outliers:-
Suppose there is an observation in the dataset that has a very high or very low value as
compared to the other observations in the data, i.e. it does not belong to the population, such
an observation is called an outlier. In simple words, it is an extreme value. An outlier is a
problem because many times it hampers the results we get.
Multi collinearity:-
When the independent variables are highly correlated to each other, then the variables are
said to be multicollinear. Many types of regression techniques assume multicollinearity
should not be present in the dataset. It is because it causes problems in ranking variables
based on its importance, or it makes the job difficult in selecting the most important
independent variable.
Heteroscedasticity:-
When the variation between the target variable and the independent variable is not constant,
it is called heteroscedasticity. Example-As one’s income increases, the variability of food
consumption will increase. A poorer person will spend a rather constant amount by always
eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at
other times, eat expensive meals. Those with higher incomes display a greater variability of
food consumption.
Underfit and Overfit:-
When we use unnecessary explanatory variables, it might lead to over-fitting. Over-fitting
means that our algorithm works well on the training set but is unable to perform better on the
test sets. It is also known as a problem of high variance.
When our algorithm works so poorly that it is unable to fit even a training set well, then it is
said to underfit the data. It is also known as a problem of high bias.
Types of Regression:-
For different types of Regression analysis, there are assumptions that need to be considered
along with understanding the nature of variables and their distribution.
 Linear Regression
 Polynomial Regression
 Logistic Regression
Linear Regression
The simplest of all regression types is Linear Regression which tries to establish
relationships between Independent and Dependent variables. The Dependent variable
considered here is always a continuous variable.
What is Linear Regression?
Linear Regression is a predictive model used for finding the linear relationship between a
dependent variable and one or more independent variables.
Here, ‘Y’ is our dependent variable, which is a continuous numerical and

we are trying to understand how ‘Y’ changes with ‘X’.
Examples of Independent & Dependent Variables:
• Here x is Rainfall and y is Crop Yield

• Secondly, x is Advertising Expense and y is Sales
• At last, x is sales of goods and y is GDP
If the relationship with the dependent variable is in the form of single
variables, then it is known as Simple Linear Regression
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:

If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.
Simple Linear Regression:-

X —–> Y
If the relationship between Independent and dependent variables is

multiple in numbers, then it is called Multiple Linear Regression
Multiple Linear Regression:-
Simple Linear Regression Model:-

y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)

X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear
Regression model representation.
The main factor that is considered as part of Regression analysis is

understanding the variance between the variables. For understanding the
variance, we need to understand the measures of variation.
SST = total sum of squares (Total Variation)

Measures the variation of the Y i values around their mean Y
SSR = regression sum of squares (Explained Variation)
Variation attributable to the relationship between X and Y
SSE = error sum of squares (Unexplained Variation)
Variation in Y attributable to factors other than X
With all these factors taken into consideration, before we start assessing if the model is doing
good, we need to consider the assumptions of Linear Regression.
Linear Regression Line

A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can
show two types of relationship:
o Positive Linear Relationship:

If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.
Multiple Linear Regression:-
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response variable.
The goal of multiple linear regression is to model the linear relationship between the
explanatory (independent) variables and response (dependent) variables. In essence, multiple
regression is the extension of ordinary least-squares (OLS) regression because it involves more
than one explanatory variable.
KEY TAKEAWAYS
 Multiple linear regression (MLR), also known simply as multiple regression, is a

statistical technique that uses several explanatory variables to predict the outcome of a
response variable.
 Multiple regression is an extension of linear (OLS) regression that uses just one
explanatory variable.
 MLR is used extensively in econometrics and financial inference.
Formula and Calculation of Multiple Linear Regression
yi=β0+β1xi1+β2xi2+...+βpxip+ϵ
where, for i=n
observations: yi=dependent variable
xi=explanatory variables
β0=y-intercept (constant term)
βp=====slope coefficients for each explanatory variable
ϵ=the model’s error term (also known as the residuals)
What Multiple Linear Regression

Simple linear regression is a function that allows an analyst or statistician to make predictions
about one variable based on the information that is known about another variable. Linear
regression can only be used when one has two continuous variables—an independent variable
and a dependent variable. The independent variable is the parameter that is used to calculate the
dependent variable or outcome. A multiple regression model extends to several explanatory
variables.
The multiple regression models is based on the following assumption
 There is a linear relationship between the dependent variables and the independent
variables
 The independent variables are not too highly correlated with each other
 yi observations are selected independently and randomly from the population
 Residuals should be normally distributed with a mean of 0 and variance σ
The coefficient of determination (R-squared) is a statistical metric that is used to measure how
much of the variation in outcome can be explained by the variation in the independent variables.
R2 always increases as more predictors are added to the MLR model, even though the predictors
may not be related to the outcome variable.
R2 by itself can't thus be used to identify which predictors should be included in a model and
which should be excluded. R 2 can only be between 0 and 1, where 0 indicates that the outcome
cannot be predicted by any of the independent variables and 1 indicates that the outcome can be
predicted without error from the independent variables.
When interpreting the results of multiple regression, beta coefficients are valid while holding all
other variables constant ("all else equal"). The output from a multiple regression can be
displayed horizontally as an equation, or vertically in table form.
Example of How to Use Multiple Linear Regression

As an example, an analyst may want to know how the movement of the market affects the price
of ExxonMobil (XOM). In this case, their linear equation will have the value of the S&P 500
index as the independent variable, or predictor, and the price of XOM as the dependent variable.
In reality, multiple factors predict the outcome of an event. The price movement of
ExxonMobil, for example, depends on more than just the performance of the overall market.
Other predictors such as the price of oil, interest rates, and the price movement of oil futures can
affect the price of XOM and stock prices of other oil companies. To understand a relationship in
which more than two variables are present, multiple linear regression is used.
Multiple linear regression (MLR) is used to determine a mathematical relationship among

several random variables.1 In other terms, MLR examines how multiple independent variables
are related to one dependent variable. Once each of the independent factors has been determined
to predict the dependent variable, the information on the multiple variables can be used to create
an accurate prediction on the level of effect they have on the outcome variable. The model
creates a relationship in the form of a straight line (linear) that best approximates all the
individual data points.
Referring to the MLR equation above, in our example:
 yi = dependent variable—the price of XOM

 xi1 = interest rates
 xi2 = oil price
 xi3 = value of S&P 500 index
 xi4= price of oil futures
 B0 = y-intercept at time zero
 B1 = regression coefficient that measures a unit change in the dependent variable when
xi1 changes - the change in XOM price when interest rates change
 B2 = coefficient value that measures a unit change in the dependent variable when
xi2 changes—the change in XOM price when oil prices change
 The least-squares estimates—B 0, B1, B2…Bp—are usually computed by statistical
software. As many variables can be included in the regression model in which each
independent variable is differentiated with a number—1,2, 3, 4...p. The multiple
regression model allows an analyst to predict an outcome based on information provided
on multiple explanatory variables.
 Still, the model is not always perfectly accurate as each data point can differ slightly
from the outcome predicted by the model. The residual value, E, which is the difference
between the actual outcome and the predicted outcome, is included in the model to
account for such slight variations.
 Assuming we run our XOM price regression model through a statistics computation
software, that returns this output:
An analyst would interpret this output to mean if other variables are held constant, the price of
XOM will increase by 7.8% if the price of oil in the markets increases by 1%. The model also
shows that the price of XOM will decrease by 1.5% following a 1% rise in interest rates.
R2 indicates that 86.5% of the variations in the stock price of Exxon Mobil can be explained by
changes in the interest rate, oil price, oil futures, and S&P 500 index.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will have
the least error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function
is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost
function, which is the average of squared error occurred between the
predicted values and actual values. It can be written as:
Polynomial Regression
This type of regression technique is used to model nonlinear equations by taking polynomial
functions of independent variables.
In the figure given below, you can see the red curve fits the data better than the green curve.
Hence in the situations where the relationship between the dependent and independent
variable seems to be non-linear, we can deploy Polynomial Regression Models.
Thus a polynomial of degree k in one variable is written as:
Here we can create new features like
and can fit linear regression in a similar manner.
In the case of multiple variables say X1 and X2, we can create a third new feature (say X3)
which is the product of X1 and X2 i.e.
The main drawback of this type of regression model is if we create unnecessary extra
features or fitting polynomials of a higher degree this may lead to overfitting of the model.
Logistic Regression
Logistic Regression is also known as Logistic, Maximum-Entropy classifier
is a supervised learning method for classification. It establishes a relation
between dependent class variables and independent variables using
regression.
The dependent variable is categorical i.e. it can take only integral values
representing different classes. The probabilities describing the possible
outcomes of a query point are modelled using a logistic function. This
model belongs to a family of discriminative classifiers. They rely on
attributes which discriminate the classes well. This model is used when we
have 2 classes of dependent variables. When there are more than 2
classes, then we have another regression method which helps us to
predict the target variable better.
There are two broad categories of Logistic Regression algorithms
1. Binary Logistic Regression when the dependent variable is strictly
binary
2. Multinomial Logistic Regression is when the dependent variable has
multiple categories.
There are two types of Multinomial Logistic Regression
1. Ordered Multinomial Logistic Regression (dependent variable has
ordered values)
2. Nominal Multinomial Logistic Regression (dependent variable has
unordered categories)
Process Methodology
Logistic regression takes into consideration the different classes of
dependent variables and assigns probabilities to the event happening for
each row of information. These probabilities are found by assigning
different weights to each independent variable by understanding the
relationship between the variables. If the correlation between the
variables is high, then positive weights are assigned and in the case of an
inverse relationship, negative weight is assigned.
As the model is mainly used to classify the classes of target variables as

either 0 or 1, thus the Sigmoid function is obtained by implementing the
log-normal function on these probabilities that are calculated on these
independent variables.
The Sigmoid function:
P(y= 1) = Sigmoid(Z) = 1/(1 + e -z)

P(y= 0) = 1 –P(y =1) = 1 –(1/(1 + e -z)) = e –z/ (1 + e -z)
y = 1 if P(y=1|X) > .5, else y = 0
where the default probability cut off is taken as 0.5.
This method is also called the Odds Log ratio.
Assumptions
 The dependent variable is categorical. Dichotomous for binary logistic
regression and multi-label for multi-class classification
 Attributes and log odds i.e. log(p / 1-p) should be linearly related to the
independent variables
 Attributes are independent of each other (low or no multicollinearity)
 In binary logistic regression class of interest is coded with 1 and other
class 0
 In multi-class classification using Multinomial Logistic Regression or OVR
scheme, class of interest is coded 1 and rest 0(this is done by the
algorithm)
Note: The assumptions of Linear Regression such as homoscedasticity, normal distribution of

error terms, a linear relationship between the dependent and independent variables are not
required here.
Some examples where this model can be used for predictions.
 Predicting the weather: You can only have a few definite weather
types. Stormy, sunny, cloudy, rainy and a few more.
 Medical diagnosis: Given the symptoms predicted the disease patient
is suffering from.
 Credit Default: If a loan has to be given to a particular candidate
depends on his identity check, account summary, any properties he
holds, any previous loan, etc
 HR Analytics: IT firms recruit a large number of people, but one of the
problems they encounter is after accepting the job offer many
candidates do not join. So, this results in cost overruns because they
have to repeat the entire process again. Now when you get an
application, can you actually predict whether that applicant is likely to
join the organization (Binary Outcome – Join / Not Join).
 Elections: Suppose that we are interested in the factors that influence
whether a political candidate wins an election. The outcome (response)
variable is binary (0/1); win or lose. The predictor variables of interest
are the amount of money spent on the campaign and the amount of
time spent campaigning negatively.
Analysis of variance (ANOVA) is a statistical technique that is used to check if the means
of two or more groups are significantly different from each other. ANOVA checks the
impact of one or more factors by comparing the means of different samples

1st Unit Notes (1)

Uploaded by

Copyright:

Available Formats

1st Unit Notes (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1st Unit Notes (1)

Uploaded by

Copyright:

Available Formats

UNIT-I:

DESCRIPTIVE STATISTICS: Probability Distributions, Inferential Statistics, Inferential

1. Common Data Types

Common Data Types

The graph of a uniform distribution curve looks like

The mathematical representation of binomial distribution is given by:

Now, when probability of success = probability of failure, in such a situation

Mean -> µ = n*p

Variance -> Var(X) = n*p*q

distribution is known as Normal distribution if it has the following characteristics:

1. The mean, median and mode of the distribution coincide.

trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.

A standard normal distribution is defined as the distribution with mean 0 and

1. The number of emergency calls recorded at a hospital in a day.

A distribution is called Poisson distribution when the following assumptions

Now, if any distribution validates the above assumptions then it is a Poisson

 λ is the rate at which an event occurs,

Here, X is called a Poisson Random Variable and the probability distribution of

Let µ denote the mean number of events in an interval of length t. Then, µ =

The PMF of X following a Poisson distribution is given by:

The mean µ is the parameter of this distribution. µ is also defined as the λ

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ

Other examples are:

1. Length of time beteeen metro arrivals,

Exponential distribution is widely used for survival analysis. From the

A random variable X is said to have an exponential distribution with PDF:

and parameter λ>0 which is also called the rate.

Mean and Variance of a random variable X following an exponential

Mean -> E(X) = 1/λ

Variance -> Var(X) = (1/λ)²

To ease the computation, there are some formulas given below.

What Is Hypothesis Testing in Statistics?

Examples of statistical hypothesis from real-life: -

H0 is the symbol for it, and it is pronounced H-naught.

Simple and Composite Hypothesis Testing

Type 1 and Type 2 Error

H0: Student has passed

How do we obtain a random sample?

2. Deciding your sample size

3. Randomly select a sample

4. Analyze the data sample

2. What are population and sample in statistics?

3. What is descriptive statistics?

 Brown hair: 130

What is central tendency?

 The mean: The average value of all the data points.

For the regression analysis is be a successful method, we understand the

 Dependent Variable: This is the variable that we are trying to

What is Regression Analysis?

Regression analysis is used for prediction and forecasting. This has

 Regression Meaning In Simple terms:-

Terminologies used in Regression Analysis:-

Here, ‘Y’ is our dependent variable, which is a continuous numerical and

Examples of Independent & Dependent Variables:

• Here x is Rainfall and y is Crop Yield

Variance -> Var(X) = npq