Unit-4 DS Student

UNIT 4
Model Development
Simple and Multiple Regression, Model Evaluation using Visualization , Residual Plot,
Distribution Plot, Polynomial Regression and Pipelines, Measures for In-sample
Evaluation, Prediction and Decision Making.
Model:
A model is a transformation engine that helps us to express dependent variables as a

function of independent variables.
Parameters:
Parameters are ingredients added to the model for estimating the output.
Concept
Linear regression models provide a simple approach towards supervised learning. They are
simple yet effective.
Wait, what do we mean by linear?
Linear implies the following: arranged in or extending along a straight or nearly straight
line. Linear suggests that the relationship between dependent and independent variable can
be expressed in a straight line.
Recall the geometry lesson from high school. What is the equation of a line?
y = mx + c
Linear regression is nothing but a manifestation of this simple equation.
● y is the dependent variable i.e. the variable that needs to be estimated and
predicted.
● x is the independent variable i.e. the variable that is controllable. It is the input.
● m is the slope. It determines what will be the angle of the line. It is the
parameter denoted as β.
● c is the intercept. A constant that determines the value of y when x is 0.

1. Simple and Multiple Regression
Simple Linear Regression Models
1. Simple Linear Regression
This method uses a single independent variable to predict a dependent variable by fitting a
best linear relationship.
⚫ algorithms.
Linear regression is one of the easiest and most popular Machine Learning
⚫ makes
It is a statistical method that is used for predictive analysis. Linear regression
predictions for continuous/real or numeric variables such as sales, salary,
age, product price, etc.
⚫ Linear regression algorithm shows a linear relationship between a dependent (y)

and one or more independent (y) variables, hence called as linear regression.
⚫ Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.
⚫ The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
⚫
⚫ Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.
⚫ The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
⚫ Linear Regression Line

⚫ Avariables
linear line showing the relationship between the dependent and independent
is called a regression line. A regression line can show two types of
relationship:
⚫ Positive Linear Relationship:

If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
⚫
⚫ Negative Linear Relationship:
⚫ Ifincreases
the dependent variable decreases on the Y-axis and independent variable
on the X-axis, then such a relationship is called a negative linear
relationship.
⚫ Finding the best fit line:
⚫ When working with linear regression, our main goal is to find the best fit line
that means the error between predicted values and actual values should be
minimized. The best fit line will have the least error.
⚫ The different values for weights or the coefficient of lines (a , a ) gives a different
0 1
line of regression, so we need to calculate the best values for a and a to find the
0 1
best fit line, so to calculate this we use cost function.
⚫ Cost function-
⚫ The different values for weights or coefficient of lines (a , a ) gives the different
0 1
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
⚫ Cost function optimizes the regression coefficients or weights. It measures how a

linear regression model is performing.
⚫ We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable.
⚫ This mapping function is also known as Hypothesis function.
⚫ For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:
⚫
⚫ Where,
⚫ N=Total number of observation
⚫ Yi = Actual value
⚫ (a1x +a )= Predicted value.
i 0
⚫ Residuals: The
residual.
distance between the actual value and predicted values is called
⚫ Ifhigh,theandobserved points are far from the regression line, then the residual will be
so cost function will high.
⚫ small
If the scatter points are close to the regression line, then the residual will be
and hence the cost function.
Assumptions of Linear Regression
⚫ Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best
possible result from the given dataset.
⚫ Linear relationship between the features and target:

⚫ Linear regression assumes the linear relationship between the dependent and
independent variables.
⚫ Small or no multicollinearity between the features:

⚫ Due to multicollinearity, it may difficult to find the true relationship between the
predictors and target variables.
⚫ Ortargetwevariable
can say, it is difficult to determine which predictor variable is affecting the
and which is not.
Independent and Dependent variables:
In the context of Statistical learning, there are two types of data:
● Independent variables: Data that can be controlled directly.
● Dependent variables: Data that cannot be controlled directly.
The data that can’t be controlled i.e. dependent variables need to predicted or estimated.
● What is Linear Regression?

● It’s a method to predict a target variable by fitting the best linear
relationship between the dependent and independent variable.
●
● Example!
● What is the Best Fit?
● It can be of any shape depending on the number of independent variables (a point on
the axis, a line in two dimensions, a plane in three dimensions, or a hyperplane in
higher dimensions).
Least Squares Method: The best fit is done by making sure that the sum of all the
distances between the shape and the actual observations at each point is as small as
possible. The fit of the shape is “best” in the sense that no other position would
produce less error given the choice of shape.
George Box, a famous British statistician, once quoted:
“All models are wrong; some are useful.”
Linear regression models are not perfect. It tries to approximate the relationship between
dependent and independent variables in a straight line. Approximation leads to errors. Some
errors can be reduced. Some errors are inherent in the nature of the problem. These errors
cannot be eliminated. They are called as an irreducible error, the noise term in the true
relationship that cannot fundamentally be reduced by any model.
The same equation of a line can be re-written as:
β0 and β1 are two unknown constants that represent the intercept and slope. They are the
parameters.
ε is the error term.
Formulation
Let us go through an example to explain the terms and workings of a Linear regression
model.
Fernando is a Data Scientist. He wants to buy a car. He wants to estimate or predict the car
price that he will have to pay. He has a friend at a car dealership company. He asks for prices
for various other cars along with a few characteristics of the car. His friend provides him with
some information.
The following are the data provided to him:

● make: make of the car.
● fuelType: type of fuel used by the car.
● nDoor: number of doors.
● engineSize: size of the engine of the car.
● price: the price of the car.
First, Fernando wants to evaluate if indeed he can predict car price based on engine size. The
first set of analysis seeks the answers to the following questions:
● Is price of car price related with engine size?
● How strong is the relationship?
● Is the relationship linear?
● Can we predict/estimate car price based on engine size?
Fernando does a correlation analysis. Correlation is a measure of how much the two variables
are related. It is measured by a metric called as the correlation coefficient. Its value is
between 0 and 1.
If the correlation coefficient is a large(> 0.7) +ve number, it implies that as one variable
increases, the other variable increases as well. A large -ve number indicates that as one
variable increases, the other variable decreases.
He does a correlation analysis. He plots the relationship between price and engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is
used for the test.
He builds a linear regression model. He uses a statistical package to create the model. The
model creates a linear equation that expresses price of the car as a function of engine size.
Following are the answers to the questions:
● Is price of car price related with engine size?
● Yes, there is a relationship.
● How strong is the relationship?
● The correlation coefficient is 0.872 => There is a strong relationship.
● Is the relationship linear?
● A straight line can fit => A decent prediction of price can be made using
engine size.
● Can we predict/estimate the car price based on engine size?
● Yes, car price can be estimated based on engine size.
Fernando now wants to build a linear regression model that will estimate the price of the car
price based on engine size. Superimposing the equation to the car price problem, Fernando
formulates the following equation for price prediction.
price = β0 + β1 x engine size

Model Building and Interpretation
Model
The data needs to be split into training and testing set. The training data is used to learn about
the data. The training data is used to create the model. The testing data is used to evaluate the
model performance.
Fernando splits the data into training and test set. 75% of data is used for training. Remaining
is used for the test. He builds a linear regression model. He uses a statistical package to create
the model. The model produces a linear equation that expresses price of the car as a function
of engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is
used for the test.
He builds a linear regression model. He uses a statistical package to create the model. The
model creates a linear equation that expresses price of the car as a function of engine size.
The model estimates the parameters:
● β0 is estimated as -6870.1
● β1 is estimated as 156.9
The linear equation is estimated as:
price = -6870.1 + 156.9 x engine size
Interpretation
The model provides the equation for the predicting the average car price given a specific
engine size. This equation means the following:
One unit increase in engine size will increase the average price of the car by 156.9 units.
Evaluation
The model is built. The robustness of the model needs to be evaluated. How can we be sure
that the model will be able to predict the price satisfactory? This evaluation is done in two
parts. First, test to establish the robustness of the model. Second, test to evaluate the accuracy
of the model.
Fernando first evaluates the model on the training data. He gets the following statistics.
There are a lot of statistics in there. Let us focus on key ones (marked in red). Recall the
discussion on hypothesis testing. The robustness of the model is evaluated using hypothesis
testing.
H0 and Ha need to be defined. They are defined as follows:
● H0 (NULL hypothesis): There is no relationship between x and y i.e. there is

no relationship between price and engine size.
● Ha (Alternate hypothesis): There is some relationship between x and y i.e.

there is a relationship between price and engine size.
β1: The value of β1 determines the relationship between price and engine size. If β1 = 0 then
there is no relationship. In this case, β1 is positive. It implies that there is some relationship
between price and engine size.
t-stat: The t-stat value is how many standard deviations the coefficient estimate (β1) is far
away from zero. Further, it is away from zero stronger the relationship between price and
engine size. The coefficient is significant. In this case, t-stat is 21.09. It is far enough from
zero.
p-value: p-value is a probability value. It indicates the chance of seeing the given t-statistics,
under the assumption that NULL hypothesis is true. If the p-value is small e.g. < 0.0001, it
implies that the probability that this is by chance and there is no relation is very low. In this
case, the p-value is small. It means that relationship between price and engine is not by
chance.
With these metrics, we can safely reject the NULL hypothesis and accept the alternate
hypothesis. There is a robust relationship between price and engine size
The relationship is established. How about accuracy? How accurate is the model? To get a
feel for the accuracy of the model, a metric named R-squared or coefficient of
determination is important.
R-squared or Coefficient of determination: To understand these metrics, let us break it

down into its component.
● Error (e) is the difference between the actual y and the predicted y. The
predicted y is denoted as ŷ. This error is evaluated for each observation. These
errors are also called as residuals.
● Then all the residual values are squared and added. This term is called
as Residual Sum of Squares (RSS). Lower the RSS, the better it is.
● There is another part of the equation of R-squared. To get the other part, first,
the mean value of the actual target is computed i.e. average value of the price
of the car is estimated. Then the differences between the mean value and actual
values are calculated. These differences are then squared and added. It is
the total sum of squares (TSS).
● R-squared a.k.a coefficient of determination is computed as 1- RSS/TSS. This

metric explains the fraction of the variance between the values predicted by the
model and the value as opposed to the mean of the actual. This value is
between 0 and 1. The higher it is, the better the model can explain the
variance.
Let us look at an example.
In the example above, RSS is computed based on the predicted price for three cars. RSS
value is 41450201.63. The mean value of the actual price is 11,021. TSS is calculated as
44,444,546. R-squared is computed as 6.737%. For these three specific data points, the model
is only able to explain 6.73% of the variation. Not good enough!!
However, for Fernando’s model, it is a different story. The R-squared for the training set
is 0.7503 i.e. 75.03%. It means that the model can explain more 75% of the variation.
⚫ So, the model assumes either little or no multicollinearity between the features or
⚫ Homoscedasticity Assumption:
⚫ Homoscedasticity
is a situation when the error term is the same for all the values of
⚫ With homoscedasticity, there should be no clear pattern distribution of data in the

scatter plot.
⚫ Normal distribution of error terms:

⚫ Linear
pattern.
regression assumes that the error term should follow the normal distribution
⚫ Ifeithererrortooterms are not normally distributed, then confidence intervals will become
wide or too narrow, which may cause difficulties in finding coefficients.
⚫ Itdeviation,
can be checked using the q-q plot. If the plot shows a straight line without any
which means the error is normally distributed.
⚫ No autocorrelations:
⚫ The linear regression model assumes no autocorrelation in error terms.
⚫ accuracy
If there will be any correlation in the error term, then it will drastically reduce the
of the model.
⚫ Autocorrelation usually occurs if there is a dependency between residual errors.

2. Multiple Linear Regressions
This method uses more than one independent variable to predict a dependent variable by
fitting a best linear relationship.
Note: It works best when multicollinearity is absent. It’s a phenomenon in which two or

more predictor variables are highly correlated.
⚫ Example
⚫ Prediction
car.
of CO emission based on engine size and number of cylinders in a
2
⚫ Some key points about MLR:

⚫ For MLR, the dependent or target variable(Y) must be the continuous/real, but
the predictor or independent variable may be of continuous or categorical form.
⚫ Each feature variable must model the linear relationship with the dependent
variable.
⚫ MLR tries to fit a regression line through a multidimensional space of

data-points.
⚫ Assumptions for Multiple Linear Regression:

⚫ A linear relationship should exist between the Target and predictor variables.
⚫ The regression residuals must be normally distributed.
⚫ MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.
In case of Multiple Regression, the parameters can be found in the same way as that in the
case of simple linear regression, by minimising the cost function using:
⚫ Gradient Descent: Given a function defined by a set of parameters, Gradient Descent

starts with an initial set of parameter values and iteratively moves towards a set of
values that minimise the function. This iterative minimisation is done using calculus,
taking steps in the negative direction of the function gradient.
⚫
4. Stepwise Regression
This regression model is used when we have more than one independent variable. It uses
automatic procedure to select important independent variables and there is no human
intervention.
Forward Stepwise Regression
● Here, we start with null model which means it has no predictors, just one

intercept (the mean over dependent variable).
● Now, fit p(total number of variables) simple linear regression models, each

with one of the variables in. Thus, we have just searched through all the single
variable models, the best one and fixed this one in the model.
● Similarly, search through the remaining p-1 variables one by one, but this time
with that variable in the model which was selected in previous step. Now
choose the model which will be best among the p-1 models.
● Continue until some stopping rule is satisfied like some threshold value of the
number of variables to be selected.
Backward Stepwise Regression
● It starts with the full least squares model containing all p predictors.
● Now remove the variable with the largest p-value i.e. the least significant
predictor.
● The new model shall have (p-1) variables. Remove the variable with largest
p-value again.
● Continue until some stopping rule is satisfied like all variables have a p-value
smaller than a threshold value.
we can see, Stepwise Linear Regression is applying Multiple Linear Regression multiple
times and selecting the important variables or removing the least significant predictors each
time.
Note 1: For Backward Stepwise Linear Regression or Multiple Linear Regression to work
fine, the number of observations (n) should be more than the number of variables(p). It
is because we can do least squares regression only when n is greater than p. For p greater than
n, least squares model is not even defined.
Note 2: Automatic procedures may not choose the right significant variables from practical
point of view as they don’t have the special knowledge the analyst might have.
3.Logistic regression
⚫ Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
⚫ Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
⚫ It is a predictive analysis algorithm which works on the concept of probability.
⚫ Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
⚫ Logistic regression uses sigmoid function or logistic function which is a complex

cost function.
⚫ This sigmoid function is used to model the data in logistic regression. The function
can be represented as:
⚫ f(x)= Output between the 0 and 1 value.
⚫ x= input to the function
⚫ e= base of natural logarithm.
⚫ When we provide the input values (data) to the function, it gives the S-curve as
follows:
⚫ Itupuses the concept of threshold levels, values above the threshold level are rounded
to 1, and values below the threshold level are rounded up to 0.
⚫ There are three types of logistic regression:

⚫ Binary(0/1, pass/fail)
⚫ Multi(cats, dogs, lions)
⚫ Ordinal(low, medium, high)
4. Polynomial Regression
⚫ Polynomial Regression is a type of regression which models the non-linear

dataset using a linear model.
⚫ Itvalueis similar to multiple linear regression, but it fits a non-linear curve between the
of x and corresponding conditional values of y.
⚫ Suppose there is a dataset which consists of datapoints which are present in a

non-linear fashion, so for such case, linear regression will not best fit to those
datapoints.
⚫ To cover such datapoints, we need Polynomial regression

⚫ features
In Polynomial regression, the original features are transformed into polynomial
of given degree and then modeled using a linear model.
⚫ Which means the datapoints are best fitted using a polynomial line.
⚫ The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b 0+b1x+ b2x2+ b3x3+.....+ bnxn.
⚫ Here Y is the predicted/target output, b , b ,... b are the regression coefficients. x

is our independent/input variable.
0 1 n
⚫ The model is still linear as the coefficients are still linear with quadratic
⚫ regression,
This is different from Multiple Linear regression in such a way that in Polynomial
a single element has different degrees instead of multiple variables with
the same degree
⚫ addIt issome
also called the special case of Multiple Linear Regression in ML. Because we
polynomial terms to the Multiple Linear regression equation to convert it
into Polynomial Regression.
⚫ It is a linear model with some modification in order to increase the accuracy.

⚫ The dataset used in Polynomial regression for training is of non-linear nature.
⚫ Itfunctions
makes use of a linear regression model to fit the complicated and non-linear
and datasets.
⚫ Need for Polynomial Regression:

⚫ The need of Polynomial Regression in ML can be understood in the below points:
⚫ Ifhaveweseen
apply a linear model on a linear dataset, then it provides us a good result as we
in Simple Linear Regression,
⚫ but if we apply the same model without any modification on a non-linear dataset, then
it will produce a drastic output.
⚫ Due to which loss function will increase, the error rate will be high, and accuracy will
be decreased.
⚫ Polynomial
So for such cases, where data points are arranged in a non-linear fashion, we need the
Regression model.
⚫ We can understand it in a better way using the below comparison diagram of the
linear dataset and non-linear dataset.
In the above image, we have taken a dataset which is arranged non-linearly.
⚫ Socovers
if we try to cover it with a linear model, then we can clearly see that it hardly
any data point.
⚫ On the other hand, a curve is suitable to cover most of the data points, which is
of the Polynomial model.
⚫ Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.
Gradient Descent
When there are one or more inputs you can use a process of optimizing the values of the
coefficients by iteratively minimizing the error of the model on your training data.
This operation is called Gradient Descent and works by starting with random values for each
coefficient. The sum of the squared errors are calculated for each pair of input and output
values. A learning rate is used as a scale factor and the coefficients are updated in the
direction towards minimizing the error. The process is repeated until a minimum sum squared
error is achieved or no further improvement is possible.
When using this method, you must select a learning rate (alpha) parameter that determines
the size of the improvement step to take on each iteration of the procedure.
Regularization
There are extensions of the training of the linear model called regularization methods. These
seek to both minimize the sum of the squared error of the model on the training data (using
ordinary least squares) but also to reduce the complexity of the model (like the number or
absolute size of the sum of all coefficients in the model).
Two popular examples of regularization procedures for linear regression are:
● Lasso Regression: where Ordinary Least Squares is modified to also minimize the
absolute sum of the coefficients (called L1 regularization).
● Ridge Regression: where Ordinary Least Squares is modified to also minimize the
squared absolute sum of the coefficients (called L2 regularization).
These
methods are effective to use when there is collinearity in your input values and ordinary least
squares would over fit the training data.
Model evaluation through Visualization
How to use Residual Plots for regression model validation?
One of the most important parts of any Data Science/ML project is model validation. For
regression, there are numerous methods to evaluate the goodness of your fit i.e. how well the
model fits the data. R² values are just one such measure. But they are not always the best at
making us feel confident about our model.
Residuals
A residual is a measure of how far away a point is vertically from the regression line. Simply,
it is the error between a predicted value and the observed actual value.
Residual Equation
Figure 1 is an example of how to visualize residuals against the line of best fit. The vertical
lines are the residuals.
Residual Plots
A typical residual plot has the residual values on the Y-axis and the independent variable on
the x-axis. Figure 2 below is a good example of how a typical residual plot looks like.
Residual Plot Analysis
The most important assumption of a linear regression model is that the errors are
independent and normally distributed.
Every regression model inherently has some degree of error since you can never predict
something 100% accurately. More importantly, randomness and unpredictability are always a
part of the regression model. Hence, a regression model can be explained as:
Ideally, our linear equation model should accurately capture the predictive information.
Essentially, what this means is that if we capture all of the predictive information, all that is
left behind (residuals) should be completely random & unpredictable i.e stochastic. Hence,
we want our residuals to follow a normal distribution. And that is exactly what we look for in
a residual plot.
Characteristics of Good Residual Plots
A few characteristics of a good residual plot are as follows:
1. It has a high density of points close to the origin and a low density of points
away from the origin
2. It is symmetric about the origin
To explain why Fig. 3 is a good residual plot based on the characteristics above, we project
all the residuals onto the y-axis. As seen in Figure 3b, we end up with a normally distributed
curve; satisfying the assumption of the normality of the residuals.
Fig. 3: Good Residual Plot
Project onto the y-axis
Finally, one other reason this is a good residual plot is, that independent of the value of an
independent variable (x-axis), the residual errors are approximately distributed in the same
manner. In other words, we do not see any patterns in the value of the residuals as we move
along the x-axis.
Hence, this satisfies our earlier assumption that regression model residuals are independent
and normally distributed.
Using the characteristics described above, we can see why Figure 4 is a bad residual plot.
This plot has high density far away from the origin and low density close to the origin. Also,
when we project the residuals on the y-axis, we can see the distribution curve is not normal.
Example of Bad Residual plot
Project onto the y-axis
It is important to understand here that these plots signify that we have not completely
captured the predictive information of the data in our model, which is why it is “seeping” into
our residuals. A good model should always only have random error left after using the
predictive information
Data distributions
You may have noticed that numerical data is often summarized with the average value. For
example, the quality of a high school is sometimes summarized with one number: the average
score on a standardized test. Occasionally, a second number is reported: the standard
deviation. For example, you might read a report stating that scores were 680 plus or minus 50
(the standard deviation). The report has summarized an entire vector of scores with just two
numbers. Is this appropriate? Is there any important piece of information that we are missing
by only looking at this summary rather than the entire list?
Our first data visualization building block is learning to summarize lists of factors or numeric
vectors. More often than not, the best way to share or explore this summary is through data
visualization. The most basic statistical summary of a list of objects or numbers is its
distribution. Once a vector has been summarized as a distribution, there are several data
visualization techniques to effectively relay this information.
In this chapter, we first discuss properties of a variety of distributions and how to visualize
distributions using a motivating example of student heights. We then discuss
the ggplot2 geometries for these visualizations
Variable types
We will be working with two types of variables: categorical and numeric. Each can be
divided into two other groups: categorical can be ordinal or not, whereas numerical variables
can be discrete or continuous.
When each entry in a vector comes from one of a small number of groups, we refer to the
data as categorical data. Two simple examples are sex (male or female) and regions
(Northeast, South, North Central, West). Some categorical data can be ordered even if they
are not numbers per se, such as spiciness (mild, medium, hot). In statistics textbooks, ordered
categorical data are referred to as ordinal data.
Examples of numerical data are population sizes, murder rates, and heights. Some numerical
data can be treated as ordered categorical. We can further divide numerical data into
continuous and discrete. Continuous variables are those that can take any value, such as
heights, if measured with enough precision. For example, a pair of twins may be 68.12 and
68.11 inches, respectively. Counts, such as population sizes, are discrete because they have to
be round numbers.
Keep in mind that discrete numeric data can be considered ordinal. Although this is
technically true, we usually reserve the term ordinal data for variables belonging to a small
number of different groups, with each group having many members. In contrast, when we
have many groups with few cases in each group, we typically refer to them as discrete
numerical variables. So, for example, the number of packs of cigarettes a person smokes a
day, rounded to the closest pack, would be considered ordinal, while the actual number of
cigarettes would be considered a numerical variable. But, indeed, there are examples that can
be considered both numerical and ordinal when it comes to visualizing data.
Let’s start by understanding what exactly distribution mean . Term “distribution ” in data
science or statistics usually mean a probability distribution . Distribution is nothing but a
function which provide the possible value of variable and how often they occur. Probability
distribution is mathematical function which provide the possibilities of occurrence of various
possible outcome that can occur in an experiment.
There are many types of probability distribution . but is mainly considered in regression
1. Normal distribution
2. Binomial distribution
3. Bernoulli distribution
4. Uniform distribution
5. Poisson distribution
Normal distribution:
● Normal distribution is most important distribution ,because it fits in many natural

phenomenon.
For instance :height,blood pressure,IQ score,etc
● Normal distribution is also called as guassian distribution.
● Let’s consider X is random variable belongs to normal distribution with mean and
standard deviation . If we plot the histogram or pdf(probability density function) of
random variable ,it will look like bell curve as shown below:
● Following are three important properties of normal distribution. These properties also
called Emperical formula.
1. Probability of variable that falls within range of 1 Standard Deviation i.e
(mew - sigma to mew+sigma)
is equal to 68%.
It means 68% of data point belongs to X falls within range of 1 Standard Deviation.
2. Probability of variable that falls within range of 2 Standard Deviation is equal to 95%.
95% of data point belongs to random variable X falls within range of 2 Standard Deviation.
3. Probability of variable that falls within range of 3 Standard Deviation is equal to 99.7%.
99.7% of data point belongs to random variable X falls within range of 3 Standard Deviation.
During Exploratory Data Analysis , we try to plot the features. If it make the bell curve then
all above properties will be applied to it , because it’s a normal distribution or guassian
distribution.
Data Science Pipeline
data science pipeline. That is O.S.E.M.N.
OSEMN Pipeline
● O — Obtaining our data
● S — Scrubbing / Cleaning our data
● E — Exploring / Visualizing our data will allow us to find patterns and trends
● M — Modeling our data will give us our predictive power as a wizard
● N — Interpreting our data
Data Science is an interdisciplinary field that focuses on extracting knowledge from data sets
that are typically huge in amount. The field encompasses analysis, preparing data for
analysis, and presenting findings to inform high-level decisions in an organization. As such, it
incorporates skills from computer science, mathematics, statics, information visualization,
graphic, and business.
In simple words, a pipeline in data science is “a set of actions which changes the raw (and
confusing) data from various sources (surveys, feedbacks, list of purchases, votes, etc.), to
an understandable format so that we can store it and use it for analysis.”
But besides storage and analysis, it is important to formulate the questions that we will solve
using our data. And these questions would yield the hidden information which will give us
the power to predict results, just like a wizard. For instance:
● What type of sales will reduce risks?
● Which product will sell more during a crisis?
● Which practice can bring more business?
After getting hold of our questions, now we are ready to see what lies inside the data science
pipeline. When the raw data enters a pipeline, it’s unsure of how much potential it holds
within. It is we data scientists, waiting eagerly inside the pipeline, who bring out its worth by
cleaning it, exploring it, and finally utilizing it in the best way possible. So, to understand its
journey let’s jump into the pipeline.
The raw data undergoes different stages within a pipeline which are:
1) Fetching/Obtaining the Data
This stage involves the identification of data from the internet or internal/external databases
and extracts into useful formats. Prerequisite skills:
● Distributed Storage: Hadoop, Apache Spark/Flink.
● Database Management: MySQL, PostgresSQL, MongoDB.
● Querying Relational Databases.
● Retrieving Unstructured Data: text, videos, audio files, documents.
2) Scrubbing/Cleaning the Data
This is the most time-consuming stage and requires more effort. It is further divided into two
stages:
● Examining Data:
● identifying errors
● identifying missing values
● identifying corrupt records
● Cleaning of data:
● replace or fill missing values/errors

Prerequisite skills:
● Coding language: Python, R.
● Data Modifying Tools: Python libs, Numpy, Pandas, R.
● Distributed Processing: Hadoop, Map Reduce/Spark.
3) Exploratory Data Analysis
When data reaches this stage of the pipeline, it is free from errors and missing values, and
hence is suitable for finding patterns using visualizations and charts.
● Python: NumPy, Matplotlib, Pandas, SciPy.
● R: GGplot2, Dplyr.
● Statistics: Random sampling, Inferential.
● Data Visualization: Tableau.
4) Modeling the Data
This is that stage of the data science pipeline where machine learning comes to play. With the
help of machine learning, we create data models. Data models are nothing but general rules in
a statistical sense, which is used as a predictive tool to enhance our business decision-making.
● Machine Learning: Supervised/Unsupervised algorithms.
● Evaluation methods.
● Machine Learning Libraries: Python (Sci-kit Learn, NumPy).
● Linear algebra and Multivariate Calculus.
5) Interpreting the Data
Similar to paraphrasing your data science model. Always remember, if you can’t explain it to
a six-year-old, you don’t understand it yourself. So, communication becomes the key!! This
is the most crucial stage of the pipeline, wherewith the use of psychological techniques,
correct business domain knowledge, and your immense storytelling abilities, you can explain
your model to the non-technical audience.
● Business domain knowledge.
● Data visualization tools: Tableau, D3.js, Matplotlib, ggplot2, Seaborn.
● Communication: Presenting/speaking and reporting/writing.
Video --
https://www.coursera.org/lecture/data-analysis-with-python/measures-for-in-sam
ple-evaluation-h6K6H
prediction and Decision Making.
Video--
https://www.coursera.org/lecture/data-analysis-with-python/prediction-and-decision-making-
4Li1D
Here’s how you can Data Science for Better Decision Making-
1. Personal Movie Recommender System
Netflix recommends you watch new movies based on your viewing history. But,
there’s a limitation. Netflix can only recommend the movies that it has in its library.
Let’s create a bigger list of movies we have watched and
By making a personal movie recommender system, you can avoid all of this hassle.
This may be a bit complicated for a beginner.
But, if you have an idea of basic ML models you can try this out in a different way.
Create a dataset by feeding-in all the movies you have watched along with their
IMDB scores, genre, major actors, language, director, etc. and give them a personal
rating out of 10. Use this personal rating as the target variable, pick a validation
approach, and then use the appropriate modeling technique on it.
In a similar manner, gather a list of all the movies that you want to watch and using
the above-fitted model, try to get the predicted ratings on each of them!
What’s next? Start watching these movies in descending order and enjoy it!
2. Doing a Self-Analysis
I have always been this guy who would avoid his emotions and would distract
himself if something serious would pop into his head. The pandemic made me realize
how important it is to face your emotions, scrutinize them and then let them go.
So, I am thinking about creating a journal wherein I would write the majority of
things I did throughout the day followed by a review of the day. You can do this too!
Learn some basic techniques on how to analyse textual data, also known as Natural
Language Processing or NLP and let’s try to put it into action. A sentimental
analysis can be done to understand how you were feeling when you wrote that
message and how often do you feel like that.
Here is a comprehensive guide on how to perform sentiment analysis.

Also, you wrote down what you did throughout the day, right? You can associate
those set of activities with the emotions you feel at the end of the day and then
understand what kind of activities make your day better.
3. Creating a Chatbot to answer your friends when you’re busy
When this COVID pandemic outbroke, everything got affected, offices had to be
closed because we were all supposed to stay inside our homes and of course call
centres and service centres were closed as well. With limited number of people
present at their disposal, companies had to look for alternatives which could make
their jobs easier. Many companies used chatbots to talk to us and record our
problems. If the solution was easy and could be solved without any human assistance,
the chatbot itself would recommend some solutions.
You can also make a chatbot that answers for you when you’re busy. Let’s say that
you’re at office and someone texts you asking where you are, you don’t have to leave
your work and sit back to reply to that person. The chatbot that you program can
handle that effectively for you.
You can check this out to know more about Chatbots and how to implement them.
4. Researching efficiently before buying anything
You are thinking about buying a smartphone and you google “Best phones under
35k”. You open the first link and the blog lists 5 choices for you, and the confusion
game starts. Also, it doesn’t talk about the features that matter to you.
To be honest, only you know what matters to you and those blogs are written from a
generic point of view. So how do you choose the best (for yourself) among thousands
of different options?
Let me tell how an analyst i.e. you could do that job. Prepare a list of all kinds of
smartphones available in the market along with the latest prices of the same. You
could do this activity efficiently if you are aware of Web Scraping Techniques.
Now, you have a structured dataset in front of you! What are you waiting for? Start
Digging! Apply all sorts of filters on Battery, Storage, Price etc. and viola! You have
‘Best smartphones under 35k.’
For these shortlisted smartphones, you can gather the top 500 or 1000 reviews and
then analyse using NLP techniques.
Here is how you can do that.
5. Investment
You have a great job but it is a task for you to save money. And to be very honest, your
money will not multiply if you keep the majority of it in your Savings Account. To be honest,
you will not be able to catch up with the inflation rate.
Stock Investing sounds very fancy right? By the way, it’s not just for investment
bankers. Anyone can do it! If you choose the right set of stocks, in just one month,
you could earn an amount of money equivalent to what you would get had you put
this money in a Savings Bank account.
People do it by conventional methods- qualitative analysis, looking at charts etc. But,

you’re comfortable with computers and have analytical skills too. Why don’t you
try Algorithmic Trading? Where your computer does most of your job, analyzing the
historical data and picking the right time to invest your money.
Pick up various stocks, diversify your portfolio and make efficient use of your Data
skills.
How Data Science Is Enabling Better Decision-making
Good decision-making is key to companies and institutions running efficiently and

overcoming unforeseeable obstacles. With the help of data science companies, decision
makers are now able to make better-informed choices than in the past by shaping and filtering
the data their organizations have collected. By using this data, they are able to formulate
predictions of the future based on what would happen if they decided to take their
organization in a brand new direction, for example, or how they would rebuild themselves
after a financial disaster, among a variety of other things.
It should be mentioned, however, that having data science alone is not always enough. As
Irina Peregud, of InData Labs, explains: “Data scientists analyze data to find insights but it’s
the job of product managers and business leaders to tell them what to look for.” Essentially,
business leaders and heads of governmental institutions need to know what the problem is
before they send in their data science troops to try and solve it. Data scientists are able to dig
up masses of information but it’s worth nothing unless they are led by someone who
understands the setting in which they are working — a leader with industry experience. Goals
need to be clearly set before data scientists are able to theorize ways to reach them.
Automatization
Building automated response systems is often seen as an end goal for many business leaders
seeking to invest in data science. Many small decisions can be automated with ease when the
right data is collected and utilized. For example, many banks that grant loans have for many
years now been using credit scoring systems to predict their clients ‘credit-worthiness’,
however, now, with the aid of data science, they are able to do this with a much higher degree
of accuracy, which has relieved their employees of some the decision-making process,
lowered the possibility of not getting a return on their loans if the customer was not ‘worthy’,
and also sped up the process as well.
On top of that, data science is also able to help automate much more complicated
decision-making processes, with the ability to provide numerous solid directions to choose
from with data as evidence for those possibilities. Using data science, it is possible to forecast
the impacts of decisions that are yet to even be made. There are many examples of this, but
perhaps some of the best known are of companies on the brink of collapse that placed their
trust in data science and were saved by remodeling their company based on what the data told
them would work, such as Dunkin’ Donuts and Timberland. The former, invested in a loyalty
system and the latter invested in identifying it’s ideal customer. Having the data to back big
decisions such as these, allows decision makers to feel more confident in what they are doing
and invest more in the idea financially as well as psychologically.
Healthcare and Insurance
Healthcare is another where area data science has shown to be highly beneficial for
decision-making in a variety of sectors. Obviously providing adequate treatment is the
number one priority. Many healthcare providers now are moving towards evidence-based
medicine, which when used in conjunction with data science, enables physicians to provide
patients with a more personalized experience by accessing a larger pool of sources before
making a decision on treatment.
Health and life insurance are other areas of healthcare that benefit significantly from data
science. Similar to how banks grant loans based on a ‘credit-worthiness’ score as mentioned
above, health and life insurers are able to develop ‘well-being’ scores. To develop such a
score can involve the collection of data from a large number of places, including social
media, financial transactions, and even body sensors. This can also be seen throughout the
insurance industry as a whole. In fact, as Datafloq explains: “Insurance companies rely on
growing their number of customers by adapting insurance policies for each individual
customer”. By using data science, insurers are able to grant more personalized insurance
schemes that work for both the customer and the insurer. This makes the decision-making
process easier as it is no longer a question of insurers saying ‘yes’ or ‘no’ to customers, but
about questioning the terms that will work for both parties involved.
Smoother Operations
Data science is also well-known to those that aim to improve operational standards, too. By
applying data science to operational procedures, decision makers are able to implement
changes much more efficiently and monitor if they are successful or not much more closely
through trial and error. Such methods can be applied to hiring and firing employees by
collecting and measuring data to see who fits the job best, as well as measuring performance
targets to see who really deserves to get promoted. On top of this, it helps employers see
where work is really needed and where it can be cut.
William Edwards Deming once said: “In God we trust. All others must bring data.” Though
he passed away more than 24 years ago, his words now hold more truth today than when they
were spoken. With the aid of data science, decision makers — whatever industry they may be
in — can make much more precise choices than ever before. Or, in some situations, wipe out
the whole decision-making process by automating it.
By harnessing data science to its full potential, top-ranking decision makers in all industries,
not only make better-informed decisions but make them with clearer predictions of the future.
With that advantage on their side, they are able to stabilize businesses that have not always
had a clear vision and save businesses that are on the brink of collapse.
However, it should be mentioned again that data science is only an advantage when decision
makers know there is a problem to be solved and can give the data scientists under their
leadership goals to aim for. Once goals have been established, data scientists can work their
magic and theorize how to fix it. Data science alone is not an advantage for decision-making,
data science combined with good leadership is.
Prediction
Predictive analytics uses historical data to predict future events. Typically, historical data is
used to build a mathematical model that captures important trends. That predictive model is
then used on current data to predict what will happen next, or to suggest actions to take for
optimal outcomes.
Predictive analytics has received a lot of attention in recent years due to advances in
supporting technology, particularly in the areas of big data and machine learning.
Developing Predictive Models
Your aggregated data tells a complex story. To extract the insights it holds, you need an
accurate predictive model.
Predictive modeling uses mathematical and computational methods to predict an event or

outcome. These models forecast an outcome at some future state or time based upon changes
to the model inputs. Using an iterative process, you develop the model using a training data
set and then test and validate it to determine its accuracy for making predictions. You can try
out different machine learning approaches to find the most effective model.
Examples include time-series regression models for predicting airline traffic

volume or predicting fuel efficiency based on a linear regression model of engine speed
versus load, and remaining useful life estimation models for prognostics.
Learn more
Predictive Analytics vs. Prescriptive Analytics

Organizations that have successfully implemented predictive analytics see prescriptive
analytics as the next frontier. Predictive analytics creates an estimate of what will happen
next; prescriptive analytics tells you how to react in the best way possible given the
prediction.
Prescriptive analytics is a branch of data analytics that uses predictive models to suggest
actions to take for optimal outcomes. Prescriptive analytics relies on optimization and
rules-based techniques for decision making. Forecasting the load on the electric grid over the
next 24 hours is an example of predictive analytics, whereas deciding how to operate power
plants based on this forecast represents prescriptive analytics.

Unit-4 DS Student

Uploaded by

Copyright:

Available Formats

Unit-4 DS Student

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-4 DS Student

Uploaded by

Copyright:

Available Formats

UNIT 4

A model is a transformation engine that helps us to express dependent variables as a

Parameters are ingredients added to the model for estimating the output.

Wait, what do we mean by linear?

Linear regression is nothing but a manifestation of this simple equation.

● c is the intercept. A constant that determines the value of y when x is 0.

Simple Linear Regression Models

1. Simple Linear Regression

⚫ Linear regression algorithm shows a linear relationship between a dependent (y)

⚫ Linear Regression Line

⚫ Positive Linear Relationship:

⚫ Cost function optimizes the regression coefficients or weights. It measures how a

Assumptions of Linear Regression

⚫ Linear relationship between the features and target:

⚫ Small or no multicollinearity between the features:

Independent and Dependent variables:

In the context of Statistical learning, there are two types of data:

● Independent variables: Data that can be controlled directly.

● Dependent variables: Data that cannot be controlled directly.

● What is Linear Regression?

George Box, a famous British statistician, once quoted:

“All models are wrong; some are useful.”

The same equation of a line can be re-written as:

ε is the error term.

The following are the data provided to him:

● fuelType: type of fuel used by the car.

● nDoor: number of doors.

● engineSize: size of the engine of the car.

● price: the price of the car.

● Is price of car price related with engine size?

● How strong is the relationship?

● Is the relationship linear?

● Can we predict/estimate car price based on engine size?

● Is price of car price related with engine size?

● Yes, there is a relationship.

● How strong is the relationship?

● The correlation coefficient is 0.872 => There is a strong relationship.

● Is the relationship linear?

● Can we predict/estimate the car price based on engine size?

● Yes, car price can be estimated based on engine size.

price = β0 + β1 x engine size

The linear equation is estimated as:

price = -6870.1 + 156.9 x engine size

H0 and Ha need to be defined. They are defined as follows:

● H0 (NULL hypothesis): There is no relationship between x and y i.e. there is

● Ha (Alternate hypothesis): There is some relationship between x and y i.e.

R-squared or Coefficient of determination: To understand these metrics, let us break it

● R-squared a.k.a coefficient of determination is computed as 1- RSS/TSS. This

Let us look at an example.

⚫ With homoscedasticity, there should be no clear pattern distribution of data in the

⚫ Normal distribution of error terms:

⚫ Autocorrelation usually occurs if there is a dependency between residual errors.

Note: It works best when multicollinearity is absent. It’s a phenomenon in which two or

⚫ Some key points about MLR:

⚫ MLR tries to fit a regression line through a multidimensional space of

⚫ Assumptions for Multiple Linear Regression:

⚫ Gradient Descent: Given a function defined by a set of parameters, Gradient Descent

Forward Stepwise Regression