Unit-Iii-1 1

UNIT-III
Regression
Regression Concepts:
Regression analysis is a form of predictive modelling technique which investigates
the relationship between a dependent (target) and independent variable (s)
(predictor). This technique is used for forecasting, time series modelling and finding
the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied
through regression.
 Dependent-Target Variable, e.g: test score
 Independent Variable- Predictive Variable or Explanatory Variable,
e.g : age
Regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current
economic conditions. You have the recent company data which indicates that the
growth in sales is around two and a half times the growth in the economy. Using this
insight, we can predict future sales of the company based on current & past
information.
There are multiple benefits of using regression analysis. They are as follows:
1. It indicates the significant relationships between dependent variable and

independent variable.
2. It indicates the strength of impact of multiple independent variables on a
dependent variable.
There are various kinds of regression techniques available to make predictions.

These techniques are mostly driven by three metrics (number of independent
variables, type of dependent variables and shape of regression line). We’ll discuss
them in detail in the following sections.
For the creative ones, you can even cook up new regressions, if you feel the need to
use a combination of the parameters above, which people haven’t used before. But
before you start that, let us understand the most commonly used regressions:
 Linear Regression
 Logistic Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression
1. Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually
among the first few topics which people pick while learning predictive modeling. In
this technique, the dependent variable is continuous, independent variable(s) can be
continuous or discrete, and nature of regression line is linear.
The relationship between the two variable is three types. They are
(i) Linear Relationship
• The graph of linear relationship between two variables looks as follows,

(ii) Non-Linear relationship
(iii) No Relationship

Linear Regression establishes a relationship between dependent variable (Y) and
one or more independent variables (X) using a best fit straight line (also known
as regression line).
Where
 y is dependent variable
 x is independent variable
 b is slope--> how much the line rises for each unit increase in x
 a is y intercept --> the value of y when x=0.
Simple Linear Regression: It represents the relationships between the two

variables. One is independent variables is X and one dependent variable is Y.
Multiple Linear Regression:
When you have multiple independent variables, then we call it as Multiple Linear
Regression
Assumptions of linear regression:

 There must be a linear relation between independent and dependent variables.
 There should not be any outliers present.
 No heteroscedasticity
 Sample observations should be independent.
 Error terms should be normally distributed with mean 0 and constant variance.
 Absence of multicollinearity and auto-correlation.
Logistic Regression
Logistic Regression is used to solve the classification problems, so it’s called as
Classification Algorithm that models the probability of output class.
 It is a classification problem where your target element is categorical
 Unlike in Linear Regression, in Logistic regression the output required is represented
in discrete values like binary 0 and 1.
 It estimates relationship between a dependent variable (target) and one or more
independent variable (predictors) where dependent variable is categorical/nominal.
 Logistic regression is a supervised learning classification algorithm used to predict
the probability of a dependent variable.
 The nature of target or dependent variable is dichotomous(binary), which means
there would be only two possible classes.
 In simple words, the dependent variable is binary in nature having data coded as
either 1 (stands for success/yes) or 0 (stands for failure/no), etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
 Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).

Sigmoid Function:
 It is the logistic expression especially used in Logistic Regression.
 The sigmoid function converts any line into a curve which has discrete values like
binary 0 and 1.
 In this session let’s see how a continuous linear regression can be manipulated and
converted into Classifies Logistic.
 The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
 It maps any real value into another value within a range of 0 and 1.
 The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Where,
P represents Probability of Output class Y represents predicted output.
Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.

• The independent variable should not have multi-collinearity.
Logistic Regression Equation:
Example
Admissions(dependent variables) CGPA(Independent variables)
0 4.2
0 5.1
0 5.5
1 8.2
1 9.0
1 9.9
Logistic regression can be binomial, ordinal or multinomial.

 Binomial or binary logistic regression deals with situations in which the
observed outcome for a dependent variable can have only two possible
types, "0" and "1" (which may represent, for example, "dead" vs. "alive"
or "win" vs. "loss").
 Multinomial logistic regression deals with situations where the outcome
can have three or more possible types (e.g., "disease A" vs. "disease B"
vs. "disease C") that are not ordered.
 Ordinal logistic regression deals with dependent variables that are
ordered.
Differences Between Linear and Logistic Regression
Polynomial Regression
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between
the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed
into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic.
Need for Polynomial Regression:
• If we apply a linear model on a linear dataset, then it provides us a good

result as we have seen in Simple Linear Regression, but if we apply the same
model without any modification on a non-linear dataset, then it will produce
a drastic output. Due to which loss function will increase, the error rate will
be high, and accuracy will be decreased.
• So for such cases, where data points are arranged in a non-linear fashion,
we need the Polynomial Regression model. We can understand it in a better
way using the below comparison diagram of the linear dataset and non-linear
dataset.
• In the above image, we have taken a dataset which is arranged non-linearly.
So if we try to cover it with a linear model, then we can clearly see that it
hardly covers any data point. On the other hand, a curve is suitable to cover
most of the data points, which is of the Polynomial model.
• Hence, if the datasets are arranged in a non-linear fashion, then we should
use the Polynomial Regression model instead of Simple Linear Regression.
When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables.
The Simple and Multiple Linear equations are also Polynomial equations with
a single degree, and the Polynomial regression equation is Linear equation
with the nth degree.
So if we add a degree to our linear equations, then it will be converted into

Polynomial Linear equations.
Stepwise Regression
• This form of regression is used when we deal with multiple independent
variables. In this technique, the selection of independent variables is done
with the help of an automatic process, which involves no human intervention.
• Stepwise regression basically fits the regression model by adding/dropping
co-variates one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:
Standard stepwise regression does two things. It adds and removes predictors
as needed for each step.
 Forward selection starts with most significant predictor in the model and
adds variable for each step.
 Backward elimination starts with all predictors in the model and removes
the least significant variable for each step.
 The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to handle
higher dimensionality of data set.
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in
which a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the lambda
to the squared weight of each individual features.
o The equation for ridge regression will be:
o A general linear or polynomial regression will fail if there is high collinearity

between the independent variables, so to solve such problems, Ridge
regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity
of the model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will
be:
BLUE Property Assumptions:

Best Linear Unbiased Estimator:
WHAT IS AN ESTIMATOR?
• In statistics, an estimator is a rule for calculating an estimate of a given quantity
based on observed data
• Example-
i. X follows a normal distribution, but we do not know the parameters of our
distribution, namely mean (μ) and variance (σ2 )
ii. To estimate the unknowns, the usual procedure is to draw a random sample
of size ‘n’ and use the sample data to estimate parameters.
TWO TYPES OF ESTIMATORS
• Point Estimators A point estimate of a population parameter is a single value of a
statistic. For example, the sample mean x is a point estimate of the population mean
μ. Similarly, the sample proportion p is a point estimate of the population proportion
P.
• Interval Estimators An interval estimate is defined by two numbers, between
which a population parameter is said to lie. For example, a < x < b is an interval
estimate of the population mean μ. It indicates that the population mean is greater
than a but less than b.
PROPERTIES OF BLUE
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)
LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations
• Sample mean is a linear estimator because it is a linear function of the X values.
UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true
mean of the variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
• We also write this as follows:
E(β)=β
Similarly, if this is not the case, we say that the estimator is biased
• Similarly, if this is not the case, we say that the estimator is biased
• Bias=E(β ) - β
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be centered around the
true population , so too it is desirable for the sampling distribution to be as narrow
(or precise) as possible.
– Centering around “the truth” but with high variability might be of very little use
• One way of narrowing the sampling distribution is to increase the sampling size
Assumptions of Gauss Markov or BLUE assumptions:

1 Linearity in Parameters: The population model is linear in its parameters and
correctly speciﬁed
2 Random Sampling: The observed data represent a random sample from the
population described by the model.
3 Variation in X: There is variation in the explanatory variable.
4 Zero conditional mean: Expected value of the error term is zero conditional on all
values of the explanatory variable
5 Homoskedasticity: The error term has the same variance conditional on all values
of the explanatory variable.
6 Normality: The error term is independent of the explanatory variables and
normally distributed.
7 Multicolinearity: There is must be no corelation among independent variables.
8 Non Collinearity: In regression analysis, regressor is being calculated aren’t
perfectlty corelated with each other.
9 Exogeneity: the regressors are not corelated with error terms.
Least Square Estimation:

Least Squares Calculator. Least Squares Regression is a way of finding a straight
line that best fits the data, called the "Line of Best Fit". Enter your data as (x,y) pairs,
and find the equation of a line that best fits the data.
Line of Best Fit
Imagine you have some points, and want to have a line that best fits them like this:
We can place the line "by eye": try to have the line as close as possible to all points,
and a similar number of points above and below the line.
But for better accuracy let's see how to calculate the line using Least Squares
Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a
line :
y = mx + b
Where:
 y = how far up
 x = how far along
 m = Slope or Gradient (how steep the line is)
 b = the Y Intercept (where the line crosses the Y axis)
Steps
To find the line of best fit for N points:
Step 1: For each (x,y) point calculate x2 and xy

Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum
up")
Step 3: Calculate Slope m:
m = N Σ(xy) − Σx Σy/N Σ(x2) − (Σx)2
(N is the number of points.)
Step 4: Calculate Intercept b:
b = Σy − m Σx/N
Step 5: Assemble the equation of a line
y = mx + b
Done!
Example
Example: Sam found how many hours of sunshine vs how many ice creams were
sold at the shop from Monday to Friday:
"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
Step 1: For each (x,y) calculate x2 and xy:
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):
x y x2 Xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
Also N (number of data values) = 5
Step 3: Calculate Slope m:

m = N Σ(xy) − Σx Σy/ N Σ(x2) − (Σx)2
= 5 x 263 − 26 x 41/ 5 x 168 − 262
= 1315 – 1066/ 840 − 676
= 249/ 164 = 1.5183...
Step 4: Calculate Intercept b:
b = Σy − m Σx/ N
= 41 − 1.5183 x 26/ 5
= 0.3049...
Step 5: Assemble the equation of a line:
y = mx + b
y = 1.518x + 0.305
Let's see how it works out:
x y y = 1.518x + 0.305 error
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so
he uses the above equation to estimate that he will sell
y = 1.518 x 8 + 0.305 = 12.45 Ice Creams
Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.
How does it work?
It works by making the total of the square of the errors as small as possible (that is
why it is called "least squares"):
The straight line minimizes the sum of squared errors

So, when we square each of those errors and add them all up, the total is as small as
possible.
You can imagine (but not accurately) each data point connected to a straight bar by
springs:
Outliers
Be careful! Least squares is sensitive to outliers. A strange value will pull the line
towards it.
OLS -> Ordinary Least Square

MLE -> Maximum Likelihood Estimation
The ordinary least squares, or OLS, can also be called the linear least squares. This
is a method for approximately determining the unknown parameters located in a
linear regression model. According to books of statistics and other online sources,
the ordinary least squares is obtained by minimizing the total of squared vertical
distances between the observed responses within the dataset and the responses
predicted by the linear approximation. Through a simple formula, you can express
the resulting estimator, especially the single regressor, located on the right-hand side
of the linear regression model.
For example, you have a set of equations which consists of several equations that
have unknown parameters. You may use the ordinary least squares method because
this is the most standard approach in finding the approximate solution to your overly
determined systems. In other words, it is your overall solution in minimizing the sum
of the squares of errors in your equation. Data fitting can be your most suited
application. Online sources have stated that the data that best fits the ordinary least
squares minimizes the sum of squared residuals. “Residual” is “the difference
between an observed value and the fitted value provided by a model.”
Maximum likelihood estimation, or MLE, is a method used in estimating the

parameters of a statistical model, and for fitting a statistical model to data. If you
want to find the height measurement of every basketball player in a specific location,
you can use the maximum likelihood estimation. Normally, you would encounter
problems such as cost and time constraints. If you could not afford to measure all of
the basketball players’ heights, the maximum likelihood estimation would be very
handy. Using the maximum likelihood estimation, you can estimate the mean and
variance of the height of your subjects. The MLE would set the mean and variance
as parameters in determining the specific parametric values in a given model.
Variable Rationalization:
Method selection allows you to specify how independent variables are entered into
the analysis. Using different methods, you can construct a variety of regression
models from the same set of variables.
 Enter (Regression). A procedure for variable selection in which all variables in a
block are entered in a single step.
 Stepwise. At each step, the independent variable not in the equation that has the
smallest probability of F is entered, if that probability is sufficiently small. Variables
already in the regression equation are removed if their probability of F becomes
sufficiently large. The method terminates when no more variables are eligible for
inclusion or removal.
 Remove. A procedure for variable selection in which all variables in a block are
removed in a single step.
 Backward Elimination. A variable selection procedure in which all variables are
entered into the equation and then sequentially removed. The variable with the
smallest partial correlation with the dependent variable is considered first for
removal. If it meets the criterion for elimination, it is removed. After the first variable
is removed, the variable remaining in the equation with the smallest partial
correlation is considered next. The procedure stops when there are no variables in
the equation that satisfy the removal criteria.
 Forward Selection. A stepwise variable selection procedure in which variables are
sequentially entered into the model. The first variable considered for entry into the
equation is the one with the largest positive or negative correlation with the
dependent variable.
Model Building:
In regression analysis, model building is the process of developing a probabilistic
model that best describes the relationship between the dependent and independent
variables. The major issues are finding the proper form (linear or curvilinear) of the
relationship and selecting which independent variables to include. In building
models it is often desirable to use qualitative as well as quantitative variables. As
noted above, quantitative variables measure how much or how many; qualitative
variables represent types or categories. For instance, suppose it is of interest to
predict sales of an iced tea that is available in either bottles or cans. Clearly, the
independent variable “container type” could influence the dependent variable
“sales.” Container type is a qualitative variable, however, and must be assigned
numerical values if it is to be used in a regression study. So-called dummy variables
are used to represent qualitative variables in regression analysis. For example, the
dummy variable x could be used to represent container type by setting x = 0 if the
iced tea is packaged in a bottle and x = 1 if the iced tea is in a can. If the beverage
could be placed in glass bottles, plastic bottles, or cans, it would require two dummy
variables to properly represent the qualitative variable container type. In general, k -
1 dummy variables are needed to model the effect of a qualitative variable that may
assume k values.
The general linear model y = β0 + β1x1 + β2x2 + . . . + βpxp + ε can be used to model
a wide variety of curvilinear relationships between dependent and independent
variables. For instance, each of the independent variables could be a
nonlinear function of other variables. Also, statisticians sometimes find it necessary
to transform the dependent variable in order to build a satisfactory model. A
logarithmic transformation is one of the more common types.
Logistic Regression
Model Theory:
Logistic regression is a statistical method for predicting binary classes. The outcome
or target variable is binary in nature. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence.
It is a special case of linear regression where the target variable is categorical in

nature. It uses a log of odds as the dependent variable. Logistic Regression predicts
the probability of occurrence of a binary event utilizing a logit function.
Linear Regression Equation:
Where, y is dependent variable and x1, x2 … and Xn are explanatory variables.
Sigmoid Function:
Apply Sigmoid function on linear regression:
Properties of Logistic Regression:
 The dependent variable in logistic regression follows Bernoulli Distribution.
 Estimation is done through maximum likelihood.
 No R Square, Model fitness is calculated through Concordance, KS-Statistics.

Maximum Likelihood Estimation
The MLE is a “likelihood” maximization method, while OLS is a distance-
minimizing approximation method. Maximizing the likelihood function determines
the parameters that are most likely to produce the observed data. From a statistical
point of view, MLE sets the mean and variance as parameters in determining the
specific parametric values for a given model. This set of parameters can be used for
predicting the data needed in a normal distribution.
Sigmoid Function
Sigmoid curve
The sigmoid function also called the logistic function gives an ‘S’ shaped curve that
can take any real-valued number and map it into a value between 0 and 1. If the curve
goes to positive infinity, y predicted will become 1, and if the curve goes to negative
infinity, y predicted will become 0. If the output of the sigmoid function is more than
0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify
it like 0 or NO. If the output is 0.75, we can say in terms of probability as: There is a
75 percent chance that patient will suffer from cancer.
Sigmoid function
The sigmoid curve has a finite limit of:
‘0’ as x approaches −∞
‘1’ as x approaches +∞
The output of sigmoid function when x=0 is 0.5
Thus, if the output is more tan 0.5, we can classify the outcome as 1 (or YES) and if
it is less than 0.5, we can classify it as 0(or NO).
For example: If the output is 0.65, we can say in terms of probability as:
“There is a 65 percent chance that your favorite cricket team is going to win today ”.
Logistic Regression Assumptions

 Binary logistic regression requires the dependent variable to be binary.
 For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
 Only meaningful variables should be included.
 The independent variables should be independent of each other. That is, the
model should have little or no multicollinearity.
 The independent variables are linearly related to the log odds.
 Logistic regression requires quite large sample sizes.
Binary Logistic Regression model building in Scikit learn

Model Fit Statistics:
A commonly used measure of the goodness of fit provided by the estimated
regression equation is the coefficient of determination. Computation of this
coefficient is based on the analysis of variance procedure that partitions the total
variation in the dependent variable, denoted SST, into two parts: the part explained
by the estimated regression equation, denoted SSR, and the part that remains
unexplained, denoted SSE.
The measure of total variation, SST, is the sum of the squared deviations of the
dependent variable about its mean: Σ(y − ȳ)2. This quantity is known as the total sum
of squares. The measure of unexplained variation, SSE, is referred to as the residual
sum of squares. SSE is the sum of the squared distances from each point in to the
estimated regression line: Σ(y − ŷ)2. SSE is also commonly referred to as the error
sum of squares. A key result in the analysis of variance is that SSR + SSE = SST.
The ratio r2 = SSR/SST is called the coefficient of determination. If the data points
are clustered closely about the estimated regression line, the value of SSE will be
small and SSR/SST will be close to 1. Using r2, whose values lie between 0 and 1,
provides a measure of goodness of fit; values closer to 1 imply a better fit. A value
of r2 = 0 implies that there is no linear relationship between the dependent and
independent variables.
When expressed as a percentage, the coefficient of determination can be interpreted

as the percentage of the total sum of squares that can be explained using the
estimated regression equation. For the stress-level research study, the value of r2 is
0.583; thus, 58.3% of the total sum of squares can be explained by the estimated
regression equation ŷ = 42.3 + 0.49x. For typical data found in the social sciences,
values of r2 as low as 0.25 are often considered useful. For data in the physical
sciences, r2 values of 0.60 or greater are frequently found.
Hosmer Lemeshow Test:

 The Hosmer–Lemeshow test is a statistical test for goodness of fit for
logistic regression models.
 It is used frequently in risk prediction models.
 The test assesses whether or not the observed event rates match
expected event rates in subgroups of the model population.
 The Hosmer–Lemeshow test specifically identifies subgroups as the
deciles of fitted risk values.
 Models for which expected and observed event rates in
subgroups are similar are called well calibrated.
 The Hosmer–Lemeshow test statistic is given by:
Here Og, Eg, Ng, and πg denote the observed events, expected events, observations,
predicted risk for the gth risk decile group, and G is the number of groups
Model Construction:
One Model Building Strategy
We've talked before about the "art" of model building. Unsurprisingly, there are
many approaches to model building, but here is one strategy—consisting of seven
steps—that is commonly used when building a regression model.
The first step

Decide on the type of model that is needed in order to achieve the goals of the
study. In general, there are five reasons one might want to build a regression
model. They are:
 For predictive reasons — that is, the model will be used to predict the
response variable from a chosen set of predictors.
 For theoretical reasons — that is, the researcher wants to estimate a model
based on a known theoretical relationship between the response and
predictors.
 For control purposes — that is, the model will be used to control a response
variable by manipulating the values of the predictor variables.
 For inferential reasons — that is, the model will be used to explore the
strength of the relationships between the response and the predictors.
 For data summary reasons — that is, the model will be used merely as a
way to summarize a large set of data by a single equation.
The second step
Decide which predictor variables and response variable on which to collect the
data. Collect the data.
The third step

Explore the data. That is:
 On a univariate basis, check for outliers, gross data errors, and missing
values.
 Study bivariate relationships to reveal other outliers, to suggest possible
transformations, and to identify possible multicollinearities.
I can't possibly over-emphasize the importance of this step. There's not a data
analyst out there who hasn't made the mistake of skipping this step and later
regretting it when a data point was found in error, thereby nullifying hours of
work.
The fourth step

Randomly divide the data into a training set and a validation set:
 The training set, with at least 15-20 error degrees of freedom, is used to
estimate the model.
 The validation set is used for cross-validation of the fitted model.
The fifth step
Using the training set, identify several candidate models:
 Use best subsets regression.

 Use stepwise regression, which of course only yields one model unless
different alpha-to-remove and alpha-to-enter values are specified.
The sixth step
Select and evaluate a few "good" models:
 Select the models based on the criteria we learned, as well as the number and
nature of the predictors.
 Evaluate the selected models for violation of the model conditions.
 If none of the models provide a satisfactory fit, try something else, such as
collecting more data, identifying different predictors, or formulating a
different type of model.
The seventh and final step
Select the final model:
 Compare the competing models by cross-validating them against the

validation data.
 The model with a smaller mean square prediction error (or larger cross-
validation R2) is a better predictive model.
 Consider residual plots, outliers, parsimony, relevance, and ease of
measurement of predictors.
And, most of all, don't forget that there is not necessarily only one good model for
a given set of data. There might be a few equally satisfactory models.
Analytics applications to various Business Domains:

 Finance
BA is of utmost importance to the finance sector. Data Scientists are in high
demand in investment banking, portfolio management, financial planning,
budgeting, forecasting, etc.
For example: Companies these days have a large amount of financial data.
Use of intelligent Business Analytics tools can help use this data to determine
the products’ prices. Also, on the basis of historical information Business
Analysts can study the trends on the performance of a particular stock and
advise the client on whether to retain it or sell it.
 Marketing
Studying buying patterns of consumer behaviour, analysing trends, help in
identifying the target audience, employing advertising techniques that can
appeal to the consumers, forecast supply requirements, etc.
For example: Use Business Analytics to gauge the effectiveness and impact
of a marketing strategy on the customers. Data can be used to build loyal
customers by giving them exactly what they want as per their specifications.
 HR Professionals
HR professionals can make use of data to find information about educational
background of high performing candidates, employee attrition rate, number
of years of service of employees, age, gender, etc. This information can play
a pivotal role in the selection procedure of a candidate.
For example: HR manager can predict the employee retention rate on the
basis of data given by Business Analytics.
 CRM
Business Analytics helps one analyze the key performance indicators, which
further helps in decision making and make strategies to boost the relationship
with the consumers. The demographics, and data about other socio-economic
factors, purchasing patterns, lifestyle, etc., are of prime importance to the
CRM department.
 For example: The company wants to improve its service in a particular
geographical segment. With data analytics, one can predict the customer’s
preferences in that particular segment, what appeals to them, and accordingly
improve relations with customers.
 Manufacturing
Business Analytics can help you in supply chain management, inventory
management, measure performance of targets, risk mitigation plans, improve
efficiency in the basis of product data, etc.
For example: The Manager wants information on performance of a
machinery which has been used past 10 years. The historical data will help
evaluate the performance of the machinery and decide whether costs of
maintaining the machine will exceed the cost of buying a new machinery.
 Credit Card Companies
Credit card transactions of a customer can determine many factors: financial
health, life style, preferences of purchases, behavioral trends, etc.
For example: Credit card companies can help the retail sector by locating
the target audience. According to the transactions reports, retail companies
can predict the choices of the consumers, their spending pattern, preference
over buying competitor’s products, etc. This historical as well as real-time
information helps them direct their marketing strategies in such a way that it
hits the dart and reaches the right audience.

Unit-Iii-1 1

Uploaded by

Copyright:

Available Formats

Unit-Iii-1 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-Iii-1 1

Uploaded by

Copyright:

Available Formats

UNIT-III

1. It indicates the significant relationships between dependent variable and

There are various kinds of regression techniques available to make predictions.

(i) Linear Relationship

• The graph of linear relationship between two variables looks as follows,

• The graph of linear relationship between two variables looks as follows,

• The graph of linear relationship between two variables looks as follows,

Simple Linear Regression: It represents the relationships between the two

Assumptions of linear regression:

Assumptions for Logistic Regression:

• The dependent variable must be categorical in nature.

Logistic Regression Equation:

Admissions(dependent variables) CGPA(Independent variables)

Logistic regression can be binomial, ordinal or multinomial.

Need for Polynomial Regression:

• If we apply a linear model on a linear dataset, then it provides us a good

So if we add a degree to our linear equations, then it will be converted into

o A general linear or polynomial regression will fail if there is high collinearity

BLUE Property Assumptions:

Assumptions of Gauss Markov or BLUE assumptions:

Least Square Estimation:

To find the line of best fit for N points:

Step 1: For each (x,y) point calculate x2 and xy

Step 3: Calculate Slope m:

m = N Σ(xy) − Σx Σy/N Σ(x2) − (Σx)2

(N is the number of points.)

Step 4: Calculate Intercept b:

Step 5: Assemble the equation of a line

Step 1: For each (x,y) calculate x2 and xy:

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):

Σx: 26 Σy: 41 Σx2: 168 Σxy: 263

Also N (number of data values) = 5

Step 3: Calculate Slope m:

= 5 x 263 − 26 x 41/ 5 x 168 − 262

= 1315 – 1066/ 840 − 676

= 249/ 164 = 1.5183...

Step 4: Calculate Intercept b:

Step 5: Assemble the equation of a line:

Let's see how it works out:

x y y = 1.518x + 0.305 error

y = 1.518 x 8 + 0.305 = 12.45 Ice Creams

How does it work?

The straight line minimizes the sum of squared errors

OLS -> Ordinary Least Square

Maximum likelihood estimation, or MLE, is a method used in estimating the

It is a special case of linear regression where the target variable is categorical in

Linear Regression Equation:

Where, y is dependent variable and x1, x2 … and Xn are explanatory variables.

Apply Sigmoid function on linear regression:

Properties of Logistic Regression:

 The dependent variable in logistic regression follows Bernoulli Distribution.

 Estimation is done through maximum likelihood.

 No R Square, Model fitness is calculated through Concordance, KS-Statistics.

The sigmoid curve has a finite limit of:

The output of sigmoid function when x=0 is 0.5

Logistic Regression Assumptions

 Only meaningful variables should be included.

 The independent variables are linearly related to the log odds.

 Logistic regression requires quite large sample sizes.