Unit-Iii-1 1
Unit-Iii-1 1
Unit-Iii-1 1
Regression
Regression Concepts:
Regression analysis is a form of predictive modelling technique which investigates
the relationship between a dependent (target) and independent variable (s)
(predictor). This technique is used for forecasting, time series modelling and finding
the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied
through regression.
Dependent-Target Variable, e.g: test score
Independent Variable- Predictive Variable or Explanatory Variable,
e.g : age
Regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current
economic conditions. You have the recent company data which indicates that the
growth in sales is around two and a half times the growth in the economy. Using this
insight, we can predict future sales of the company based on current & past
information.
There are multiple benefits of using regression analysis. They are as follows:
Linear Regression
Logistic Regression
Polynomial Regression
Ridge Regression
Lasso Regression
1. Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually
among the first few topics which people pick while learning predictive modeling. In
this technique, the dependent variable is continuous, independent variable(s) can be
continuous or discrete, and nature of regression line is linear.
The relationship between the two variable is three types. They are
(iii) No Relationship
Where
y is dependent variable
x is independent variable
b is slope--> how much the line rises for each unit increase in x
a is y intercept --> the value of y when x=0.
Logistic Regression
Logistic Regression is used to solve the classification problems, so it’s called as
Classification Algorithm that models the probability of output class.
It is a classification problem where your target element is categorical
Unlike in Linear Regression, in Logistic regression the output required is represented
in discrete values like binary 0 and 1.
It estimates relationship between a dependent variable (target) and one or more
independent variable (predictors) where dependent variable is categorical/nominal.
Logistic regression is a supervised learning classification algorithm used to predict
the probability of a dependent variable.
The nature of target or dependent variable is dichotomous(binary), which means
there would be only two possible classes.
In simple words, the dependent variable is binary in nature having data coded as
either 1 (stands for success/yes) or 0 (stands for failure/no), etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
Sigmoid Function:
It is the logistic expression especially used in Logistic Regression.
The sigmoid function converts any line into a curve which has discrete values like
binary 0 and 1.
In this session let’s see how a continuous linear regression can be manipulated and
converted into Classifies Logistic.
The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Where,
P represents Probability of Output class Y represents predicted output.
Example
0 4.2
0 5.1
0 5.5
1 8.2
1 9.0
1 9.9
Polynomial Regression
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between
the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed
into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic.
When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables.
The Simple and Multiple Linear equations are also Polynomial equations with
a single degree, and the Polynomial regression equation is Linear equation
with the nth degree.
Stepwise Regression
• This form of regression is used when we deal with multiple independent
variables. In this technique, the selection of independent variables is done
with the help of an automatic process, which involves no human intervention.
• Stepwise regression basically fits the regression model by adding/dropping
co-variates one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:
Standard stepwise regression does two things. It adds and removes predictors
as needed for each step.
Forward selection starts with most significant predictor in the model and
adds variable for each step.
Backward elimination starts with all predictors in the model and removes
the least significant variable for each step.
The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to handle
higher dimensionality of data set.
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in
which a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the lambda
to the squared weight of each individual features.
o The equation for ridge regression will be:
WHAT IS AN ESTIMATOR?
• In statistics, an estimator is a rule for calculating an estimate of a given quantity
based on observed data
• Example-
i. X follows a normal distribution, but we do not know the parameters of our
distribution, namely mean (μ) and variance (σ2 )
ii. To estimate the unknowns, the usual procedure is to draw a random sample
of size ‘n’ and use the sample data to estimate parameters.
TWO TYPES OF ESTIMATORS
• Point Estimators A point estimate of a population parameter is a single value of a
statistic. For example, the sample mean x is a point estimate of the population mean
μ. Similarly, the sample proportion p is a point estimate of the population proportion
P.
• Interval Estimators An interval estimate is defined by two numbers, between
which a population parameter is said to lie. For example, a < x < b is an interval
estimate of the population mean μ. It indicates that the population mean is greater
than a but less than b.
PROPERTIES OF BLUE
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)
LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations
• Sample mean is a linear estimator because it is a linear function of the X values.
UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true
mean of the variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
• We also write this as follows:
E(β)=β
Similarly, if this is not the case, we say that the estimator is biased
• Similarly, if this is not the case, we say that the estimator is biased
• Bias=E(β ) - β
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be centered around the
true population , so too it is desirable for the sampling distribution to be as narrow
(or precise) as possible.
– Centering around “the truth” but with high variability might be of very little use
• One way of narrowing the sampling distribution is to increase the sampling size
Imagine you have some points, and want to have a line that best fits them like this:
We can place the line "by eye": try to have the line as close as possible to all points,
and a similar number of points above and below the line.
But for better accuracy let's see how to calculate the line using Least Squares
Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a
line :
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
b = Σy − m Σx/N
y = mx + b
Done!
Example
Example: Sam found how many hours of sunshine vs how many ice creams were
sold at the shop from Monday to Friday:
"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x y x2 Xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
b = Σy − m Σx/ N
= 41 − 1.5183 x 26/ 5
= 0.3049...
y = mx + b
y = 1.518x + 0.305
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so
he uses the above equation to estimate that he will sell
Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.
It works by making the total of the square of the errors as small as possible (that is
why it is called "least squares"):
You can imagine (but not accurately) each data point connected to a straight bar by
springs:
Outliers
Be careful! Least squares is sensitive to outliers. A strange value will pull the line
towards it.
The ordinary least squares, or OLS, can also be called the linear least squares. This
is a method for approximately determining the unknown parameters located in a
linear regression model. According to books of statistics and other online sources,
the ordinary least squares is obtained by minimizing the total of squared vertical
distances between the observed responses within the dataset and the responses
predicted by the linear approximation. Through a simple formula, you can express
the resulting estimator, especially the single regressor, located on the right-hand side
of the linear regression model.
For example, you have a set of equations which consists of several equations that
have unknown parameters. You may use the ordinary least squares method because
this is the most standard approach in finding the approximate solution to your overly
determined systems. In other words, it is your overall solution in minimizing the sum
of the squares of errors in your equation. Data fitting can be your most suited
application. Online sources have stated that the data that best fits the ordinary least
squares minimizes the sum of squared residuals. “Residual” is “the difference
between an observed value and the fitted value provided by a model.”
Variable Rationalization:
Method selection allows you to specify how independent variables are entered into
the analysis. Using different methods, you can construct a variety of regression
models from the same set of variables.
Enter (Regression). A procedure for variable selection in which all variables in a
block are entered in a single step.
Stepwise. At each step, the independent variable not in the equation that has the
smallest probability of F is entered, if that probability is sufficiently small. Variables
already in the regression equation are removed if their probability of F becomes
sufficiently large. The method terminates when no more variables are eligible for
inclusion or removal.
Remove. A procedure for variable selection in which all variables in a block are
removed in a single step.
Backward Elimination. A variable selection procedure in which all variables are
entered into the equation and then sequentially removed. The variable with the
smallest partial correlation with the dependent variable is considered first for
removal. If it meets the criterion for elimination, it is removed. After the first variable
is removed, the variable remaining in the equation with the smallest partial
correlation is considered next. The procedure stops when there are no variables in
the equation that satisfy the removal criteria.
Forward Selection. A stepwise variable selection procedure in which variables are
sequentially entered into the model. The first variable considered for entry into the
equation is the one with the largest positive or negative correlation with the
dependent variable.
Model Building:
In regression analysis, model building is the process of developing a probabilistic
model that best describes the relationship between the dependent and independent
variables. The major issues are finding the proper form (linear or curvilinear) of the
relationship and selecting which independent variables to include. In building
models it is often desirable to use qualitative as well as quantitative variables. As
noted above, quantitative variables measure how much or how many; qualitative
variables represent types or categories. For instance, suppose it is of interest to
predict sales of an iced tea that is available in either bottles or cans. Clearly, the
independent variable “container type” could influence the dependent variable
“sales.” Container type is a qualitative variable, however, and must be assigned
numerical values if it is to be used in a regression study. So-called dummy variables
are used to represent qualitative variables in regression analysis. For example, the
dummy variable x could be used to represent container type by setting x = 0 if the
iced tea is packaged in a bottle and x = 1 if the iced tea is in a can. If the beverage
could be placed in glass bottles, plastic bottles, or cans, it would require two dummy
variables to properly represent the qualitative variable container type. In general, k -
1 dummy variables are needed to model the effect of a qualitative variable that may
assume k values.
The general linear model y = β0 + β1x1 + β2x2 + . . . + βpxp + ε can be used to model
a wide variety of curvilinear relationships between dependent and independent
variables. For instance, each of the independent variables could be a
nonlinear function of other variables. Also, statisticians sometimes find it necessary
to transform the dependent variable in order to build a satisfactory model. A
logarithmic transformation is one of the more common types.
Logistic Regression
Model Theory:
Logistic regression is a statistical method for predicting binary classes. The outcome
or target variable is binary in nature. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence.
Sigmoid Function:
Sigmoid curve
The sigmoid function also called the logistic function gives an ‘S’ shaped curve that
can take any real-valued number and map it into a value between 0 and 1. If the curve
goes to positive infinity, y predicted will become 1, and if the curve goes to negative
infinity, y predicted will become 0. If the output of the sigmoid function is more than
0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify
it like 0 or NO. If the output is 0.75, we can say in terms of probability as: There is a
75 percent chance that patient will suffer from cancer.
Sigmoid function
‘0’ as x approaches −∞
‘1’ as x approaches +∞
Thus, if the output is more tan 0.5, we can classify the outcome as 1 (or YES) and if
it is less than 0.5, we can classify it as 0(or NO).
For example: If the output is 0.65, we can say in terms of probability as:
“There is a 65 percent chance that your favorite cricket team is going to win today ”.
For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
The independent variables should be independent of each other. That is, the
model should have little or no multicollinearity.
The measure of total variation, SST, is the sum of the squared deviations of the
dependent variable about its mean: Σ(y − ȳ)2. This quantity is known as the total sum
of squares. The measure of unexplained variation, SSE, is referred to as the residual
sum of squares. SSE is the sum of the squared distances from each point in to the
estimated regression line: Σ(y − ŷ)2. SSE is also commonly referred to as the error
sum of squares. A key result in the analysis of variance is that SSR + SSE = SST.
The ratio r2 = SSR/SST is called the coefficient of determination. If the data points
are clustered closely about the estimated regression line, the value of SSE will be
small and SSR/SST will be close to 1. Using r2, whose values lie between 0 and 1,
provides a measure of goodness of fit; values closer to 1 imply a better fit. A value
of r2 = 0 implies that there is no linear relationship between the dependent and
independent variables.
Here Og, Eg, Ng, and πg denote the observed events, expected events, observations,
predicted risk for the gth risk decile group, and G is the number of groups
Model Construction:
One Model Building Strategy
We've talked before about the "art" of model building. Unsurprisingly, there are
many approaches to model building, but here is one strategy—consisting of seven
steps—that is commonly used when building a regression model.
For predictive reasons — that is, the model will be used to predict the
response variable from a chosen set of predictors.
For theoretical reasons — that is, the researcher wants to estimate a model
based on a known theoretical relationship between the response and
predictors.
For control purposes — that is, the model will be used to control a response
variable by manipulating the values of the predictor variables.
For inferential reasons — that is, the model will be used to explore the
strength of the relationships between the response and the predictors.
For data summary reasons — that is, the model will be used merely as a
way to summarize a large set of data by a single equation.
The second step
Decide which predictor variables and response variable on which to collect the
data. Collect the data.
On a univariate basis, check for outliers, gross data errors, and missing
values.
Study bivariate relationships to reveal other outliers, to suggest possible
transformations, and to identify possible multicollinearities.
I can't possibly over-emphasize the importance of this step. There's not a data
analyst out there who hasn't made the mistake of skipping this step and later
regretting it when a data point was found in error, thereby nullifying hours of
work.
The training set, with at least 15-20 error degrees of freedom, is used to
estimate the model.
The validation set is used for cross-validation of the fitted model.
The fifth step
Using the training set, identify several candidate models: