Linear Regression - Least-Squares
Linear Regression - Least-Squares
models are linear models. A linear model expresses the target output
value in terms of a sum of weighted
input variables. For example, our goal may be to predict the market
value of a house, it's expected sales price in the next month, for example. Suppose
we're given
two input variables. How much tax the property
is assessed each year by the local government and the
age of the house in years. You can imagine that these
two features of the house would each have some
information that's helpful in predicting
the market price because in most places there's a positive correlation between the
tax assessment on a
house and its market value. Indeed, the tax assessment
is often partly based on market prices
from previous years. Maybe a negative
correlation between its age in years and
the market value, so older houses may need more repair and
upgrading, for example. One linear model, which
I've made up as an example, could compute the
expected market price in US dollars by starting
with a constant term here, 212,000, and then
adding some number, let's say 109 times the
value of tax paid last year, and then subtracting 2,000 times the age of
the house in years. For example, this linear model would estimate the
market price of a house where the
tax assessment was $10,000 and that was 75 years
old as about $1.2 million. Now, I just made up this
particular linear model myself as an example, but in general, when we talk about
training a linear model, we mean estimating values for the parameters of the model
or coefficients of the model, as we sometimes call
them, which are here, the constant value 212,000 and
the weights 109 and 2,000. In such a way that the
resulting predictions for the outcome variable Y price for different houses
are a good fit to the data from actual past sales. We'll discuss what good
fit means shortly. Predicting house price is an
example of a regression task using a linear model called not surprisingly,
linear regression. While generally in a
linear regression model, there may be multiple input
variables or features, which will denote,
x_0, x_1, etc. Each feature x_i has a
corresponding weight w_i. The predicted output,
which we denote y hat, is a weighted sum of features
plus a constant term b hat. I've put a hat over all the
quantities here that are estimated during the
aggression training process. The w hat and b hat values, which we call the train
parameters or coefficients, are estimated from
training data, and y hat is estimated from
the linear function of input feature values and the train parameters
For example, in the simple housing
price example we just saw, w_0 hat was 109, x_0 represented tax paid, w_1 hat was
negative 20, x_1 was house age, and b hat was 212,000. We call these w-i values
model coefficients or
sometimes feature weights, and b hat is called the bias term or the
intercept of the model. Here's an example of a
linear regression model with just one input variable or feature x_0 on a simple
artificial example dataset. The blue cloud of
points represents training set of x_0, y pairs. In this case, the formula for
predicting the output y hat is just w_0 hat times
x_0 plus b hat, which you might recognize as the familiar slope intercept
formula for a straight line, where w_0 hat is the slope, and b hat is the y-
intercept. The gray and red lines represent different possible linear
regression models that could attempt to explain the
relationship between x_0 and y. You can see that some lines
are a better fit than others. The better fitting
models capture the approximately
linear relationship where as x_0 increases, y also increases in
a linear fashion. The red line seems
especially good. Intuitively, there are not as many blue training
points that are very far above or very far below the
red linear model prediction. Let's take a look at
a very simple form of linear regression
model that just has one input variable or feature
to use for prediction. In this case, we have the vector x is just
as a single component. We'll call it x_0. That's the input variable,
the input feature. In this case, because
there's just one variable, the predicted output is simply the product of the
weight w_0 with the input variable
x_0 plus a bias term b. X_0 is the value that's provided and
comes with the data, and so the parameters we
have to estimate are w-0 and b in order to obtain the parameters for
this linear regression model. This formula may look familiar. It's the formula for
a line
in terms of its slope. In this case, slope
corresponds to the weight w_0 and b
corresponds to the y-intercept, we call the bias term. Here, the job of the model
is to take as input, let's pick a point
here along the x-axis. W_0 corresponds to the
slope of this line, and b corresponds to the
y-intercept of the line. By finding these two parameters together to find a
straight
line in this feature space. Now, the important
thing to remember is that there's a training phase
and a prediction phase. The training phase using
the training data is what we'll use to estimate w_0 and b. One widely used
method for estimating w and b for linear
regression problem, it's called least-squares
linear regression, also known as ordinary
least-squares. Least-squares linear regression
finds the line through this cloud of points
that minimizes what is called the mean squared
error of the model. The mean squared error of
the model is essentially the sum of the squared
differences between the predicted target value and the actual target value for all
the points
in the training set. This plot illustrates
what that means. The blue points represent
points in the training set. The red line here represents the least-squares model
that was found through this cloud
of training points. These black lines show the difference between
the y value that was predicted for a
training point based on its x position and the actual y value of
the training point. For example here, this point, let's say has an x
value of negative 1.75. If we plug it into the formula
for this linear model, we get a prediction here at this point
on the line which is somewhere around, let's say 60. But the actual observed value
in the training set for this point was
maybe closer to 10. In this case, for this
particular point, the squared difference between
the predicted target and the actual target would
be 60 minus 10 squared. We can do this calculation for every one of the points
in the training set. We can compute this
squared difference between the y value we observe
in the training set for a point and the y value
that we'd be predicted by the linear model given that
training points x value. Each of these can be computed as this squared difference
can be computed. Then if we add all these up, and divide by the number
of training points, take the average, that will be the mean squared
error of the model. The technique of least-squares is designed to find the slope,
the w value, and the b value, the y-intercept that minimize
this mean squared error. One thing to note about this linear regression model is
that there are
no parameters to control the model complexity. No matter what the
value of w and b, the result is always going
to be a straight line. This is both a strength and a weakness of the model
as we'll see later. When you have a moment, compare this simple linear model to the
more complex regression
model learned with k-nearest neighbors regression
on the same dataset. You can see that
linear models make a strong prior assumption about the relationship between
the input x and output y. Linear models may
seem simplistic, but for data with many features, linear models can be
very effective and generalize well to new data
beyond the training set. Now the question is, how exactly do we estimate the
linear models, w and b parameters so
the model is a good fit? Well, the w and b parameters are estimated using
the training data, and there are lots of
different methods for estimating w and b
depending on the criteria you'd like to use for the definition of
what a good fit to the training data is and how you want to control
model complexity. For linear models,
model complexity is based on the nature
of the weights w on the input features. Simpler linear models have a weight vector
w that's
closer to zero, i.e., where more features or they're
not used at all and have zero weight or have less influence on the outcome
with very small weight. Typically, given
possible settings for the model parameters, the learning algorithm predicts the
target value for
each training example, and then computes
what is called a loss function for each
training example. That's a penalty value for
incorrect predictions. Predictions incorrect when they predict the target value is
different than the actual target value in
the training set. For example, a squared
loss function would return the squared difference between the target value and the
actual value as the penalty. The learning algorithm then computes or searches
for the set of w, b parameters that minimize the total of this loss function
over all training points. The most popular way to estimate w and b parameters
is using what's called least squares linear regression or
ordinary least-squares. Least-squares finds
the values of w and b that minimize the total sum of squared differences between
the predicted y value and the actual y value
in the training set. Or equivalently, it minimizes the mean squared
error of the model. Least-squares is based on the squared loss function
mentioned before. This is illustrated graphically
here where I've zoomed in on the left lower portion of the simple
regression data set. The red line represents the least squares solution for w and b
through
the training data. The vertical lines
represent the difference between the actual y value
of a training point, x_i [inaudible] y and
its predicted y-value given x_i which lies on the
red line where x equals x_i. Adding up all the squared
values of these differences for all the training points gives
the total squared error. This is what the least
squares solution minimizes. Here there are no parameters
to control model complexity. The linear model
always uses all of the input variables and always is represented
by a straight line. Another name for this quantity is the residual sum of squares.
The actual target value is given in y_i and the predicted
y-hat value for the same training example
is given by the right side of the formula using
the linear model with parameters w and b. Let's look at how to implement
this in scikit-learn. Linear regression in
scikit-learn is implemented by the linear regression class in the sklearn.linear
model module. As we did with other
estimators in scikit-learn, like the nearest neighbors classifier and the
regression models, we use the train
test split function on the original data
set and then create an fit the linear
regression object using the training data in x_train and the corresponding training
data target values in y train. Here, note that we're doing the creation and fitting
of the linear regression object in one line by chaining the fit method with the
constructor for the new object. The linear regression fit method acts to estimate
the
feature weights w, which it calls the
coefficients of the model. It stores this in the coef_ attribute and
the bias term b, which is stored in the
intercept_ attribute. Note that if a scikit-learn objects attribute ends
with an underscore, this means that these
attributes were derived from training data and not say quantities that
were set by the user. If we dump the coef_ and intercept_ attributes
for this simple example, we see that because there's only one input feature
variable, there's only one
element in the coef_ list, the value 45.7. The intercept attribute has
a value of about 148.4. We can see that indeed
these correspond to the red line shown
in the plot which has a slope of 45.7 and a
y-intercept of about 148.4. Here's the same code
in the notebook with additional code to score the quality of the
regression model in the same way that we did
for k nearest neighbors regression using the
R-squared metric. Here's the notebook
code we use to plot the least-squares linear
solution for this data set. Now that we've seen both
K-Nearest Neighbors regression and least squares regression. It's interesting now
to compare the least squares linear
regression results with the K-nearest
neighbor results. Here we can see how these
two regression methods represent two complementary
types of supervised learning. The K nearest neighbor
regressor doesn't make a lot of assumptions about
the structure of the data, and it gives
potentially accurate but sometimes
unstable predictions that are sensitive to small
changes in the training data. That has a
correspondingly higher training set R-squared score compared to least squares
linear regression. K-NN achieves an R-squared
score of 0.72 and least-squares
achieves an R-squared of 0.679 on the training set. On the other hand, linear
models make strong assumptions about
the structure of the data. In other words, that the target value can be predicted
using a weighted sum of the input variables and linear models gives stable but
potentially
inaccurate predictions. However, in this case, it turns out that
the linear model's strong assumption that there's a linear relationship
between the input and output variables happens to be a good fit for this data set.
It's better at more
accurately predicting the y value for a new x values that weren't
seen during training. We can see that
the linear model, it gets a slightly
better test set score of 0.492 versus 0.471 for
K-nearest neighbors. This indicates its
ability to better generalize and capture
this global linear trend.