Lect 10 Regression
Lect 10 Regression
Lecture 10
Application of Python language for Data Science and Artificial Intelligence
Regression
(lecture without lecture)
What Is Regression?
Regression searches for relationships among variables.
In regression analysis, you consider some phenomenon of interest and have a number of
observations.
Each observation has two or more features.
Following the assumption that at least one of the features depends on the others, you try to
establish a relation among them.
You need to find a function that maps some features or variables to others sufficiently well.
The dependent features are called the dependent variables, outputs, or responses.
The independent features are called the independent variables, inputs, regressors, or
predictors.
Regression problems usually have one continuous and unbounded dependent variable.
The inputs, however, can be continuous, discrete, or even categorical.
It’s a common practice to denote the outputs with 𝑦 and the inputs with 𝑥.
If there are two or more independent variables, then they can be represented as the vector 𝐱 =
(𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of inputs.
Linear Regression
Linear regression is probably one of the most important and widely used regression
techniques.
It’s among the simplest regression methods.
One of its main advantages is the ease of interpreting results.
Problem Formulation
When implementing linear regression of some dependent variable 𝑦 on the set of independent
variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship
between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀.
This equation is the regression equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is
the random error.
Linear regression calculates the estimators of the regression coefficients or simply the
predicted weights, denoted with 𝑏₀, 𝑏₁, …, 𝑏ᵣ.
These estimators define the estimated regression function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ.
This function should capture the dependencies between the inputs and output sufficiently
well.
The estimated or predicted response, 𝑓(𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as
close as possible to the corresponding actual response 𝑦ᵢ.
The differences 𝑦ᵢ – 𝑓(𝐱ᵢ) for all observations 𝑖 = 1, …, 𝑛, are called the residuals.
Regression is about determining the best predicted weights – that is, the weights
corresponding to the smallest residuals.
To get the best weights, you usually minimize the sum of squared residuals (SSR) for all
observations 𝑖 = 1, …, 𝑛:
SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))².
Regression Performance
The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛, occurs partly due to the dependence on the
predictors 𝐱ᵢ.
However, there’s also an additional inherent variance of the output.
The coefficient of determination, denoted as 𝑅², tells you which amount of variation in 𝑦 can
be explained by the dependence on 𝐱, using the particular regression model.
A larger 𝑅² indicates a better fit and means that the model can better explain the variation of
the output with different inputs.
The value 𝑅² = 1 corresponds to SSR = 0.
That’s the perfect fit, since the values of predicted and actual responses fit completely to each
other.
The estimated regression function, represented by the black line, has the equation:
𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥.
Your goal is to calculate the optimal values of the predicted weights 𝑏₀ and 𝑏₁ that minimize
SSR and determine the estimated regression function.
The value of 𝑏₀, also called the intercept, shows the point where the estimated regression line
crosses the 𝑦 axis. It’s the value of the estimated response 𝑓(𝑥) for 𝑥 = 0.
The vertical dashed gray lines represent the residuals, which can be calculated as:
𝑦ᵢ - 𝑓(𝐱ᵢ) = 𝑦ᵢ - 𝑏₀ - 𝑏₁𝑥ᵢ for 𝑖 = 1, …, 𝑛.
They’re the distances between the green circles and red squares.
When you implement linear regression, you’re actually trying to minimize these distances and
make the red squares as close to the predefined green circles as possible.
Task_01
import …
import …
from sklearn.linear_model import LinearRegression
np.random.seed(0)
x = 10*np.random.rand(50)
y = 2*x -1 + np.random.rand(50)*6
plt.scatter(x,y,color='b')
X = x[:, np.newaxis]
print(X.shape)
model = LinearRegression().fit(X,y)
LinearRegression(fit_intercept=True)
np.random.seed()
xnew = 10*np.random.rand(10);print(xnew.shape)
ypred = model.predict(xnew.reshape(-1,1))
plt.scatter(xnew, ypred, color="r")
plt.savefig('fig_01.png')
plt.show()
Task_02
Repeat building the linear regression model for the data below.
Data np.array
x = [5, 15, 25, 35, 45, 55]
y = [5, 20, 14, 32, 22, 38]
xnew = arange(5)
Task_03
Generate a set of observations (x,y) with 127 elements.
x values should be in the range [-10,10].
y values should depend linearly on x plus a random factor.
• Draw a scatter plot for these observations.
• Draw a linear regression line for these observations.
• Print the parameters of the regression equation.
• Generate 23 new values of the independent variable x.
• For these values, calculate the predicted value of y.
• On the graph, mark the new observations x and the predicted y values for them.
Task_04
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(0)
x1 = 10*np.random.rand(100)
x2 = 10*np.random.rand(100)
y = 2*x1 + 3*x2 - 3 + np.random.rand(100)*10
ax = plt.axes(projection='3d')
ax.scatter3D(x1, x2, y, c=y, cmap='PuBu')
x = np.hstack((x1.reshape(-1,1),x2.reshape(-1,1)))
model = LinearRegression().fit(x,y)
X1 = np.linspace(0,10,100)
X2 = np.linspace(0,10,100)
XX1, XX2 = np.meshgrid(X1,X2)
Task_05
Repeat building the linear regression model for the data below.
x1 = [0, 5, 15, 25, 35, 45, 55, 60]
x2 = [1, 1, 2, 5, 11, 15, 34, 35]
y = [4, 5, 20, 14, 32, 22, 38, 43]
Task_06
Generate a set of observations (x, y) with 134 elements. x = (x1, x2)
x values should be in the range [-10,10].
y values should depend linearly on x plus a random factor.
• Draw a scatter plot for these observations.
• Draw a surface regression line for these observations.
• Print the parameters of the regression equation.
• Generate 32 new values of the independent variable x.
• For these values, calculate the predicted value of y.
• On the graph, mark the new observations x and the predicted y values for them.
Polynomial Regression
You can regard polynomial regression as a generalized case of linear regression.
You assume the polynomial dependence between the output and inputs and, consequently, the
polynomial estimated regression function.
In addition to linear terms like 𝑏₁𝑥₁, regression function 𝑓 can include nonlinear terms such as
𝑏₂𝑥₁², 𝑏₃𝑥₁³, 𝑏₄𝑥₁𝑥₂, 𝑏₅𝑥₁²𝑥₂, ...
The simplest example of polynomial regression has a single independent variable, and the
estimated regression function is a polynomial of degree two:
𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥 + 𝑏₂𝑥².
This is why you can solve the polynomial regression problem as a linear problem with the
term 𝑥² regarded as an input variable.
In the case of two variables and the polynomial of degree two, the regression function has this
form:
𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂ + 𝑏₃𝑥₁² + 𝑏₄𝑥₁𝑥₂ + 𝑏₅𝑥₂².
The procedure for solving the problem is identical to the previous case.
You apply linear regression for five inputs: 𝑥₁, 𝑥₂, 𝑥₁², 𝑥₁𝑥₂, and 𝑥₂².
As the result of regression, you get the values of six weights that minimize SSR: 𝑏₀, 𝑏₁, 𝑏₂, 𝑏₃,
𝑏₄, and 𝑏₅.
Task_07
Nonlinear (polynomial) regression one explanatory variable with second degree.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
y_pred = model.predict(x_)
print(f"predicted response:\n{y_pred}")
print(f"y=\n{y}")
plt.scatter(x,y,color='b')
plt.scatter(x,y_pred,color='r')
Task_08
Version of task 7
Generate a set of observations (x,y) with 33 elements.
• Draw a scatter plot for these observations.
• Build nonlinear (polynomial) regression.
• Print the parameters of the regression equation.
• Draw a plot of this nonlinear regression.
• Generate 11 new values of the independent variable x.