ML Unit3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Unit-3Linear Models for Regression

Dr. Anil B. Gavade


June 1, 2024

Introduction to Linear Models for Regression


Understanding Regression Analysis
Regression analysis aims to model the relationship between a dependent vari-
able (response) and one or more independent variables (predictors). The goal
is to predict the value of the dependent variable based on the values of the
independent variables.

Linear Models
Linear models assume that the relationship between the independent variables
and the dependent variable is linear. This means that changes in the indepen-
dent variables result in a proportional change in the dependent variable.
A linear model is represented as:

y = β0 + β1 x1 + β2 x2 + ... + βn xn + ϵ
Where:
• y is the dependent variable.
• x1 , x2 , ..., xn are the independent variables.
• β0 , β1 , ..., βn are the coefficients (parameters) of the model.
• ϵ is the error term, representing the difference between the observed and
predicted values of the dependent variable.

Importance of Linear Models


Linear models offer several advantages that make them widely used in regression
analysis:
1. Interpretability: The coefficients in a linear model provide direct inter-
pretation of the relationship between the independent variables and the
dependent variable.
2. Simplicity: Linear models are simple and easy to understand, mak-
ing them suitable for explaining relationships between variables to non-
technical audiences.

1
3. Efficiency: Linear models can be estimated efficiently even with large
datasets, making them computationally tractable for many applications.
4. Flexibility: Despite their name, linear models can capture nonlinear re-
lationships between variables by including appropriate transformations or
interactions of the independent variables.

Applications of Linear Models


Linear models find applications across various domains, including:

• Economics: Modeling the relationship between variables such as income,


expenditure, and GDP.
• Finance: Predicting stock prices based on financial indicators.
• Healthcare: Predicting patient outcomes based on clinical variables.
• Marketing: Analyzing the impact of marketing campaigns on sales.

Linear models for regression provide a powerful and versatile framework for
modeling the relationship between variables in a wide range of applications.
By understanding the principles of linear regression, practitioners can make
informed decisions and extract valuable insights from their data. In the fol-
lowing sections, we will delve deeper into specific aspects of linear regression,
including model specification, estimation techniques, and advanced topics such
as regularization and Bayesian methods.

Model Specification
In the context of linear basis function models, model specification refers to the
process of defining the structure and components of the regression model. This
involves selecting the appropriate basis functions, determining the number of
parameters, and establishing how the input variables are related to the output
variable.
1. Basis Functions Selection:
The choice of basis functions is critical as it determines the flexibility and
expressiveness of the model. Basis functions transform the input features into a
higher-dimensional space where the relationship between the input and output
variables becomes linear. Common choices for basis functions include:

• Polynomial Basis Functions: These are functions of the form ϕj (x) =


xj , where j is the degree of the polynomial. Polynomial basis functions
can capture nonlinear relationships in the data.
• Radial Basis Functions (RBF): RBFs are centered around specific
points in the input space and decrease with distance from these points.
They are particularly useful for capturing local variations in the data.
• Sigmoidal Basis Functions: Sigmoidal functions, such as the logistic
function or the hyperbolic tangent function, can transform input variables

2
into a bounded range, making them suitable for modeling binary or cate-
gorical outcomes.

2. Number of Basis Functions:


The number of basis functions determines the complexity of the model. A
larger number of basis functions can lead to a more flexible model capable of
capturing intricate patterns in the data. However, too many basis functions may
result in overfitting, where the model learns noise in the training data rather
than the underlying relationship.
3. Input-Output Relationship:
Once the basis functions are chosen, the model specification involves defining
how the input variables are related to the output variable. This is typically
done by specifying the linear combination of basis functions that approximate
the target function. The model is represented as:
M
X −1
y(x, w) = wj ϕj (x)
j=0

where x is the input vector, w is the vector of parameters to be learned,


ϕj (x) are the basis functions, and M is the number of basis functions.
Example:
Let’s consider a simple example of modeling the relationship between a single
input variable x and an output variable y using polynomial basis functions. We
can specify the model as:

y(x, w) = w0 + w1 x + w2 x2 + . . . + wN −1 xN −1

Here, N represents the number of polynomial terms included in the model,


and wi are the parameters to be estimated from the data.
Model specification is a crucial step in building linear basis function models
for regression. By carefully selecting appropriate basis functions, determining
the number of parameters, and defining the input-output relationship, we can
create models that accurately capture the underlying patterns in the data and
make reliable predictions. However, it’s essential to strike a balance between
model complexity and generalization performance to avoid overfitting.

Maximum Likelihood Estimation (MLE)


Maximum Likelihood Estimation (MLE) is a statistical method used to estimate
the parameters of a probability distribution based on observed data. In the
context of linear models for regression, MLE is commonly employed to determine
the optimal parameters that best fit the observed data.

Introduction to MLE
The fundamental idea behind maximum likelihood estimation is to find the set
of parameters for a statistical model that maximizes the likelihood function

3
given the observed data. The likelihood function represents the probability
of observing the data given the model parameters. Maximizing this function
essentially means finding the parameter values that make the observed data the
most likely under the assumed model.

Mathematical Formulation
Consider a dataset consisting of N observations {(xn , yn )}Nn=1 , where xn rep-
resents the input vector and yn represents the corresponding target value. In
the context of linear regression, the model assumes that the target variable y is
related to the input variables x through a linear relationship:

y(x, w) = wT x + ϵ
where w is the vector of parameters to be estimated, and ϵ represents the
random error term.
The likelihood function L(w) is defined as the joint probability density func-
tion (or mass function) of observing the target values {yn }Nn=1 given the input
data {xn }Nn=1 and the parameter vector w:

L(w) = p({yn }N N
n=1 |{xn }n=1 , w)

Maximizing the Likelihood


To find the maximum likelihood estimate ŵ, we seek the parameter vector w
that maximizes the likelihood function L(w):

ŵ = arg max L(w)


w

In practice, it is often more convenient to work with the log-likelihood func-


tion, denoted as ℓ(w), which is the natural logarithm of the likelihood function:

ℓ(w) = log L(w)


Maximizing the log-likelihood is mathematically equivalent to maximizing
the likelihood itself. This simplifies the optimization process and ensures nu-
merical stability.

Practical Considerations
In real-world applications, optimization techniques such as gradient-based meth-
ods (e.g., gradient descent, Newton’s method) or numerical optimization al-
gorithms (e.g., BFGS, L-BFGS) are typically employed to find the maximum
likelihood estimate. Additionally, regularization techniques may be applied to
prevent overfitting and improve generalization performance.
Maximum Likelihood Estimation is a powerful and widely used method for
estimating the parameters of statistical models, including linear regression mod-
els. By maximizing the likelihood function, MLE provides a principled approach
to parameter estimation that is grounded in probability theory and statistics.

4
Least Squares
Least squares is a method used to estimate the parameters of a linear model by
minimizing the sum of squared errors between the observed target values and
the model predictions.
The objective function to minimize is the sum of squared errors:
N
X
J(w) = (yi − xTi w)2
i=1

To find the least squares estimate ŵ, we take the derivative of J(w) with
respect to w and set it equal to zero:

∂J(w)
= −2XT (y − Xw) = 0
∂w
Solving this equation yields the least squares estimate ŵ = (XT X)−1 XT y.

Comparison
Maximum likelihood estimation and least squares are closely related, especially
in the case of Gaussian errors. MLE assumes a probabilistic model for the errors,
while least squares directly minimizes the error between observed and predicted
values. In the case of Gaussian noise, MLE is equivalent to least squares, making
least squares a special case of MLE. However, MLE can be more flexible and
applicable to situations with non-Gaussian noise distributions.

Geometry of Least Squares


In linear regression, the goal is to find the best-fitting line or hyperplane that
minimizes the sum of squared errors between the observed target values and
the predicted values from the model. Geometrically, this can be understood as
finding the projection of the target vector onto the subspace spanned by the
basis functions.
Consider a dataset consisting of N data points, where each data point is
represented by a vector xi and a target value yi . We can represent the dataset
as a matrix X with N rows and M columns, where M is the number of features
or basis functions. Let y be the target vector.
The linear regression model can be represented as:

y = Xw + ϵ

where w is the vector of model parameters and ϵ is the vector of errors.


The least squares solution seeks to minimize the squared error:
1
J(w) = ∥y − Xw∥22
2

5
This optimization problem can be solved analytically by setting the gradient
of J(w) with respect to w to zero:

X T (Xw − y) = 0

Solving this equation yields the optimal parameter vector w∗ :

w∗ = (X T X)−1 X T y

Geometrically, X T X represents the covariance matrix of the features, and


T
X y represents the covariance between the features and the target variable.
The optimal parameter vector w∗ corresponds to the projection of y onto the
subspace spanned by the columns of X.

Sequential Learning
Sequential learning techniques involve updating model parameters incremen-
tally as new data becomes available. This is particularly beneficial for online
learning scenarios where data arrives sequentially and the model needs to adapt
to changing conditions.
One common approach to sequential learning is stochastic gradient descent
(SGD), where model parameters are updated after processing each data point.
This allows the model to quickly adapt to new observations and can be compu-
tationally efficient for large datasets.
Another approach is online learning with regularized least squares, where
model parameters are updated using a combination of the new data and infor-
mation from previous observations. This helps prevent overfitting and ensures
that the model remains stable over time.

Regularized Least Squares


Regularized least squares methods introduce a penalty term to the least squares
criterion to prevent overfitting. Two commonly used regularization techniques
are Ridge regression (L2 regularization) and Lasso regression (L1 regulariza-
tion).
In Ridge regression, the penalty term is proportional to the squared magni-
tude of the parameter vector:
1
JRidge (w) = ∥y − Xw∥22 + λ∥w∥22
2
where λ is the regularization parameter. Ridge regression shrinks the pa-
rameter estimates towards zero, effectively reducing the model’s complexity and
preventing overfitting.
In Lasso regression, the penalty term is proportional to the absolute magni-
tude of the parameter vector:
1
JLasso (w) = ∥y − Xw∥22 + λ∥w∥1
2

6
Lasso regression encourages sparsity in the parameter estimates, leading to
a simpler model with fewer non-zero coefficients. This makes Lasso regression
particularly useful for feature selection and model interpretability.

Multiple Outputs
In linear regression, the target variable can have multiple dimensions or outputs.
This scenario arises when predicting multiple related variables simultaneously,
such as predicting the prices of multiple stocks or the coordinates of multiple
points in space.
Multiple output regression can be addressed using several approaches:
1. Independently Model Each Output: Treat each output variable inde-
pendently and fit a separate linear regression model to each output dimension.
This approach ignores any relationships or dependencies between the output
variables.
2. Jointly Model All Outputs: Fit a single regression model that jointly
predicts all output variables simultaneously. This approach captures correla-
tions between the output variables and can lead to more accurate predictions,
especially when the outputs are related.
3. Hierarchical Modeling: Hierarchical models capture dependencies be-
tween output variables by introducing hierarchical structures or constraints on
the model parameters. This approach is useful when there are known relation-
ships or dependencies between subsets of output variables.
Overall, multiple output regression extends the concepts of linear regres-
sion to handle multidimensional target variables, offering flexibility in modeling
complex relationships between inputs and outputs.

Bayesian Linear Regression


Bayesian linear regression offers a probabilistic framework for estimating model
parameters, making predictions, and quantifying uncertainty.

Parameter Distribution
In Bayesian linear regression, the parameters of the linear model are treated
as random variables with prior distributions. These prior distributions capture
our beliefs about the parameters before observing any data. For example, we
might assume Gaussian (normal) distributions for the parameters, with means
and variances reflecting our prior knowledge or assumptions.
When new data is observed, Bayesian inference updates the prior distribu-
tions to posterior distributions using Bayes’ theorem. The posterior distribu-
tions represent our updated beliefs about the parameters after considering the
observed data. They combine the information from the prior distributions with
the likelihood of the observed data given the parameters.

7
Mathematically, if w represents the vector of parameters, p(w) is the prior
distribution, p(y|X, w) is the likelihood function, and p(w|X, y) is the posterior
distribution, Bayes’ theorem states:

p(y|X, w) · p(w)
p(w|X, y) =
p(y|X)

where X is the matrix of input features and y is the vector of observed target
values.

Predictive Distribution
The predictive distribution in Bayesian linear regression gives the distribution
of the target variable y given new input data x. It incorporates uncertainty not
only in the parameters but also in the observation noise.
The predictive distribution is obtained by integrating over the posterior dis-
tribution of the parameters:
Z
p(y|x, X, y) = p(y|x, w) · p(w|X, y) dw

where p(y|x, w) is the likelihood function, representing the distribution of y


given x and w.
The predictive distribution provides a probabilistic framework for making
predictions. Instead of just predicting a single value for y, it gives a distribution
of possible values, reflecting the uncertainty in the prediction.

Equivalent Kernel
In some cases, the predictive distribution in Bayesian linear regression can be
expressed in terms of an equivalent kernel. This kernel captures the similarities
between input points in the feature space and is useful for understanding the
model’s behavior.
The equivalent kernel represents the inner product between the feature vec-
tors in the feature space. It allows us to compute predictions without explicitly
calculating the posterior distribution over the parameters.
Understanding the equivalent kernel provides insights into how the model
generalizes to new data and can guide the selection of appropriate basis functions
or regularization techniques.
In summary, Bayesian linear regression offers a powerful probabilistic frame-
work for regression analysis. By treating model parameters as random variables
and incorporating uncertainty in predictions, it provides more robust and inter-
pretable results compared to traditional regression methods. The predictive dis-
tribution and equivalent kernel are key concepts that enhance our understanding
of the model’s behavior and facilitate model evaluation and interpretation.

8
Bayesian Model Comparison
Bayesian model comparison enables the comparison of different models based
on their posterior probabilities given observed data.

The Evidence Approximation


The evidence function, also known as the marginal likelihood, quantifies the
goodness-of-fit of a model to the data. It integrates the likelihood function with
respect to the prior distribution over the parameters, providing a measure of
how well the model explains the observed data.

Evaluation of the Evidence Function


The evidence function is computed by integrating the likelihood function with
respect to the prior distribution over the parameters. This integration accounts
for the uncertainty in the parameters and penalizes overly complex models.

Maximizing the Evidence Function


Maximizing the evidence function with respect to model parameters allows for
automatic model selection and hyperparameter tuning. Models with higher evi-
dence values are favored, indicating better overall fit to the data while penalizing
model complexity.

Effective Number of Parameters


The effective number of parameters quantifies the complexity of a model and
penalizes overly complex models in the evidence approximation. It takes into
account both the number of parameters and their uncertainty, providing a more
accurate measure of model complexity.

Limitations of Fixed Basis Functions


Fixed basis function models, such as polynomial regression, may struggle to
capture complex nonlinear relationships in the data. They rely on a fixed set of
basis functions, limiting their flexibility and ability to model intricate patterns
in the data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy