ML Unit3
ML Unit3
ML Unit3
Linear Models
Linear models assume that the relationship between the independent variables
and the dependent variable is linear. This means that changes in the indepen-
dent variables result in a proportional change in the dependent variable.
A linear model is represented as:
y = β0 + β1 x1 + β2 x2 + ... + βn xn + ϵ
Where:
• y is the dependent variable.
• x1 , x2 , ..., xn are the independent variables.
• β0 , β1 , ..., βn are the coefficients (parameters) of the model.
• ϵ is the error term, representing the difference between the observed and
predicted values of the dependent variable.
1
3. Efficiency: Linear models can be estimated efficiently even with large
datasets, making them computationally tractable for many applications.
4. Flexibility: Despite their name, linear models can capture nonlinear re-
lationships between variables by including appropriate transformations or
interactions of the independent variables.
Linear models for regression provide a powerful and versatile framework for
modeling the relationship between variables in a wide range of applications.
By understanding the principles of linear regression, practitioners can make
informed decisions and extract valuable insights from their data. In the fol-
lowing sections, we will delve deeper into specific aspects of linear regression,
including model specification, estimation techniques, and advanced topics such
as regularization and Bayesian methods.
Model Specification
In the context of linear basis function models, model specification refers to the
process of defining the structure and components of the regression model. This
involves selecting the appropriate basis functions, determining the number of
parameters, and establishing how the input variables are related to the output
variable.
1. Basis Functions Selection:
The choice of basis functions is critical as it determines the flexibility and
expressiveness of the model. Basis functions transform the input features into a
higher-dimensional space where the relationship between the input and output
variables becomes linear. Common choices for basis functions include:
2
into a bounded range, making them suitable for modeling binary or cate-
gorical outcomes.
y(x, w) = w0 + w1 x + w2 x2 + . . . + wN −1 xN −1
Introduction to MLE
The fundamental idea behind maximum likelihood estimation is to find the set
of parameters for a statistical model that maximizes the likelihood function
3
given the observed data. The likelihood function represents the probability
of observing the data given the model parameters. Maximizing this function
essentially means finding the parameter values that make the observed data the
most likely under the assumed model.
Mathematical Formulation
Consider a dataset consisting of N observations {(xn , yn )}Nn=1 , where xn rep-
resents the input vector and yn represents the corresponding target value. In
the context of linear regression, the model assumes that the target variable y is
related to the input variables x through a linear relationship:
y(x, w) = wT x + ϵ
where w is the vector of parameters to be estimated, and ϵ represents the
random error term.
The likelihood function L(w) is defined as the joint probability density func-
tion (or mass function) of observing the target values {yn }Nn=1 given the input
data {xn }Nn=1 and the parameter vector w:
L(w) = p({yn }N N
n=1 |{xn }n=1 , w)
Practical Considerations
In real-world applications, optimization techniques such as gradient-based meth-
ods (e.g., gradient descent, Newton’s method) or numerical optimization al-
gorithms (e.g., BFGS, L-BFGS) are typically employed to find the maximum
likelihood estimate. Additionally, regularization techniques may be applied to
prevent overfitting and improve generalization performance.
Maximum Likelihood Estimation is a powerful and widely used method for
estimating the parameters of statistical models, including linear regression mod-
els. By maximizing the likelihood function, MLE provides a principled approach
to parameter estimation that is grounded in probability theory and statistics.
4
Least Squares
Least squares is a method used to estimate the parameters of a linear model by
minimizing the sum of squared errors between the observed target values and
the model predictions.
The objective function to minimize is the sum of squared errors:
N
X
J(w) = (yi − xTi w)2
i=1
To find the least squares estimate ŵ, we take the derivative of J(w) with
respect to w and set it equal to zero:
∂J(w)
= −2XT (y − Xw) = 0
∂w
Solving this equation yields the least squares estimate ŵ = (XT X)−1 XT y.
Comparison
Maximum likelihood estimation and least squares are closely related, especially
in the case of Gaussian errors. MLE assumes a probabilistic model for the errors,
while least squares directly minimizes the error between observed and predicted
values. In the case of Gaussian noise, MLE is equivalent to least squares, making
least squares a special case of MLE. However, MLE can be more flexible and
applicable to situations with non-Gaussian noise distributions.
y = Xw + ϵ
5
This optimization problem can be solved analytically by setting the gradient
of J(w) with respect to w to zero:
X T (Xw − y) = 0
w∗ = (X T X)−1 X T y
Sequential Learning
Sequential learning techniques involve updating model parameters incremen-
tally as new data becomes available. This is particularly beneficial for online
learning scenarios where data arrives sequentially and the model needs to adapt
to changing conditions.
One common approach to sequential learning is stochastic gradient descent
(SGD), where model parameters are updated after processing each data point.
This allows the model to quickly adapt to new observations and can be compu-
tationally efficient for large datasets.
Another approach is online learning with regularized least squares, where
model parameters are updated using a combination of the new data and infor-
mation from previous observations. This helps prevent overfitting and ensures
that the model remains stable over time.
6
Lasso regression encourages sparsity in the parameter estimates, leading to
a simpler model with fewer non-zero coefficients. This makes Lasso regression
particularly useful for feature selection and model interpretability.
Multiple Outputs
In linear regression, the target variable can have multiple dimensions or outputs.
This scenario arises when predicting multiple related variables simultaneously,
such as predicting the prices of multiple stocks or the coordinates of multiple
points in space.
Multiple output regression can be addressed using several approaches:
1. Independently Model Each Output: Treat each output variable inde-
pendently and fit a separate linear regression model to each output dimension.
This approach ignores any relationships or dependencies between the output
variables.
2. Jointly Model All Outputs: Fit a single regression model that jointly
predicts all output variables simultaneously. This approach captures correla-
tions between the output variables and can lead to more accurate predictions,
especially when the outputs are related.
3. Hierarchical Modeling: Hierarchical models capture dependencies be-
tween output variables by introducing hierarchical structures or constraints on
the model parameters. This approach is useful when there are known relation-
ships or dependencies between subsets of output variables.
Overall, multiple output regression extends the concepts of linear regres-
sion to handle multidimensional target variables, offering flexibility in modeling
complex relationships between inputs and outputs.
Parameter Distribution
In Bayesian linear regression, the parameters of the linear model are treated
as random variables with prior distributions. These prior distributions capture
our beliefs about the parameters before observing any data. For example, we
might assume Gaussian (normal) distributions for the parameters, with means
and variances reflecting our prior knowledge or assumptions.
When new data is observed, Bayesian inference updates the prior distribu-
tions to posterior distributions using Bayes’ theorem. The posterior distribu-
tions represent our updated beliefs about the parameters after considering the
observed data. They combine the information from the prior distributions with
the likelihood of the observed data given the parameters.
7
Mathematically, if w represents the vector of parameters, p(w) is the prior
distribution, p(y|X, w) is the likelihood function, and p(w|X, y) is the posterior
distribution, Bayes’ theorem states:
p(y|X, w) · p(w)
p(w|X, y) =
p(y|X)
where X is the matrix of input features and y is the vector of observed target
values.
Predictive Distribution
The predictive distribution in Bayesian linear regression gives the distribution
of the target variable y given new input data x. It incorporates uncertainty not
only in the parameters but also in the observation noise.
The predictive distribution is obtained by integrating over the posterior dis-
tribution of the parameters:
Z
p(y|x, X, y) = p(y|x, w) · p(w|X, y) dw
Equivalent Kernel
In some cases, the predictive distribution in Bayesian linear regression can be
expressed in terms of an equivalent kernel. This kernel captures the similarities
between input points in the feature space and is useful for understanding the
model’s behavior.
The equivalent kernel represents the inner product between the feature vec-
tors in the feature space. It allows us to compute predictions without explicitly
calculating the posterior distribution over the parameters.
Understanding the equivalent kernel provides insights into how the model
generalizes to new data and can guide the selection of appropriate basis functions
or regularization techniques.
In summary, Bayesian linear regression offers a powerful probabilistic frame-
work for regression analysis. By treating model parameters as random variables
and incorporating uncertainty in predictions, it provides more robust and inter-
pretable results compared to traditional regression methods. The predictive dis-
tribution and equivalent kernel are key concepts that enhance our understanding
of the model’s behavior and facilitate model evaluation and interpretation.
8
Bayesian Model Comparison
Bayesian model comparison enables the comparison of different models based
on their posterior probabilities given observed data.