11 Linearmodels 3
11 Linearmodels 3
11 Linearmodels 3
Linear models
1 Linear functions
A linear model encodes the assumption that two quantities are linearly related. Mathemati-
cally, this is characterized using linear functions. A linear function is a function such that a
linear combination of inputs is mapped to the same linear combination of the corresponding
outputs.
Theorem 1.2 (Equivalence between matrices and linear functions). For finite m, n every
linear function T : Rn → Rm can be represented by a matrix T ∈ Rm×n .
This implies that in order to analyze linear models in finite-dimensional spaces we can restrict
our attention to matrices.
The range of a matrix A ∈ Rm×n is the set of all possible vectors in Rm that we can reach
by applying the matrix to a vector in Rn .
Definition 1.4 (Null space). The null space of A ∈ Rm×n contains the vectors in Rn that
A maps to the zero vector.
The following lemma shows that the null space is the orthogonal complement of the row
space of the matrix.
The lemma, proved in Section A.2 of the appendix, implies that the matrix is invertible if
we restrict the inputs to be in the row space of the matrix.
Corollary 1.6. Any matrix A ∈ Rm×n is invertible when acting on its row space. For any
two nonzero vectors x1 6= x2 in the row space of A
Proof. Assume that for two different nonzero vectors x1 and x2 in the row space of A Ax1 =
Ax2 . Then x1 − x2 is a nonzero vector in the null space of A. By Lemma 1.5 this implies that
x1 − x2 is orthogonal to the row space of A and consequently to itself, so that x1 = x2 .
This means that for every matrix A ∈ Rm×n we can decompose any vector in Rn into two
components: one is in the row space and is mapped to a nonzero vector in Rm that is unique
in the sense that no other vector in row (A) is mapped to it, the other is in the null space
and is mapped to the zero vector.
2
1.2 Interpretation using the SVD
Recall that the left singular vectors of a matrix A that correspond to nonzero singular values
are a basis of the column space of A. It follows that they are also a basis for the range.
The right singular vectors corresponding to nonzero singular values are a basis of the row
space. As a result any orthonormal set of vectors that forms a basis of Rn together with
these singular vectors is a basis of the null space of the matrix. We can therefore write any
matrix A such that m ≥ n as
σ1 0 · · · 0 0 · · · 0
0 σ2 · · · 0 0 · · · 0
· · ·
A = [ u1 u2 · · · ur ur+1 · · · un ] 0 0 · · · σr 0 · · · 0
[v|1 v2 {z· · · v}r v|r+1 {z· · · vn ]T .
| {z }
Basis of range(A)
0 0 · · · 0 0 · · · 0 Basis of row(A) }
Basis of null(A)
···
0 0 ··· 0 0 ··· 0
Note that the vectors ur+1 , . . . , un are a subset of an orthonormal basis of the orthogonal
complement of the range, which has dimension m − r.
The SVD provides a very intuitive characterization of the mapping between x ∈ Rn and
Ax ∈ Rm for any matrix A ∈ Rm×n with rank r,
r
X
σi viT x ui .
Ax = (8)
i=1
1. Compute the projection of x onto the right singular vectors of A: v1T x, v2T x, . . . , vrT x.
2. Scale the projections using the corresponding singular value: σ1 v1T x, σ2 v2T x, . . . , σr vrT x.
3. Multiply each scaled projection with the corresponding left singular vector ui .
In a linear model we assume that the data y ∈ Rm can be represented as the result of
applying a linear function or matrix A ∈ Rm×n to an unknown vector x ∈ Rn ,
Ax = y. (9)
3
The aim is to determine x from the measurements. Depending on the structure of A and y
this may or may not be possible.
If we expand the matrix-vector product, the linear model is equivalent to a system of linear
equations
If the number of equations m is greater than the number of unknowns n the system is said
to be overdetermined. If there are more unknowns than equation n > m then the system
is underdetermined.
Recall that range (A) is the set of vectors that can be reached by applying A. If y does not
belong to this set, then the system cannot have a solution.
Lemma 1.7. The system y = Ax has one or multiple solutions if and only if y ∈ range (A).
Proof. If y = Ax has a solution then y ∈ range (A) by (4). If y ∈ range (A) then there is a
linear combination of the columns of A that yield y by (4) so the system has at least one
solution.
If the null space of the matrix has dimension greater than 0, then the system cannot have a
unique solution.
Lemma 1.8. If dim (null (A)) > 0, then if Ax = y has a solution, the system has an infinite
number of solutions.
Proof. The null space has at least dimension one, so it contains an infinite number of vectors
h such that for any solution x for which y = Ax x + h is also a solution.
In the critical case m = n, linear systems may have a unique solution if the matrix is full
rank, i.e. if all its rows (and its columns) are linearly independent. This means that the
data in the linear model completely specify the unknown vector of interest, which can be
recovered by inverting the matrix.
Lemma 1.9. For any square matrix A ∈ Rn×n , the following statements are equivalent.
2. A is full rank.
4
3. A is invertible.
4. The system Ax = y has a unique solution for every vector y ∈ Rn .
Proof. We prove that the statements imply each other in the order (1) ⇒ (2) ⇒ (3) ⇒
(4) ⇒ (1).
(1) ⇒ (2): If dim (null (A)) = 0, by Lemma 1.5 the row space of A is the orthogonal
complement of {0}, so it is equal to Rn and therefore the rows are all linearly independent.
(2) ⇒ (3): If A is full rank, its rows span all of Rn so range (A) = Rn and A is invertible by
Corollary 1.6.
(3) ⇒ (4) If A is invertible there is a unique solution to Ax = y which is A−1 y. If the
solution is not unique then Ax1 = Ax2 = y for some x1 6= x2 so that 0 and x1 − x2 have the
same image and A is not invertible.
(4) ⇒ (1) If Ax = y has a unique solution for 0, then Ax = 0 implies x = 0.
Recall that the inverse of a product of invertible matrices is equal to the product of the
inverses,
and that the inverse of the transpose of a matrix is the transpose of the inverse,
−1 T
AT = A−1 . (15)
Using these facts, the inverse of a matrix A can be written in terms of its singular vectors
and singular values,
−1
A−1 = U SV T (16)
−1
= VT S −1 U −1
(17)
1
σ
0 ··· 0
01 1 · · · 0
σ2 T
=V U (18)
···
0 0 · · · σ1n
n
X 1
= vi uTi . (19)
σ
i=1 i
Note that if one of the singular values of a matrix is very small, the corresponding term
in (19) becomes very large. As a result, the solution to the system of equations becomes
very susceptible to noise in the data. In order to quantify the stability of the solution of a
system of equations we use the condition number of the corresponding matrix.
5
Definition 1.10 (Conditioning number). The condition number of a matrix is the ratio
between its largest and its smallest singular values
σmax
cond (A) = . (20)
σmin
If the condition number is very large, then perturbations in the data may be dramatically
amplified in the corresponding solution. This is illustrated by the following example.
has a condition number equal to 401. Compare the solutions to the corresponding system of
equations for two very similar vectors
−1
1 1 1 1
= , (22)
1 1.001 1 0
−1
1 1 1.1 101
= . (23)
1 1.001 1 −100
2 Least squares
Just like a system for which m = n, an overdetermined system will have a solution as long
as y ∈ range (A). However, if m > n then range (A) is a low-dimensional subspace of Rm .
This means that even a small perturbation in a random direction is bound to kick y out
of the subspace. As a result, in most cases overdetermined systems do not have a solution.
However, we may still compute the point in range (A) that is closest to the data y. If we
measure distance using the `2 norm, then this is denoted by the method of least squares.
Definition 2.1 (Least squares). The method of least squares consists of estimating x by
solving the optimization problem
6
Tall matrices (with more rows than columns) are said to be full rank if all their columns are
linearly independent. If A is full rank, then the solution to the least-squares problem has a
closed-form solution given by the following theorem.
Theorem 2.2 (Least-squares solution). If A ∈ Rm×n is full rank and m ≥ n the solution to
the least-squares problem (24) is equal to
xLS = V S −1 U T y (25)
−1 T
= AT A A y. (26)
since every x ∈ Rn corresponds to a unique z ∈ range (A) (we are assuming that the matrix
is full rank, so the null space only contains the zero vector). By Theorem 1.7 in Lecture
Notes 9, the solution to Problem (27) is
n
X
uTi y ui
Prange(A) y = (28)
i=1
= U U T y. (29)
U T U SV T xLS = U T U U T y. (32)
We have
U T U = I, (33)
because the columns of U are orthonormal (note that U U T 6= I if m > n!). As a result
−1 T
xLS = SV T U y (34)
−1
= VT S −1 U T y
(35)
= V S −1 U T y, (36)
7
where we have used the fact that
−1
V −1 = V T and VT =V (37)
where we have used that S is diagonal so S T = S and A is full rank, so that all the singular
values are nonzero and S is indeed invertible.
−1 T
The matrix AT A A is called the pseudoinverse of A. In the square case it reduces to
the inverse of the matrix.
A very important application of least squares is fitting linear regression models. In linear
regression, we assume that a quantity of interest can be expressed as a linear combination
of other observed quantities.
n
X
a≈ θj cj , (42)
j=1
8
model. Given m observations of the response and the covariates, we can place the response
in a vector y and the covariates in a matrix X such that each column corresponds to a different
covariate. We can then fit the parameters so that the model approximates the response as
closely as possible in `2 norm. This is achieved by solving a least-squares problem
Lemma 2.3. Let Y and Z are random vectors of dimension n such that
Y = Xθ + Z, (44)
Proof. Setting Σ = I in Definition 2.20 of Lecture Notes 3, we have that the likelihood
function
1 1 2
L (θ) = p exp − ||y − Xθ||2 . (45)
(2π)n 2
9
Maximum temperature Minimum temperature
30 20
25 15
20
Temperature (Celsius)
Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Model Model
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
25 14
12
20
10
Temperature (Celsius)
Temperature (Celsius)
15 8
6
10 4
2
5
Data 0 Data
Model Model
0 2
1900 1901 1902 1903 1904 1905 1900 1901 1902 1903 1904 1905
25 15
20
10
Temperature (Celsius)
Temperature (Celsius)
15
5
10
0
5
5
0
Data Data
Model Model
5 10
1960 1961 1962 1963 1964 1965 1960 1961 1962 1963 1964 1965
Figure 1: Data and fitted model described by Example 2.4 for maximum and minimum temper-
atures.
10
Example 2.4 (Global warming). In this example we build a model for temperature data
taken in a weather station in Oxford over 150 years.1 The model is of the form
2πt
y ≈ a + b̃ cos + c̃ + dt (49)
12
2πt 2πt
= a + b cos + c sin + dt, (50)
12 12
where t denotes the time in months. The parameter a represents the mean temperature,
b and c account for periodic yearly fluctuations and d is the overall trend. Is d is positive
then the model indicates that temperatures are increasing, whereas if it is negative then it
indicates that temperatures are decreasing. To fit these parameters using the data, we build
a matrix A with four columns,
compile the temperatures in a vector y and solve a least-squares problem. The results are
shown in Figures 1 and 2. The fitted model indicates that both the maximum and minimum
temperatures are increasing by about 0.8 degrees Celsius (around 1.4 ◦ F).
1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/
stationdata/oxforddata.txt.
11
Maximum temperature Minimum temperature
30 20
25 15
20
Temperature (Celsius)
Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Trend Trend
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
A Proofs
The matrix is
T := T (e1 ) T (e2 ) · · · T (en ) , (52)
i.e. the columns of the matrix are the result of applying T to the standard basis of Rn .
Indeed, for any vector x ∈ Rn
n
!
X
T (x) = T x[i]ei (53)
i=1
n
X
= x[i]T (ei ) by (1) and (2) (54)
i=1
= T x. (55)
We prove (6) by showing that both sets are subsets of each other.
12
Any vector x in the row space of A can be written as
x = AT z, (56)
y T x = y T AT z (57)
= (Ay)T z (58)
= 0. (59)
13