11 Linearmodels 3

DS-GA 1002 Lecture notes 10 November 23, 2015
Linear models
1 Linear functions
A linear model encodes the assumption that two quantities are linearly related. Mathemati-
cally, this is characterized using linear functions. A linear function is a function such that a
linear combination of inputs is mapped to the same linear combination of the corresponding
outputs.
Definition 1.1 (Linear function). A linear function T : Rn → Rm is a function that maps

vectors in Rn to vectors in Rm such that for any scalar α ∈ R and any vectors x1 , x2 ∈ Rn
T (x1 + x2 ) = T (x1 ) + T (x2 ) , (1)

T (α x1 ) = α T (x1 ) (2)
Multiplication with a matrix of dimensions m × n maps vectors in Rn to vectors in Rm . For

a fixed matrix, this is a linear function. Perhaps surprisingly, the converse is also true: any
linear function between Rn and Rm corresponds to multiplication with a certain matrix. The
proof is in Section A.1 of the appendix.
Theorem 1.2 (Equivalence between matrices and linear functions). For finite m, n every
linear function T : Rn → Rm can be represented by a matrix T ∈ Rm×n .
This implies that in order to analyze linear models in finite-dimensional spaces we can restrict
our attention to matrices.
1.1 Range and null space
The range of a matrix A ∈ Rm×n is the set of all possible vectors in Rm that we can reach
by applying the matrix to a vector in Rn .
Definition 1.3 (Range). Let A ∈ Rm×n ,
range (A) := {y | y = Ax for some x ∈ Rn } . (3)
This set is a subspace of Rm .

As we saw in the previous lecture notes (equation (43)), the product of a matrix and a vector
is a linear combination of the columns of the matrix, which implies that for any matrix A
range (A) = col (A) . (4)
In words, the range is spanned by the columns of the matrix.

The null space of a function is the set of vectors that are mapped to zero by the function.
If we interpret Ax as data related linearly to x, then the null space corresponds to the set
of vectors that are invisible under the measurement operator.
Definition 1.4 (Null space). The null space of A ∈ Rm×n contains the vectors in Rn that
A maps to the zero vector.
null (A) := {x | Ax = 0} . (5)
This set is a subspace of Rn .
The following lemma shows that the null space is the orthogonal complement of the row
space of the matrix.
Lemma 1.5. For any matrix A ∈ Rm×n
null (A) = row (A)⊥ . (6)
The lemma, proved in Section A.2 of the appendix, implies that the matrix is invertible if
we restrict the inputs to be in the row space of the matrix.
Corollary 1.6. Any matrix A ∈ Rm×n is invertible when acting on its row space. For any
two nonzero vectors x1 6= x2 in the row space of A
Ax1 6= Ax2 . (7)
Proof. Assume that for two different nonzero vectors x1 and x2 in the row space of A Ax1 =
Ax2 . Then x1 − x2 is a nonzero vector in the null space of A. By Lemma 1.5 this implies that
x1 − x2 is orthogonal to the row space of A and consequently to itself, so that x1 = x2 .
This means that for every matrix A ∈ Rm×n we can decompose any vector in Rn into two
components: one is in the row space and is mapped to a nonzero vector in Rm that is unique
in the sense that no other vector in row (A) is mapped to it, the other is in the null space
and is mapped to the zero vector.
2
1.2 Interpretation using the SVD
Recall that the left singular vectors of a matrix A that correspond to nonzero singular values
are a basis of the column space of A. It follows that they are also a basis for the range.
The right singular vectors corresponding to nonzero singular values are a basis of the row
space. As a result any orthonormal set of vectors that forms a basis of Rn together with
these singular vectors is a basis of the null space of the matrix. We can therefore write any
matrix A such that m ≥ n as
 
σ1 0 · · · 0 0 · · · 0
 0 σ2 · · · 0 0 · · · 0
 

 · · · 

A = [ u1 u2 · · · ur ur+1 · · · un ]  0 0 · · · σr 0 · · · 0

 [v|1 v2 {z· · · v}r v|r+1 {z· · · vn ]T .
| {z }
Basis of range(A)
 0 0 · · · 0 0 · · · 0 Basis of row(A) }
  Basis of null(A)
 ··· 
0 0 ··· 0 0 ··· 0
Note that the vectors ur+1 , . . . , un are a subset of an orthonormal basis of the orthogonal
complement of the range, which has dimension m − r.
The SVD provides a very intuitive characterization of the mapping between x ∈ Rn and
Ax ∈ Rm for any matrix A ∈ Rm×n with rank r,
r
X
σi viT x ui .

Ax = (8)
i=1
The linear function can be decomposed into four simple steps:
1. Compute the projection of x onto the right singular vectors of A: v1T x, v2T x, . . . , vrT x.
2. Scale the projections using the corresponding singular value: σ1 v1T x, σ2 v2T x, . . . , σr vrT x.
3. Multiply each scaled projection with the corresponding left singular vector ui .
4. Sum all the scaled left singular vectors.
1.3 Systems of equations
In a linear model we assume that the data y ∈ Rm can be represented as the result of
applying a linear function or matrix A ∈ Rm×n to an unknown vector x ∈ Rn ,
Ax = y. (9)
3
The aim is to determine x from the measurements. Depending on the structure of A and y
this may or may not be possible.
If we expand the matrix-vector product, the linear model is equivalent to a system of linear
equations
A11 x[1] + A12 x[2]+ . . . + A1n x[n] = y[1] (10)

A21 x[1] + A22 x[2]+ . . . + A2n x[n] = y[2] (11)
··· (12)
Am1 x[1] + Am2 x[2]+ . . . + Amn x[n] = y[m]. (13)
If the number of equations m is greater than the number of unknowns n the system is said
to be overdetermined. If there are more unknowns than equation n > m then the system
is underdetermined.
Recall that range (A) is the set of vectors that can be reached by applying A. If y does not
belong to this set, then the system cannot have a solution.
Lemma 1.7. The system y = Ax has one or multiple solutions if and only if y ∈ range (A).
Proof. If y = Ax has a solution then y ∈ range (A) by (4). If y ∈ range (A) then there is a
linear combination of the columns of A that yield y by (4) so the system has at least one
solution.
If the null space of the matrix has dimension greater than 0, then the system cannot have a
unique solution.
Lemma 1.8. If dim (null (A)) > 0, then if Ax = y has a solution, the system has an infinite
number of solutions.
Proof. The null space has at least dimension one, so it contains an infinite number of vectors
h such that for any solution x for which y = Ax x + h is also a solution.
In the critical case m = n, linear systems may have a unique solution if the matrix is full
rank, i.e. if all its rows (and its columns) are linearly independent. This means that the
data in the linear model completely specify the unknown vector of interest, which can be
recovered by inverting the matrix.
Lemma 1.9. For any square matrix A ∈ Rn×n , the following statements are equivalent.
1. null (A) = {0}.
2. A is full rank.
4
3. A is invertible.
4. The system Ax = y has a unique solution for every vector y ∈ Rn .
Proof. We prove that the statements imply each other in the order (1) ⇒ (2) ⇒ (3) ⇒
(4) ⇒ (1).
(1) ⇒ (2): If dim (null (A)) = 0, by Lemma 1.5 the row space of A is the orthogonal
complement of {0}, so it is equal to Rn and therefore the rows are all linearly independent.
(2) ⇒ (3): If A is full rank, its rows span all of Rn so range (A) = Rn and A is invertible by
Corollary 1.6.
(3) ⇒ (4) If A is invertible there is a unique solution to Ax = y which is A−1 y. If the
solution is not unique then Ax1 = Ax2 = y for some x1 6= x2 so that 0 and x1 − x2 have the
same image and A is not invertible.
(4) ⇒ (1) If Ax = y has a unique solution for 0, then Ax = 0 implies x = 0.
Recall that the inverse of a product of invertible matrices is equal to the product of the
inverses,
(AB)−1 = B −1 A−1 , (14)
and that the inverse of the transpose of a matrix is the transpose of the inverse,
−1 T
AT = A−1 . (15)
Using these facts, the inverse of a matrix A can be written in terms of its singular vectors
and singular values,
−1
A−1 = U SV T (16)
−1
= VT S −1 U −1

(17)
1 
σ
0 ··· 0
 01 1 · · · 0 
σ2  T
=V  U (18)

 ··· 
0 0 · · · σ1n
n
X 1
= vi uTi . (19)
σ
i=1 i
Note that if one of the singular values of a matrix is very small, the corresponding term
in (19) becomes very large. As a result, the solution to the system of equations becomes
very susceptible to noise in the data. In order to quantify the stability of the solution of a
system of equations we use the condition number of the corresponding matrix.
5
Definition 1.10 (Conditioning number). The condition number of a matrix is the ratio
between its largest and its smallest singular values
σmax
cond (A) = . (20)
σmin
If the condition number is very large, then perturbations in the data may be dramatically
amplified in the corresponding solution. This is illustrated by the following example.
Example 1.11 (Ill-conditioned system). The matrix

1.001 1
(21)
1 1
has a condition number equal to 401. Compare the solutions to the corresponding system of
equations for two very similar vectors
−1
1 1 1 1
= , (22)
1 1.001 1 0
−1
1 1 1.1 101
= . (23)
1 1.001 1 −100
2 Least squares
Just like a system for which m = n, an overdetermined system will have a solution as long
as y ∈ range (A). However, if m > n then range (A) is a low-dimensional subspace of Rm .
This means that even a small perturbation in a random direction is bound to kick y out
of the subspace. As a result, in most cases overdetermined systems do not have a solution.
However, we may still compute the point in range (A) that is closest to the data y. If we
measure distance using the `2 norm, then this is denoted by the method of least squares.
Definition 2.1 (Least squares). The method of least squares consists of estimating x by
solving the optimization problem
min ||y − Ax||2 (24)

x∈Rn
6
Tall matrices (with more rows than columns) are said to be full rank if all their columns are
linearly independent. If A is full rank, then the solution to the least-squares problem has a
closed-form solution given by the following theorem.
Theorem 2.2 (Least-squares solution). If A ∈ Rm×n is full rank and m ≥ n the solution to
the least-squares problem (24) is equal to
xLS = V S −1 U T y (25)
−1 T
= AT A A y. (26)
Proof. The problem (24) is equivalent to
min ||y − z||2 (27)

z∈range(A)
since every x ∈ Rn corresponds to a unique z ∈ range (A) (we are assuming that the matrix
is full rank, so the null space only contains the zero vector). By Theorem 1.7 in Lecture
Notes 9, the solution to Problem (27) is
n
X
uTi y ui

Prange(A) y = (28)
i=1
= U U T y. (29)
Where A = U SV T is the SVD of A, so the columns of U ∈ Rm×n are an orthonormal basis

for the range of A. Now, to find the solution we need to find the unique xLS such that
AxLS = U SV T xLS (30)

= U U T y. (31)
This directly implies
U T U SV T xLS = U T U U T y. (32)
We have
U T U = I, (33)
because the columns of U are orthonormal (note that U U T 6= I if m > n!). As a result
−1 T
xLS = SV T U y (34)
−1
= VT S −1 U T y

(35)
= V S −1 U T y, (36)
7
where we have used the fact that
−1
V −1 = V T and VT =V (37)
because V T V = V V T = I (V is an n × n orthogonal matrix).

Finally,
−1 T −1
AT A A = V S T U T U SV T V ST U T (38)
  2  −1
σ1 0 · · · 0
  0 σ22 · · · 0  T 
=  V  V S T U T by (33)
V 

···  
2
0 0 · · · σn
1 
σ12
0 ··· 0
 0 1 ··· 0 
σ22  T
=V   V V S T U T by (37) (39)

 ··· 
0 0 · · · σ12
n
1 
σ12
0 · · · 0
 0 1 ··· 0 
σ22
=V   SU T by (37) (40)
 
 ··· 
0 0 · · · σ12
n
−1 T
=VS U by (37), (41)
where we have used that S is diagonal so S T = S and A is full rank, so that all the singular
values are nonzero and S is indeed invertible.
−1 T
The matrix AT A A is called the pseudoinverse of A. In the square case it reduces to
the inverse of the matrix.
2.1 Linear regression
A very important application of least squares is fitting linear regression models. In linear
regression, we assume that a quantity of interest can be expressed as a linear combination
of other observed quantities.
n
X
a≈ θj cj , (42)
j=1
where a ∈ R is called the response or dependent variable, c1 , c2 , . . . , cn ∈ R are the

covariates or independent variables and θ1 , θ2 , . . . , θn ∈ R are the parameters of the
8
model. Given m observations of the response and the covariates, we can place the response
in a vector y and the covariates in a matrix X such that each column corresponds to a different
covariate. We can then fit the parameters so that the model approximates the response as
closely as possible in `2 norm. This is achieved by solving a least-squares problem
min ||y − Xθ||2 (43)

θ∈Rn
to fit the parameters.

Geometrically, the estimated parameters are those that project the response on the subspace
spanned by the covariates. Alternatively, linear regression also has a probabilistic interpre-
tation. It corresponds to computing the maximum likelihood estimator for a particular
model.
Lemma 2.3. Let Y and Z are random vectors of dimension n such that
Y = Xθ + Z, (44)
where X is a deterministic matrix (not a random variable). If Z is an iid Gaussian random

vector with mean zero and unit variance then the maximum likelihood estimator of Y given
Z is the solution to the least-squares problem (41).
Proof. Setting Σ = I in Definition 2.20 of Lecture Notes 3, we have that the likelihood
function

1 1 2
L (θ) = p exp − ||y − Xθ||2 . (45)
(2π)n 2
Maximizing the likelihood yields
θML = arg max L (θ) (46)

θ
= arg max log L (θ) (47)
θ
= arg min ||y − Xθ||2 . (48)
θ
9
Maximum temperature Minimum temperature
30 20
25 15
20
Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Model Model
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
25 14
12
20
10
15 8
6
10 4
2
5
Data 0 Data
Model Model
0 2
1900 1901 1902 1903 1904 1905 1900 1901 1902 1903 1904 1905
25 15
20
10
15
5
10
0
5
5
0
Data Data
Model Model
5 10
1960 1961 1962 1963 1964 1965 1960 1961 1962 1963 1964 1965
Figure 1: Data and fitted model described by Example 2.4 for maximum and minimum temper-
atures.
10
Example 2.4 (Global warming). In this example we build a model for temperature data
taken in a weather station in Oxford over 150 years.1 The model is of the form

2πt
y ≈ a + b̃ cos + c̃ + dt (49)
12

2πt 2πt
= a + b cos + c sin + dt, (50)
12 12
where t denotes the time in months. The parameter a represents the mean temperature,
b and c account for periodic yearly fluctuations and d is the overall trend. Is d is positive
then the model indicates that temperatures are increasing, whereas if it is negative then it
indicates that temperatures are decreasing. To fit these parameters using the data, we build
a matrix A with four columns,
1 cos 2πt 2πt1

 
12
1
sin 12
dt 1
 1 cos 2πt2 sin 2πt2 dt2 
A=  12 12 , (51)
··· ··· ··· · · ·
1 cos 2πt12
n
sin 2πt
12
n
dtn
compile the temperatures in a vector y and solve a least-squares problem. The results are
shown in Figures 1 and 2. The fitted model indicates that both the maximum and minimum
temperatures are increasing by about 0.8 degrees Celsius (around 1.4 ◦ F).
1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/
stationdata/oxforddata.txt.
11
Maximum temperature Minimum temperature
30 20
25 15
20
10
15
5
10
0
5
5
0 Data Data
Trend Trend
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
+ 0.75 ◦ C / 100 years + 0.88 ◦ C / 100 years

Figure 2: Temperature trend obtained by fitting the model described by Example 2.4 for maximum
and minimum temperatures.
A Proofs
A.1 Proof of Theorem 1.2
The matrix is

T := T (e1 ) T (e2 ) · · · T (en ) , (52)
i.e. the columns of the matrix are the result of applying T to the standard basis of Rn .
Indeed, for any vector x ∈ Rn
n
!
X
T (x) = T x[i]ei (53)
i=1
n
X
= x[i]T (ei ) by (1) and (2) (54)
i=1
= T x. (55)
A.2 Proof of Lemma 1.5
We prove (6) by showing that both sets are subsets of each other.
12
Any vector x in the row space of A can be written as
x = AT z, (56)
for some vector z ∈ Rm . If y ∈ null (A) then
y T x = y T AT z (57)
= (Ay)T z (58)
= 0. (59)
So null (A) ⊆ row (A)⊥ .

If x ∈ row (A)⊥ then in particular it is orthogonal to every row of A, so Ax = 0 and
row (A)⊥ ⊆ null (A).
13

11 Linearmodels 3

Uploaded by

Copyright:

Available Formats

11 Linearmodels 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 Linearmodels 3

Uploaded by

Copyright:

Available Formats

DS-GA 1002 Lecture notes 10 November 23, 2015

Definition 1.1 (Linear function). A linear function T : Rn → Rm is a function that maps

T (x1 + x2 ) = T (x1 ) + T (x2 ) , (1)

Multiplication with a matrix of dimensions m × n maps vectors in Rn to vectors in Rm . For

1.1 Range and null space

Definition 1.3 (Range). Let A ∈ Rm×n ,

range (A) := {y | y = Ax for some x ∈ Rn } . (3)

This set is a subspace of Rm .

range (A) = col (A) . (4)

In words, the range is spanned by the columns of the matrix.

null (A) := {x | Ax = 0} . (5)

This set is a subspace of Rn .

Lemma 1.5. For any matrix A ∈ Rm×n

null (A) = row (A)⊥ . (6)

Ax1 6= Ax2 . (7)

The linear function can be decomposed into four simple steps:

4. Sum all the scaled left singular vectors.

1.3 Systems of equations

A11 x[1] + A12 x[2]+ . . . + A1n x[n] = y[1] (10)

1. null (A) = {0}.

(AB)−1 = B −1 A−1 , (14)

Example 1.11 (Ill-conditioned system). The matrix

min ||y − Ax||2 (24)

Proof. The problem (24) is equivalent to

min ||y − z||2 (27)

Where A = U SV T is the SVD of A, so the columns of U ∈ Rm×n are an orthonormal basis

AxLS = U SV T xLS (30)

This directly implies

because V T V = V V T = I (V is an n × n orthogonal matrix).

2.1 Linear regression

where a ∈ R is called the response or dependent variable, c1 , c2 , . . . , cn ∈ R are the

min ||y − Xθ||2 (43)

to fit the parameters.

where X is a deterministic matrix (not a random variable). If Z is an iid Gaussian random

Maximizing the likelihood yields

θML = arg max L (θ) (46)

1 cos 2πt 2πt1

+ 0.75 ◦ C / 100 years + 0.88 ◦ C / 100 years

A.1 Proof of Theorem 1.2

A.2 Proof of Lemma 1.5

for some vector z ∈ Rm . If y ∈ null (A) then

So null (A) ⊆ row (A)⊥ .

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.