Vision Dummy PDF
Vision Dummy PDF
Vision Dummy PDF
matrix
Contents [hide] [hide]
1 Introduction
4 Conclusion
Introduction
In this article, we provide an intuitive, geometric interpretation of the covariance matrix, by
exploring the relation between linear transformations and the resulting data covariance. Most
textbooks explain the shape of data based on the concept of covariance matrices. Instead, we
take a backwards approach and explain the concept of covariance matrices based on the shape of
data.
In a previous article, we discussed the concept of variance, and provided a derivation and proof of
the well known formula to estimate the sample variance. Figure 1 was used in this article to show
that the standard deviation, as the square root of the variance, provides a measure of how much
the data is spread across the feature space.
Figure 1. Gaussian density function. For normally distributed data, 68% of the samples fall within
the interval defined by the mean plus and minus the standard deviation.
We showed that an unbiased estimator of the sample variance can be obtained by:
(1)
However, variance can only be used to explain the spread of the data in the directions parallel to
the axes of the feature space. Consider the 2D feature space shown by figure 2:
in the y-direction. However, the horizontal spread and the vertical spread of the
data does not explain the clear diagonal correlation. Figure 2 clearly shows that on average, if the
x-value of a data point increases, then also the y-value increases, resulting in a positive correlation.
This correlation can be captured by extending the notion of variance to what is called the
covariance of the data:
(2)
For 2D data, we thus obtain
and
(3)
If x is positively correlated with y, y is also positively correlated with x. In other words, we can state
that
variances on its diagonal and the covariances off-diagonal. Two-dimensional normally distributed
Figure 3 illustrates how the overall shape of the data defines the covariance matrix:
Figure 3. The covariance matrix defines the shape of the data. Diagonal spread is captured by the
covariance, while axis-aligned spread is captured by the variance.
vector and its magnitude, we should simply try to find the vector that points into the direction of the
largest spread of the data, and whose magnitude equals the spread (variance) in this direction.
If we define this vector as
as
vector
that points into the direction of the largest variance, we should choose its components
, where
In other words, the largest eigenvector of the covariance matrix always points into the direction of
the largest variance of the data, and the magnitude of this vector equals the corresponding
eigenvalue. The second largest eigenvector is always orthogonal to the largest eigenvector, and
points into the direction of the second largest spread of the data.
Now lets have a look at some examples. In an earlier article we saw that a linear transformation
matrix
(4)
where
is an eigenvector of
, and
If the covariance matrix of our data is a diagonal matrix, such that the covariances are zero, then
this means that the variances must be equal to the eigenvalues
where the eigenvectors are shown in green and magenta, and where the eigenvalues clearly
equal the variance components of the covariance matrix.
Now lets forget about covariance matrices for a moment. Each of the examples in figure 3 can
simply be considered to be a linearly transformed instance of figure 6:
(5)
where
(6)
These matrices are defined as:
(7)
where
(8)
where
and
are the scaling factors in the x direction and the y direction respectively.
In the following paragraphs, we will discuss the relation between the covariance matrix
linear transformation matrix
, and the
Lets start with unscaled (scale equals 1) and unrotated data. In statistics this is often refered to as
white data because its samples are drawn from a standard normal distribution and therefore
correspond to white (uncorrelated) noise:
(9)
(10)
The data
of
is now:
(11)
Thus, the covariance matrix
transformation
(12)
, where
However, although equation (12) holds when the data is scaled in the x and y direction, the
question rises if it also holds when a rotation is applied. To investigate the relation between the
linear transformation matrix
try to decompose the covariance matrix into the product of rotation and scaling matrices.
As we saw earlier, we can represent the covariance matrix by its eigenvectors and eigenvalues:
(13)
where
is an eigenvector of
, and
two eigenvectors and two eigenvalues. The system of two equations defined by equation (13) can
be represented efficiently using matrix notation:
(14)
where
and
(15)
Equation (15) is called the eigendecomposition of the covariance matrix and can be obtained
using aSingular Value Decomposition algorithm. Whereas the eigenvectors represent the
directions of the largest variance of the data, the eigenvalues represent the magnitude of this
variance in those directions. In other words,
a scaling matrix. The covariance matrix can thus be decomposed further as:
(16)
where
is a scaling matrix.
represents
. Furthermore, since
Therefore,
. Since
is an orthogonal matrix,
is a diagonal scaling
.
(17)
In other words, if we apply the linear transformation defined by
data
matrix
Figure 10. The covariance matrix represents a linear transformation of the original data.
The colored arrows in figure 10 represent the eigenvectors. The largest eigenvector, i.e. the
eigenvector with the largest corresponding eigenvalue, always points in the direction of the largest
variance of the data and thereby defines its orientation. Subsequent eigenvectors are always
orthogonal to the largest eigenvector due to the orthogonality of rotation matrices.
Conclusion
In this article we showed that the covariance matrix of observed data is directly related to a linear
transformation of white, uncorrelated data. This linear transformation is completely defined by the
eigenvectors and eigenvalues of the data. While the eigenvectors represent the rotation matrix,
the eigenvalues correspond to the square of the scaling factor in each dimension.
1 Introduction
2 Axis-aligned confidence ellipses
3 Arbitrary confidence ellipses
4 Source Code
5 Conclusion
Introduction
In this post, I will show how to draw an error ellipse, a.k.a. confidence ellipse, for 2D
normally distributed data. The error ellipse represents an iso-contour of the Gaussian
distribution, and allows you to visualize a 2D confidence interval. The following figure
shows a 95% confidence ellipse for a set of 2D normally distributed data samples. This
confidence ellipse defines the region that contains 95% of all samples that can be drawn
from the underlying Gaussian distribution.
0.9387
Furthermore, it is clear that the magnitudes of the ellipse axes depend on the variance of
the data. In our case, the largest variance is in the direction of the X-axis, whereas the
smallest variance lies in the direction of the Y-axis.
In general, the equation of an axis-aligned ellipse with a major axis of length
and a
minor axis of length , centered at the origin, is defined by the following equation:
(1)
In our case, the length of the axes are defined by the standard deviations
the data such that the equation of the error ellipse becomes:
(2)
and
of
where defines the scale of the ellipse and could be any arbitrary number (e.g. s=1). The
question is now how to choose , such that the scale of the resulting ellipse represents a
chosen confidence level (e.g. a 95% confidence level corresponds to s=5.991).
Our 2D data is sampled from a multivariate Gaussian with zero covariance. This means
that both the x-values and the y-values are normally distributed too. Therefore, the left
hand side of equation (2) actually represents the sum of squares of independent normally
distributed data samples. The sum of squared Gaussian data points is known to be
distributed according to a so called Chi-Square distribution. A Chi-Square distribution is
defined in terms of degrees of freedom, which represent the number of unknowns. In our
case there are two unknowns, and therefore two degrees of freedom.
Therefore, we can easily obtain the probability that the above sum, and thus equals a
specific value by calculating the Chi-Square likelihood. In fact, since we are interested in a
confidence interval, we are looking for the probability that is less then or equal to a
specific value which can easily be obtained using the cumulative Chi-Square distribution.
As statisticians are lazy people, we usually dont try to calculate this probability, but simply
look it up in a probability table: https://people.richland.edu/james/lecture/m170/tbl-chi.html.
For example, using this probability table we can easily find that, in the 2-degrees of
freedom case:
Therefore, a 95% confidence interval corresponds to s=5.991. In other words, 95% of the
data will fall inside the ellipse defined as:
(3)
Similarly, a 99% confidence interval corresponds to s=9.210 and a 90% confidence interval
corresponds to s=4.605.
The error ellipse show by figure 2 can therefore be drawn as an ellipse with a major axis
length equal to
(4)
where
is the eigenvector of the covariance matrix that corresponds to the largest
eigenvalue.
Based on the minor and major axis lengths and the angle between the major axis and
the x-axis, it becomes trivial to plot the confidence ellipse. Figure 3 shows error ellipses for
several confidence values:
Source Code
Matlab source code
C++ source code (uses OpenCV)
Conclusion
In this article we showed how to obtain the error ellipse for 2D normally distributed data,
according to a chosen confidence value. This is often useful when visualizing or analyzing
data and will be of interest in a future article about PCA.
1 Introduction
2 Minimum variance, unbiased estimators
o 2.1 Parameter bias
o 2.2 Parameter variance
3 Maximum Likelihood estimation
4 Estimating the variance if the mean is known
o 4.1 Parameter estimation
o 4.2 Performance evaluation
5 Estimating the variance if the mean is unknown
o 5.1 Parameter estimation
o 5.2 Performance evaluation
o 5.3 Fixing the bias
6 Conclusion
Introduction
In this article, we will derive the well known formulas for calculating the
mean and the variance of normally distributed data, in order to answer the
question in the articles title. However, for readers who are not interested in
the why of this question but only in the when, the answer is quite simple:
If you have to estimate both the mean and the variance of the data (which is
typically the case), then divide by N-1, such that the variance is obtained as:
If, on the other hand, the mean of the true population is known such that only
the variance needs to be estimated, then divide by N, such that the variance is
obtained as:
Whereas the former is what you will typically need, an example of the latter
would be the estimation of the spread of white Gaussian noise. Since the mean
of white Gaussian noise is known to be zero, only the variance needs to be
estimated in this case.
If data is normally distributed we can completely characterize it by its mean and its
variance
. The variance is the square of the standard deviation which represents the
average deviation of each data point to the mean. In other words, the variance represents
the spread of the data. For normally distributed data, 68.3% of the observations will have a
value between
and
. This is illustrated by the following figure which shows a
Gaussian density function with mean
and variance
:
Figure 1. Gaussian density function. For normally distributed data, 68% of the samples fall
within the interval defined by the mean plus and minus the standard deviation.
Usually we do not have access to the complete population of the data. In the above
example, we would typically have a few observations at our disposal but we do not have
access to all possible observations that define the x-axis of the plot. For example, we might
have the following set of observations:
Table 1
Observation ID
Observation 1
Observation 2
Observation 3
Observation 4
Observation 5
Observed Value
10
12
7
5
11
If we now calculate the empirical mean by summing up all values and dividing by the
number of observations, we have:
(1)
Usually we assume that the empirical mean is close to the actually unknown mean of the
distribution, and thus assume that the observed data is sampled from a Gaussian
distribution with mean
. In this example, the actual mean of the distribution is 10, so
the empirical mean indeed is close to the actual mean.
The variance of the data is calculated as follows:
(2
Again, we usually assume that this empirical variance is close to the real and unknown
variance of underlying distribution. In this example, the real variance was 9, so indeed the
empirical variance is close to the real variance.
The question at hand is now why the formulas used to calculate the empirical mean and
the empirical variance are correct. In fact, another often used formula to calculate the
variance, is defined as follows:
(3)
The only difference between equation (2) and (3) is that the former divides by N-1,
whereas the latter divides by N. Both formulas are actually correct, but when to use which
one depends on the situation.
In the following sections, we will completely derive the formulas that best approximate the
unknown variance and mean of a normal distribution, given a few samples from this
distribution. We will show in which cases to divide the variance by N and in which cases to
normalize by N-1.
A formula that approximates a parameter (mean or variance) is called an estimator. In the
following, we will denote the real and unknown parameters of the distribution by and
.
The estimators, e.g. the empirical average and empirical variance, are denoted
as and
.
To find the optimal estimators, we first need an analytical expression for the likelihood of
observing a specific data point , given the fact that the population is normally distributed
with a given mean and standard deviation . A normal distribution with known
parameters is usually denoted as
(4)
To calculate the mean and variance, we obviously need more than one sample from this
distribution. In the following, let vector
be a vector that contains all
the available samples (e.g. all the values from the example in table 1). If all these samples
are statistically independent, we can write their joint likelihood function as the sum of all
individual likelihoods:
(5)
Plugging equation (4) into equation (5) then yields an analytical expression for this joint
probability density function:
(6)
Equation (6) will be important in the next sections and will be used to derive the well known
expressions for the estimators of the mean and the variance of a Gaussian distribution.
Parameter bias
Imagine that we could obtain different (disjoint) subsets of the complete population. In
analogy to our previous example, imagine that, apart from the data in Table 1, we also
have a Table 2 and a Table 3 with different observations. Then a good estimator for the
mean, would be an estimator that on average would be equal to the real mean. Although
we can live with the idea that the empirical mean from one subset of data is not equal to
the real mean like in our example, a good estimator should make sure that the average of
the estimated means from all subsets is equal to the real mean. This constraint is
expressed mathematically by stating that the Expected Value of the estimator should equal
the real parameter value:
(7)
If the above conditions hold, then the estimators are called unbiased estimators. If the
conditions do not hold, the estimators are said to be biased, since on average they will
either underestimate or overestimate the true value of the parameter.
Parameter variance
Unbiased estimators guarantee that on average they yield an estimate that equals the real
parameter. However, this does not mean that each estimate is a good estimate. For
instance, if the real mean is 10, an unbiased estimator could estimate the mean as 50 on
one population subset and as -30 on another subset. The expected value of the estimate
would then indeed be 10, which equals the real parameter, but the quality of the estimator
clearly also depends on the spread of each estimate. An estimator that yields the
estimates (10, 15, 5, 12, 8) for five different subsets of the population is unbiased just like
an estimator that yields the estimates (50, -30, 100, -90, 10). However, all estimates from
the first estimator are closer to the true value than those from the second estimator.
Therefore, a good estimator not only has a low bias, but also yields a low variance. This
variance is expressed as the mean squared error of the estimator:
A good estimator is therefore is a low bias, low variance estimator. The optimal estimator,
if such estimator exists, is then the one that has no bias and a variance that is lower than
any other possible estimator. Such an estimator is called the minimum variance, unbiased
(MVU) estimator. In the next section, we will derive the analytical expressions for the mean
and the variance estimators of a Gaussian distribution. We will show that the MVU
estimator for the variance of a normal distribution requires us to divide the variance by
under certain assumptions, and requires us to divide by N-1 if these assumptions do not
hold.
. Indeed if we would
, we would find that it is
Therefore, the formula to compute the variance based on the sample data is simply
derived by finding the peak of the maximum likelihood function. Furthermore, instead of
and
In the following paragraphs we will use this technique to obtain the MVU estimators of both
and . We consider two cases:
The first case assumes that the true mean of the distribution is known. Therefore, we
only need to estimate the variance and the problem then corresponds to finding the
maximum in a one-dimensional likelihood function, parameterized by
. Although this
situation does not occur often in practice, it definitely has practical applications. For
instance, if we know that a signal (e.g. the color value of a pixel in an image) should have a
specific value, but the signal has been polluted by white noise (Gaussian noise with zero
mean), then the mean of the distribution is known and we only need to estimate the
variance.
The second case deals with the situation where both the true mean and the true variance
are unknown. This is the case you would encounter most and where you would obtain an
estimate of the mean and the variance based on your sample data.
In the next paragraphs we will show that each case results in a different MVU estimator.
More specific, the first case requires the variance estimator to be normalized by
to be
MVU, whereas the second case requires division by
to be MVU.
(8)
However, calculating the derivative of
, defined by equation (6) is rather involved
due to the exponent in the function. In fact, it is much easier to maximize the log-likelihood
function instead of the likelihood function itself. Since the logarithm is a monotonous
function, the maximum will be the same. Therefore, we solve the following problem
instead:
(9)
In the following we set
to obtain a simpler notation. To find the maximum of the
log-likelihood function, we simply calculate the derivative of the logarithm of equation (6)
and set it to zero:
It is clear that if
(10)
Note that this maximum likelihood estimator for
However, the maximum likelihood method does not guarantee to deliver an unbiased
estimator. On the other hand, if the obtained estimator is unbiased, then the maximum
likelihood method does guarantee that the estimator is also minimum variance and thus
MVU. Therefore, we need to check if the estimator in equation (10) is unbiassed.
Performance evaluation
To check if the estimator defined by equation (10) is unbiassed, we need to check if the
condition of equation (7) holds, and thus if
and write:
can be written
Since
, the condition shown by equation (7) holds, and therefore the obtained
estimator for the variance of the data is unbiassed. Furthermore, because the maximum
likelihood method guarantees that an unbiased estimator is also minimum variance (MVU),
this means that no other estimator exists that can do better than the one obtained here.
Therefore, we have to divide by
instead of
while calculating the variance of
normally distributed data, if the true mean of the underlying distribution is known.
If
, then it is clear that the above equation only has a solution if:
(11)
Note that indeed this is the well known formula to calculate the mean of a distribution.
Although we all knew this formula, we now proved that it is the maximum likelihood
estimator for the true and unknown mean of a normal distribution. For now, we will just
assume that the estimator that we found earlier for the variance , defined by equation (10),
is still the MVU variance estimator. In the next section however, we will show that this
estimator is no longer unbiased now.
Performance evaluation
To check if the estimator for the true mean
the condition of equation (7) holds:
Since
, this means that the obtained estimator for the mean of the distribution is
unbiassed. Since the maximum likelihood method guarantees to deliver the minimum
variance estimator if the estimator is unbiassed, we proved that is the MVU estimator of
the mean.
To check if the earlier found estimator for the variance is still unbiassed if it is based on
the empirical mean instead of the true mean , we simply plug the obtained
estimator into the earlier derived estimator of equation
(10):
To check if the estimator is still unbiased, we now need to check again if the condition of
equation (7) holds:
As mentioned in the previous section, an important property of variance is that the true
variance
that
can be written as
such
. Using this property in the above equation yields:
Since clearly
, this shows that estimator for the variance of the distribution is no
longer unbiassed. In fact, this estimator on average underestimates the true variance with
a factor
. As the number of samples approaches infinity (
), this bias
converges to zero. For small sample sets however, the bias is signification and should be
eliminated.
estimate
as follows:
This estimator is now unbiassed and indeed resembles the traditional formula to calculate
the variance, where we divide by
instead of . However, note that the resulting
estimator is no longer the minimum variance estimator, but it is the estimator with the
minimum variance amongst all unbiased estimators. If we divide by , then the estimator
is biassed, and if we divide by
, the estimator is not the minimum variance estimator.
However, in general having a biased estimator is much worse than having a slightly higher
variance estimator. Therefore, if the mean of the population is unknown, division
by
should be used instead of division by .
Conclusion
In this article, we showed where the usual formulas for calculating the mean and the
variance of normally distributed data come from. Furthermore, we have proven that the
normalization factor in the variance estimator formula should be
population is known, and should be
1 Introduction
2 PCA as a decorrelation method
3 PCA as an orthogonal regression method
4 A practical PCA application: Eigenfaces
5 The PCA recipe
o 5.1 1) Center the data
o 5.2 2) Normalize the data
o 5.3 3) Calculate the eigendecomposition
o 5.4 4) Project the data
6 PCA pitfalls
7 Source Code
8 Conclusion
Introduction
In this article, we discuss how Principal Component Analysis (PCA) works, and how it can be used as a
dimensionality reduction technique for classification problems. At the end of this article, Matlab source
code is provided for demonstration purposes.
In an earlier article, we discussed the so called Curse of Dimensionality and showed that classifiers tend to
overfit the training data in high dimensional spaces. The question then rises which features should be
preferred and which ones should be removed from a high dimensional feature vector.
If all features in this feature vector were statistically independent, one could simply eliminate the least
discriminative features from this vector. The least discriminative features can be found by various
greedy feature selection approaches. However, in practice, many features depend on each other or on an
underlying unknown variable. A single feature could therefore represent a combination of multiple types of
information by a single value. Removing such a feature would remove more information than needed. In
the next paragraphs, we introduce PCA as a feature extraction solution to this problem, and introduce its
inner workings from two different perspectives.
light. As a result, the R, G, B components of a pixel are statistically correlated. Therefore, simply eliminating
the R component from the feature vector, also implicitly removes information about the G and B channels.
In other words, before eliminating features, we would like to transform the complete feature space such
that the underlying uncorrelated components are obtained.
Consider the following example of a 2D feature space:
and , illustrated by figure 1, are clearly correlated. In fact, their covariance matrix is:
In an earlier article we discussed the geometric interpretation of the covariance matrix. We saw that the
covariance matrix can be decomposed as a sequence of rotation and scaling operations on white,
uncorrelated data, where the rotation matrix is defined by theeigenvectors of this covariance matrix.
Therefore, intuitively, it is easy to see that the data shown in figure 1 can be decorrelated by rotating
each data point such that the eigenvectors become the new reference axes:
(1)
In fact, the original data used in this example and shown by figure 1 was generated by linearly combining
two 1D Gaussian feature vectors
and
as follows:
Since the features and are linear combinations of some unknown underlying
components and , directly eliminating either or as a feature would have removed some
information from both and . Instead, rotating the data by the eigenvectors of its covariance matrix,
allowed us to directly recover the independent components and (up to a scaling factor). This can
be seen as follows: The eigenvectors of the covariance matrix of the original data are (each column
represents an eigenvector):
The first thing to notice is that in this case is a rotation matrix, corresponding to a rotation of 45 degrees
(cos(45)=0.7071), which indeed is evident from figure 1. Secondly, treating as a linear transformation
matrix results in a new coordinate system, such that each new feature and is expressed as a linear
combination of the original features and :
(2)
and
(3)
In other words, decorrelation of the feature space corresponds to the recovery of the unknown,
uncorrelated components and of the data (up to an unknown scaling factor if the transformation
matrix was not orthogonal). Once these components have been recovered, it is easy to reduce the
dimensionality of the feature space by simply eliminating either or .
In the above example we started with a two-dimensional problem. If we would like to reduce the
dimensionality, the question remains whether to eliminate (and thus ) or (and thus ).
Although this choice could depend on many factors such as the separability of the data in case of
classification problems, PCA simply assumes that the most interesting feature is the one with the largest
variance or spread. This assumption is based on an information theoretic point of view, since the dimension
with the largest variance corresponds to the dimension with the largest entropy and thus encodes the
most information. The smallest eigenvectors will often simply represent noise components, whereas the
largest eigenvectors often correspond to the principal components that define the data.
Dimensionality reduction by means of PCA is then accomplished simply by projecting the data onto the
largest eigenvectors of its covariance matrix. For the above example, the resulting 1D feature space is
illustrated by figure 3:
Figure 4. 3D data projected onto a 2D or 1D linear subspace by means of Principal Component Analysis.
In general, PCA allows us to obtain a linear M-dimensional subspace of the original N-dimensional data,
where
. Furthermore, if the unknown, uncorrelated components are Gaussian distributed, then
PCA actually acts as an independent component analysis since uncorrelated Gaussian variables are
statistically independent. However, if the underlying components are not normally distributed, PCA merely
generates decorrelated variables which are not necessarily statistically independent. In this case, non-linear
dimensionality reduction algorithms might be a better choice.
vector such that projecting the data onto this vector corresponds to a projection error that is lower than
the projection error that would be obtained when projecting the data onto any other possible vector. The
question is then how to find this optimal vector.
Consider the example shown by figure 5. Three different projection vectors are shown, together with the
resulting 1D data. In the next paragraphs, we will discuss how to determine which projection vector
minimizes the projection error. Before searching for a vector that minimizes the projection error, we have to
define this error function.
is
minimized. In other words, if is treated as the independent variable, then the obtained
regressor
is a linear function that can predict the dependent variable such that the squared error
is minimal. The resulting model
is illustrated by the blue line in figure 5, and the error that is
minimized is illustrated in figure 6.
Figure 6. Linear regression where x is the independent variable and y is the dependent variable,
corresponds to minimizing the vertical projection error.
However, in the context of feature extraction, one might wonder why we would define feature as the
independent variable and feature as the dependent variable. In fact, we could easily define as the
independent variable and find a linear function
that predicts the dependent variable , such
that
is minimized. This corresponds to minimization of the horizontal projection
error and results in a different linear model as shown by figure 7:
Figure 7. Linear regression where y is the independent variable and x is the dependent variable,
corresponds to minimizing the horizontal projection error.
Clearly, the choice of independent and dependent variables changes the resulting model, making ordinary
least squares regression an asymmetric regressor. The reason for this is that least squares regression
assumes the independent variable to be noise-free, whereas the dependent variable is assumed to be
noisy. However, in the case of classification, all features are usually noisy observations such that
neither or should be treated as independent. In fact, we would like to obtain a model
that
minimizes both the horizontal and the vertical projection error simultaneously. This corresponds to finding
a model such that the orthogonal projection error is minimized as shown by figure 8.
Figure 8. Linear regression where both variables are independent corresponds to minimizing the orthogonal
projection error.
The resulting regression is called Total Least Squares regression or orthogonal regression, and assumes that
both variables are imperfect observations. An interesting observation is now that the obtained vector,
representing the projection direction that minimizes the orthogonal projection error, corresponds the the
largest principal component of the data:
Figure 9. The vector which the data can be projected unto with minimal orthogonal error corresponds to
the largest eigenvector of the covariance matrix of the data.
In other words, if we want to reduce the dimensionality by projecting the original data onto a vector such
that the squared projection error is minimized in all directions, we can simply project the data onto the
largest eigenvectors. This is exactly what we called Principal Component Analysis in the previous section,
where we showed that such projection also decorrelates the feature space.
dimensional vector, and the feature vectors of the people in our training dataset. The smallest distance
then tells us which person we are looking at.
However, operating in a 1024-dimensional space becomes problematic if we only have a few hundred
training samples. Furthermore, Euclidean distances behave strangely in high dimensional spaces as
discussed in an earlier article. Therefore, we could use PCA to reduce the dimensionality of the feature
space by calculating the eigenvectors of the covariance matrix of the set of 1024-dimensional feature
vectors, and then projecting each feature vector onto the largest eigenvectors.
Since the eigenvector of 2D data is 2-dimensional, and an eigenvector of 3D data is 3-dimensional, the
eigenvectors of 1024-dimensional data is 1024-dimensional. In other words, we could reshape each of the
1024-dimensional eigenvectors to a 3232 image for visualization purposes. Figure 10 shows the first four
eigenvectors obtained by eigendecomposition of the Cambridge face dataset:
Figure 10. The four largest eigenvectors, reshaped to images, resulting in so called EigenFaces.
(source:https://nl.wikipedia.org/wiki/Eigenface)
Each 1024-dimensional feature vector (and thus each face) can now be projected onto the N largest
eigenvectors, and can be represented as a linear combination of these eigenfaces. The weights of these
linear combinations determine the identity of the person. Since the largest eigenvectors represent the
largest variance in the data, these eigenfaces describe the most informative image regions (eyes, noise,
mouth, etc.). By only considering the first N (e.g. N=70) eigenvectors, the dimensionality of the feature
space is greatly reduced.
The remaining question is now how many eigenfaces should be used, or in the general case; how many
eigenvectors should be kept. Removing too many eigenvectors might remove important information from the
feature space, whereas eliminating too few eigenvectors leaves us with the curse of dimensionality.
Regrettably there is no straight answer to this problem. Although cross-validation techniques can be used
to obtain an estimate of this hyperparameter, choosing the optimal number of dimensions remains a
problem that is mostly solved in an empirical (an academic term that means not much more than trialand-error) manner. Note that it is often useful to check how much (as a percentage) of the variance of the
original data is kept while eliminating eigenvectors. This is done by dividing the sum of the kept eigenvalues
by the sum of all eigenvalues.
(4)
PCA pitfalls
In the above discussion, several assumptions have been made. In the first section, we discussed how PCA
decorrelates the data. In fact, we started the discussion by expressing our desire to recover the unknown,
underlying independent components of the observed features. We then assumed that our data was
normally distributed, such that statistical independence simply corresponds to the lack of a linear
correlation. Indeed, PCA allows us to decorrelate the data, thereby recovering the independent
components in case of Gaussianity. However, it is important to note that decorrelation only corresponds to
statistical independency in the Gaussian case. Consider the data obtained by sampling half a period
of
:
Figure 11 Uncorrelated data is only statistically independent if normally distributed. In this example a clear
non-linear dependency still exists: y=sin(x).
Although the above data is clearly uncorrelated (on average, the y-value increases as much as it decreases
when the x-value goes up) and therefore corresponds to a diagonal covariance matrix, there still is a clear
non-linear dependency between both variables.
In general, PCA only uncorrelates the data but does not remove statistical dependencies. If the underlying
components are known to be non-Gaussian, techniques such as ICA could be more interesting. On the
other hand, if non-linearities clearly exist, dimensionality reduction techniques such as non-linear PCA can
be used. However, keep in mind that these methods are prone to overfitting themselves, since more
parameters are to be estimated based on the same amount of training data.
A second assumption that was made in this article, is that the most discriminative information is captured
by the largest variance in the feature space. Since the direction of the largest variance encodes the most
information this is likely to be true. However, there are cases where the discriminative information actually
resides in the directions of the smallest variance, such that PCA could greatly hurt classification
performance. As an example, consider the two cases of figure 12, where we reduce the 2D feature space to
a 1D representation:
Figure 12. In the first case, PCA would hurt classification performance because the data becomes linearly
unseparable. This happens when the most discriminative information resides in the smaller eigenvectors.
If the most discriminative information is contained in the smaller eigenvectors, applying PCA might actually
worsen the Curse of Dimensionality because now a more complicated classification model (e.g. non-linear
classifier) is needed to classify the lower dimensional problem. In this case, other dimensionality reduction
methods might be of interest, such asLinear Discriminant Analysis (LDA) which tries to find the projection
vector that optimally separates the two classes.
Source Code
The following code snippet shows how to perform principal component analysis for dimensionality
reduction in Matlab:
Matlab source code
Conclusion
In this article, we discussed the advantages of PCA for feature extraction and dimensionality reduction from
two different points of view. The first point of view explained how PCA allows us to decorrelate the feature
space, whereas the second point of view showed that PCA actually corresponds to orthogonal regression.
Furthermore, we briefly introduced Eigenfaces as a well known example of PCA based feature extraction,
and we covered some of the most important disadvantages of Principal Component Analysis.
1 Introduction
2 Calculating the eigenvalues
3 Calculating the first eigenvector
4 Calculating the second eigenvector
5 Conclusion
Introduction
Eigenvectors and eigenvalues have many important applications in computer vision and machine learning
in general. Well known examples are PCA (Principal Component Analysis)for dimensionality reduction
or EigenFaces for face recognition. An interesting use of eigenvectors and eigenvalues is also illustrated in
my post about error ellipses. Furthermore, eigendecomposition forms the base of the geometric
interpretation of covariance matrices, discussed in an more recent post. In this article, I will provide a
gentle introduction into this mathematical concept, and will show how to manually obtain the
eigendecomposition of a 2D square matrix.
An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it.
Consider the image below in which three vectors are shown. The green square is only drawn to illustrate
the linear transformation that is applied to each of these three vectors.
Eigenvectors (red) do not change direction when a linear transformation (e.g. scaling) is applied to them.
Other vectors (yellow) do.
The transformation in this case is a simple scaling with factor 2 in the horizontal direction and factor 0.5 in
the vertical direction, such that the transformation matrix is defined as:
.
A vector
is then scaled by applying this transformation as
. The above figure
shows that the direction of some vectors (shown in red) is not affected by this linear transformation. These
vectors are called eigenvectors of the transformation, and uniquely define the square matrix . This
unique, deterministic relation is exactly the reason that those vectors are called eigenvectors (Eigen
means specific in German).
In general, the eigenvector
of a matrix
(1)
where is a scalar value called the eigenvalue. This means that the linear transformation
vector is completely defined by .
on
(2)
where is the identity matrix of the same dimensions as
However, assuming that is not the null-vector, equation (2) can only be defined if
is not
invertible. If a square matrix is not invertible, that means that itsdeterminant must equal zero. Therefore, to
find the eigenvectors of , we simply have to solve the following equation:
(3)
In the following sections we will determine the eigenvectors and eigenvalues of a matrix
equation (3). Matrix in this example, is defined by:
(4)
, by solving
(5)
Calculating the determinant gives:
(6)
To solve this quadratic equation in , we find the discriminant:
Since the discriminant is strictly positive, this means that two different values for
(7)
exist:
We have now determined the two eigenvalues and . Note that a square matrix of
size
always has exactly eigenvalues, each with a corresponding eigenvector. The eigenvalue
specifies the size of the eigenvector.
Since this is simply the matrix notation for a system of equations, we can write it in its equivalent form:
(8)
and solve the first equation as a function of
, resulting in:
(9)
Since an eigenvector simply represents an orientation (the corresponding eigenvalue represents the
magnitude), all scalar multiples of the eigenvector are vectors that are parallel to this eigenvector, and are
therefore equivalent (If we would normalize the vectors, they would all be equal). Thus, instead of further
solving the above system of equations, we can freely chose a real value for either
or
, and
determine the other one by using equation (9).
For this example, we arbitrarily choose
corresponds to eigenvalue
is
, such that
(10)
(11)
Written as a system of equations, this is equivalent to:
(12)
Solving the first equation as a function of
resuls in:
(13)
We then arbitrarily choose
eigenvalue
is
(14)
, and find
Conclusion
In this article we reviewed the theoretical concepts of eigenvectors and eigenvalues. These concepts are of
great importance in many techniques used in computer vision and machine learning, such as
dimensionality reduction by means of PCA, or face recognition by means of EigenFaces.