Inf2b Learn Note10 2up
Inf2b Learn Note10 2up
Discriminant functions
4
Hiroshi Shimodaira∗ 2
2
4 March 2015
x
−2
−4
In the previous chapter we saw how we can combine a Gaussian probability density function with class
prior probabilities using Bayes’ theorem to estimate class-conditional posterior probabilities. For each −6
point in the input space we can estimate the posterior probability of each class, assigning that point to
−8
the class with the maximum posterior probability. We can view this process as dividing the input space −8 −6 −4 −2
x
0 2 4 6 8
1
into decision regions, separated by decision boundaries. In the next section we investigate whether
the maximum posterior probability rule is indeed the best decision rule (in terms of minimising the Figure 1: Decision regions for the three-class two-dimensional problem from the previous chapter.
number of errors). In the following sections we introduce discriminant functions which define the Class A (red), class B (blue), class C (cyan).
decision boundaries, and investigate the form of decision functions induced by Gaussian pdfs with
different constraints on the covariance matrix.
Thus the probability of the total error may be written as:
1 2
Learning and Data Note 10 Informatics 2B Learning and Data Note 10 Informatics 2B
0.2 0.2
The posterior probability could also be used as a discriminant function, with the same results: choosing
0.18 0.18
the class with the largest posterior probability is an identical decision rule to choosing the class with
0.16 0.16 the largest log posterior probability.
0.14 0.14
As discussed above, classifying a point as the class with the largest (log) posterior probability cor-
0.12 0.12 responds to the decision rule which minimises the probability of misclassification. In that sense, it
0.1 0.1 forms an optimal discriminant function. A decision boundary occurs at points in the input space where
0.08 0.08
discriminant functions are equal. If the region of input space classified as class ck (Rk) and the region
classified as class c4 (R4) are contiguous, then the decision boundary separating them is given by:
0.06 0.06
Let’s consider the case in which the Gaussian pdfs for each class all share the same covariance matrix.
2 Discriminant functions That is, for all classes c, Σc =Σ. In this case Σ is class-independent (since it is equal for all classes),
therefore the term −1/2 ln |Σ| may also be dropped from the discriminant function and we have:
If we have a set of K classes then we may define a set of K discriminant functions yk(x), one for each
1
class. Data point x is assigned to class c if yc(x) = − (x − µc )T Σ−1(x − µc ) + ln P(c) .
2
yc(x) > yk(x) for all k ≠ c. If we explicitly expand the quadratic matrix-vector expression we obtain the following:
In other words: assign x to the class c whose discriminant function yc(x) is biggest. 1 T
y (x) (x −1x xT −1 T −1
x T −1
) ln P(c) (4)
This is precisely what we did in the previous chapter when classifying based on the values of the log c = − Σ − Σ µc − µc Σ + µc Σ µc + .
2
posterior probability. Thus the log posterior probability of class c given a data point x is a possible
The mean µc depends on class c, but (as stated before) the covariance matrix is class-independent.
discriminant function:
Therefore, terms that do not include the mean or the prior probabilities are class independent, and may
yc(x) = ln P(c | x) = ln p(x | c) + ln P(c) + const. be dropped. Thus we may drop xT Σ−1x from the discriminant.
3 4
Learning and Data Note 10 Informatics 2B Learning and Data Note 10 Informatics 2B
x2
the diagonal terms (variances) are equal for all components. In this case the matrix may be defined by
a single number, σ2, the value of the variances:
Σ = σ2I
C2 −1 1
Σ = I
σ2
y1(x)=y2(x)
where I is the identity matrix.
C1 Since this is a special case of Gaussians with equal covariance, the discriminant functions are linear,
and may be written as (8). However, we can get another view of the discriminant functions if we write
them as:
x µ 2
x1 yc(x) = − || − ||c + ln P(c) . (9)
2σ2
Figure 3: Discriminant function for equal covariance Gaussians If the prior probabilities are equal for all classes, the decision rule simply assigns an unseen vector to
the nearest class mean (using the Euclidean distance). In this case the class means may be regarded as
We can simplify this discriminant function further. It is a fact that for a symmetric matrix M and class templates or prototypes.
vectors a and b: Exercise: Show that (9) is indeed reduced to a linear discriminant.
aT Mb = bT Ma .
Now since the covariance matrix Σ is symmetric, it follows that Σ−1 is also symmetric1. Therefore:
6 Two-class linear discriminants
xT Σ−1µc = µTc Σ−1x .
To get some more insights into linear discriminants, we can look at another special case: two-class
We can thus simplify (4) as: problems. Two class problems occur quite often in practice, and they are more straightforward to think
1 about because we are considering a single decision boundary between the two classes.
yc(x) = µTc Σ−1x − µTc Σ−1µ c + ln P(c) . (5)
2 In the two-class case it is possible to use a single discriminant function: for example one which takes
This equation has three terms on the right hand side, but only the first depends on x. We can define two value zero at the decision boundary, negative values for one class and positive values for the other. A
new variables wc (d-dimension vector) and wc0, which are derived from µc, P(c), and Σ: suitable discriminant function in this case is the log odds (log ratio of posterior probabilities):
wTc = µTc Σ−1 (6) y(x) = ln P(c1 | x) = ln p(x | c1) + ln P(c1)
1 T −1 1 T P(c | x) p(x | c ) P(c )
w = − µ Σ µ + ln P(c) = − w µ + ln P(c) . (7) 2 2 2
c0
2 c c
2 c c = ln p(x | c1) − ln p(x | c2) + ln P(c1) − ln P(c2) . (10)
Substituting (6) and (7) into (5) we obtain: Feature vector x is assigned to class c1 when y(x) > 0; x is assigned to class c2 when y(x) < 0. The
decision boundary is defined by y(x) = 0.
yc(x) = wTc x + wc0 . (8)
If the pdf for each class is a Gaussian, and the covariance matrix is shared, then the discriminant
This is a linear equation in d dimensions. We refer to wc as the weight vector and wc0 as the bias for function is linear:
class c. y(x) = wT x + w0 ,
We have thus shown that the discriminant function for a Gaussian which shares the same covariance where w is a function of the class-dependent means and the class-independent covariance matrix, and
matrix with the Gaussians pdfs of all the other classes may be written as (8). We call such discriminant the w0 is a function of the means, the covariance matrix and the prior probabilities.
functions linear discriminants: they are linear functions of x. If x is two-dimensional, the decision
The decision boundary for the two-class linear discriminant corresponds to a (d −1)-dimensional
boundaries will be straight lines, illustrated in Figure 3. In three dimensions the decision boundaries
hyperplane in the input space. Let xna and xnb be two points on the decision boundary. Then:
will be planes. In d-dimensions the decision boundaries are called hyperplanes.
y(xna) = 0 = y(xnb) .
5 Spherical Gaussians with equal covariance And since y(x) is a linear discriminant:
Let’s look at an even more constrained case, where not only do all the classes share a covariance wT xna + w0 = 0 = wT xnb + w0.
matrix, but that covariance matrix is spherical: the off-diagonal terms (covariances) are all zero, and And a little rearranging gives us:
1 It also follows that xT Σ−1x ≥ 0 for any x. wT (xna − xnb) = 0 . (11)
5 6
Learning and Data Note 10 Informatics 2B
x2
y( x)=0
w
x1
−w0
||w||
In three dimensions (11) is the equation of a plane, with w being the vector normal to the plane. In
higher dimensions, this equation describes a hyperplane, and w is normal to any vector lying on the
hyperplane. The hyperplane is the decision boundary in this two-class problem.
If x is a point on the hyperplane, then the normal distance from the hyperplane to the origin is given by:
wT x w0
4= =− (using y(x) = 0) ,
||w|| ||w||
which is illustrated in Figure 4.
7
Learning and Data Note 10 Informatics 2B
Learning Objectives:
• Understand the concept and purpose of Linear Discriminant Analysis (LDA)
• Learn how LDA performs dimensionality reduction and classification
• Grasp the mathematical principles behind Fisher’s Linear Discriminant
• Explore LDA implementation and applications using Python
What is Linear Discriminant Analysis?
• Linear Discriminant Analysis (LDA) is a statistical technique for categorizing data into groups. It identifies patterns in features to distinguish between different
classes. For instance, it may analyze characteristics like size and color to classify fruits as apples or oranges. LDA aims to find a straight line or plane that best
separates these groups while minimizing overlap within each class. By maximizing the separation between classes, it enables accurate classification of new data
points. In simpler terms, LDA helps make sense of data by effectively finding the most efficient way to separate different categories. Consequently, this aids in
tasks like pattern recognition and classification.
7
Learning and Data Note 10 Informatics 2B
Linear Discriminant Analysis in Machine Learning is a generalized form of Fisher’s Linear Discriminant (FLD). Initially, Fisher, in his paper, used a discriminant function to
classify between two plant species, namely Iris Setosa and Iris Versicolor.
The basic idea of FLD is to project data points onto a line in order to maximize the between-class scatter and minimize the within-class scatter. Consequently, this
approach aims to enhance the separation between different classes by optimizing the distribution of data points along a linear dimension.
This might sound a bit cryptic but it is quite straightforward. So, before we delve deep into the derivation part, we need to familiarize ourselves with certain terms and
expressions.
• Let’s suppose we have d-dimensional data points x1….xn with 2 classes Ci=1,2 each having N1 & N2 samples.
• Consider W as a unit vector onto which we will project the data points. Since we are only concerned with the direction, we choose a unit vector for this purpose.
• Number of samples : N = N1 + N2
• If x(n) are the samples on the feature space then WTx(n) denotes the data points after projection.
• Means of classes before projection: mi
• Means of classes after projection: Mi = WTmi
7
Learning and Data Note 10 Informatics 2B
Scatter matrix: Used to make estimates of the covariance matrix. IT is a m X m positive semi-definite matrix.
Given by: sample variance * no. of samples.
Note: Scatter and variance measure the same thing but on different scales. So, we might use both words interchangeably. So, do not get confused.
Two Types of Scatter Matrices
Here we will be dealing with two types of scatter matrices
• Between class scatter = Sb = measures the distance between class means
• Within class scatter = Sw = measures the spread around means of each class