Classification: 12.1 Discriminant Analysis
Classification: 12.1 Discriminant Analysis
Classification: 12.1 Discriminant Analysis
Classification
tion; the second can be a ‘black box’ that makes a decision without any explana-
tion. In many applications no explanation is required (no one cares how machines
read postal (zip) codes, only that the envelope is correctly sorted) but in others,
especially in medicine, some explanation may be necessary to get the methods
adopted.
data mining, although some of data mining is ex-
ploratory in the sense of Chapter 11. Hand et al. (2001) and (especially) Hastie
et al. (2001) are pertinent introductions.
Some of the methods considered in earlier chapters are widely used for clas-
n trees, logistic regression for groups and
multinomial log-linear models (Section 7.3) for groups.
Suppose that we have a set of classes, and for each case we know the class
(assumed correctly). We can then use the class information to help reveal the
structure of the data. Let denote the within-class covariance matrix, that is the
covariance matrix of the variables centred on the class mean, and denote the
331
332
between-classes covariance matrix, that is, of the predictions by the class means.
Let be the matrix of class means, and be the matrix of class
indicator variables (so if and only if case is assigned to class ). Then
the predictions are . Let be the means of the variables over the whole
sample. Then the sample covariance matrices are
(12.1)
5
second linear discriminant
v v
v s
ss s
v vvvvv c ss ss
v vvvvvvvvvvvv vvc ccccccc s s ssss ss
v v v v c cc c s
s s
vvv vvvvv cc cc cc c sss s s sss ss s
0
v vvvvvc cv cc cccc s s ss ssss
s s s
v c cc c
v cc cc c s s s
v v c c cc c c
cc
s
c
-5
-5 0 5 10
first linear discriminant
axis. Using
(12.2)
and it is fairly simple to show that the allocation rule which makes the smallest
expected number of errors chooses the class with maximal ; this is known
as the Bayes rule. (We consider a more general version in Section 12.2.)
334
Now suppose the distribution for class is multivariate normal with mean
and covariance . Then the Bayes rule minimizes
(12.3)
and we can work in dimensions. If there are just two classes, there is a single
linear discriminant, and
const
dataset
Can we construct a rule to predict the sex of a future Leptograpsus crab of un-
known colour form (species)? We noted that is measured differently for males
12.1 Discriminant Analysis 335
and females, so it seemed prudent to omit it from the analysis. To start with, we ig-
nore the differences between the forms. Linear discriminant analysis, for what are
,
a dimensionally neutral quantity. Six errors are made, all for the blue form:
It does make sense to take the colour forms into account, especially as the
within-group distributions look close to joint normality (look at the Figures 4.13
O
OO
O
O
4
OO OO O
B O O O O
O O OO O O O
B OO O O
B B O O OOO
O O O
B B O OOOO
B BB B BB BB O O
2
B B OO O O
BB O
B BBBB B OO
BBB O
BB o
B B B oo
B B B O
BB B B b
Second LD
B
B B
0
o
B B BB o o
bb B o o o o
B o o
b o o o oo
Bb b b ooo
b bb o
o
ooo o
B b b o o
b
b b bb o o ooo o oo
b b bb b b b o o
-2
b o o o o
b b b bb b o o o
b b o
b
b bb bb o o
bb b b
b o o
-4
b
-6 -4 -2 0 2 4 6
First LD
Figure 12.2: Linear discriminants for the data. Males are coded as capitals, fe-
males as lower case, colours as the initial letter of blue or orange. The crosses are the
group means for a linear discriminant for sex (solid line) and the dashed line is the deci-
sion boundary for sex based on four groups.
1 Adopted by R in package .
12.1 Discriminant Analysis 337
15
CC
F H
2
FV V
V F HHHHHH H
F FFF V FVF FNF H
T F H
N F H
F F FFNFFFNF
10
N
FN N FT H
FFF NNFFFFFNFFN T
C F NVF
FVN
F NN HT NT
T HH H
NN
NNF FV
NFF
V
NF FF
F
NNF
N
N
N
NNF
F
N
T HH
H H N
0 NNFVNNFFN
FFNVF
VN
NNN NNN N
N VVNF FF FNFV
NVN
FF NN N
N H HH
N
N
F NNN N N N
FN
N H HH
N
N
N
N H T C N
N C C C
C C C C
5
T
LD1
LD1
N CN C N C C
-2
C H N
CN C C N NTT
V N C T
N CN
N H C
N N T
N HFFNN N
N
NFF
V FFF VFFN T T
NFNN HH
-4
F
F
VF
FF
FF
FF
F
F
NNFVVNN T
0
N F
N F
FFF
V
NNN T H
H
H
NVF
NN
F
V VF
FNF
F
V
NV
F
N
F
N
VN
N
NN
N
NN
F NN
NN
FF
FN
V H H N
N FNNNN
C
FN
N
N
FNF N
N
VVVVNN
F
F
FNF T H
H
F N N
H HH H
N FH H
HHF
H H
-6
H
H
-5
C
C
C
-8
-4 -2 0 2 4 6 -10 -5 0 5 10 15
LD2 LD2
those points whose Mahalanobis distance from the initial mean using the initial
lly within the 97.5% point under normality),
and returning their mean and variance matrix).
An alternative approach is to extend the idea of M-estimation to this setting,
distribution for a small number of degrees of free-
dom. This is implemented in our function ; the theory behind the algo-
rithm used is given in Kent, Tyler and Vardi (1994) and Ripley (1996). Normally
is faster than , but it lacks the latter’s extreme resistance. We
can use linear discriminant analysis on more than two classes, and illustrate this
with the forensic glass dataset .
Our function has an argument to use the minimum
volume ellipsoid estimate (but without robust estimation of the group centres) or
the multivariate distribution by setting . This makes a consid-
erable difference for the forensic glass data, as Figure 12.3 shows. We use
the default .
338
In the terminology of pattern recognition the given examples together with their
training set, and future cases form the test set. Our
primary measure of success is the error (or
would obtain (possibly seriously) biased estimates by re-classifying the training
set, but that the error rate on a test set randomly chosen from the whole population
will be an unbiased estimator.
It may be helpful to know the type of errors made. A confusion matrix gives
the number of cases with true class . In some problems
some errors are considered to be worse than others, so we assign costs to
allocating a case of class to class . Then we will be interested in the average
error cost rather than the error rate.
It is fairly easy to show (Ripley, 1996, p. 19) that the average error cost is
minimized by the Bayes rule, which is to allocate to the class minimizing
where is the posterior distribution of the classes after ob-
serving . If the costs of all errors are the same, this rule amounts to choosing
the class with the largest posterior probability . The minimum average
cost is known as the Bayes risk. We can often estimate a lower bound for it by the
method of Ripley (1996, pp. 196–7) (see the example on page 347).
We saw in Section 12.1 how can be computed for normal popula-
tions, and how estimating the Bayes rule with equal error costs leads to lin-
ear and quadratic discriminant analysis. As our functions and
this using a surrogate log-linear Poisson GLM model (Section 7.3), but using the
function in library section will usually be faster and easier.
directly by a special multiple logistic
model, one in which the right-hand side is a single factor specifying which leaf
the case will be assigned to by the tree. Again, since the posterior probabilities
339
are given by the method it is easy to estimate the Bayes rule for unequal
error costs.
rule we need to know the posterior probabilities . Since these are unknown
we use an explicit or implicit parametric family . In the methods con-
sidered so far we act as if were the actual posterior probabilities, where
is an estimate computed from the training set , often by maximizing some ap-
propriate likelihood. This is known as the ‘plug-in’ rule. However, the ‘correct’
estimate of is (Ripley, 1996, 2.4) to use the predictive estimates
(12.5)
If we are very sure of our estimate there will be little difference between
and ; otherwise the predictive estimate will normally be less
extreme (not as near 0 or 1). The ‘plug-in’ estimate ignores the uncertainty in the
parameter estimate which the predictive estimate takes into account.
It is not often possible to perform the integration in (12.5) analytically, but it
is possible for linear and quadratic discrimination with appropriate ‘vague’ pri-
ors on (Aitchison and Dunsmore, 1975; Geisser, 1993; Ripley, 1996). This
estimate is implemented by of the meth-
ods for our functions and . Often the differences are small, especially
for linear discrimination, provided there are enough data for a good estimate of
the variance matrices. When there are not, Moran and Murphy (1979) argue
that considerable improvement can be obtained by using an unbiased estimator
of , implemented by the argument .
LDA QDA
a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone
Figure 12.4: Linear and quadratic discriminant analysis applied to the Cushing’s syndrome
data.
a a
cc c cc c
2
5.00
u
b
u
b
c
c c c c
1
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
0
u u a u u
0.50
b b b b
a ub b a ub b
-1
b b b
-2
a u u a u u
0.05
-3
a a
5 10 50 1 2 3 4
Tetrahydrocortisone Tetrahydrocortisone
Figure 12.5
data.
in Figure 12.5.
mates of the class densities or of the log posterior. Library section imple-
ments the
tler, 1982; Ripley, 1996) and learning vector quantization (Kohonen, 1990, 1995;
nearest examples in some
reference set, and taking a majority vote among the classes of these examples,
or, equivalently, estimating the posterior probabilities by the proportions
of the classes among the examples.
The methods differ in their choice of reference set. The -nearest neighbour
methods use the whole training set or an edited subset. Learning vector quantiza-
tion is similar to K-means in selecting points in the space other than the training
342
1-NN 3-NN
a a
c c c c
c c
5.00
5.00
u u
c b c b
c c
b b b b
Pregnanetriol
Pregnanetriol
a b a b
a b a b
u u u u
b b b b
0.50
0.50
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone
set examples to summarize the training set, but unlike K-means it takes the classes
of the examples into account.
These methods almost always measure ‘nearest’ by Euclidean distance. For
the Cushing’s syndrome data we use Euclidean distance on the logged covariates,
rather arbitrarily scaling them equally.
This dataset is too small to try the editing and LVQ methods in library section
.
2 S envi-
ronment.
12.4 Neural Networks 343
a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone
a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone
Figure 12.7: Neural networks applied to the Cushing’s syndrome data. Each panel shows
The results are shown in Figure 12.7. We see that in all cases there are multiple
Once we have a penalty, the choice of the number of hidden units is often not
critical (see Figure 12.7). The spirit of the predictive approach is to average the
predicted
344
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone
Figure 12.8: Neural networks with three hidden units and applied to the Cush-
ing’s syndrome data.
Note that there are two quite different types of local maxima occurring here, and
some local maxima occur several times (up to convergence tolerances). An aver-
They have been promoted enthusiastically, but with little respect to the selection
effects of choosing the test problem and the member of the large class of classi-
et al. (1992); Cortes and Vapnik
(1995); Vapnik (1995, 1998); the books by Cristianini and Shawe-Taylor (2000)
and Hastie et al. (2001, 4.5, 12.2, 12.3) present the underlying theory.
The method for classes is fairly simple to describe. Logistic regression
points on one side and all class-two points on the other. It would be a coincidence
for there to be only one such hyperplane, a
in the middle of the ‘gap’ between
middle of the gap, that is with maximal margin (the distance from the hyperplane
12.5 Support Vector Machines 345
to the nearest point). This is quadratic programming problem that can be solved
by standard methods.3 Such a hyperplane has support vectors, data points that are
exactly the margin distance away from the hyperplane. It will typically be a very
is tackled in two ways. First, we can allow some points to be on the wrong side
of their margin (and for some on the wrong side of the hyperplane) subject to
where
not dissimilar (Hastie et al., 2001, p. 380) to a logistic regression with weight
). The claimed advantage of SVMs is
3 See Section
16.2 for S software for this problem; however, special-purpose software is often used.
4 Codeby David Meyer based on C++ code by Chih-Chung Chang and Chih-Jen Lin. A port to
S-PLUS is available for machines with a C++ compiler.
346
The extension to classes is much less elegant, and several ideas have
been used. The function uses one attributed to Knerr et al. (1990) in which
of classes, and the majority vote amongst
the resulting
The forensic glass dataset has 214 points from six classes with nine mea-
seen (Figures 4.17 on page 99, 5.4 on page 116, 11.5 on page 309 and 12.3 on
page 337) the types of glass do not form compact well-separated groupings, and
the marginal distributions are far from normal. There are some small classes (with
9, 13 and 17 examples), so we cannot use quadratic discriminant analysis.
We assess their performance by 10-fold cross-validation, using the same ran-
dom partition for all the methods. Logistic regression provides a suitable bench-
mark (as is often the case), and in this example linear discriminant analysis does
equally well.
12.6 Forensic Glass Example 347
We can use nearest-neighbour methods to estimate the lower bound on the Bayes
risk as about 10% (Ripley, 1996, pp. 196–7).
this dataset. We need to cross-validate over the choice of tree size, which does
vary by group from four to seven.
348
Neural networks
We wrote some general functions for testing neural network models by -fold
cross-validation. First we rescale the dataset so the inputs have range .
the PC).
This code chooses between neural nets on the basis of their cross-validated
error rate. An alternative is to use logarithmic scoring, which is equivalent to
by
in .
Support vector machines
The following is faster, but not strictly comparable with the results above, as a
different random partition will be used.
12.7 Calibration Plots 349
We set an even prior over the classes as otherwise there are too few representatives
of the smaller classes. Our initialization code in follows Kohonen’s in
selecting the number of representatives; in this problem 24 points are selected,
four from each class.
One measure that a suitable model for has been found is that the predicted
probabilities are well calibrated; that is, that a fraction of about of the events
we predict with probability actually occur. Methods for testing calibration of
probability forecasts have been developed in connection with weather forecasts
(Dawid, 1982, 1986).
For the forensic glass example we are making six probability forecasts for
each case, one for each class. To ensure that they are genuine forecasts, we should
use the cross-validation procedure. A minor change to the code gives the proba-
bility predictions:
350
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
predicted probability
one. Indeed, only 22/64 of the events predicted with probability greater than
0.9 occurred. (The underlying cause is the multimodal nature of some of the
underlying class distributions.)
12.7 Calibration Plots 351
use of plug-in rather than predictive estimates. Then the plot can be used to adjust
the probabilities (which may need further adjustment to sum to one for more than
two classes).