Multidimensional Data Analysis
Multidimensional Data Analysis
Multidimensional Data Analysis
1 This
Contents
1 Introduction
Analysis
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
4 Cluster methods
4.1 Hierarchical Methods . . . . . . . . . . . . . . . . . . .
4.1.1 Hierarchical Clustering using the MST . . . . .
4.1.2 Selection of Jets by Hierarchical Clustering . . .
4.2 Nonhierarchical Techniques . . . . . . . . . . . . . . .
4.2.1 The ValleySeeking Technique . . . . . . . . .
4.2.2 4.2.2 The Cluster Algorithm CLUCOV . . . . .
4.2.3 4.2.3 Gelsemas Interactive Clustering Technique
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
9
11
11
12
13
14
14
17
19
Chapter 1
Introduction
The necessity for the application of multidimensional data analysis in high energy
physics arises from at least two reasons:
Events of a given reaction are described by a (multiplicity dependent) large
number of kinematical variables
The number of events is normally high so that the application of statistical
methods becomes essential
Being confronted with a large amount of multidimensional data the physicist often
chooses the traditional way of analyzing these data, i.e. to produce one or two
dimensional projections of the data.
Lowdimensional projections of multidimensional data are often insucient as
they reveal only a small part of the information contained in the data. Multidimensional data analysis therefore aims to use the full experimental information.
If there exists a physical model of the measured process the parameters of this
model can be tted to the data. Such parametric methods as partial wave analysis, analytical multichannel analysis or prism plot analysis rely heavily on a priori
physical knowledge and will not be considered here.
Also beyond the scope of this survey are classication procedures starting from a
known category structure within the data and sorting the data points into categories
dened before. We describe methods to nd out structure in multidimensional data
with minimum a priori knowledge of the underlying physics and data structure.
In chapter 2 the importance of a skilled choice of the variables spanning up the
multidimensional space and the denition of similarity in the case of cluster methods
are discussed. Methods of lowering the dimensionality of the multidimensional space
and of comparing multidimensional data sets are presented in chapter 3. Looking
for an appropriate separation algorithm the analyst has to take into account the
computing demands of dierent algorithms. The hierarchical algorithms presented
in chapter 4.1 are only applicable for relatively small numbers of data points, while
the nonhierarchical methods of chapter 4.2 can also be applied to large data sets.
For theoretical discussion and applications of dierent methods of multidimensional data analysis we refer the reader to the review of Kittel (KITT76) and the
proceedings of the meetings on high energy data analysis (CERN76, NIJM78). The
reader especially interested in cluster analysis should consult the monographies of
Anderberg (ANDE73) and Duran and Odell (DURA74).
Chapter 2
Variables, Metrics, Similarity
Measures
Information and structure do not exist by themselves but only in the context of
a given application. Such, for example, the noise in a radio receiver contains no
information for the ordinary listener, but it can well give some information on a
defect in the receiver.
The existence of structure within multidimensional data sets depends largely on
the variables that span the multidimensional space and on the metrics dened in this
space. For classication procedures one furthermore has to decide on the meaning of
similarity and dissimilarity of categories.
The importance of the choice of variables for the ultimate results of the analysis
is well illustrated in g. 2.1. Scaling one coordinate already suggests a new group
structure within the four data points.
Figure 2.1: The eect of the scale of variables on the group structure
answers obtained from the analysis depend largely on the questions asked by the
analyst.
Multidimensional data in high energy physics most frequently consist of the momentum and energy variables of many particle nal states. Therefore we shall consider the problem of the choice of variables in this context.
It would be desirable to nd some physical requirements that make the choice
of variables less arbitrary or even unique. One approach would be to look for a
complete set of Lorentzinvariant variables which does not favour any of the nal
state particles.
The reaction
a + b 1 + 2 +...+ n
can be described in terms of the following variables
(ai) = (pa pi )
(bi) = (pb pi )
(ik) = (pi pk )
with i, k = 1, . . . , n.
These variables are relativistically invariant and invariant under permutations of the
incoming particles a, b and the nal state particles i, k.
On the other hand they are directly related to momentum transfers and masses
squared so that they are relevant for the dynamics of the reaction to be investigated.
However, the number of invariants exceeds the number of independent variables
which is 3n 5 for an nparticle nal state at xed total energy. To our knowledge
there is no subset of variables that maintains the properties of Lorentzinvariance
and permutation symmetry. Therefore we have to resort to the somewhat weaker
property of quasipermutation invariance which means permutation invariance up to
a linear transformation. This leads to the demand for distance measures that are
invariant under linear transformations.
Provided one has such a distance measure, a relativistically and quasipermutation invariant subset of variables should be good for a multidimensional analysis.
For a threeparticle nal state one can simply choose the four invariants
(a1) , (a2)
(b1) , (b2)
For fourparticle nal states, Yang (BYER64, BYCK73) proposed the following seven
variables
(a1),
(a2),
(a3),
(b1),
(b2),
(b3),
and 4 (1234)
where 4 is the Gram determinant dened by
k (12 . . . k) =
(11)
.
.
.
(k1)
. . . (1k)
...
.
...
.
...
.
. . . (kk)
In order to generalize the Yang variables one has to take into account the kinematic constraints on the invariants. Such a generalization has been performed in
(BECK76). For this purpose we dene a quantity
(n)
Zm
m+1
= [(1)
n
...
i1
n
m (i1 , . . . , im )] m
im
Z1
n
m2i
i=1
(n)
with mi as the mass of particle i so that only Zm
with m > 1 can be incorcorated
in the variable sets.
For ve particles in the nal state we have ten invariants
(a1)
(b1)
and
...
...
(a4)
(b4)
(5)
(5)
Z4 , Z3 .
...
...
(a5)
(b5)
(6)
(6)
(6)
and Z4 , Z3 , Z2 .
For a detailed analysis of many particle nal states the full set of (3n 5) variables
should be used. Experience has shown, however, that most of the global information
on the reaction mechanisms is already contained in the (2n 2) four-momentum
products (BECK76)
(a1) . . . (a n 1)
(b1) . . . (b n 1)
We now turn our attention to the problem of metrics. A nonnegative real function d(Xi , Xj ) is called a metric in a pdimensional Euclidean space Ep if
1. d(Xi, Xj ) > 0 for all Xi and Xj in Ep
2. d(Xi, Xj ) = 0 if and only if Xi = Xj
3. d(Xi, Xj ) = d(Xj , Xi )
4. d(Xi, Xj ) d(Xi , Xk ) + d(Xk , Xj )
p
(Xki Xkj )2 ] 2
k=1
d1 (Xi , Xj ) =
|Xki Xkj |
k=1
Chapter 3
Some General Methods of
Multidimensional Data Analysis
Most experiments in high energy physics (but not only there) lead to multidimensional data. The methods to analyze such data are much less developed than for
the onedimensional case. In some cases one has a model attempting to describe
the process from which the data derive. Then the methods of parametric density
estimation as maximum likelihood or moments can be used. If one is not so lucky
to have a model for the data one normally will check if all the dimensions are really
needed to describe the data, i.e. one will look at the intrinsic dimensionality of the
data. For this the principal component analysis can be used in the linear case. In the
nonlinear case the generalized principal component analysis or successive application
of the principal component analysis (FRIE76,FUKU71) can be used.
In this section we shall describe some methods for the exploratory data analysis
which have proven to be of broad applicability.
3.1
Projection Pursuit
Projection pursuit (FRIE74) is a mapping technique which searches for the one
or twodimensional projection exhibiting as much structure of the data as possible.
First the one dimensional case is discussed.
i (i = 1 . . . N) be the data set. Then the aim is to search for that direction a
Let X
with |a| = 1 for which the so called projection index I(a) is maximized. In (FRIE74)
the following construction of I(a) is proposed:
I(a) = S(a)d(a)
wth S(a) measuring the spread of data (trimmed standard deviation)
(1p)N
S(a) = [
i=pN
a =
X
(1p)N
i=pN
ia X
a )2 1
(X
]2
(1 2p)N
ia
X
(1 2p)N
N
N
i=1 j=1
i X
j )a|
rij = |(X
() =
1 if > 0
0 if 0
The function f (r) should be monotonically decreasing in [0, R]. The algorithm is
insensitive to the special form of f (r) but
R
r =
rf (r)dr
f (r)dr
R
0
determines the size of the structure which is searched for. Finding the maximum of
I(a) is a nonlinear problem which can be solved by standard procedures.
If one aims to nd twodimensional projections the generalization is straightforward: The projection is given by the two orthogonal directions a, b (a b = 0)
S(a, b) = S(a)S(b)
i X
j ) a]2 + [(X
i X
j ) b]2 ) 12
rij = ([(X
R
r =
3.2
rf (r)dr
f (r)dr
R
Nonlinear Mapping
1
N
1 N
Dij dij 2
(
)
N i=1 j=i+1
dij
A modication of this algorithm is due to Manton (MANT76). The main disadvantage of these algorithms is the high computational load. So they are applicable only
to a few hundred data points.
A much simpler but still powerful technique is the sequential nonlinear mapping
(LEE74). lt starts from the fact that if Yi and Yj are chosen to preserve the original
distance
Dij = dij
it is always possible to place a third Yk such as to preserve the distances to the points
i, j:
Dik = dik , Djk = djk
(triangulation). Thus at least 2N 3 interpoint distances of the N(N 1)/2 total
number of interpoint distances can be preserved. Since the edge lengths of the MST
are known to carry much information about the structure of the point set it is only
natural to use the N 1 edge lengths of the MST as part of the distances to be
preserved. Then still N 2 interpoint distances can be preserved. For this purpose
Lee suggested (LEE74):
Preserve the distance of all points to a xed reference point or
Preserve the distances of each point to its nearest point already matched or
Preserve the distance to its farthest point already matched.
As with the other nonlinear mappings the disadvantage is that the resulting transformation cannot be summarized with a few numbers and that adding a new data
point demands complete recomputation.
3.3
A very interesting approach to the problem of two sample test for multidimensional
data is due to Friedman and Rafsky (FRIE78). They generalize the WaldWolfowitz
and the Smirnov tests. The WaldWolfowitz as well as the Smirnov test in the one
dimensional case start by sorting the pooled data in ascending order without regard
to sample identity. Now the essential idea is to use the MST as a generalization of
the sorting in the onedimensional case. This is justied by two important properties
of the MST:
1. The N nodes are connected by N 1 edges.
2. The edges connect closely lying points.
The WaldWolfowitz test is then generalized as follows. The two samples of size m
and n respectively are merged to give the pooled sample of size N = m + n, for which
the MST is constructed. The number of runs R is dened by
R =
Zi + 1
edges
where
Zi =
It can be shown that under the null hynothesis (both samples drawn from the same
distribution) and for large sample sizes the quantity W has a standard normal distribution:
W =
R E{R}
2 {R}
2mn
E{R} =
+1
N
C N +2
2mn 2mn N
+
[N(N 1) 4mn + 2]
2 {R} =
N(N 1)
N
(N 2)(N 3)
C is the number of edge pairs sharing a common mode. It should be noted that a
multidimensional two sample test can also be used to test factorization (FRIE73):
The hypothesis to be tested is
f (x1 , x2 , . . . , xd ) = f1 (x1 , . . . , xi ) f (xi+1 , . . . , xd )
For this one compares the original sample with a sample derived from the original
one by randomly exchanging the components xi+1 , . . . , xd within the point set.
10
Chapter 4
Cluster methods
In exploratory data analysis one often is confronted with the following problem:
given a set of data one has to nd out if there are groups of data such that members
of a given group are similar and dierent groups are dissimilar. Obviously the
meaning of the terms similar and dissimilar depends on the context in which the
data analysis is performed.
Finding groups with the above stated properties has the following consequences:
It shows that the data points build clusters, i.e. the data points do not occupy
the available space randomly but are concentrated in some regions so that the
clusters are distinguishable from each other.
A data reduction results since the data set can be described by the features of
the clusters.
The investigator might try to interprete the resulting clusters, i.e. to explain the
structure of the data in the context of the underlying process. The knowledge
obtained this way can even be used to enter an iterative procedure of clustering
(e. g. cutting some data, averaging or splitting some clusters, chosing another
clustering criterion etc.)
There is a great variety of algorithms for nding clusters in a given data set. One
class (hierarchical clustering methods) does not aim to nd a unique solution but
rather results in a set of solutions of dierent levels such that two clusters of a lower
level belong to one cluster at some higher level. Applications of hierarchical clustering
algorithms in high energy physics are discussed in section 4.1. Section 4.2 deals with
non hierarchical clustering algorithms.
4.1
Hierarchical Methods
11
at the highest level. In either case some measure is necessary to decide which groups
are to be combined or which group has to be split.
One also needs some criterion to decide at which level the tree has to be cut to
obtain the real clustering since obviously not all the possible groupings depicted by
the tree are of real interest.
We shall now describe in some detail the application of a splitting algorithm
(4.1.1) as well as a merging algorithm (4.1.2).
4.1.1
Minimal spanning trees (MST, see 3.) can be used for nding clusters (ZAHN71).
Cutting one edge of the MST one is left with two unconnected parts. Proceeding
this way one derives a hierarchy of groupings like in g. 4.1. The rst application of
this technique to high energy data was by Schotanus (SCHO78). The data consisted
of 1000 events of the nal state + p + 0 p at 5GeV /c. The distance used in the
construction of the MST was the Euclidean distance in the space of
m2p+ , m2+ 0 ,
The MST was cut at its inconsistent edge, i.e. at the edge with the greatest ratio
edge length/averages of upward and downward neighbourhood, since this corresponds
to a density minimum.
Studying the resulting grouping it was concluded that edges in the overlap regions
are rather shorter than in neighbouring regions so that dierent clusters could not
be separated one from another.
To achieve an improvement it was necessary to modify the criterion by taking
into account additional information about the shape of the neighbourhood. For this
the path collinearity was introduced, Path collinearity is a directional criterion which
can be dened for any dened path through the MST advantageously one uses a
maximal path. The path collinearity at a certain point p with a leverage lp is the
12
angle between the straight lines connecting p with the points lp places upward and
downward the path respectively. Thus measures if the data have linear pieces (
being ) or sharp bends. By means of the collinearity criterion the + p + 0 p
data at 5GeV /c could be separated into
a) a backward production region (baryon exchange)
b) ++ resonance production
c) + resonance production
d) diraction dissociation and + production
e) elastic contamination of the reaction + p + p.
The collinearity criterion is applicable if the data is essentially onedimensional. It
could be extended to higher dimensions dening coplanarity etc. but it was suggested
(SCHO76) that for this type of extensions it would be better to analyze the full
covariance matrix of the neighbourhood.
4.1.2
High energy quarks and gluons produced in storage ring collisions manifest themselves as hadronic jets observed in the nal state. As the total energy increases
QCD unambigously predicts an increase of the jet multiplicity. Thus it becomes an
important task to develop methods to recognize events of any jet multiplicity
The most obvious feature of jets is the strong angular correlation, i.e. the appearence of a narrow cone built by the momenta of the particles belonging to a given
jet. This quite naturally leads to the use of clustering algorithms for the study of
jet events. For this the distance dik = d(
pi, pk ) between the two particles i and k is
dened such that small angles result in short distances. The rst application of this
idea is in (LANI81). For dik they used:
dik =
pi pk )
1 (
(
+ 1)
2 |
pi ||pk |
The algorithm then combines (N 1) times the two most similar (smallest dik ) groups.
For this it is neccessary to state what the similarity of a combined group to the old
ones is. If groups (= particles at the beginning) i and k have been combined to give
a group called m the similarity dml of the new group m to the remaining groups is
chosen to be
dml = min(dil , dkl )
with l = i and l = k.
13
2ik
|
pi ||pk |
4.2
Nonhierarchical Techniques
Nonhierarchical clustering techniques do not rely on a hierarchy of subsequent partitions of the data sample. They rather create in each iteration step a new assignment
of the data points to the clusters according to some clustering criterion. We shall
now describe in some detail three nonhierarchical clustering procedures, which all
have been applied to high energy data:
the nonparametric valleyseeking technique of Koontz and
Fukunaga (KOON72)
the cluster algorithm CLUCOV of Nowak and Schiller (NOWA75) and
the interactive clustering technique of Gelsema (GELS74).
4.2.1
We now describe the valley seeking technique of Koontz and Fukunaga (KOON72)
in a version designed for the analysis of manyparticle nal states in high energy
physics.
Results of the application of this technique to the reactions
and
+ p p + +
+ p p + + +
at 8 and 16 GeV /c
at 16 GeV /c
N
N
i=1 j=1
14
d (wi , wj ) =
D, D > 0
0
for wi = wj
for wi = wj
1 if dX (Xi , Xj ) < R
0 if dX (Xi , Xj ) > R > 0
fR [dX (Xi , Xj )]
fX (Xi , Xj ) =
Using the symmetry of fR with respect to Xi and Xj and d (wi , wi ) = 0 for all i (a
property of any metric) and assuming suciently small R we obtain
J 2D 2
N
N
i=1 j=i+1
2D 2 JR
JR assigns a nonzero penalty for each pair of vectors closer together than R and
classied into dierent classes. Hence the main contributions to JR come from points
near the boundary between two clusters.
Minimization of JR consequently enforces the boundaries between the clusters
to be dened across a region of minimum density of data points. The name valley
seeking technique originates from this property.
This clustering criterion has the following advantages:
1. Computation: For a given classication, JR is determined by counting rather
than by dicult calculations.
2. Storage: The storage requirement is mainly governed by the number of pairs
of vectors that are closer together than R. This number can be kept small by
chosing R suciently small.
3. The valleyseeking property makes the clustering criterion suitable for non
supervised classication.
Its disadvantages are:
1. Very distant clusters can receive the same label.
2. No account is taken of the inner structure of the clusters.
3. The cluster shape does not enter the cluster criterion.
As can be seen from g. 4.2, pronounced Vshaped structures occur in practice
which should be split at the edges.
Having dened a clustering criterion it remains to choose an algorithm of how to
eciently achieve an optimum classication. The minimization of the information
loss JR is performed by the following algorithm:
15
Figure 4.2: + p p + 0 (plab = 3.9GeV /c) a.) Prism plot for MonteCarlo events
(Lorentzinvariant phase space) b) Prism plot for experimental data
16
4.2.2
Analyzing three and four particle hadronic nal states it was found (BRAU71) that
clusters are generally ellipsoids with arbitrary orientation in phase space. This is also
illustrated in g. 4.2. The cluster algorithm CLUCOV was specially designed to meet
this situation. A detailed description of the algorithm can be found in (NOWA75).
In the cluster algorithm CLUCOV the kth cluster Gk is therefore characterized
by the moments of order zero, one and two of the distribution of the N points X m =
m
(X1m , . . . , X3n5
) contained in this cluster. Let these points have their experimental
m
weights w . Then the three moments are
the number I k of points X m contained in the cluster Gk
k
N
wm
m=1
Cijk
Nk
1
= k
w m Xim
I m=1
with X m Gk
The eigenvectors of the covariance matrix point into the direction of the main axes
of the ellipsoids by which the shape of the clusters is approximated. The eigenvalues
of the covariance matrix denote the lengths of the main axes, and the determinant
measures the volume of the clusters.
All these three moments enter the denition of the distance of a point m at X m
from the kth cluster Gk .
The number I k of points in cluster k is included into the distance measure as a
linear weight factor, so that big clusters attract further points.
The Euclidean distance (X m Qk ) of a given point X m from the centroid Qk
of the kth cluster enters the exponent of a gaussian containing also the covariance
17
matrix C k :
k
fm
Ik
1
exp[ (X m Qk )(C k )1 (X m Qk )]
2
(2)3n5 |C k |
h0
hk hl
The quantities hk and hl are superpositions of the gaussians f k (X) and f l (X) of the
clusters Gk and Gl in their centroids Qk and Ql :
hk = f k (Qk ) + f l (Qk )
and
hl = f k (Ql ) + f l (Ql )
The quantity h0 is the minimum of this superposition along the distance vector
(Qk Ql ) between the clusters:
h0 = min[f k (X) + f l (X)]
If for a pair of groups this measure t exceeds a limit tmerge being a parameter of the
merging procedure this pair of groups is united into one group.
The compactness of the clusters is tested by arbitrarily subdividing the clusters
with hyperplanes through the cluster centroids. If the relative compactness of the two
parts of the clusters is smaller than tsplit being a parameter of the splitting procedure,
this cluster is split by the corresponding hyperplane.
18
Figure 4.3: Test results obtained with the cluster algorithm CLUCOV applied to
twodimensional data
-------------------------------------------------------|
+
|
+
|
|
++
|
|
|
++ + ++
| +++ + + +
|
|
+
|
|
|
+
++
+
|
+
+
+
+
|
|
+
|
+
+
|
|
+
+
+
|
+
+
+
+
|
|
+
|
+
+
|
|
+
|
+
+
+
+
|
|
++
|
++
+
|
|
+ +
| +++ + ++++
++++ + + +
|
|
+
|
|
|
+
o
|
+
++
|
|
+
oo
|
+ + + + + ++ +
* |
|
|
|
|
+ +
oo
|
+ + + + + + +
|
|
+
oo
o
|
|
|
+
+
o
o
|
+
+
*
|
|
++* *
ooo
o o
|
+++
* * ** *
|
|
o
o
|
+
|
|
*
*
o
o
|
*
*
|
|
*
o
o
|
*
*
*
|
|
*
*
|
*
*
|
|
|
* *
*
|
|
* *
|
|
** **
*
| * * * * ** ** * * * ||
|
|
** *
|
|
*
*
|
*
***
*
|
|
*** ** *
|
** *****
*** * |
|
|
*
*
|
|
**
|
** * * * *
|
|
|
* |
|
** * ***
* **
|
|
|
*
|
* * ** ** *** ***
|
|
|
|
|-----------------------------|----------------------------|
|
++ ++
|
|
|
|
|
|
+++ +
|
|
|
+
|
|
|
++
+
|
|
++++++
+
|
||
|
+
++
|
|
|
++
++
+
|
|
|
+ ++ +
|
oo
|
|
+
|
ooooo
|
|
++++++++
|
ooooo
|
|
+
|
o
|
|
+
+
+
|
o
o
ooo
|
|
+++
|
ooo
ooo
|
|
+
|
ooo
o
|
|
++
|
oo
|
|
++
|
oo
oooo
o o
|
|
|
ooooo
|
|
|
o
o
oooo
|
|
+
+
|
oo
ooo
++
|
|
+
+
|
ooo
+
+
|
|
+
|
ooooooo
+ ++
|
|
|
o
o
+++
|
|
* + + *
o ooo
++++++
|
|
**
* |
|
oo oo o
++
|
|
*
*
|
oo
ooo
+++++
|
|
*
*
*
**
*
*
*
|
o
ooo
++++
+
|
|
*
***
*
*
*
**
*
|
o
o
++++
++
|
|
**
*
*
*
*
*
*
***
|
o
o
++++++
|
| *
* **********
** **
******* |
| o
+++++
|
|
****
+++
|
|
***
**
*
***
*
***
*
|
+++
+
|
|
*
**
*
*
**
***
**
|
+
++
+
|
|
*
**
**
*
*
****
*
*
|
+++
|
|
* * ** *
*
+
|
|
*** |
|
|
|*
-----------------------------|----------------------------|
Other measures of distance and compactness are possible within the cluster algorithm CLUCOV and can be easily implemented.
Finally we describe a starting procedure. The contents of all clusters are set to
one, and all covariance matrices are set to the unit matrix. To nd a starting set
of cluster centroids it is demanded that no single point has an Euclidean distance of
more than R (a parameter of the starting procedure) from at least one cluster center.
This is achieved by the following procedure:
1. Choose an arbitrary point as the rst center.
2. Decide for each following point if its distance to all existing cluster centers is
greater than R. If yes, take this point as a new cluster center.
Fig. 4.3 demonstrates the capacities of this algorithm at some twodimensional test
examples. For applications of this algorithm to manyparticle hadronic nal states
see (HONE79,NAUM79).
4.2.3
This procedure developed in CERN (GELS74) starts from the following considerations:
Let the probability density distribution h(X|b ) of the points X in phase space
with the distribution parameter vector b consist of a mixture of M distributions
19
h(X|b ) =
pk f (X|bk )
k=1
ln[h(X|b)]h(X|b )dX.
It can be shown (PATR72) that the vector b maximizing the information function
corresponds to the asymptotic minimumrisk solution.
If h(X|b ) is a superposition of M nonoverlapping gaussians with relative weights
pk and covariance matrices Ck , then maximizing (b, b ) corresponds to maximizing
(PATR72)
G (b) =
M
pk
pk ln[
].
|CK |
k=1
As |Ck | is related to the volume occupied by the category (or cluster) k, maximizingg
G (b) leads to that subdivision of observation space which corresponds to maximum
average probability density. A procedure that maximizes G (b) will therefore tend to
locate clusters in the observation space.
The cluster procedure of Gelsema has the following general properties:
1. The number of clusters is xed.
2. Initial cluster nuclei have to be dened on the basis of a priori knowledge.
3. Events may be left unclassied. This permits the treatment of clusters superimposed on a background.
The algorithm is an interactive one and works as follows. At the beginning, cluster
nuclei have to be dened using some a priori knowledge. For each cluster nucleus i
the fraction pk of events in this nucleus and the covariance matrix Ck are calculated.
This gives the starting value of G (b) qualifying the initial solution.
In each subsequent iteration step the events are assigned to all of the M existing
clusters in turn and i (b) is calculated. No updating of the classes is performed at
this stage, but for every event the sequence of improvements
i (b) = i (b) old (b)
for a tentative assignment to all classes i = 1, . . . , M is calculated and histogrammed
separately for every class. The maximum improvement max and the corresponding
20
class number are stored. Events which really belong to class i will have larger values
of i .
The interaction between data analyst and program now consists in a visual inspection of these histograms on a display leading to the denition of a threshold value
i of i above which events are assigned to class i. Now the clusters are updated.
An event enters class i if both conditions
max = i
and
i > i
are satised.
An event is omitted from class i if at least one of the conditions
max = i
or
i i
is satised.
A table of the number of reassignments to the clusters is displayed. If these
numbers get small the procedure becomes stable and can be terminated. Applications
of Gelsemas interactive clustering technique can be found in (BAUB77,VAIS76).
In order to achieve a meaningful separation, in the histograms of i the peaks
of high values of i have to be well separated from the rest of the events. Then,
small changes in the cut values i will not aect the nal result. In such a case the
procedure can even be run in an unsupervised way.
21
References
(ANDE73) M.R.Anderberg, Cluster Analysis for Applications, Academic Press
New York, San Francisco London 1973
(BAUB77) M.Baubillier et al., Multidimensional Analysis of the Reaction n
+ n at 9 GeV /c and Multidimensional Analysis of the reaction d
+ d at 9 GeV /c, both submitted to Intern. Conf. on High Energy
Physics, Budapest 1977
(BECK76) L.Becker and H. Schiller, Possible Generalization of Yang Variables for
the Study of Many Particle Final States, Berlin preprint PHE 7623 (1976)
(BOET74) H.Bottcher et al., Nucl.Phys. B81 (1974) 365
(BRAN79) S.Brandt, H.D.Dahmen, Z.Phys. C1 (1979) 61
(BRAU7I) J.E.Brau et al., Phys.Rev.Letts. 27 (1971) 1481
(BYCK73) E.Byckling and K.Kajantie, Particle Kinematics, J.Wiley and Sons
(1973) p.202
(BYER64) N.Byers and C.N.Yang, Rev.Mod.Phy5. 36 (1964) 595
(CERN76) Topical Meeting on Multidimensional Data Analysis, CERN (1976)
(DORF80) J.Dorfan, SLACPUB2623 (1980)
(DURA74) B.S.Duran, P.L.Odell, Cluster Analysis, Lecture Notes in Economics
and Mathematical Systems, Springer Verlag 1974
(FRIE73) J.H. Friedman, SLACPUB1358 (1973)
(FRIE74) J.H.Friedman, J.W.Tukey, IEEE Transactions on Computers, Vol.C-23,
(1974) 881
(FRIE76) J.H.Friedman, CERN/DD/76/23, pp 19
(FRIE78) J.H.Friedman, L.C.Rafsky, SLACPUB2116 (1978)
(FUKU7I) K.Fukunaga, D.R.Olsen, IEEE Transactions on Computers, Vol.C20
(1971) 176
(GELS74) E.S.Gelsema, Description of an Interactive Clustering Technique and its
Applications, CERN/DD/74/16
(HONE79) R.Honecker et al., Ann.Physik Vol. 36 (1979) 199
(KITT76) W.Kittel, Progress in Multidimensional Analysis of High Energy Data,
IVth International Winter Meeting on Fundamental Physics, Salardu (Spain)
1976
22
23