Multidimensional Data Analysis

MULTIDIMENSIONAL DATA ANALYSIS1
Th. Naumann and H. Schiller

Institut f
ur Hochenergiephysik der Akademie der Wissenschaften der DDR,
BerlinZeuthen
1 This
is a reprint from FORMULAE AND METHODS IN EXPERIMENTAL DATA

EVALUATION published by the European Physical Society (Computational Physics
Group) at CERN in January 1984
Contents
1 Introduction
2 Variables, Metrics, Similarity Measures
3 Some General Methods of Multidimensional Data

3.1 Projection Pursuit . . . . . . . . . . . . . . . . . .
3.2 Nonlinear Mapping . . . . . . . . . . . . . . . . . .
3.3 Multidimensional Two Sample Test . . . . . . . . .
Analysis
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
4 Cluster methods
4.1 Hierarchical Methods . . . . . . . . . . . . . . . . . . .
4.1.1 Hierarchical Clustering using the MST . . . . .
4.1.2 Selection of Jets by Hierarchical Clustering . . .
4.2 Nonhierarchical Techniques . . . . . . . . . . . . . . .
4.2.1 The ValleySeeking Technique . . . . . . . . .
4.2.2 4.2.2 The Cluster Algorithm CLUCOV . . . . .
4.2.3 4.2.3 Gelsemas Interactive Clustering Technique
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
9
11
11
12
13
14
14
17
19
Chapter 1
Introduction
The necessity for the application of multidimensional data analysis in high energy
physics arises from at least two reasons:
Events of a given reaction are described by a (multiplicity dependent) large
number of kinematical variables
The number of events is normally high so that the application of statistical
methods becomes essential
Being confronted with a large amount of multidimensional data the physicist often
chooses the traditional way of analyzing these data, i.e. to produce one or two
dimensional projections of the data.
Lowdimensional projections of multidimensional data are often insucient as
they reveal only a small part of the information contained in the data. Multidimensional data analysis therefore aims to use the full experimental information.
If there exists a physical model of the measured process the parameters of this
model can be tted to the data. Such parametric methods as partial wave analysis, analytical multichannel analysis or prism plot analysis rely heavily on a priori
physical knowledge and will not be considered here.
Also beyond the scope of this survey are classication procedures starting from a
known category structure within the data and sorting the data points into categories
dened before. We describe methods to nd out structure in multidimensional data
with minimum a priori knowledge of the underlying physics and data structure.
In chapter 2 the importance of a skilled choice of the variables spanning up the
multidimensional space and the denition of similarity in the case of cluster methods
are discussed. Methods of lowering the dimensionality of the multidimensional space
and of comparing multidimensional data sets are presented in chapter 3. Looking
for an appropriate separation algorithm the analyst has to take into account the
computing demands of dierent algorithms. The hierarchical algorithms presented
in chapter 4.1 are only applicable for relatively small numbers of data points, while
the nonhierarchical methods of chapter 4.2 can also be applied to large data sets.
For theoretical discussion and applications of dierent methods of multidimensional data analysis we refer the reader to the review of Kittel (KITT76) and the
proceedings of the meetings on high energy data analysis (CERN76, NIJM78). The
reader especially interested in cluster analysis should consult the monographies of
Anderberg (ANDE73) and Duran and Odell (DURA74).
Chapter 2
Variables, Metrics, Similarity
Measures
Information and structure do not exist by themselves but only in the context of
a given application. Such, for example, the noise in a radio receiver contains no
information for the ordinary listener, but it can well give some information on a
defect in the receiver.
The existence of structure within multidimensional data sets depends largely on
the variables that span the multidimensional space and on the metrics dened in this
space. For classication procedures one furthermore has to decide on the meaning of
similarity and dissimilarity of categories.
The importance of the choice of variables for the ultimate results of the analysis
is well illustrated in g. 2.1. Scaling one coordinate already suggests a new group
structure within the four data points.
Figure 2.1: The eect of the scale of variables on the group structure
Leaving out relevant variables naturally makes a meaningful analysis impossible.

Adding variables that are not very relevant to the purpose of the analysis but induce
nevertheless a partition of the data is clearly misleading.
Another serious problem is the relative scale between variables of dierent origin.
It is sometimes recommended to reduce all variables to standard form (zero mean
and unit variance) at the beginning.
These considerations are meant as a warning: multidimensional data analysis
cannot reveal some absolute information that preexists in the data. It can merely
act as a heuristic tool to generate hypotheses on the structure of the data. The
answers obtained from the analysis depend largely on the questions asked by the
analyst.
Multidimensional data in high energy physics most frequently consist of the momentum and energy variables of many particle nal states. Therefore we shall consider the problem of the choice of variables in this context.
It would be desirable to nd some physical requirements that make the choice
of variables less arbitrary or even unique. One approach would be to look for a
complete set of Lorentzinvariant variables which does not favour any of the nal
state particles.
The reaction
a + b 1 + 2 +...+ n
can be described in terms of the following variables
(ai) = (pa pi )
(bi) = (pb pi )
(ik) = (pi pk )
with i, k = 1, . . . , n.
These variables are relativistically invariant and invariant under permutations of the
incoming particles a, b and the nal state particles i, k.
On the other hand they are directly related to momentum transfers and masses
squared so that they are relevant for the dynamics of the reaction to be investigated.
However, the number of invariants exceeds the number of independent variables
which is 3n 5 for an nparticle nal state at xed total energy. To our knowledge
there is no subset of variables that maintains the properties of Lorentzinvariance
and permutation symmetry. Therefore we have to resort to the somewhat weaker
property of quasipermutation invariance which means permutation invariance up to
a linear transformation. This leads to the demand for distance measures that are
invariant under linear transformations.
Provided one has such a distance measure, a relativistically and quasipermutation invariant subset of variables should be good for a multidimensional analysis.
For a threeparticle nal state one can simply choose the four invariants
(a1) , (a2)
(b1) , (b2)
For fourparticle nal states, Yang (BYER64, BYCK73) proposed the following seven
variables
(a1),
(a2),
(a3),
(b1),
(b2),
(b3),
and 4 (1234)
where 4 is the Gram determinant dened by
k (12 . . . k) =

(11)
.
.
.
(k1)
. . . (1k)

...
.
...
.
...
.

. . . (kk)
In order to generalize the Yang variables one has to take into account the kinematic constraints on the invariants. Such a generalization has been performed in
(BECK76). For this purpose we dene a quantity
(n)
Zm
m+1
= [(1)
n
...
i1
n
m (i1 , . . . , im )] m
im
This quantity is relativistically invariant and quasipermutationally invariant with

respect to the nal state particles. Thus it can be used as a generalization of 4 =
(4)
Z4 for more than four particles in the nal state. Furthermore, this invariant is
constant for m = 1:
(n)
Z1
n
m2i
i=1
(n)
with mi as the mass of particle i so that only Zm
with m > 1 can be incorcorated
in the variable sets.
For ve particles in the nal state we have ten invariants
(a1)
(b1)
and
...
...
(a4)
(b4)
(5)
(5)
Z4 , Z3 .
For six particles we need thirteen variables

(a1)
(b1)
...
...
(a5)
(b5)
(6)
(6)
(6)
and Z4 , Z3 , Z2 .
For a detailed analysis of many particle nal states the full set of (3n 5) variables
should be used. Experience has shown, however, that most of the global information
on the reaction mechanisms is already contained in the (2n 2) four-momentum
products (BECK76)
(a1) . . . (a n 1)
(b1) . . . (b n 1)
We now turn our attention to the problem of metrics. A nonnegative real function d(Xi , Xj ) is called a metric in a pdimensional Euclidean space Ep if
1. d(Xi, Xj ) > 0 for all Xi and Xj in Ep
2. d(Xi, Xj ) = 0 if and only if Xi = Xj
3. d(Xi, Xj ) = d(Xj , Xi )
4. d(Xi, Xj ) d(Xi , Xk ) + d(Xk , Xj )
where Xi , Xj and Xk are any three vectors in Ep .

The most popular and commonly used metric is the Euclidean metric
d2 (Xi , Xj ) = [
p
(Xki Xkj )2 ] 2
k=1
An absolute value norm

p
d1 (Xi , Xj ) =
|Xki Xkj |
k=1
is computationally even cheaper.

A generalized Euclidean distance is the Mahalanobis metric (MAHA36)
D 2 (Xi , Xj ) = (Xi Xj )T C 1 (Xi Xj ).
The matrix C 1 is usually the inverse of the covariance matrix of a class of data
points. The Mahalanobis distance has a very useful property: it is invariant under
any nonsingular linear transformation. Thus it ts weIl to the Yang variables and
their generalizations which are only quasipermutationally invariant, that is up to
linear transformations. In chapter 4.2.2 a cluster algorithm is described which uses
this metric together with the Yang variables.
For a general approach to the problem of similarity and dissimilarity of groups
or clusters one should consult (DURA74). In practice the similarity measure or
clustering criterion will be chosen within the context of the given problem.
The number of possible subdivisions of real data sets is astronomically high.
Therefore it is impossible to nd the best among all possible partitions for a given
clustering criterion. Consequently, one also needs an algorithm how to eciently
reach an approximately optimum solution of the clustering problem.
Some clustering criteria and cluster procedures that have found application in
high energy physics will be presented in chapter 4.
Chapter 3
Some General Methods of
Multidimensional Data Analysis
Most experiments in high energy physics (but not only there) lead to multidimensional data. The methods to analyze such data are much less developed than for
the onedimensional case. In some cases one has a model attempting to describe
the process from which the data derive. Then the methods of parametric density
estimation as maximum likelihood or moments can be used. If one is not so lucky
to have a model for the data one normally will check if all the dimensions are really
needed to describe the data, i.e. one will look at the intrinsic dimensionality of the
data. For this the principal component analysis can be used in the linear case. In the
nonlinear case the generalized principal component analysis or successive application
of the principal component analysis (FRIE76,FUKU71) can be used.
In this section we shall describe some methods for the exploratory data analysis
which have proven to be of broad applicability.
3.1
Projection Pursuit
Projection pursuit (FRIE74) is a mapping technique which searches for the one
or twodimensional projection exhibiting as much structure of the data as possible.
First the one dimensional case is discussed.
i (i = 1 . . . N) be the data set. Then the aim is to search for that direction a
Let X
with |a| = 1 for which the so called projection index I(a) is maximized. In (FRIE74)
the following construction of I(a) is proposed:
I(a) = S(a)d(a)
wth S(a) measuring the spread of data (trimmed standard deviation)
(1p)N
S(a) = [
i=pN
a =
X
(1p)N
i=pN
ia X
a )2 1
(X
]2
(1 2p)N
ia
X
(1 2p)N
i are ordered on their projected values (X

ia) and with d(a) being
Supposing the X
an average nearness function of the form
d(a) =
N
N
f (rij )(R rij )
i=1 j=1
i X
j )a|
rij = |(X
() =
1 if > 0
0 if 0
The function f (r) should be monotonically decreasing in [0, R]. The algorithm is
insensitive to the special form of f (r) but
R
r =
rf (r)dr
f (r)dr
R
0
determines the size of the structure which is searched for. Finding the maximum of
I(a) is a nonlinear problem which can be solved by standard procedures.
If one aims to nd twodimensional projections the generalization is straightforward: The projection is given by the two orthogonal directions a, b (a b = 0)
S(a, b) = S(a)S(b)
i X
j ) a]2 + [(X
i X
j ) b]2 ) 12
rij = ([(X
R
r =
3.2
rf (r)dr
f (r)dr
R
Nonlinear Mapping
The idea of mapping is to construct a low dimensional (normally twodimensional)

map of the data points preserving approximately the interpoint distances as they are
in the original data set Due to the extraordinary human gift for pattern recognition
the investigator can very competently detect the structure of the data, nd clusters
etc.
i (i = 1 . . . N) be the original points which are mapped to Yi and
Let X
Dij = |Yi Yj |
i X
j |
dij = |X
Then the mapping algorithm of Sammon (SAMM69) minimizes the error function
E(Y1 . . . YN ) =
1
N
1 N
Dij dij 2
(
)
N i=1 j=i+1
dij
A modication of this algorithm is due to Manton (MANT76). The main disadvantage of these algorithms is the high computational load. So they are applicable only
to a few hundred data points.
A much simpler but still powerful technique is the sequential nonlinear mapping
(LEE74). lt starts from the fact that if Yi and Yj are chosen to preserve the original
distance
Dij = dij
it is always possible to place a third Yk such as to preserve the distances to the points
i, j:
Dik = dik , Djk = djk
(triangulation). Thus at least 2N 3 interpoint distances of the N(N 1)/2 total
number of interpoint distances can be preserved. Since the edge lengths of the MST
are known to carry much information about the structure of the point set it is only
natural to use the N 1 edge lengths of the MST as part of the distances to be
preserved. Then still N 2 interpoint distances can be preserved. For this purpose
Lee suggested (LEE74):
Preserve the distance of all points to a xed reference point or
Preserve the distances of each point to its nearest point already matched or
Preserve the distance to its farthest point already matched.
As with the other nonlinear mappings the disadvantage is that the resulting transformation cannot be summarized with a few numbers and that adding a new data
point demands complete recomputation.
3.3
Multidimensional Two Sample Test
A very interesting approach to the problem of two sample test for multidimensional
data is due to Friedman and Rafsky (FRIE78). They generalize the WaldWolfowitz
and the Smirnov tests. The WaldWolfowitz as well as the Smirnov test in the one
dimensional case start by sorting the pooled data in ascending order without regard
to sample identity. Now the essential idea is to use the MST as a generalization of
the sorting in the onedimensional case. This is justied by two important properties
of the MST:
1. The N nodes are connected by N 1 edges.
2. The edges connect closely lying points.
The WaldWolfowitz test is then generalized as follows. The two samples of size m
and n respectively are merged to give the pooled sample of size N = m + n, for which
the MST is constructed. The number of runs R is dened by
R =
Zi + 1
edges
where
Zi =
1 if the edge links nodes from dierent samples

0 otherwise
It can be shown that under the null hynothesis (both samples drawn from the same
distribution) and for large sample sizes the quantity W has a standard normal distribution:
W =
R E{R}

2 {R}
2mn
E{R} =
+1
N
C N +2
2mn 2mn N
+
[N(N 1) 4mn + 2]
2 {R} =
N(N 1)
N
(N 2)(N 3)
C is the number of edge pairs sharing a common mode. It should be noted that a
multidimensional two sample test can also be used to test factorization (FRIE73):
The hypothesis to be tested is
f (x1 , x2 , . . . , xd ) = f1 (x1 , . . . , xi ) f (xi+1 , . . . , xd )
For this one compares the original sample with a sample derived from the original
one by randomly exchanging the components xi+1 , . . . , xd within the point set.
10
Chapter 4
Cluster methods
In exploratory data analysis one often is confronted with the following problem:
given a set of data one has to nd out if there are groups of data such that members
of a given group are similar and dierent groups are dissimilar. Obviously the
meaning of the terms similar and dissimilar depends on the context in which the
data analysis is performed.
Finding groups with the above stated properties has the following consequences:
It shows that the data points build clusters, i.e. the data points do not occupy
the available space randomly but are concentrated in some regions so that the
clusters are distinguishable from each other.
A data reduction results since the data set can be described by the features of
the clusters.
The investigator might try to interprete the resulting clusters, i.e. to explain the
structure of the data in the context of the underlying process. The knowledge
obtained this way can even be used to enter an iterative procedure of clustering
(e. g. cutting some data, averaging or splitting some clusters, chosing another
clustering criterion etc.)
There is a great variety of algorithms for nding clusters in a given data set. One
class (hierarchical clustering methods) does not aim to nd a unique solution but
rather results in a set of solutions of dierent levels such that two clusters of a lower
level belong to one cluster at some higher level. Applications of hierarchical clustering
algorithms in high energy physics are discussed in section 4.1. Section 4.2 deals with
non hierarchical clustering algorithms.
4.1
Hierarchical Methods
Hierarchical clustering methods applied to n data points result in (n 1) possible

groupings of points normally represented as a tree as illustrated in g. 4.1. At the
lowest level each data point is a group for its own while at the highest level all the
data set is united in one group. The tree could be constructed either by successive
merging of groups starting at the lowest level or successive splitting of groups starting
11
Figure 4.1: Example of a hierarchical tree

one group = the complete data set
n groups = the n data points
at the highest level. In either case some measure is necessary to decide which groups
are to be combined or which group has to be split.
One also needs some criterion to decide at which level the tree has to be cut to
obtain the real clustering since obviously not all the possible groupings depicted by
the tree are of real interest.
We shall now describe in some detail the application of a splitting algorithm
(4.1.1) as well as a merging algorithm (4.1.2).
4.1.1
Hierarchical Clustering using the MST
Minimal spanning trees (MST, see 3.) can be used for nding clusters (ZAHN71).
Cutting one edge of the MST one is left with two unconnected parts. Proceeding
this way one derives a hierarchy of groupings like in g. 4.1. The rst application of
this technique to high energy data was by Schotanus (SCHO78). The data consisted
of 1000 events of the nal state + p + 0 p at 5GeV /c. The distance used in the
construction of the MST was the Euclidean distance in the space of
m2p+ , m2+ 0 ,
The MST was cut at its inconsistent edge, i.e. at the edge with the greatest ratio
edge length/averages of upward and downward neighbourhood, since this corresponds
to a density minimum.
Studying the resulting grouping it was concluded that edges in the overlap regions
are rather shorter than in neighbouring regions so that dierent clusters could not
be separated one from another.
To achieve an improvement it was necessary to modify the criterion by taking
into account additional information about the shape of the neighbourhood. For this
the path collinearity was introduced, Path collinearity is a directional criterion which
can be dened for any dened path through the MST advantageously one uses a
maximal path. The path collinearity at a certain point p with a leverage lp is the
12
angle between the straight lines connecting p with the points lp places upward and
downward the path respectively. Thus measures if the data have linear pieces (
being ) or sharp bends. By means of the collinearity criterion the + p + 0 p
data at 5GeV /c could be separated into
a) a backward production region (baryon exchange)
b) ++ resonance production
c) + resonance production
d) diraction dissociation and + production
e) elastic contamination of the reaction + p + p.
The collinearity criterion is applicable if the data is essentially onedimensional. It
could be extended to higher dimensions dening coplanarity etc. but it was suggested
(SCHO76) that for this type of extensions it would be better to analyze the full
covariance matrix of the neighbourhood.
4.1.2
Selection of Jets by Hierarchical Clustering
High energy quarks and gluons produced in storage ring collisions manifest themselves as hadronic jets observed in the nal state. As the total energy increases
QCD unambigously predicts an increase of the jet multiplicity. Thus it becomes an
important task to develop methods to recognize events of any jet multiplicity
The most obvious feature of jets is the strong angular correlation, i.e. the appearence of a narrow cone built by the momenta of the particles belonging to a given
jet. This quite naturally leads to the use of clustering algorithms for the study of
jet events. For this the distance dik = d(
pi, pk ) between the two particles i and k is
dened such that small angles result in short distances. The rst application of this
idea is in (LANI81). For dik they used:
dik =
pi pk )
1 (
(
+ 1)
2 |
pi ||pk |
The algorithm then combines (N 1) times the two most similar (smallest dik ) groups.
For this it is neccessary to state what the similarity of a combined group to the old
ones is. If groups (= particles at the beginning) i and k have been combined to give
a group called m the similarity dml of the new group m to the remaining groups is
chosen to be
dml = min(dil , dkl )
with l = i and l = k.
Using this denition one arrives at the complete linkage.

The cluster algorithm labels the particles according to their membership to a
given group. But it does not answer the question of how many jets are in the event,
i.e. at which level the tree has to be cut to yield the real clustering. Thus for each
event one has to make all the possible hypotheses about the jet multiplicity and to
decide which is the most probable one. In (LANI81) a straightforward generalization
of the triplicity (BRAN79) was proposed that can be used in the decision procedure.
lt was concluded that
13
the sketched clustering algorithm is well suited to nd jets in multihadron nal

states
it is applicable to higher jet multiplicities
the particles are classied according to their membership to the jets.
Another approach due to Dorfan (DORF80) uses
dik =
2ik
|
pi ||pk |
ik being the angle between the momenta pi and pk .

Starting from this distance measure the MST (see 3.) is constructed and inconsistent edges are cut if they are larger than (R1 median of the edge lengths). A
detailed Monte Carlo study again leads to the conclusion that hadronic jets are reliably reproduced by the found clusters which thereby allow for a meaningful study
of the jet features.
4.2
Nonhierarchical Techniques
Nonhierarchical clustering techniques do not rely on a hierarchy of subsequent partitions of the data sample. They rather create in each iteration step a new assignment
of the data points to the clusters according to some clustering criterion. We shall
now describe in some detail three nonhierarchical clustering procedures, which all
have been applied to high energy data:
the nonparametric valleyseeking technique of Koontz and
Fukunaga (KOON72)
the cluster algorithm CLUCOV of Nowak and Schiller (NOWA75) and
the interactive clustering technique of Gelsema (GELS74).
4.2.1
The ValleySeeking Technique
We now describe the valley seeking technique of Koontz and Fukunaga (KOON72)
in a version designed for the analysis of manyparticle nal states in high energy
physics.
Results of the application of this technique to the reactions
and
+ p p + +
+ p p + + +
at 8 and 16 GeV /c
at 16 GeV /c
can be found in (BOET74).

The construction of the algorithm starts from the loss of information J which
arises if one replaces N given data vectors [X1 , . . . , XN ] by labels or cluster numbers
[w1 , . . . , wN ]:
J =
N
N
f (Xi, Xj )[dX (Xi , Xj ) d (wi , wj )]2
i=1 j=1
14
Here, dX (Xi , Xj ) denotes the distance between two vectors Xi and Xj , d is an

appropriately dened metric for the distance between two classes or clusters and
f (Xi , Xj ) are weighting factors. The labels wi can be integers from 1 to M (M < N)
and denote the class to which Xi is assigned. The task of nding a meaningful
partition of the data sample is now reformulated: one searches for the partition with
minimal information loss J.
We start with the assumptions
d (wi , wj ) =
D, D > 0
0
for wi = wj
for wi = wj
1 if dX (Xi , Xj ) < R
0 if dX (Xi , Xj ) > R > 0
fR [dX (Xi , Xj )]
fX (Xi , Xj ) =
Using the symmetry of fR with respect to Xi and Xj and d (wi , wi ) = 0 for all i (a
property of any metric) and assuming suciently small R we obtain
J 2D 2
N
N
fR [dX (Xi , Xj )](1 wi wj )
i=1 j=i+1
2D 2 JR
JR assigns a nonzero penalty for each pair of vectors closer together than R and
classied into dierent classes. Hence the main contributions to JR come from points
near the boundary between two clusters.
Minimization of JR consequently enforces the boundaries between the clusters
to be dened across a region of minimum density of data points. The name valley
seeking technique originates from this property.
This clustering criterion has the following advantages:
1. Computation: For a given classication, JR is determined by counting rather
than by dicult calculations.
2. Storage: The storage requirement is mainly governed by the number of pairs
of vectors that are closer together than R. This number can be kept small by
chosing R suciently small.
3. The valleyseeking property makes the clustering criterion suitable for non
supervised classication.
Its disadvantages are:
1. Very distant clusters can receive the same label.
2. No account is taken of the inner structure of the clusters.
3. The cluster shape does not enter the cluster criterion.
As can be seen from g. 4.2, pronounced Vshaped structures occur in practice
which should be split at the edges.
Having dened a clustering criterion it remains to choose an algorithm of how to
eciently achieve an optimum classication. The minimization of the information
loss JR is performed by the following algorithm:
15
Figure 4.2: + p p + 0 (plab = 3.9GeV /c) a.) Prism plot for MonteCarlo events
(Lorentzinvariant phase space) b) Prism plot for experimental data
1. Choose an initial assignment of the N points to M classes.

2. For every point i count the number of points belonging to a given class within
a certain distance R of Xi .
3. The point i is assigned to the class having the maximum number of points
within R.
4. If any point is placed in a new class, return to step 2. Otherwise, stop.
M and R have to be determined empirically.
For an application of the valley seeking technique to the reaction + p p + +
the following variables were chosen:
M(pf+ ), M(ps+ ), M(p ), M(f+ ), M(s+ )
M denotes the invariant mass of the two particles in the bracket, the index f /s stands
for the + with the greater and smaller longitudinal momentum.
As resonance production plays a decisive role in this reaction, this set of variables
should permit to extract a signicant part of the information contained in the data.
Inclusion of the fourmomentum transfer variables t(p/p) and t( + / ) leads
to a set of seven variables which completely describe this fourparticle nal state.
Other sets of independent invariants can be derived from these variables by linear
transformations.
As linear transformations do not change the results of the analysis, this choice of
variables guarantees a high degree of generality of the procedure.
The number N of points Xi led to storage problems in this application. Therefore
the algorithm was simplied as follows.
NA arbitrarily chosen points were submitted to the algorithm described above.
The remaining NB = N NA points were assigned to the existing clusters according
to step 2 and 3 of the original algorithm. This improved the statistical signicance
of the existing clusters.
In this application an Euclidean metric was chosen for dX (Xi Xj ).
16
The distance parameter R was chosen to be R = 0.455GeV with M = 15 initial

clusters. The following methodic results were obtained:
1. There exist clusters in phase space.
2. They correspond to dynamical mechanisms.
3. Several production mechanism can contribute to one cluster.
4. Some mechanisms are well separated.
For detailed physical results see (B0ET74).
The inclusion of the fourmomentum transfers did not lead to a cleaner separation. Possibly these variables do not contain additional information, or this information could not be extracted because of statistical limitations.
4.2.2
4.2.2 The Cluster Algorithm CLUCOV
Analyzing three and four particle hadronic nal states it was found (BRAU71) that
clusters are generally ellipsoids with arbitrary orientation in phase space. This is also
illustrated in g. 4.2. The cluster algorithm CLUCOV was specially designed to meet
this situation. A detailed description of the algorithm can be found in (NOWA75).
In the cluster algorithm CLUCOV the kth cluster Gk is therefore characterized
by the moments of order zero, one and two of the distribution of the N points X m =
m
(X1m , . . . , X3n5
) contained in this cluster. Let these points have their experimental
m
weights w . Then the three moments are
the number I k of points X m contained in the cluster Gk
k
N
wm
m=1
the centroid Qk of the cluster Gk

Qki
Cijk
Nk
1
= k
w m Xim
I m=1
with X m Gk
the covariance matrix C k of the cluster Gk

Nk
1
= k
w m (Xim Qki )(Xjm Qkj )
with X m Gk
I m=1
The eigenvectors of the covariance matrix point into the direction of the main axes
of the ellipsoids by which the shape of the clusters is approximated. The eigenvalues
of the covariance matrix denote the lengths of the main axes, and the determinant
measures the volume of the clusters.
All these three moments enter the denition of the distance of a point m at X m
from the kth cluster Gk .
The number I k of points in cluster k is included into the distance measure as a
linear weight factor, so that big clusters attract further points.
The Euclidean distance (X m Qk ) of a given point X m from the centroid Qk
of the kth cluster enters the exponent of a gaussian containing also the covariance
17
matrix C k :
k
fm
Ik
1
exp[ (X m Qk )(C k )1 (X m Qk )]
2
(2)3n5 |C k |
Thus, each cluster builds its own metric.

In the direction of the main axes distances are measured in units of the corresponding eigenvalue of the covariance matrix (which is suggested by the quadratic
form in the exponent of the gaussian).
k
The determinant in the denominator of fm
favours compact clusters against voluminous clusters of the same content.
This distance measure is also invariant under linear transformations such as translation and rotation and when applying the Yang variables also against permutation of the nal state particles.
We now describe an iteration step of the algorithm. A starting procedure will be
given later.
k
1. Calculate for all points X m the distance measure fm
with respect to all existing
k
clusters G .
l
2. Assign the point X m to the cluster Gl with the biggest fm
.
k
3. If fm
is for all clusters lower than a certain limit (which is a parameter of the
algorithm) assign X m to a garbage cluster.
4. Update I k , Qk and C k for all clusters. Goto 1.

In the algorithm CLUCOV it is also possible to split and merge clusters, that is to
change the number of clusters.
To achieve this, a measure t for the compactness of two clusters Gk and Gl is
dened:
t =
h0
hk hl
The quantities hk and hl are superpositions of the gaussians f k (X) and f l (X) of the
clusters Gk and Gl in their centroids Qk and Ql :
hk = f k (Qk ) + f l (Qk )
and
hl = f k (Ql ) + f l (Ql )
The quantity h0 is the minimum of this superposition along the distance vector
(Qk Ql ) between the clusters:
h0 = min[f k (X) + f l (X)]
If for a pair of groups this measure t exceeds a limit tmerge being a parameter of the
merging procedure this pair of groups is united into one group.
The compactness of the clusters is tested by arbitrarily subdividing the clusters
with hyperplanes through the cluster centroids. If the relative compactness of the two
parts of the clusters is smaller than tsplit being a parameter of the splitting procedure,
this cluster is split by the corresponding hyperplane.
18
Figure 4.3: Test results obtained with the cluster algorithm CLUCOV applied to
twodimensional data
-------------------------------------------------------|
+
|
+
|
|
++
|
|
|
++ + ++
| +++ + + +
|
|
+
|
|
|
+
++
+
|
+
+
+
+
|
|
+
|
+
+
|
|
+
+
+
|
+
+
+
+
|
|
+
|
+
+
|
|
+
|
+
+
+
+
|
|
++
|
++
+
|
|
+ +
| +++ + ++++
++++ + + +
|
|
+
|
|
|
+
o
|
+
++
|
|
+
oo
|
+ + + + + ++ +
* |
|
|
|
|
+ +
oo
|
+ + + + + + +
|
|
+
oo
o
|
|
|
+
+
o
o
|
+
+
*
|
|
++* *
ooo
o o
|
+++
* * ** *
|
|
o
o
|
+
|
|
*
*
o
o
|
*
*
|
|
*
o
o
|
*
*
*
|
|
*
*
|
*
*
|
|
|
* *
*
|
|
* *
|
|
** **
*
| * * * * ** ** * * * ||
|
|
** *
|
|
*
*
|
*
***
*
|
|
*** ** *
|
** *****
*** * |
|
|
*
*
|
|
**
|
** * * * *
|
|
|
* |
|
** * ***
* **
|
|
|
*
|
* * ** ** *** ***
|
|
|
|
|-----------------------------|----------------------------|
|
++ ++
|
|
|
|
|
|
+++ +
|
|
|
+
|
|
|
++
+
|
|
++++++
+
|
||
|
+
++
|
|
|
++
++
+
|
|
|
+ ++ +
|
oo
|
|
+
|
ooooo
|
|
++++++++
|
ooooo
|
|
+
|
o
|
|
+
+
+
|
o
o
ooo
|
|
+++
|
ooo
ooo
|
|
+
|
ooo
o
|
|
++
|
oo
|
|
++
|
oo
oooo
o o
|
|
|
ooooo
|
|
|
o
o
oooo
|
|
+
+
|
oo
ooo
++
|
|
+
+
|
ooo
+
+
|
|
+
|
ooooooo
+ ++
|
|
|
o
o
+++
|
|
* + + *
o ooo
++++++
|
|
**
* |
|
oo oo o
++
|
|
*
*
|
oo
ooo
+++++
|
|
*
*
*
**
*
*
*
|
o
ooo
++++
+
|
|
*
***
*
*
*
**
*
|
o
o
++++
++
|
|
**
*
*
*
*
*
*
***
|
o
o
++++++
|
| *
* **********
** **
******* |
| o
+++++
|
|
****
+++
|
|
***
**
*
***
*
***
*
|
+++
+
|
|
*
**
*
*
**
***
**
|
+
++
+
|
|
*
**
**
*
*
****
*
*
|
+++
|
|
* * ** *
*
+
|
|
*** |
|
|
|*
-----------------------------|----------------------------|
Other measures of distance and compactness are possible within the cluster algorithm CLUCOV and can be easily implemented.
Finally we describe a starting procedure. The contents of all clusters are set to
one, and all covariance matrices are set to the unit matrix. To nd a starting set
of cluster centroids it is demanded that no single point has an Euclidean distance of
more than R (a parameter of the starting procedure) from at least one cluster center.
This is achieved by the following procedure:
1. Choose an arbitrary point as the rst center.
2. Decide for each following point if its distance to all existing cluster centers is
greater than R. If yes, take this point as a new cluster center.
Fig. 4.3 demonstrates the capacities of this algorithm at some twodimensional test
examples. For applications of this algorithm to manyparticle hadronic nal states
see (HONE79,NAUM79).
4.2.3
4.2.3 Gelsemas Interactive Clustering Technique
This procedure developed in CERN (GELS74) starts from the following considerations:
Let the probability density distribution h(X|b ) of the points X in phase space
with the distribution parameter vector b consist of a mixture of M distributions
19
f (X|bk ) with weights pk

M
h(X|b ) =
pk f (X|bk )
k=1
The task of clustering is now to nd the distribution parameter vector b which is

the best estimate of b and is consistent with a set of observations from the density
h(X|b ).
Now dene the information function (b, b ) as the expectation value of the natural logarithm of the mixture density
(b, b ) = E[ln h(X|b)]

ln[h(X|b)]h(X|b )dX.
It can be shown (PATR72) that the vector b maximizing the information function
corresponds to the asymptotic minimumrisk solution.
If h(X|b ) is a superposition of M nonoverlapping gaussians with relative weights
pk and covariance matrices Ck , then maximizing (b, b ) corresponds to maximizing
(PATR72)
G (b) =
M
pk
pk ln[
].
|CK |
k=1
As |Ck | is related to the volume occupied by the category (or cluster) k, maximizingg
G (b) leads to that subdivision of observation space which corresponds to maximum
average probability density. A procedure that maximizes G (b) will therefore tend to
locate clusters in the observation space.
The cluster procedure of Gelsema has the following general properties:
1. The number of clusters is xed.
2. Initial cluster nuclei have to be dened on the basis of a priori knowledge.
3. Events may be left unclassied. This permits the treatment of clusters superimposed on a background.
The algorithm is an interactive one and works as follows. At the beginning, cluster
nuclei have to be dened using some a priori knowledge. For each cluster nucleus i
the fraction pk of events in this nucleus and the covariance matrix Ck are calculated.
This gives the starting value of G (b) qualifying the initial solution.
In each subsequent iteration step the events are assigned to all of the M existing
clusters in turn and i (b) is calculated. No updating of the classes is performed at
this stage, but for every event the sequence of improvements
i (b) = i (b) old (b)
for a tentative assignment to all classes i = 1, . . . , M is calculated and histogrammed
separately for every class. The maximum improvement max and the corresponding
20
class number are stored. Events which really belong to class i will have larger values
of i .
The interaction between data analyst and program now consists in a visual inspection of these histograms on a display leading to the denition of a threshold value
i of i above which events are assigned to class i. Now the clusters are updated.
An event enters class i if both conditions
max = i
and
i > i
are satised.
An event is omitted from class i if at least one of the conditions
max = i
or
i i
is satised.
A table of the number of reassignments to the clusters is displayed. If these
numbers get small the procedure becomes stable and can be terminated. Applications
of Gelsemas interactive clustering technique can be found in (BAUB77,VAIS76).
In order to achieve a meaningful separation, in the histograms of i the peaks
of high values of i have to be well separated from the rest of the events. Then,
small changes in the cut values i will not aect the nal result. In such a case the
procedure can even be run in an unsupervised way.
21
References
(ANDE73) M.R.Anderberg, Cluster Analysis for Applications, Academic Press
New York, San Francisco London 1973
(BAUB77) M.Baubillier et al., Multidimensional Analysis of the Reaction n
+ n at 9 GeV /c and Multidimensional Analysis of the reaction d
+ d at 9 GeV /c, both submitted to Intern. Conf. on High Energy
Physics, Budapest 1977
(BECK76) L.Becker and H. Schiller, Possible Generalization of Yang Variables for
the Study of Many Particle Final States, Berlin preprint PHE 7623 (1976)
(BOET74) H.Bottcher et al., Nucl.Phys. B81 (1974) 365
(BRAN79) S.Brandt, H.D.Dahmen, Z.Phys. C1 (1979) 61
(BRAU7I) J.E.Brau et al., Phys.Rev.Letts. 27 (1971) 1481
(BYCK73) E.Byckling and K.Kajantie, Particle Kinematics, J.Wiley and Sons
(1973) p.202
(BYER64) N.Byers and C.N.Yang, Rev.Mod.Phy5. 36 (1964) 595
(CERN76) Topical Meeting on Multidimensional Data Analysis, CERN (1976)
(DORF80) J.Dorfan, SLACPUB2623 (1980)
(DURA74) B.S.Duran, P.L.Odell, Cluster Analysis, Lecture Notes in Economics
and Mathematical Systems, Springer Verlag 1974
(FRIE73) J.H. Friedman, SLACPUB1358 (1973)
(FRIE74) J.H.Friedman, J.W.Tukey, IEEE Transactions on Computers, Vol.C-23,
(1974) 881
(FRIE76) J.H.Friedman, CERN/DD/76/23, pp 19
(FRIE78) J.H.Friedman, L.C.Rafsky, SLACPUB2116 (1978)
(FUKU7I) K.Fukunaga, D.R.Olsen, IEEE Transactions on Computers, Vol.C20
(1971) 176
(GELS74) E.S.Gelsema, Description of an Interactive Clustering Technique and its
Applications, CERN/DD/74/16
(HONE79) R.Honecker et al., Ann.Physik Vol. 36 (1979) 199
(KITT76) W.Kittel, Progress in Multidimensional Analysis of High Energy Data,
IVth International Winter Meeting on Fundamental Physics, Salardu (Spain)
1976
22
(KOON72) W.L.G.Koontz and K.Fukunaga, IEEE Transactions on Computers

Vol. C21 (1972) 171 and 967
(LANI81) K.Lanius, H.E.Rolo and H.Schiller, Z. Physik C, Particles and Fields 8
(1981) 251
(LEE74) R.C.T. Lee, J.R. Slagh and H. Blum, IEEE Computer Society Repository,
R. 74230 (1974)
(MAHA36) P.C. Mahalanobis, On the Generalized Distance in Statistics, Proc.
Natl. Inst. Sci. (India), Vol. 12 (1936) 49
(MANT76) N. Manton, Two nonlinear Mapping Algorithms for Use in Multidimensional Data Analysis, Proc. Topical Conf. on Multidimensional Data
Analysis, CERN 1976
(NAUM79) Th. Naumann and H. Schiller, Ann.Physik Vol. 36 (1979) 411
(NIJM78) Proceedings of the 3rd Topical Meeting on MultiDimensional Analysis
of High Energy Data, Nijmegen (1978)
(NOWA75) W.D. Nowak and H. Schiller, Berlin preprint PHE 7512 (1975)
(PATR72) E.A.Patrick, Fundamentals of Pattern Recognition, Prentice Hall Inc.
1972
(SAMM69) J.W. Sammon, IEEE Transactions on Computers VoI.C18 (1969) 401
(SCHO78) D.J. Schotanus, Topical Meetings on Multidimensional Data Analysis,
CERN 1976 and Nijmegen 1978
(VAIS76) Ch. de la Vaissiere, Application of the CERN Interactive Cluster Analysis to Multibody pp Interactions, Topical Meeting on Multidimensional Data
Analysis, CERN 1976
(ZAHN71) C.T. Zahn, IEEE Transactions on Computers C-20 (1971) 68
23

Multidimensional Data Analysis

Uploaded by

Copyright:

Available Formats

Multidimensional Data Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multidimensional Data Analysis

Uploaded by

Copyright:

Available Formats

MULTIDIMENSIONAL DATA ANALYSIS1

Th. Naumann and H. Schiller

is a reprint from FORMULAE AND METHODS IN EXPERIMENTAL DATA

2 Variables, Metrics, Similarity Measures

3 Some General Methods of Multidimensional Data

Leaving out relevant variables naturally makes a meaningful analysis impossible.

This quantity is relativistically invariant and quasipermutationally invariant with

For six particles we need thirteen variables

where Xi , Xj and Xk are any three vectors in Ep .

An absolute value norm

is computationally even cheaper.

i are ordered on their projected values (X

f (rij )(R rij )

The idea of mapping is to construct a low dimensional (normally twodimensional)

Multidimensional Two Sample Test

1 if the edge links nodes from dierent samples

Hierarchical clustering methods applied to n data points result in (n 1) possible

Figure 4.1: Example of a hierarchical tree

n groups = the n data points

Hierarchical Clustering using the MST

Selection of Jets by Hierarchical Clustering

Using this denition one arrives at the complete linkage.

the sketched clustering algorithm is well suited to nd jets in multihadron nal

ik being the angle between the momenta pi and pk .

The ValleySeeking Technique

can be found in (BOET74).

f (Xi, Xj )[dX (Xi , Xj ) d (wi , wj )]2

Here, dX (Xi , Xj ) denotes the distance between two vectors Xi and Xj , d is an

fR [dX (Xi , Xj )](1 wi wj )

1. Choose an initial assignment of the N points to M classes.

The distance parameter R was chosen to be R = 0.455GeV with M = 15 initial

4.2.2 The Cluster Algorithm CLUCOV

the centroid Qk of the cluster Gk

the covariance matrix C k of the cluster Gk

Thus, each cluster builds its own metric.

4. Update I k , Qk and C k for all clusters. Goto 1.

4.2.3 Gelsemas Interactive Clustering Technique

f (X|bk ) with weights pk

The task of clustering is now to nd the distribution parameter vector b which is

(KOON72) W.L.G.Koontz and K.Fukunaga, IEEE Transactions on Computers

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.