MIL Rules 4
MIL Rules 4
MIL Rules 4
June 2003
c 2003 Xin Xu
Abstract
As a matter of fact, for some of these methods, it is actually claimed that they use the standard
MI assumption stated above.
to which they can best be applied. The empirical results presented in this thesis
show that they are competitive on standard benchmark datasets. Finally, we explore
some practical applications of MI learning, both existing and new ones.
This thesis makes three contributions: a new framework for MI learning, new MI
methods based on this framework and experimental results for new applications of
MI learning.
ii
Acknowledgements
There are a number of people I want to thank for their help with my thesis.
First of all, my supervisor, Dr. Eibe Frank. I do not really know what I can say to
express my gratitude. He contributed virtually all the ideas involved in this thesis.
What is more important, he always provided me with his support for my work.
When I was puzzled by a derivation or analysis of an algorithm, he was always the
one to help me sort it out. His review of this thesis was always in such a detail that
any kind of mistakes could be spotted, from logic mistakes to grammar errors. He
even helped me with practical work and tools like latex, gnuplot, shell scripts and
so on. He absolutely provided me with much more than a supervisor usually does.
This thesis is dedicated to my supervisor, Dr. Eibe Frank.
Secondly, I would like to thank my project-mate, Nils Weidmann, a lot. I feel
so lucky to have worked in the same research area as Nils. He shared with me
many great ideas and provided me with much of his work, including datasets, which
made my job much more convenient (and made myself lazier). In fact Chapter 6
is the result of joint work with Nils Weidmann. He constructed the Mutagenesis datasets using the MI setting and in an ARFF file format [Witten and Frank,
1999] that allowed me to easily apply the MI methods developed in this thesis to
them. He also kindly provided me with the photo files from the online photo library
www.photonewzealand.com and the resulting MI datasets. Some of the experimental results in Chapter 6 are taken from his work on MI learning [Weidmann,
2003]. But he helped me definitely much more than that. When I met any diffiiii
culties during the writeup of my thesis, Nils was usually the first one I asked for
help.
Thirdly, many thanks to Dr. Yong Wang. It was him who first introduced me to the
empirical Bayes methods. He also spent so much precious time selflessly sharing
with me his statistical knowledge and ideas, which inspired and benefited my study
and research a lot. I am really grateful for this great help.
As a matter of fact, the Machine Learning (ML) family at the Computer Science
Department here at the University of Waikato provided me with such a superb environment for study and research. More precisely, I am grateful to: our group leader
Prof. Geoff Holmes, Prof. Ian Witten, Dr. Bernhard Pfahringer, Dr. Mark Hall, Gabi
Schmidberger, Richard Kirkby (especially for helping me survive with WEKA). I
was once trying to name everybody working in the ML lab when I kicked off my
project at that moment we had only three people in the lab, Gabi, Richard and
myself. Now there are so many people in the lab, which makes me eventually give
up my attempt. But anyway, I would like to thank every one in the ML lab for
her/his help or concerns regarding my work.
Help also came from outside the Computer Science Department. More specifically,
I am thankful for Dr. Bill Bolstard and Dr. Ray Littler of the Statistics Department.
Apart from the lectures they provided to me, 2 they also kindly answered lots of my
(perhaps stupid) questions about Statistics.
As for the experiment and development environment used in this thesis, I heavily
relied on the WEKA workbench [Witten and Frank, 1999] to build the MI package.
My work would have been impossible without WEKA. As for the datasets, I would
like to thank Dr. Alistair Mowat for providing the kiwifruit datasets. I also want to
acknowledge the online photo gallery www.photonewzealand.com for their
(indirect) provision of a photo dataset through Nils.
Finally, thanks to my family for their consistent support of my project and research.
2
I always regarded the comments from Ray on the weekly ML discussion as lectures to me.
iv
In fact their question of havent you finished your thesis yet? has always been my
motivation to push the progress of the thesis.
My research was funded by a postgraduate study award as part of Marsden Grant
01-UOW-019, from the Royal Society of New Zealand, and I am very grateful for
this support.
vi
Contents
Abstract
Acknowledgements
iii
List of Figures
xii
List of Tables
xiii
1 Introduction
1.1
1.2
1.3
2 Background
2.1
2.2
Current Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1
1997-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2
2000-Now . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3
A New Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5
25
3.1
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2
3.3
3.4
Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
41
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2
4.3
An Assumption-based Upgrade . . . . . . . . . . . . . . . . . . . . 47
4.4
4.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
65
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2
5.3
5.4
5.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
83
6.1.2
6.2
6.3
Image Categorization . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Algorithmic Details
99
7.1
Numeric Optimization . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2
7.3
7.4
7.5
119
8.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2
Appendices
A Java Classes for MI Learning
127
131
139
143
154
ix
List of Figures
2.1
2.2
3.1
3.2
3.3
4.1
4.2
4.3
4.4
5.1
5.2
5.3
5.4
6.1
6.2
77
sky. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3
7.1
7.2
7.3
xii
List of Tables
3.1
4.1
4.2
4.3
5.1
5.2
6.1
6.2
Error rate estimates for the Mutagenesis datasets and standard deviations (if available). . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3
6.4
6.5
6.6
xiii
CHAPTER 1. INTRODUCTION
Chapter 1
Introduction
Multiple instance (MI) learning has been a popular research topic in machine learning since seven years ago when it first appeared in the pioneering work of Dietterich
et al. [Dietterich, Lathrop and Lozano-Perez, 1997]. One of the reasons that it attracts so many researches is perhaps the exotic conceptual setting that it presents.
Most of machine learning follows the general rationale of learning by examples
and MI learning is no exception. But unlike the single-instance learning problem,
which describes an example using one instance, the MI problem describes an example with multiple instances. However, there is still only one class label for each
example. At this stage we try to avoid any special notation or terminology and
simply give some rough ideas of the MI learning problem using real-world examples. Whilst some practical applications of MI learning will be discussed in detail
in Chapter 6, we briefly describe them here to give some flavor of how MI problems
were identified.
CHAPTER 1. INTRODUCTION
fit because the disease may originate in only one or few fruits in a batch. However,
even if some fruits indeed exhibit some minor symptoms, the majority of fruits in a
batch may be resistant to the disease, rendering the symptoms for the whole batch
negligible. Hence the batch can be negative even if some fruits are affected to a
small degree.
There are also some real-world problems that are not apparently MI problems.
However with some proper manipulation, one can model them as MI problems and
generate MI datasets for them. The content-based image processing task is one of
the most popular ones [Maron, 1998; Maron and Lozano-Perez, 1998; Zhang, Goldman, Yu and Fritts, 2002], while the stock market prediction task [Maron, 1998;
Maron and Lozano-Perez, 1998] and computer intrusion prediction task [Ruffo,
2001] are also among them.
The key to modeling content-based image processing as an MI problem is that only
some parts of an image account for the key words that describe it. Therefore one
can view each image as an example and small segments of that image as instances.
Some measures are taken to describes the pixels of an image. Since there are various ways of fragmenting an image and at least two ways to measure the pixels
(using RGB values or using Luminance-Chrominance values), there could be many
configurations of the instances. The class label of an image is whether its content is
about a certain concept and the task is to predict the class label given a new image.
When the title of an image is a single simple concept like sunset or waterfall,
the MI assumption is believed to be appropriate because only a small part of a positive image really accounts for its title while no parts of a negative image can account
for the title. However, for more complicated concepts of contents, it may be necessary to drop this assumption [Weidmann, 2003]. In this thesis, we consider this
application and the others in more detail in Chapter 6.
We aimed to achieve these objectives to the extent we could. Although some objectives may be too big to fully accomplish, it is hoped that this thesis improves the
interpretation and understanding of MI problems.
1
We also noticed that some methods actually explicitly expressed assumptions that they never
used.
CHAPTER 1. INTRODUCTION
CHAPTER 2. BACKGROUND
Chapter 2
Background
Figure 2.1: Data generation for (a) single-instance learning and (b) multipleinstance learning [Dietterich et al., 1997].
described above. It still has examples with attributes and class labels; one example
still has only one class label; and the task is still the inference of the relationship
between attributes and class labels. The only difference is that every one of the
examples is represented by more than one feature vector. If we regard one feature
vector as an instance of an example, normal supervised learning only has one instance per example while MI learning has multiple instances per example, hence
named Multiple Instance learning problem.
The difference can best be depicted graphically as shown in Figure 2.1 [Dietterich
et al., 1997]. In this figure, the Object is an example described by some attributes,
the Result is the class label and the Unknown Process is the relationship. Figure 2.1(a) depicts the case in normal supervised learning while in 2.1(b) there are
multiple instances in an example. We interpret the Unknown Process in Figure 2.1(b) as different from that in Figure 2.1(a) because the input is different. Note
that the dashed and solid arrows representing the input of the process in (b) imply
8
CHAPTER 2. BACKGROUND
that only some of the input instances may be useful. Therefore while the Unknown
Process in (a) is simply a classification problem, the Unknown Process in (b) is
commonly viewed as a two-step process with a first step consisting of a classification problem and a second step that is a selection process based on the first step and
some assumptions.1 However, as we shall see, there are methods that do not even
consider the instance selection step and directly infer the output from the interactions between the input instances. But the assumption of a selection process played
such an influential role in the MI learning domain that virtually every paper in this
area quoted it. We refer to this standard MI assumption simply as the MI assumption throughout this thesis for brevity. Let us briefly review what this assumption
entails.
Within a two class context, with class labels fpositive, negativeg, the MI assumption states that an example is positive if at least one of its instance is positive
and negative if all of its instances are negative. Note that this assumption is based
on instance-level class labels, thus the Unknown Process in Figure 2.1(b) has
to consist of two steps: the first step provides the instances class labels and the
MI assumption is applied to the second step. However, we noticed that several
MI methods do not actually follow the MI assumption (even though it is generally
mentioned), and do not explicitly state which assumptions they use. As a matter
of fact, we believe that this assumption may not be so essential to make accurate
predictions. What matters is the combination of the model from the first step and
the assumption used in the second step. Given a certain type of model for classification, it may be appropriate to explicitly drop the MI assumption and establish other
assumptions in the second step. We will discuss this in Section 2.4.
The two-step paradigm itself is only one possibility to model the MI problem. We
noticed that in general, when extending the number of instances of an example
from one to many, we may have a potentially very large number of possibilities
to model the relationship between the set of instances and their class label. For
We may like to call the selection process a parametric instance selection process because it is
based on the model built in the classification process.
2.2.1 1997-2000
The first MI algorithm stems from the pioneering paper by Dietterich et al. [1997],
which also introduced the aforementioned Musk datasets. The APR algorithms [Dietterich et al., 1997] modeled the MI problem as a two-step process: a classification process that is applied to every instance and then a selection process based on
the MI assumption. A single Axis-Parallel hyper-Rectangle (APR) is used as the
10
CHAPTER 2. BACKGROUND
pattern to be found in the classification process. As a parametric approach,2 the
objective of these methods is to find the parameters that, together with the MI assumption, can best explain the class labels of all the examples in the training data.
In other words, they look for parameters involved in the first step according to the
observed class label formed in the second step and the assumed mechanism between the two steps. There are standard algorithms that can build an APR (i.e. a
single if-then rule). Unfortunately they do not take the MI assumption into account.
Thus a special-purpose algorithm is needed. Several basic APR algorithms were
proposed [Dietterich et al., 1997], but interestingly the best method for the Musk
data was not among them. The best APR algorithm for the Musk data consists of
an iterative process with two steps: the first step expands an APR from a positive
seed instance using a back-fitting algorithm, and the second step selects useful
features greedily based on some specifically designed measures. It turned out that
the APR that best explain the training data does not generalize very well. Hence
the objective was changed a little bit to produce the lowest generalization error, and
the kernel density estimation (KDE) was used to refine some of the parameters
the boundaries of the hyper-rectangle. These tuning steps resulted in an algorithm
called iterated-discrim APR, that gave good results on the Musk datasets.
It is interesting that Dietterich et al. based all the algorithms on the APR pattern
and the MI assumption combination and never considered to drop either of them
even when difficulties were encountered. This might be due to their background
knowledge that made them believe an APR and the MI assumption are appropriate
for the Musk data. However, using KDE may have already introduced a bias into
the parameter estimates, which indicates that it might be more appropriate to model
the Musk datasets in a different way.
In spite of the above observation, the APR and the MI assumption combination
dominated the early stage of MI learning. Researchers from the computational
learning theory played an active role in MI learning before 2000. Some PAC al-
In this case, the parameters to be found are the useful features and the bounds of the hyperrectangle along these features.
11
ability of the class label of a test feature vector. According to statistical decision
theory, one should make the prediction based on whether the probability is over 0.5
in a two-class case.
But this is in single-instance supervised learning. In the MI domain, the first (and
so far the only published) probabilistic model is the Diverse Density (DD) model
proposed by Maron [Maron, 1998; Maron and Lozano-Perez, 1998]. As will be described in more details in Chapter 4 and Chapter 7, DD was also heavily influenced
by the APR and the MI assumption combination and the two-step process. The
DD method actually used the maximum binomial log-likelihood method, a statistical paradigm used by many normal single-instance learners like logistic regression,
to search for the parameters in the first step. In particular, to model an APR in
the first step, it used a radial form or a Gaussian-like form to model
12
P r(Y jX ).
CHAPTER 2. BACKGROUND
Because of the radial form, the pattern (or decision boundary) of the classification
step is an Axis-Parallel hyper-Ellipse (APE) instead of a hyper-rectangle.3 In the
second step, where we assume that we have already obtained each instances class
probability, we still need ways to model the process that decides the class label of
an example.4 There were two ways in DD, namely the noisy-or model and the mostlikely-cause model, that are both probabilistic ways to model the MI assumption.
In the single-instance case, the process to decide the class label of each example
(i.e. each instance) can be regarded as a one-stage Bernoulli process. Now, since
we have multiple instances in an example, it seems natural to extend it to a multiplestage Bernoulli process, with each instances (latent) class label determined by its
class probability in one stage.5 And this is exactly the noisy-or generative model
in DD. As we shall see, according to [Maritz and Lwin, 1989], a very similar way of
modeling was adopted by some statisticians in as early as 1943 [von Mises, 1943].
However we can also model the process as an one-stage Bernoulli process if we
assume some way to squeeze the multiple probabilities (one per instance) into
one probability. The most-likely-cause model in DD is of this kind. Either way,
we can form a binomial log-likelihood function, either multi-stage or one-stage.
The noisy-or model computes the probability of seeing all the stages negative and
the complement, the probability of seeing at least one stage positive. The mostlikely-cause model picks only one instances probability in an example to form
the binomial log-likelihood. It selects the instance within an example that has the
highest probability to be positive. Both processes to generate a bags class label
are model-based6 and are compatible with the MI assumption. By maximizing the
log-likelihood function we can find the parameters involved in the radial formula-
consistency and efficiency, one can usually assure the correctness of the solution if
3
Note that the function for a rectangle is not differentiable whereas that of an ellipse is. But in
the sense of classification decision making, they are very similar.
4
The process to decide an examples class label was called generative model in DD.
5
Note that the normal binomial distribution formula Cnr pr (1 p)n r does not apply here because
in this Bernoulli process, the probability p changes from stage to stage.
6
The processes are based on the radial formulation of P r(Y jX ).
13
2.2.2 2000-Now
The new millennium saw the breaking of the MI assumption and abandonment of
APR-like formulations. Virtually no new methods (apart from a neural networkbased method) created since then use the MI assumption, although interestingly
enough some of them were motivated based on this assumption. Hence it may not
be fair to compare these methods with the methods developed before 2000 because
they were based on different assumptions. However, one can usually compare them
in the sense of verifying which assumptions and models are more appropriate for
7
That is why we list it here and not among the new methods after 2000, although it was published
in 2002.
14
CHAPTER 2. BACKGROUND
the real-world data.
The new methods created since 2000 were all aimed to upgrade the single-instance
learners to deal with MI data. The methods upgraded so far are decision trees [Ruffo,
2001], nearest neighbour [Wang and Zucker, 2000], neural networks [Ramon and
Raedt, 2000], decision rules [Chevaleyre and Zucker, 2001] and support vector machines (SVM) [Gartner, Flach, Kowalczyk and Smola, 2002]. Nevertheless the
techniques involved in some of these methods are significantly different from those
used in their single-instance parents. We can categorize them into instance-based
methods and metadata-based methods. The term instance-based denotes that a
method is trying to select some (or all) representative instances from an example and
model these representatives for the example. The selection could be based on the MI
assumption or, more often after 2000, not. The term metadata-based means that
a method actually ignores the instances within an example. Instead it extracts some
meta-data from an example that is no longer related to any specific instances. The
metadata-based approaches cannot possibly adhere to the MI assumption because
the MI assumption must be associated with instance selection within an example.
We briefly describe the post-2000 methods using this categorization, which is also
the backbone of the framework we will discuss in the next section.
The instance-based approaches are the nearest neighbour technique, the neural network, the decision rule learner, and the SVM (based on a multi-instance kernel).
The MI nearest neighbour algorithms [Wang and Zucker, 2000] introduces a measure that gives the distance between two example, namely the Hausdorff distance.
It basically regards the distance between two examples as the distance between the
representatives within each example (one representative instance per example). The
selection of the representative is based on the maximum or minimum of the distances between all the instances from the two examples. While it is not totally clear
from the paper [Wang and Zucker, 2000] what the so-called Bayesian-KNN does
in the testing phase, the Citation-KNN method definitely violates the MI assumption because it decides a test examples class label by the majority class of its nearest
15
Since the output value is 2 [0; 1, we can regard it as the probability to be positive.
There is no information on how a test example is classified in [Chevaleyre and Zucker, 2001].
16
CHAPTER 2. BACKGROUND
dot products between instances from two examples. This can be combined with
another non-linear kernel, e.g. RBF kernel. This effectively assumes that the class
label of an example is the true class label of all the instances inside it, and attempts to
search for the hyperplane that can separate all (or most of) 10 the training examples in
an extended feature space (because of the RBF kernel function). Since this is done
the same way for both positive and negative examples, the MI assumption is not
used in this method at all. Indeed, we observe that some methods, including SVMs,
that do not model the probability
very hard to apply, if not impossible. It would be very convenient for those methods
to have other assumptions associated with the measure they attempt to estimate.
The metadata-based approach is implemented in the MI decision tree learner RELIC
and the SVM based on a polynomial minimax kernel. This approach extracts some
metadata from each example, and regards such metadata as the characteristics of
the examples. When a new example is seen, we can directly predict its class label with regards to the metadata without knowing the class labels of the instances.
Therefore each instance is not important in this approach. What matters is the underlying properties of the instances. Hence we cannot tell which instance is positive
or negative because an examples class label is associated with the properties that
are presented by the attribute values of all the instances. Hence this approach cannot
possibly use the MI assumption.
The MI decision tree learner RELIC [Ruffo, 2001] is of this kind. In each node in
the tree, RELIC partitions the examples according to the following method:
= 1; 2; : : : R, it assigns
For a numeric attribute, and given a threshold , there will be two subsets
to be chosen: the subset less than
example in either subset based on two types of tests. The first type assigns
10
The regularization parameter C in SVM will tolerate some errors in the training data.
17
and other-
wise to the subset greater than . For the second type, it assigns an example
single-instance tree learner C4.5 [Quinlan, 1993]. Note that for numeric attributes, it looks for the best of the two types of tests simultaneously so that
only one type of tests and one will be selected for each numeric attribute.
The way that RELIC assigns examples to subsets means that it is equivalent to extracting some metadata, namely minimax values, from each example and applying
the single-instance learner C4.5 to the transformed data. Since RELIC examines
the attribute values along each dimension individually, such metadata of an example does not correspond to any specific instance inside it, although it is possible,
but very unlikely, to match the instances to the minimax values of all the attributes
simultaneously. Moreover, in the testing phase, we can directly tell an examples
class label using the tree without getting the class labels of the instances. Thus the
MI assumption obviously does not apply.
The SVM with a polynomial minimax kernel explicitly transform the original feature space to a new feature space where there are twice the number of attributes as in
the original one. For each attribute in the original feature space, two new attributes
are created in the transformed space: minimal value and maximal value. It then
maps each example in the original feature space into an instance in the new space
by finding the minimum and maximum value of each attribute for the instances in
that example. Clearly some information is lost during the transformation and this is
significantly different from the two-step paradigm used in [Dietterich et al., 1997].
The MI assumption cannot possibly apply. Note this is effectively the same as what
RELIC does.
18
CHAPTER 2. BACKGROUND
The assumption of the metadata approach is that the classification of the examples
is only related to the metadata (in this case the minimax values) of the examples,
and that the transformation does not lose (or lose little) information in terms of classification. The convenience of the approach is that it transforms the multi-instance
problem to the common mono-instance one. In general, the metadata approach enables us to transform the original feature space to other feature spaces that facilitate
the single-instance learning. The other feature spaces are not necessarily the result
of simple metadata extracted from the examples. They could be, for instance, a
model space where we build a model (either a classification model or a clustering
model) for examples and transform each example into one instance according to,
say, the count of its instances that can be explained by the model. Methods similar
to this are actively being researched [Weidmann, 2003]. The validity of such transformations really depends on the background knowledge. If one believes that interactions or relationships between instances account for classification of an example,
then this approach may outperform methods that do not have such a sophisticated
view of the problem. In this thesis, we refer to all the methods that transform the
original feature space to another feature space as the metadata-based approach,
no matter how complicated the extracted metadata may be.
In summary, the methods developed in the earlier stage of MI learning usually have
an APR-like formulation and hold the MI assumption whereas the methods developed later on often implicitly drop the MI assumption and are based on other types
of models.
The MI
Assumption
Metadatabased
Approaches
Other
Assumptions
Modeloriented
Dataoriented
Fixed
Metadata
Random
Metadata
used to form a prediction at the bag level. All current instance-based methods for
MI learning amount to estimating the parameters of the function that enable them
to predict the class label of unseen examples. Some of these methods are based on
the MI assumption but some are not, which results in two sub-categories. We will
also develop some more methods within the sub-category that are not based on the
MI assumption. We will explicitly state our assumptions and present the generative
models that the methods are based on.
The metadata-based approach has already been discussed in the last section. The
metadata could either be directly extracted from the data, so-called data-oriented
20
CHAPTER 2. BACKGROUND
in the framework shown in Figure 2.2, or from some models built on the data, called
model-oriented.
The two-level classification method [Weidmann, Frank and Pfahringer, 2003; Weidmann, 2003] is the only published approach that is model-oriented. It builds a
first-level model using a single-instance (either supervised or unsupervised) learner
to describe the potential patterns in the whole instance space as the metadata. It
then applies a second-level learner to determine the model based on the extracted
patterns. A second-level learner is a single-instance (supervised) learner. Thus it
effectively transforms the original instance space into a new (model-oriented) instance space that single-instance learners can be applied to.
For example, we can apply a clustering algorithm at the first level to the
original instances and build a clustering model from the (original) instance space.
Then we can construct a new single-instance dataset, with every attribute corresponding to one cluster extracted, and each instance corresponding to one example
in the original data. The new instances attribute values are the number of instances
of the corresponding example that fall into one cluster. Finally we apply another
single-instance learner, say a decision tree, to the new data to build a second-level
model. At testing time, the first-level model is applied to the test example to extract
the metadata (and generate a new instance), and the second-level learner to classify
the test example according to the model built on the (training) metadata. Of course
there are many combinations of the first and second-level learners, and they are not
further described in this thesis. Interested readers should refer to [Weidmann, 2003]
for more detail.
In the data-oriented sub-category, we can further specialize into fixed metadata
and random metadata sub-categories. All of the current metadata-based methods
(i.e. RELIC and SVM based on a minimax kernel) are data-oriented. In addition,
the metadata extracted is thought to be fixed and directly used to find the function mapping the metadata to
21
2.4. METHODOLOGY
we assume the data within each example is random and follows a certain distribution. Thus we can extract some low-order sufficient statistics to summarize the
data and regard the statistics as the metadata. In this case, the metadata (statistics)
have a sampling distribution parameterized by some parameters. If we further think
of these parameters as governed by some distribution, the metadata is necessarily
random, not only wandering around the parameters within a bag but from bag to
bag as well. This is the thinking behind our new two-level distribution approach
discussed in Chapter 5. It turns out that when assuming independence between attributes, it constitutes an approximate way to upgrade the naive Bayes method to
deal with MI data, which has not been tried in the MI learning domain. We also
discovered the relationship between this method and the empirical Bayes methods
in Statistics [Maritz and Lwin, 1989].
2.4 Methodology
When we generalize the assumptions and approaches of MI learning, we find much
flexibility within our framework described above. Nonetheless, we still like to restrict our methods to some scope so that we may easily find theoretical justifications
for them. We propose that it would be desirable for an MI algorithm to have the following three properties:
1. The assumptions and generative models that the method is based on are clearly
explained. Because of the richness of the MI setting, there could be many
(essentially infinitely many) mechanisms that generate MI data. It is very unlikely that one method can deal with all of them. A method has to be based on
some assumptions. Thus it is important to state the assumptions the method
is based on and their feasibility.
2. The method should be consistent in training and testing. Note consistency
here means that a method should make a prediction for a new example in
22
CHAPTER 2. BACKGROUND
the same way as it builds a model on the training data.11 For example, if a
method tries to build a model at the instance level and is based on a certain
assumption other than the MI assumption, then it should also predict the class
label of a test example using that assumption.
3. When the number of instances within each example reduces to one, the MI
method degenerates into one of the popular single-instance learning algorithms. Although this is not such an important property compared to the
former two, it is useful when we upgrade a single-instance learner, which
is theoretically well founded, to deal with MI data.
Even though not all current MI methods hold the above three properties, we aim to
achieve them in this project. In order to do so we develop the following methodology for this thesis.
1. We explicitly drop the MI assumption but state the corresponding new assumptions whenever new methods are created. The underlying generative
model of each new method will be explicitly provided so that one may clearly
understand what kind of problem the methods can solve.
2. We adopt a statistical decision theoretic point of view, that is, we always assume a joint probability distribution over the feature variable X (or other new
variables introduced for MI learning) and the class variable
Y , P r(X; Y ),
23
attribute will also be called a feature or a dimension. There are many names
for the algorithms in normal single-instance (or mono-instance) supervised learning
like propositional learner, or Attribute-Value (AV) learner. We will sometimes
use them without distinction.
There is also some notation related to the joint probability distribution P r (X; Y ).
In classification problems we are more concerned with the conditional (or marginal)
probability of
24
Chapter 3
A Heuristic Solution for Multiple
Instance Problems
This chapter introduces new assumptions for MI learning, and we discard the standard MI assumption. We regard the class label of a bag as a property that is related
to all the instances within that bag. We call this new assumption the collective
assumption because the class label of a bag is now a collective property of all the
corresponding instances. Why can the collective assumption be reasonable for some
practical problems like drug activity prediction? Because the features variables (in
this case measuring the conformations, or shapes, of a molecule) usually cannot
absolutely explain the response variables (a molecules activity), it is appropriate
to model a probabilistic mechanism to decide the class label of a data point in the
instance space. The collective assumption means that the probabilistic mechanisms
of the instances within a bag are intrinsically related, although the relationship is
unknown to us. Consider the drug activity prediction problem: a molecules conformations are not arbitrary in the instance space but confined to some certain areas.
Thus if we assume that the mechanism determining the class label of a molecule
is similar to the mechanism determining the (latent) class labels of all (or most) of
the molecules conformations, we may better explain the molecules activity. Even
if the activity of a molecule were truly determined by only one specific shape (or
a very limited number of shapes) in the instance space, it would have a very small
25
3.1. ASSUMPTIONS
probability of being sampled. Together with some measurement errors, the samples
of a molecule are very likely to wander around the true shape(s). Therefore it is
more robust to model the collective class properties of the instances within a bag
rather than that of some specific instance. We believe this is true for many practical
MI datasets, including the musk drug activity datasets.
The collective assumption is a broad concept and there is great flexibility under
this assumption. In Section 3.1 we present several possible options to model exact
generative models based on this assumption. We further illustrate it with an artificial dataset generated by one exact generative model in Section 3.2. This generative
model is strongly related to those used in Chapters 4 and 5. In fact, all the methods developed in this thesis are based on some form of the collective assumption.
Section 3.3 presents a heuristic wrapper method for MI learning that is based on the
collective assumption [Frank and Xu, 2003]. It will be shown that in some cases it
can perform classification pretty well even though it introduces some bias into the
probability estimation. We interpret and analyze some properties of the heuristic in
Section 3.4. Note that some of the material in this chapter has been published in
[Frank and Xu, 2003].
3.1 Assumptions
Under the collective assumption, we have several options to build an exact generative model. We have to decide which options to take in order to generate a specific
model. We found that answering the following question is helpful in making the
decision:
C (Y j:)? In single-instance
C (Y jX ) to be related to the posterior probability function P r (Y jX ): we either use P r (Y jX ) itself or its logit
26
jX )
log PP rr((YY =1
=0jX ) . In MI learning, we can also
model it at the bag level, i.e. we can build a function of C (Y jB ) where B are
the bags. Thus we can also have P r (Y jB ) = P r (Y jX1 ; X2 ; ; Xn ) and
P r(Y =1jX1 ;X2 ; ;Xn )
=1jB )
log PP rr((YY =0
jB) = log P r(Y =0jX1 ;X2 ; ;Xn ) where a bag B = b has n instances
x1 ; x2 ; ; xn . Note that in this thesis we restrict ourselves to the same form
of the property for instances and bags. For example, if we model P r (Y jX )
for the instances, we also model P r (Y jB ) for bags instead of the log-odds.
transform, the log-odds function
We will show the reason for doing so in the answer to the next question.
2. How to view the members of a bag and what is the relationship between
their class label properties and that of the bag?
Almost all the current MI methods regard the members (instances) of a bag as
a finite number of fixed and unrelated elements. However we have a different
point of view. We think of each bag as a population, which is continuous
and generate instances in the instance space in a dense manner. What we
have in the data for each bag are some samples randomly sampled from its
population. This point of view actually relates all the instances to each other
given a bag, that is, they are all dominated by the specific distribution of the
population of that bag. Every bag is unbounded, i.e., it ranges over the whole
instance space. However its distribution may be bounded, i.e. its instances
may only be possibly located in a small region of the instance space.1 If
one really thinks of the bags elements as random selected from the whole
instance space, we can still fit this into our thinking by modeling each bag
as with a uniform distribution. Note that the distributions of different bags
are different from each other and bags may overlap. Thus each point in the
instance space x (that is, an instance) will have different density given different bags. Therefore, unlike in normal single-instance learning that assumes
P r(X ), we have P r(X jB ) instead. We still regard each point in the instance
space as having its own class label (that could be determined by either a deterministic or a probabilistic process) but this is unobservable in the data. What
is observable is the class label of a bag that is determined using the instances
1
In fact if the bags are bounded, we can always think of their distributions as bounded ones.
27
3.1. ASSUMPTIONS
(latent) class label properties.
Now how to decide the class label of a bag given those of its instances? There
are two options here: the population version and the sample version. Since
we take the above perspective for a bag and we have already defined the class
label property C (Y j:), one application of the collective assumption is to take
the expected property of the population of each bag as the class property of
that bag. Given a bag b, we calculate
C (Y jb) = EX jb [C (Y jX ) =
Z
X
C (Y jx)P r(xjb) dx
(3.1)
We simply regard the bags class label property as the conditional expectation of the class property of all the instances given b. This is the population
random sampling, we can give each instance an equal weight and calculate
the weighted average no matter what the distribution
ulation and the sample version are approximately the same when the in-bag
sample size is large, but there may be large difference if the sample size is
small. For example, if only one instance is sampled per bag. In this case the
sample version reduces to the single-instance case because an instances (in
this case also a bags) class label is determined by its own class label property.
However, the population version will still determine the instances class label
by the overall class property of its bag. Consequently it may be less desirable
because it does not degenerate naturally to the single-instance case.
It is now clear why we choose consistent formulations of C (Y j:) for both bags
other applications of the collective assumption, in this thesis we take this per-
28
the instances of each bag from this distribution. There are also two possibilities to model P r (X jB ): one is to model it indirectly using
The first option still assumes the existence of P r (X ), thus we are now in the
same framework as normal single-instance learning where we assume both
P r(xjb) =
8
>
<
P r(x)
x2b P r(x)
>
:0
dx
= b as
if x 2 b;
otherwise:
(3.2)
P r(:) here because we really have a density function of X instead of probability if X is numeric. To put it another
Note that we abuse the notation
way, the distribution of each bag is simply the normalized instance distribution P r (X ), restricted to the corresponding range. In both the population and
the sample version of the generative model, we need Equation 3.2 to generate
instances for one bag. In the population version we also need it to create the
class label of a bag b, whereas in the sample version we do not need it because
the class label of a bag is not related to a specific form of the density function
of its instances.
The second option for modeling P r (X jB ) does not assume the existence of
29
3.1. ASSUMPTIONS
P r(X ). Indeed if all we need is P r(X jB ), why should we still rely on the
single-instance statistical learning paradigm? Given a bag b, we can directly
model P r (xjb) as some distribution, say a Gaussian, parameterized by some
parameters. Now a bag is basically described by its parameters because once
the parameters of a bag are decided, that bag has been formed. Thus we
need some mechanism to generate the parameters of the bags we can conveniently regard the parameters themselves as distributed according to some
hyper-distribution. The data generation process is exactly the same as in
the first option, for both the population version and the sample version. Note
that if we model P r (X jB ) directly, P r (X ) may or may not exist, depending
on the specific distribution involved.
The above two options may coincide sometimes, as will be shown in Section 3.2, but in general they generate different data. Note that although the
conditional density function P r (X jB ) must be specified in order to generate
instances for each bag, it is not important for the instance-based learning algorithms that will be discussed (particularly in Chapter 4) because they are
based on the sample version of the generative model, in which
P r(X jB ) is
Now that we have answered the above questions, we are able to specify how to generate a bag of instances and how to generate the class label of that bag. Thus it is the
time to generate an MI dataset based on an exact generative model, under the collective assumption. In the following section, we illustrate the above specifications
via an artificial dataset, specifying the answers of the above question as follows.
First, the class label property C (Y ) is the posterior probability at both the instance
think of each bag as a hyper-rectangle in the instance space and assume the center
(i.e., the middle-point of the hyper-rectangle) of each bag is uniformly distributed.
Thus the bags are bounded and the instances for each bag are drawn from within
30
P r(Y = 1jX ) = 1+exp(1 X ) , and based on the sample version of the generative
P
model, P r (Y jB ) = n1 ni=1 P r (Y jXi ), where n is the number of instances in a bag.
i.e.
Finally we take the first option for the last question and assume the existence of
5; 5 for each
of the two dimensions. The size of a rectangle in each dimension was chosen from
2 to 6 with equal probability (i.e. following a uniform distribution). Each rectangle
n instances from
within a rectangle according to a uniform distribution. The value of n was chosen
was used to create a bag of instances. To this end we sampled
31
P r(y = 1jx1 ; x2 ) =
1+e
3x1 3x2
Figure 3.1 shows a dataset with 20 bags that was generated according to this model.
The black line in the middle is the instance-level decision boundary (i.e. where
P r(y = 1jx1 ; x2 ) = 0:5) and the sub-space on the right side has instances with
higher probability to be positive. A rectangle indicates the region used to sample
points for the corresponding bag (and a dot indicates its centroid). The top-left
corner of each rectangle shows the bag index, followed by the number of instances
in the bag. Bags in gray belong to class negative and bags in black to class
positive. In this plot we mask the class labels of the instances with the class of
the corresponding bag because only the bags class labels are observable. Note that
bags can be on the wrong side of the instance-level decision boundary because
each bag was labeled by flipping a coin based on the average class probability of
the instances in it.
Note that the bag-level decision boundary is not explicitly defined but the instancelevel one is. However, if the number of instances in each bag goes to infinity (i.e. basically working with the population version), then the bag-level decision boundary
is defined. Since the instance-level decision boundary is symmetric w.r.t. every
rectangle and so is
of the centroid of each bag and is the same as the instance-level one. Thus the best
32
EX jb [P r(Y jX ) =
Z
X
=
=
Z
ZX
X
= P r(Y jb)
X in the joint distribution P r(X; Y ) conditional on the existence of b and get the (conditional) prior probability: the class probability of b,
P r(Y jb).
Thus we marginalize
33
Musk 1
Bagging with Discretized PART
90.222.11
RBF Support Vector Machine
89.131.15
Bagging with Discretized C4.5
90.982.51
AdaBoost.M1 with Discretized C4.5 89.241.66
AdaBoost.M1 with Discretized PART 89.782.30
Discretized PART
84.782.51
Discretized C4.5
85.432.95
Musk 2
87.161.42
87.162.14
85.002.74
85.492.73
83.701.81
87.062.16
85.691.86
Table 3.1: The best accuracies (and standard deviations) achieved by the wrapper
method on the Musk datasets (10 runs of stratified 10-fold cross-validation).
In the remainder of this chapter, we analyze a heuristic method based on the above
generative model. Although it does not find an exact solution, it performs very well
on the classification task based on this generative model. The method is very simple but the empirical performance on the Musk benchmark datasets is surprisingly
good, which may be due to the similarity between these datasets and our generative
model.
3.4 Interpretation
Recall the wrapper method has two key features: (1) the way it assigns instance
weights and class labels at training time, and (2) the probability averaging of a bag
at prediction time. We provide some explanation for why this makes sense in the
following.
The wrapper method is an instance-based method and like other methods in this
category, it also tries to recover the instance-level probability. It is well known that
many popular propositional learners aim to the minimize an expected loss function
over X and
Y , that is, EX EY jX (Loss(; Y )) where is the parameter to be estimated and usually involved in the probability function P r (Y jX ). As matter of fact,
all the single-instance learning schemes in Table 3.1 are within this category.3 Now,
2
3
Some of the properties of the Musk datasets are summarized in Table 4.2 in Chapter 4.
PART and C4.5 use an entropy-based criterion to select the optimal split point in a certain region.
35
3.4. INTERPRETATION
4
22
Estimated coefficient 1
Estimated coefficient 2
Estimated intercept
Parameter value
20
18
16
14
0
12
-1
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Number of exemplars
40
60
80
100
120
20
Figure 3.3: Test errors on the artificial data of the wrapper method
trained on masked and unmasked
data.
P r(xjb) is 1=n for each instance in the bag and zero otherwise. Therefore the
conditional expectation of the loss function given the presence of a bag b is simply
EX jb EY jX;b (Loss(; Y )) = EX jb EY jX (Loss(; Y )) assuming conditional independence of Y on b given X . Plug in given P r (xjb), the conditional expected loss is
P
simply j n1 Eyj jxj (Loss(; yj )) where xj and yj is the attribute vector and class
label of the j th instances in b respectively. We want to minimize this expected loss
of
over all the bags, thus the final expected loss to be minimized is
X
i
1
N
X
j
1
E
(Loss(; yij ))
ni yij jxij
(3.3)
where N is the number of bags and ni the number of instances in bag i. Thus the
weight
1=n of each instance x serves as P r(xjb) and the bags weight 1=N is a
constant outside the sum over all the bags and does not affect the minimization process of the expected loss. Nonetheless, Equation 3.3 can never be realized because
yij , i.e. the class label of each instance, is not observable. If it were observable,
this formulation could be used to find the true . However, under the collective
assumption, if we assign the bags class label yi to each of its instances, it may not
be a bad approximation as yi is related to all the yij . This is exactly what the wrapper method does at training time. The approximation makes the wrapper method a
heuristic because it necessarily introduces bias into the probability estimates.
This is to minimize the cross-entropy or deviance loss in a piece-wise fashion.
36
The reason the intercept estimate seems unbiased is that the true value is 0 and with a multiplicative bias, the estimate is still 0.
37
3.4. INTERPRETATION
from the artificial data that the less area each bag occupies in the instance space, the
less bias the wrapper method has. When the ranges of bags become small, the bias
is literally negligible. Hence if there are some restrictions on the range of a bag in
the instance space, the wrapper method can work well. Indeed, we observed that on
the Musk datasets, the in-bag variances are very small for most of the bags, which
may explain the feasibility of the wrapper method. Intuitively, this heuristic will
work well if the true class probabilities P r (Y jX ) are similar for all the instances
in a bag (under the above generative model) because we use the bags class labels
to approximate those of the corresponding instances. Therefore in general, as long
as this condition is approximately correct, no matter by what means, the wrapper
method can work.
Finally what the wrapper method does at prediction time is reasonable assuming the
above generative model. Nevertheless, it seems that the wrapper method does not
use the assumption of a bags class probability being the averaged instances class
probabilities at training time. In fact, by assigning the bags class labels to their
instances, it only assumes the general collective assumption but not any specific
assumption regarding how the bags class labels are created. Therefore we can use
other methods at prediction time such as taking the normalized geometric average of
the instances class probabilities within a bag as the bags probability, as long as the
method is consistent with the general collective assumption. However, according to
our experience, taking the arithmetic average of the instances probability for a bag
is more robust on practical datasets like the Musk datasets. Thus we recommend
this method for the wrapper method in general.
As explained, the wrapper method is only a heuristic method that can work well in
practice under some conditions. At least for the specific generative model proposed
in Section 3.2 of this chapter, it produces accurate classifications, although it is not
good for probability estimation. In Chapter 4, we will present an exact algorithm
that can give accurate probability estimates for the same generative model. They are
also based on the normal single-instance learning schemes and aim to upgrade them
to deal with MI data. The disadvantage of such an approach is that it heavily relies
38
3.5 Conclusions
In this chapter we first introduced a new assumption other than the MI assumption
for MI learning the collective assumption. This assumption regards the class
label of a bag related to all the instances within a bag. We then showed some
concrete applications of the collective assumption to exactly generate MI data. One
of the applications is to take the averaged probability of all the instances in the same
bag as the class probability of that bag.
Under the above exact generative model, we further assumed that the instances
within the same bag have similar class probabilities. Consequently the probability
of a bag is also similar to those of its instances. These assumptions allow us to
develop a heuristic wrapper method for MI problems. This method wraps around
normal single-instance learning algorithms by (1) assigning the class label of a bag
and instance weights to the instances at training time, and (2) averaging instances
class probabilities of a bag at testing time.
Assigning bags class labels to their corresponding instances and averaging the instances class probabilities are the application of the above two assumptions (the
collective assumption and the similar probability assumption). The instances
weights and the wrapping scheme were motivated based on the instance-level loss
function (encoded in the single-instance learners) over all the bags. We also showed
an artificial example where this wrapper method can perform well for classification
although its probability estimates are biased. Empirically, we found out that this
method works very well with the Musk benchmark datasets, in spite of its simplicity.
39
3.5. CONCLUSIONS
40
Chapter 4
Upgrading Single-instance Learners
Among many solutions to tackle multiple instance (MI) learning problems, one approach has become increasingly popular, that is, to upgrade paradigms from the
normal single-instance learning to deal with MI data. The efforts described in this
chapter also fall into this category. However, unlike most of the current algorithms
within this category, we adopt an assumption-based approach that is based on the
statistical decision theory. Starting with analyzing the assumptions and the underlying generative models of MI problems, we provide a fairly general and justified
framework for upgrading single-instance learners to deal with MI data. The key
feature of this framework is the minimization of the expected bag-level loss function based on some assumptions. As an example we upgrade two popular singleinstance learners, linear logistic regression and AdaBoost and test their empirical
performance. The assumptions and underlying generative models of these methods
are explicitly stated.
4.1 Introduction
The motivating application for MI learning was the drug activity problem considered by Dietterich et al. [1997]. The generative model for this problem was basically
regarded as a two-step process. Dietterich et al. assumed there is an Axis-Parallel
41
4.1. INTRODUCTION
Rectangle (APR) in the instance space that accounts for the class label of each
instance (or each point in the instance space). Each instance within the APR is positive and all others negative. In the second step, a bag of instances is formed by
sampling (not necessarily randomly) from the instance space. The bags class label
is determined by the MI assumption, i.e., a bag will be positive if at least one of its
instances is positive (within the assumed APR) and otherwise negative. With this
perspective, Dietterich et al. proposed APR algorithms that attempts to find the best
APR under the MI assumption. They showed empirical results of the APR methods
on the Musk datasets, which represent a musk activity prediction problem.
In this chapter, we follow the same perspective as that in [Dietterich et al., 1997]
but with different assumptions. We adopt an approach based on the statistical decision theory and select assumptions that are well-suited for upgrading each singleinstance learner.
The rest of the chapter is organized as follows. In Section 4.2 we explain the underlying generative model we assume and show artificial MI data generated using
this model. In Section 4.3 we describe a general framework for upgrading normal
single-instance learners according to the generative model. We also provide two examples of how to upgrade the linear logistic regression and AdaBoost [Freund and
Schapire, 1996] within this framework. These methods have not been studied in the
MI domain before. Section 4.4 shows some properties of our methods on both the
artificial data and practical data. Some regularization techniques, as used in normal
single-instance learning, will also be introduced in an MI context. In Section 4.5 we
show that the methods presented in this chapter perform comparatively well on the
benchmark datasets, i.e. the Musk datasets. Section 4.6 summarizes related work
and Section 4.7 concludes this chapter.
42
P r(Y jX ). In the second step, based on the values of P r(Y jX ) of all the instances
within a bag, it assumes either a multi-stage (as in the noisy-or model) or a one-stage
(as in the most-likely-cause model) Bernoulli process to determine the class label of
a bag. Therefore DD amounts to finding one (or more) Axis-Parallel hyper-Ellipse
(APE)1 under the MI assumption.
It is natural to extend the above process of generating MI data to a more general
framework. Specifically, as in single-instance learning, we assume a joint distribu-
mined based on some assumptions. In the APR algorithms [Dietterich et al., 1997],
the instance-level decision boundary pattern is modeled as an APR and
g (:) is an
43
jX )
tion, i.e. the log-odds function log PP rr((YY =1
=0jX ) , because many normal single-instance
learners aim to estimate them and we are aiming to upgrade these algorithms. We
can think of a bag
instances, we have
n
1X
P r(y jxi)
P r(Y jb) =
n i
(4.1)
or
log
n
P r(Y = 1jb) 1 X
P r(Y = 1jxi )
log
=
P r(Y = 0jb) n i=1 P r(Y = 0jxi )
8
<
):
where xi
Qn
1=n
[ i=1 P r(y=1
Qjxi)
P r(Y = 1jb) = [Qni=1 P r(y=1
j
xi )1=n +[ ni=1 P r(y=0jxi )1=n
Q
[ ni=1 P r(y=0
jxi )1=n
Q
P r(Y = 0jb) = [Qni=1 P r(y=1
1
=n
jxi ) +[ ni=1 P r(y=0jxi )1=n
(4.2)
P r(xjb) =
2
8
>
<
P r(x)
x2b P r(x)
>
:0
dx
if x 2 b;
otherwise:
= b is
(4.3)
Note that we abuse the term P r(:) here because for a numeric feature, what we have is a density
function of X instead of the probability.
2
44
P r(xjb)
8
>
<1
= n
>
:
if x 2 b;
0 otherwise:
(4.4)
where n is the number of instances inside b. This is true no matter what distribution
P r(X ) is, as long as the instances are randomly sampled. Then, for any formulation
of the class label property of a bag C (Y jb), according to our collective assumption
we associate it with the instances by calculating the conditional expectation over
X , which results in Equation 3.1 discussed in the second question in Section 3.1 of
Chapter 3.
P r(xjb) in Equation 3.1 with that in Equation 4.4 and use the sum instead of the integral. If C (Y j:) is P r (Y j:), then we
get Equation 4.1, which is the arithmetic average of the corresponding instance
j:)
probabilities. If C (Y j:) is log PP rr((YY =1
=0j:) , then we obtain Equation 4.2, which is the
normalized geometric average of the instance probabilities. Note that in this model
introducing the bags
casts a new condition so that the class labels of the instances are masked by the
collective class label.
45
X,
P r(X ) and use a different linear logistic model. Now the density function, along
each dimension, is a triangle distribution instead of a uniform distribution:
P r(y = 1jx1 ; x2 ) =
1+e
x1 2x2
Thus the instance-level decision boundary pattern is still a hyperplane, but different
from the one modeled in the artificial data in Section 3.2. We changed these in order
to demonstrate that our framework can deal with any form of P r (X ) and P r (Y jX ),
as long as the correct family of
and the correct underlying assumption (in this case the collective assumption) are
chosen. Finally we took Equation 4.2 to calculate P r (y jb). Again we labeled each
bag according to its class probability. The class labels of the instances are not
observable.
Now we have constructed a dataset based on the Hyperplane (or linear boundary)
and the collective Assumption combination instead of the APR-like (or quadratic
boundary) and MI Assumption combination used in the APR algorithms [Dietterich et al., 1997] and DD [Maron, 1998].
Figure 4.1 shows a dataset with 20 bags that was generated according to this generative model. Same as in Chapter 3, the black line in the middle is the instance-level
decision boundary (i.e. where
right side has instances with higher probability to be positive. A rectangle indicates
the region used to sample points for the corresponding bag (and a dot indicates its
centroid). The top-left corner of each rectangle shows the bag index, followed by
46
In this section we first show how to solve the above problem in the artificial domain. Since the generative model is a linear logistic model, we can upgrade linear
logistic regression to solve this problem exactly. Then we generalize the underlying ideas to a general framework to upgrade single-instance learners based on some
assumptions, which also includes the APR algorithms [Dietterich et al., 1997] and
the DD algorithm [Maron, 1998]. Finally within this framework we also show how
to upgrade AdaBoost algorithm [Freund and Schapire, 1996] to deal with MI data.
First we upgrade linear logistic regression together with the collective assumption so that it can deal with MI data. Note that normal linear logistic regression can no longer apply here because the class labels of instances are masked by
the collective class label of a bag. Suppose we knew what exactly the collective assumption is, say, Equation 4.2, then we could first construct the probability
P r(Y jb) using Equation 4.2, and estimate the parameters (the coefficients of attributes in this case) using the standard maximum binomial likelihood method. In
this way we fully recover the instance-level probability function in spite of the
fact that the class labels are masked. When a test bag is seen, we can calculate
the class probability P r (Y jbtest ) according to the recovered probability estimate
and the same assumption we used at training time. The classification is based on
P r(Y jbtest ). Mathematically, in the logistic model, P r(Y = 1jx) = 1+exp(1 x) and
1
P r(Y = 0jx) = 1+exp(
x) where is the parameters to be estimated. According to
47
Qn
1 P
1=n
n
Q
[ ni P r(y=0Qjxi )1=n
1 P
: P r (Y = 0jb) = Qn
[ i P r(y=1jxi )1=n +[ ni P r(y=0jxi )1=n = 1+exp( n1 i x )
8
<
exp( n i x )
=1Qjxi )
P r(Y = 1jb) = [Qni P r(y=1[ jxii)P1r=n(y+[
n P r(y=0jx )1=n = 1+exp( 1 P x )
i
i
i
i
Then we model the class label determination process of each bag as a one stage
Bernoulli process. Thus the binomial log-likelihood function is:
LL =
N
X
i=1
(4.5)
N is the number of bags. By maximizing the likelihood function in Equation 4.5 can we estimate the parameters . Maximum likelihood estimates (MLE)
where
are known to be asymptotically unbiased, as illustrated in Section 4.4. This formulation is based on the assumption of Equation 4.2. In practice, it is impossible
to know the underlying assumptions so other assumptions may also apply, for instance, the assumption of Equation 4.1. In that case, the log-likelihood function
of Equation 4.5 remains unchanged but the formulation of
P r(X; Y ) into
P r(Y jX )P r(X ) and estimates P r(Y jX ) directly. This category covers a wide
3
The choice of numeric optimization methods used in this thesis is discussed in Chapter 7.
48
g (:).
2. Construct a loss function at the bag level and take the expectation over all
the bags instead of over instances, i.e. the expected loss function is now
MI assumption. The APR algorithms [Dietterich et al., 1997], on the other hand,
directly minimize the misclassification error loss at the bag level and the bounds of
the APR are the parameters to be estimated for P r (Y jX ). Since the APR algorithms
= 1jx) can
be regarded as 1 for any instance x within the APR. Otherwise P r (Y = 1jx) = 0.
regard the instance-level classification as a deterministic process, P r (Y
= 1=N , i = 1; 2; : : : ; N .
2. Repeat for m = 1; 2; : : : ; M :
Wi =ni , assign the bags class label to each
(a) Set Wij
of its instances, and build an instance-level model hm (xij ) 2 f 1; 1g.
(b) Within the ith bag (with ni instances), compute the error rate ei 2 [0; 1
by counting
the number of misclassified instances within that bag,
P
i.e. ei = j 1(hm (xij )6=yi ) =ni .
(c) If ei < 0:5 for all is, STOP iterations, Go to step 3.
P
(d) Compute
m = argmin i Wi exp[(2ei 1)
m .
(e) If(
m 0) STOP iterations, Go to step 3.
P
(f) Set Wi
Wi exp[(2ei 1)
m and renormalize so that i Wi = 1.
3. return sign[
P P
i m
m hm (xtest ).
Linear logistic regression and the quadratic formulation (as in DD) assume a limited
family of underlying patterns. There are more flexible single-instance learners like
boosting and the support vector machine (SVM) algorithms that can model larger
families of patterns. However, the general upgrading framework presented above
means that, under certain assumptions, one can model a wide range of decision
boundary patterns. Here we provide an example of how to upgrade the AdaBoost
algorithm into an MI learner based on the collective assumption. Intuitively AdaBoost can easily be wrapped around an MI algorithm (without changes to the AdaBoost algorithm), but since there are not many weak MI learners available we
are more interested in taking the single-instance learners as the base classifier of the
upgraded method.
AdaBoost originated in the Computational Learning Theory domain [Freund and
Schapire, 1996], but received a statistical explanation later on [Friedman, Hastie
and Tibshirani, 2000]. It can be shown that it aims to minimize an exponential loss
)
function in a forward stagewise manner and ultimately estimates 12 log PPrr(Y(Y==11jX
jX )
(based on an additive model) [Friedman et al., 2000].4 Now, under the collective
assumption, we use the Equation 4.2 because the underlying single-instance learner
4
50
1; 1
max function,
which makes the optimization much harder). We first describe the upgraded AdaBoost algorithm in Table 4.1 and then briefly explain the derivation. The notation
The derivation follows exactly the same line of thinking as that in [Friedman et al.,
2000]. In the derivation below we regard the expectation sign
E as the sample
average instead of the population expectation. We are now looking for a function
over all the bags F (B ) that minimizes EB EY jB [exp(
b,
F (b) =
X
n
F (xb )=n:
(4.6)
EW [yh(xb )=n =
ni
N X
X
i=1 j
1
[ Wi P r(y = 1jbi )h(xij )
ni
1
W P r(y = 1jbi )h(xij )
ni i
The solution is
h(xij ) =
8
<
:
i
if W
ni P r (y
= 1jbi )
1 otherwise
Wi P r (y
ni
= 1jbi ) > 0
This formula simply means that we are looking for the function h(:) at the instance
51
in the bag. Since the class label of each bag reflects its probability, we can asi
sign the class label of a bag to its instances. Then, with the weights W
ni , we use
a single-instance learner to provide the value of h(:). This constitutes Step 2a in
the algorithm in Table 4.1. Note that in Chapter 3, we proposed a wrapper method
to apply normal single-instance learners to MI problems. There, it is a heuristic
to assign the class label of each bag to the instances pertaining to it because we
minimize the loss function within the underlying (instance-level) learner. Since the
underlying learners are at the instance level, they require the class label for each
instance instead of each bag. That is why it is a heuristic. Here we only use the
the bags class label. Therefore assigning the class label of a bag to its instances is
actually our aim here because we are trying to minimize a bag-level loss function.
Hence it is not a heuristic or approximation in this method.
Next, if we average
P
objective function
EB EY jB [exp(F (B ) +
( yf (B ))) =
=
X
i
X
i
Wi exp[
m (
Wi exp[(2ei
j h(xij )
ni
1) m :
yi f (bi ) = 2ei 1
simply means that if the error rates within all the bags ei are less than 0.5, all the
bags will be correctly classified because all the f (bi ) will have the same sign as yi .
Thus we have reached zero error and no more boosting can be done as in normal
AdaBoost. Therefore we check this in Step 2c.
The solution of this optimization problem may not have an easy analytical form.
However, it is a one-dimensional optimization problem. Hence the Newton family
of optimization techniques can find a solution in super-linear time [Gill, Murray
and Wright, 1981] and the computational cost is negligible compared to the time to
build the weak classifier. Therefore we simply search for
using a Quasi-Newton
method in Step 2d.
Note that
in Step 2d will be exactly 12 log 1 errerrww where errw is the weighted error. Hence
the weight update will also be the same as in AdaBoost.
It is easy to see that to classify a test bag, we can simply regard F (B ) as the baglevel log-odds function and take Equation 4.2 to make a prediction.
53
Estimated coefficient 1
Estimated coefficient 2
Estimated intercept
24
22
5
Test Error Rate(%)
Parameter value
20
4
3
2
18
16
14
12
10
-1
8
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Number of exemplars
50
100
150
200
250
300
Figure 4.3: Test error of MILogisticRegressionGEOM and the MI AdaBoost algorithm on the artificial
data.
Error (%)
80
60
40
20
0
0
100
200
300
400
500
600
700
800
900
1000
Number of iterations
mum likelihood method. The consistency of the MLE can be proven under fairly
general conditions [Stuart, Ord and Arnold, 1999]. In this specific artificial data,
we used a triangle distribution for the density of
However, we observed that the exact form of this density does not matter. This is
reasonable because our model only assumes random sampling from the distribution
corresponding to a bag. The exact form of P r (X ) does not matter.
MILogisticRegressionGEOM successfully recovers the instance-level class probability function (and the underlying assumption also holds for the test data). Consequently we expect it to achieve the optimal error rate on this data. As explained
before, the MI AdaBoost algorithm is also based on Equation 4.2, thus it is also
expected to perform well in this case. To test this, we first generate an independent
test dataset of 10000 bags, then generate training data (with different random seeds)
for different numbers of training bags. The decision stumps are used as the weak
learner and the maximal number of boosting iterations was set to 30. The test error of both methods on the test data against different numbers of bags is plotted in
Figure 4.3. When the number of training bags increases, the error rate goes closer
to the best error rate and eventually both methods accomplish close to the perfect
performance. However, unlike MILogisticRegressionGEOM, MI AdaBoost cannot
achieve the exact best error rate, mainly because in this case we are approximating
a line (the decision boundary) using axis-parallel rectangles, only based on a finite
amount of data.
55
Equation 4.5 where is the ridge parameter. Note that in ridged logistic regression
since it is not invariant along each dimension we need to standardize the data
to zero-mean/unit-standard-deviation before fitting the model, and transform the
model back after fitting [Hastie et al., 2001]. The mean and standard deviation used
in the standardization in our MI logistic regression methods are estimated using the
weighted instance data, with each instance weighted by the inverse of the number
of instances in the corresponding bag. This is done because intuitively we would
like to have equal weight for every bag. Note that unlike some current MI methods
that change the data, we do not pre-process the data the standardization is simply
part of the algorithm.
In boosting with decision trees, both the tree size and the number of iterations determine the degrees of freedom, and there are several ways to regularize it in singleinstance learning [Hastie et al., 2001]. We use C4.5 [Quinlan, 1993] which does not
have an explicit option to specify how many nodes will be in a tree, we use an alternative way to shrink the tree size, that is, we specify a larger minimal number of
(weighted) instances for the leaf nodes (the default setting in C4.5 is two). Together
with the restriction of the number of iterations we can achieve a very coarse form
of regularization in our MI AdaBoost algorithm. By enlarging the minimal number
of leaf-node instances and in turn shrinking the tree size, we effectively make the
tree learner weaker. We will only show experimental results for the MI AdaBoost
algorithm based on these regularized trees.
Number of bags
Number of attributes
Number of instances
Number of positive bags
Number of negative bags
Average bag size
Median bag size
Minimum bag size
Maximum bag size
Musk 1 Musk 2
92
102
166
166
476
6598
47
39
45
63
5.17
64.69
4
12
2
1
40
1044
Note that we do not include the EM-DD algorithm [Zhang and Goldman, 2002] here even
though it was reported to have the best performance on the Musk datasets. We do so for two reasons:
1. There were some errors in the evaluation process of EM-DD; 2. It can be shown via some theoretical analysis and an artificial counter-example (Appendix D) that EM-DD cannot find a maximum
likelihood estimate (MLE) in general due to the (tricky) likelihood function it aims to optimize.
Since the DD algorithm is a maximum likelihood method, the solution that EM-DD finds cannot be
trusted if it fails to find the MLE.
58
Methods
Musk 1
Musk 2
LOO
10CV
LOO
10CV
iterated-discrim APR with KDE
7.6
10.8
maxDD
11.1
17.5
MULTINST
23.3
16.0
MI Neural Networksa
12.0
18.0
SVM with the MI kernel
13.0
13.61.1
7.8
12.01.0
Citation-kNN
7.6
13.7
MILogisticRegressionGEOM
13.04 14.132.23 17.65 17.741.17
MILogisticRegressionARITH
10.87 13.261.83 16.67 15.881.29
MI AdaBoost with 50 iterations 10.87 12.071.95 15.69 15.981.31
a
It was not clear which evaluation method the MI Neural Network used. We put it into 10CV
column just for convenience.
Table 4.3: Error rate estimates from either 10 runs of stratified 10-fold crossvalidation or leave-one-out evaluation. The standard deviation of the estimates (if
available) is also shown.
the assumption. Simple patterns and assumptions together may be sufficient.
There are many methods that aim to upgrade single-instance learners to deal with
MI data, as described in Chapter 2. Although they are definitely not in the same
category as the methods in the first part, and their assumptions and generative models are not totally clear, we include some of them in the second part. Because there
are too many to list here, we simply chose the ones with the best performance on
the Musk datasets: the SVM with the MI kernel [Gartner et al., 2002] and an MI
K-Nearest-Neighbour algorithm, Citation-kNN [Wang and Zucker, 2000]. For the
SVM with the MI kernel, the evaluation is via leave-10(bags)-out, which is similar
to that of 10-fold CV because the total number of bags in either dataset is close to
100. Thus we put its results into the 10CV columns.
Finally in the third part we present results for the methods from this chapter. Since
we introduced regularization, as also used by some other methods like SVM, there
is possibility to tune the regularization parameters. However we avoided extensive
parameter tunings. In both MI logistic regression methods, we used a small, fixed
regularization parameter
selves to 50 iterations to control the degrees of freedom. The base classifier used
59
model the relationship between them, namely the noisy-or model and the mostlikely-cause model. Both ways were aimed to fit the MI assumption. The noisy-or
model regards the process of determining a bags class probability (i.e.
P r(Y =
instances in the bag. In each stage, one decides the class label using the probability
60
P r(y jx) of an (unused) instance in the bag. Roughly speaking, a bag will be labeled
positive if one sees at least one positive class label in this process and negative
otherwise. The most-likely-cause model regards the above process as one-stage. It
thus only picks one instance per bag. The selection of the instance is parametric
or model-based, that is, according to the radial formulation of P r (Y jX ) it uses the
instance with the maximal probability maxx2b fP r (y
tion (noisy-or) combination has 20.00%1.47% with the ridge parameter set to
= 12 (the high value of ridge parameter may already indicate the unsuitability of
the linear model).
As a matter of fact, the whole family of APR-like + MI Assumption-based methods are related to this chapter because their rationale is the same as that of DD.
The current methods that upgrade the single-instance learners (e.g. MI decision
tree learner RELIC [Ruffo, 2001], MI nearest neighbour algorithms [Wang and
Zucker, 2000], MI decision rule learner NaiveRipperMI [Chevaleyre and Zucker,
2001] and the SVM with an MI kernel and a polynomial minimax kernel [Gartner
et al., 2002]), on the other hand, are not so closely related to the framework presented here because the line of thinking is quite different. The SVM with the MI
kernel [Gartner et al., 2002] may also depend on the collective assumption because
61
3. Since there are a lot of mature techniques that can be directly applied to the
linear logistic model, like regularization or deviance-based feature selection,
we can apply them directly to the new model. For example, ridge methods
may further improve the performance if the instance-level decision boundary
cannot be modeled well by an APE.
4.7 Conclusions
This chapter described a general framework for upgrading single-instance learners
to deal with MI problems. Typically we require that the single-instance learners
model the posterior probability
struct a bag-level loss function with some assumptions. By minimizing the expected
loss function at the bag level we can recover the instance-level probability function P r (Y jX ). The prediction is based on the recovered function P r (Y jX ) and the
assumption used. This framework is quite general and justified thanks to the strong
theoretical basis of most single-instance learners. It also incorporates background
knowledge (based on the assumptions) involved and could serve as general-purpose
guidance for solving MI problems.
63
4.7. CONCLUSIONS
Within this framework, we upgraded linear logistic regression and AdaBoost based
the collective assumption. We have also shown that these methods, together with
mild regularizations, perform quite well on the Musk benchmark datasets.
For group-conditional single-instance learners that estimate the density P r (X jY )
and then transform to
64
Chapter 5
Learning with Two-Level
Distributions
5.1. INTRODUCTION
5.1 Introduction
Many special-purpose MI algorithms can be found in the literature. However we
observed that all these methods aim to directly model a function of f (y jx1 ;
; xn )
y
MLE
= argmaxL = argmax
j P r (bj
Y
j
P r(bj j y ):
directly related to , as we discussed before. Instead, the instances xjk are governed
by an instance-level distribution parameterized by a parameter vector , that in turn
is governed by a distribution parameterized by . Since is a random variable, we
integrate it out in L. Mathematically,
L=
=
=
P r(bj j y )
j
YZ
j
YZ
j
P r(bj ; j y ) d
P r(bj j; y )P (j y ) d
YZ
j
(5.1)
(i.e. instances) to the bag-level parameter . Now assuming the instances within
a bag are independent and identically distributed (i.i.d) according to a distribution
parameterized by , P r (bj j ) =
Qn j
i
the j th bag.
The complexity of the calculus would have stopped us here had we not assumed
68
L=
=
=
e ZZ
Y
Z Z (h Y
nj
j =1
i=1
( Z nj
e
m
h
Y
Y Y
j =1 k=1
e Y
m
Y
j =1
k=1
Bjk
i=1
(5.2)
where xjki denotes the value of the k th dimension of the ith instance in the j th
exemplar, k and k are the parameters for the k th dimension, and
Bjk =
Z hY
nj
i=1
nj
Y
i=1
P r(xjkijk ) =
nj
Y
i=1
P r(xjkijk ; k2 )
= (2k2 )
Pn j
i=1 xjki =nj and
nj =2 exp
2 + n (x
Sjk
j jk
2 2
k )2 i
(5.3)
Pnj
i=1 (xjki
2 =
Sjk
P r(k ky )
= g (ak ; bk ; wk )(k2 )
69
bk +3
exp
ak + (k wmk k )
2k2
2
(5.4)
bk
+1
ak2 2 k2
p
:
g (ak ; bk ; wk ) =
(wk ) (bk =2)
Taking a closer look at the natural conjugate prior in Equation 5.4, it is straightforward to see that k follows a normal distribution with mean mk and variance wk k2 ,
and ( ak2 ) follows a Chi-squared distribution with bk degrees of freedom (d.f. ). It is
k
Bjk =
=
Z hY
nj
i=1
Z +1Z +1 (
0
(2k2 )
nj =2 exp
g (ak ; bk ; wk )(k2 )
2 + n (x
Sjk
j jk
2 2
bk +3
exp
k )2 i
(5.5)
2 i)
ak + (k wmk k )
2k2
dk dk2
The integration is easy to do due to the form of the natural conjugate prior, resulting
in (the details of the calculation are given in Appendix C):
1)=2
2 ) + n (x
(1 + nj wk )(ak + Sjk
j jk
( bk +2 nj )
mk )
i bk +nj
2
2
nj
2
(5.6)
( b2k )
LL = log[
where
m
e Y
Y
j =1 k=1
Bjk ) =
m X
e
X
k=1 j =1
( log Bjk )
(5.7)
and sum of square errors Sjk , thus it turns out that this method extracts low-order
sufficient statistics, which is a kind of metadata, from each bag to estimate the
group-conditional hyper-parameters.
When a new bag is encountered, we simply extract these statistics from the bag. We
then compute the log-odds:
log
1
P r(btest jMLE
)P r(Y = 1)
P r(Y = 1jbtest )
= log
0
P r(Y = 0jbtest )
P r(btest jMLE )P r(Y = 0)
(5.8)
We estimate the prior P r (Y ) from the number of bags for each class in the training
data and classify btest according to whether the log-odds value is greater than zero.
In spite of its sound theoretic basis, the above method does not perform well on
the Musk datasets, mainly due to the unrealistic dependency of the variance of k
on k2 in the natural conjugate, as well as the restrictive assumptions of an InverseGamma for 2 and a Gaussian for the instances within a bag. However regarding
this as a metadata-based approach allows us to simplify it in order to drop some
assumptions. Here we make two simplifications.
The first simplification stems from the Equation 5.3. It is relatively straightforward
to see that the likelihood within each bag is equivalent to the product of the sampling
distributions of the two statistics xjk and Sjk .1 . If we think xjk is enough for the
classification task, we can simply drop the second-order statistics Sjk here. By
dropping Sjk we can generalize this method to virtually any distribution within a
bag because according to the central limit theorem, xjk
the xjki s are distributed. The second simplification is that we no longer model k2 as
drawn from an Inverse-Gamma distribution in Equation 5.4. Instead we regard it as
fixed and directly estimate ^k2 from the data. Accordingly we do not need to estimate
2
For a Gaussian distribution, the sampling distributions are xjk N (k ; nkj ) and jk
2 (nj
k
1) If we multiply these two sampling distributions, we will get the same form as in Equation 5.3,
differing only by a constant.
2
71
ak and bk any more. To estimate k2 , we give a weight of n1 to each instance where n
is the number of instances in the corresponding bag (because intuitively we regard
a bag as an object and thus should assign each bag the same weight). Based on the
weighted data, an unbiased estimate of k2 is ^k2
P 1
j nj ), where e is the number of bags.
P 1 P
i (xjki
i [ nj
xjk )2 =(e
With these two simplifications, the dependency in the natural conjugate prior disappears. We only have a Gaussian-Gaussian model within the integral in Equation 5.1,
which makes the calculus extremely simple. The resulting formula is again a Gaussian, which is
Bjk = 2
wk nj + k2
nj
1=2
exp
nj (xjk mk )2 i
2(wk nj + k2 )
(5.9)
means of the bags of the two classes are wandering around two Gaussian centroids
respectively, although with different variances, as shown in Figure 5.1. We simply
substitute Equation 5.6 in the log-likelihood function by Equation 5.9. The MLE
of
optimization procedure to search for a solution. However, since the search space is
two-dimensional (for convenience, we also search for
can be obtained directly) and has no local maxima, the search is very fast. This
constitutes our simplified TLD method and we call it TLDSimple, while we call
the former TLD method simply TLD.
defined it is a hyperplane because we used the same variance for both classes2 .
Also note that the decision boundary is only defined in terms of the true parameters
, but not even in terms of the statistics (or metadata) x. This is due to the extra
2
term ( nj ) in the variance of x. This term is distinct for each bag even if we have the
same 2 but different number of instances per bag.
We generated the artificial data mainly to analyze the properties of our methods.
2
If the Gaussian of either class had different variance, then the decision boundary would be
quadratic instead of linear.
73
12
Estimated sigma^2
Estimated w
Estimated m
10
Parameter value
-2
0
100
200
300
400
500
600
700
Number of exemplars
In order to verify the (at least asymptotic) unbiasedness of both TLD methods,
we generated a one-dimensional dataset with 20 instances per bag using the exact
generative model assumed by them (the data generation using the exact generative
model assumed by TLD is described in Chapter 7 and Appendix A). The estimated
parameters (of one class) are shown in Figure 5.2 for TLDSimple and Figure 5.3
for TLD. In both figures, Pthe solid
P lines denote the true parameter values. Note
that in TLDSimple, ^k2 =
i [ nj
(e
j (xjki
Pj
xjk )2
nj )
X
j
1 X
(x
nj i jki
xjk )2 = (e
X
j
1 2
) :
nj k
All other estimates are MLEs. As can be seen, the estimated parameters converge
to the true parameters as the number of bags increases, thus this method is at least
asymptotically unbiased. We observe that the number of instances per bag does not
affect the convergence but only the rate of convergence. The more instances per
bag, the faster the convergence. Varying number of instances will slow down the
convergence but result in similar behaviors.
74
Estimated a
Estimated b
Estimated w
Estimated m
100
Parameter value
80
60
40
20
-20
0
10cm
100
200
300
400
Number of exemplars
500
600
700
k2 cannot be estimated and can only be regarded as 0. Hence this method degrades
into naive Bayes in the single-instance case. If we had dropped the independence
assumption between attributes, we would have obtained the discriminant analysis
method. Even in the multi-instance case, if k2
can be neglected in Equation 5.9. In that case for all the bags in one class, the xjk s
can be regarded as coming from the same Gaussian. Then we can use standard
75
k s are normally distributed (i.e. N (mk ; wk )). But if the normality does not hold
very well, we need some adjustment techniques. We present one simple technique
here. When a new bag btest is met, we calculate the log-odds function according to
Equation 5.8 and usually decide its class label based on whether the log-odds > 0 or
not. Now, we determine the class label of btest based on whether the log-odds >
or not, where
optimal cut-point value instead of 0. This technique was mentioned in the context
of discriminant analysis [Hastie et al., 2001] but it is obviously applicable in our
TLD approach and in naive Bayes.
The rationale of this technique is easy to see, and illustrated in Figure 5.4 in one
dimension. Here we have one Gaussian and one standard Gamma distribution for
each class (solid lines). The dotted line plots the Gaussian estimated using the
mean and variance of the Gamma. Now the estimated decision boundary becomes
C whereas it should be C. Since in classification we are only concerned about the
decision boundary, we do not need to adjust the density. By looking for the empirically optimal cut-point, we can move it from C back to C and improve prediction
performance.
More specifically, we look for the cut-point as follows. First, we calculate the logodds ratio for each of the training bags and sort them in ascending order. The
log-odds ratios of bags with class 1 should be greater than those of bags with class
76
Figure 5.4: An illustration of the rationale for the empirical cut-point technique.
that value as the cut-point. If there is a tie, we choose the value closest to 0.
This technique is not commonly used to reduce the potential negative impact of
the normality assumption. Instead kernel density estimation or discretization of numeric attributes are more often used. Nonetheless, we found that this technique
can also work pretty well in the normal single-instance learning. We have tested
it in association with naive Bayes on some two-class datasets from the UCI repository [Blake and Merz, 1998] within the experimental environment of the WEKA
workbench [Witten and Frank, 1999]. The results are shown in Table 5.1.
In Table 5.1 we use NB to denote naive Bayes. The first column is NB with
empirical cut-point (EC) selection and this is the base line for comparison. The
second, third and fourth columns are NB without any adjustment, NB with kernel
density estimation (KDE) and NB with disretization of numeric attributes (DISC)
respectively. The results were obtained using 100 runs of 10-fold cross validation
77
Dataset
breast-cancer
breast-cancer-W
german-credit
heart-disease-C
heart-disease-H
heart-statlog
hepatitis
ionosphere
kr-vs-kp
labor
mushroom
sick
sonar
vote
pima-diabetes
Summary
NB+EC
72.53( 0.96)
95.87( 0.23)
74.75( 0.51)
82.58( 0.75)
83.52( 0.71)
83.78( 0.74)
83.33( 1 )
89.68( 0.54)
87.85( 0.16)
93.29( 2.16)
98.18( 0.03)
95.99( 0.09)
72.08( 1.39)
89.55( 0.63)
75.43( 0.46)
(v/ /*)
NB
72.76( 0.68)
96.07( 0.1 )
75.07( 0.43)
83.4 ( 0.41)
84.23( 0.52)
83.73( 0.61)
83.71( 0.82)
82.51( 0.45) *
87.8 ( 0.15)
94 ( 1.92)
95.76( 0.04) *
92.76( 0.12) *
67.88( 1.06) *
90.09( 0.15)
75.65( 0.37)
(0/11/4)
NB+KDE
72.76( 0.68)
97.51( 0.11) v
74.5 ( 0.46)
84.02( 0.58) v
84.95( 0.39) v
84.4 ( 0.59)
84.76( 0.62) v
91.83( 0.32) v
87.8 ( 0.15)
93.18( 1.36)
95.76( 0.04) *
95.78( 0.08) *
72.68( 1.13)
90.09( 0.15)
75.15( 0.37)
(5/8/2)
NB+DISC
72.76( 0.68)
97.17( 0.13) v
74.38( 0.62)
83.2 ( 0.64)
84.12( 0.36)
82.91( 0.67)
83.67( 1.14)
89.27( 0.45)
87.8 ( 0.15)
88.44( 1.85) *
95.76( 0.04) *
97.14( 0.08) v
76.43( 1.51) v
90.09( 0.15)
75.51( 0.74)
(3/10/2)
(CV), with standard deviations specified in brackets. The confidence level used for
the comparison was 99.5%. We use a * sign to indicate significantly worse than
where as v means significantly better than. We regard two results of a dataset
as significant different if the difference is statistically significant at the 99.5%
confidence level according to the corrected resampled t-test [Nadeau and Bengio,
1999]. It can be seen that the EC technique is comparable to discretization and
worse than KDE in general. However, it can indeed improve the performance of the
original naive Bayes. Moreover, in cases where KDE cannot be used, like in our
TLD approach, EC can be a convenient option.
It turns out that in the Musk datasets the normality assumption for k is a problem
and the EC technique can apply. For instance, for the naive Bayes approximation
of TLDSimple without EC, the leave-one-out (LOO) error rates on the Musk1 and
Musk2 datasets are 13.04% and 17.65% respectively. With EC, the error rates are
10.87% and 14.71%, which is an improvement of about 3% in each dataset. Note
that in the naive Bayes approximation of TLDSimple we do not use KDE because
78
Methods
Musk 1
Musk 2
LOO
10CV
LOO
10CV
SVM with Minimax Kernel
7.6
8.40.7
13.7
13.71.2
RELIC
16.3
12.7
TLDSimple+EC
14.13 16.961.86 9.80 15.882.56
naive Bayes Approximation+EC 10.87 15.112.32 14.71 17.351.85
Table 5.2: Error rate estimates from 10 runs of stratified 10-fold CV and LOO
evaluation. The standard deviation of the estimates (if available) is also shown.
we do not want to accurately estimate the density of xjk but that of k . In this
2
approximation, we ignored the term nkj in the variance of xjk . Hence xjk is only
an approximation of k . The EC technique is only concerned about the cut-point
instead of the whole density, and thus does not suffer from this approximation.
5.7 Conclusions
In this chapter we have proposed a two-level distribution (TLD) approach for solving multiple instance (MI) problems. We built a model in a group (class)-conditional
manner, and think of the data as being generated by two levels of distributions
the instance-level distribution and the bag-level distribution within each group.
It turns out that the modeling process can be based on low-order sufficient statistics,
thus we can regard it as one of the methods that are based on meta-data extracted
from each bag.
Despite its simplicity, we have shown that a simple version of TLD learning and
81
5.7. CONCLUSIONS
its naive Bayes approximation perform quite well on the Musk benchmark datasets.
We have also showed the relationship of our approach with normal group-conditional
single-instance learners, other metadata-based MI methods, and the empirical Bayes
method. Finally, we pointed out the similarity between MI learning problems and
the meta-analysis problems from scientific research.
82
Chapter 6
Applications and Experiments
Number of bags
Number of attributes
Number of instances
Number of positive bags
Number of negative bags
Average bag size
Median bag size
Minimum bag size
Maximum bag size
Friendly
188
7
10486
125
63
55.78
56
28
88
Unfriendly
42
7
2132
13
29
50.76
46
26
86
Method
Friendly
Unfriendly
Best of the TLC methods
9.311.28 18.331.61
MILogisticRegressionGEOM 15.740.91
16.670
MI AdaBoost
13.621.31 18.572.19
Wrapper with Bagged PART 18.781.26 23.102.26
Best of ILP methods
18.0
other hand, can naturally deal with nominal attributes, and we apply it directly to
the Mutagenesis datasets. For the regularization in MI AdaBoost, we use the same
strategy as in the Musk datasets. The base classifier is an unpruned regularized
C4.5 decision tree [Quinlan, 1993] as in Chapter 4. The minimal number of leafnode instances was set to around twice the averaged bag size, i.e. 120 instances for
Friendly and 100 instances for Unfriendly. Under these regularization conditions, we then look for a reasonable number of iterations. It turns out that the best
performance often occurs with 1 iteration on the unfriendly data, which lead us to
think that maybe decision trees are too strong for this dataset, and we should use decision stumps instead. Hence we eventually used 200 iterations of regularized C4.5
on the friendly dataset and 250 iterations of decision stumps on the unfriendly
87
friendly dataset indicates that under the collective assumption, there is indeed a
quite strong linear pattern while the larger ridge parameter value used on the unfriendly dataset may imply a weaker linear pattern. This is consistent with the
observation that the friendly data can be fit with linear regression whereas unfriendly data cannot. The performances of MILogisticRegressionGEOM and MI
AdaBoost are similar on the Mutagenesis datasets, which is reasonable because they
are based on the same assumption. As long as the instance-level decision boundary
pattern is close to linear under the geometric-average probability assumption, their
behaviors will be very similar. The wrapper method with bagged PART is not as
good as TLC, MILogisticRegressionGEOM and MI AdaBoost, but still comparable
to the best of the ILP methods. Besides, we did not do any parameter tuning or
use other improvement techniques like discretization here. Neither did we try the
wrapper around other single-instance classifiers. We conjecture that we might be
able to improve the accuracy considerably with these efforts because we believe the
underlying assumption of the wrapper method (that the class probabilities of all the
instances within a bag are very similar to each other) may hold reasonably well in
the drug activity prediction problems.
The overall empirical evidence seems to show that the collective assumption may
be at least as appropriate in the molecular chemistry domain as the MI assumption.
We can get good experimental results on the problems of this kind using methods
based on the collective assumption. This empirical observation seems to be contradictory to some claims that the MI assumption is precisely the case in the domain
of molecular chemistry [Chevaleyre and Zucker, 2000]. However, it would be interesting to discuss this with a domain expert.
89
1
23
49
Fruit
2
28
44
72
23
4560
63.33
760
60
120
3
19
53
Method
Default accuracy
TLD+EC
MILogisticRegressionARITH
MILogisticRegressionGEOM
MI AdaBoost
Wrapper with bagged PART
Fruit 1
68.06
62.750
71.811.47
71.112.34
73.191.74
73.060.97
Fruit 2
Fruit 3
61.11
73.61
52.78 0
66.67 0
65.69 1.97 73.61 0
64.03 2.31 73.61 0
63.61 2.84 73.61 0
61.81 2.87 71.810.67
Table 6.4: Accuracy estimates for the Kiwifruit datasets and standard deviations.
One important fact that is not shown in Table 6.3 is that there are some missing
values in the dataset. Typically instances values are missed for a whole bag for a
specific attribute. We developed different strategies to deal with the missing values
for our methods developed in this thesis. MI AdaBoost has the advantage that the
base classifier usually has its own way to deal with missing values, hence it is not a
problem for it. For the TLD approach, since we assume attribute independence, we
simply skip bags with missing values for a certain attribute when collecting loworder statistics for that attribute. Thus in the log-likelihood function of Equation 5.7
in Chapter 5, we use less bags to estimate the hyper-parameters of that attribute. The
cost is that we have less data for the estimation. As for the MI Logistic Regression
algorithms, we use the usual way of logistic regression to tackle this difficulty, that
is, we substitute the missing values with some values estimated from other training
instances without missing values. Again, in order to be consistent to our convention, we substitute the missing values with the weighted average of other instances
values of the concerned attribute. The weight of an instance is given by the inverse
of the number of (non-missing) instances of the bag it pertains to.
Now the three datasets can be tackled by all of the methods in this thesis. We
do not have other methods to compare (apart from DD, as discussed below), so
we list the default accuracy of the data, i.e. the accuracy if we simply predict the
majority class, in the first line of Table 6.4 for comparison. However we found out
that these datasets are very hard to classify. For example, we tried DD (looking
for one target concept) with the noisy-or model and got 10 10-fold CV accuracies
Since DD does not have a mechanism to deal with missing values, we pre-processed the data,
substituting the missing values with the average of the non-missing attribute values. Note that this
process may give DD an edge in classification because the resulting training and testing instances
are different from the original ones and may provide more information for classification.
92
Number of bags
Number of attributes
Number of instances
Number of positive bags
Number of negative bags
Average bag size
Median bag size
Minimum bag size
Maximum bag size
Photo
60
15
540
30
30
9
9
9
9
used in this thesis was generated based on the single-blob with neighbours (SBN)
approach proposed by [Maron, 1998], although there are many other techniques that
can be used [Zhang et al., 2002]. More precisely, there are 15 attributes in the data.
The first three attributes are the averaged R, G, B values of one blob. The remaining
twelve dimensions are the difference of the R, G, B values between one specific
blob and the four neighbours (up, right, down and left) of that blob. As in [Maron,
1998], an image was transformed into 8x8 pixel regions and each 2x2 region was
regarded as a blob. There are 9 blobs that can possibly have 4 neighbours in each
image. Therefore the resulting data has 15 attributes and 9 instances per bag, fixed
for each bag. Details of the construction of the MI datasets from the images can be
found in [Weidmann, 2003], where some other methods have also been tried.
Table 6.5 lists some of the properties of the photo dataset we used. Since all the
94
Method
Best of the TLC methods
TLDSimple+EC
MILogisticRegressionARITH
MILogisticRegressionGEOM
MI AdaBoost
Wrapper with bagged PART
Photo
81.67
69.172.64
71.672.48
71.002.96
76.672.36
750
Table 6.6: Accuracy estimates for the Photo dataset and standard deviations.
dimensions describe RGB values, they are all numeric. The dataset is about the
concept mountains and blue sky. There are 30 photos that contain both mountains
and blue sky, which are the positive bags, and 30 negative photos. The negative
photos could contain only sky, only mountains, or neither. Two examples from
[Weidmann, 2003] illustrate the class label properties of the bags. The photo shown
in Figure 6.2 is a positive photo, which contains both mountains and the sky, while
the one shown in Figure 6.3 is negative because it only contains the sky (and the
plains) but no mountains. Note that the classification task here is more difficult
than that in [Maron, 1998] and [Zhang et al., 2002] because their target objects
only involve one simple concept, say mountains, whereas we have a conjunctive
concept here, which is more complicated.
Table 6.6 lists the 1010-fold CV experimental results of some of the methods
developed in this thesis. Again we did not finely tune the user parameters. In the
MI logistic regression methods, we used a ridge parameter of 2 for both methods. In
MI AdaBoost, we used an unpruned C4.5 tree as the base classifier and the minimal
instance number for the leaf-nodes is (again) twice the bag size (18 instances). We
obtained the results in Table 6.6 using 300 iterations. The base classifiers in the
wrapper method were configured using the default settings. We also list the best
result for the TLC algorithms [Weidmann, 2003] for comparison.
For this particular dataset, the methods based on the collective assumption may be
appropriate sometimes. Because the objects mountains and blue sky are large
objects that usually occupy the whole image, the class label of an image is very
95
96
6.4 Conclusion
In this chapter, we have explored some practical applications of MI learning, namely,
drug activity prediction, fruit disease prediction and content-based image categorization problems. We experimented with some of the methods developed in this
thesis on the datasets related to these problems. We believe that the collective assumption is widely applicable in the real world. This belief is supported by the
empirical evidence we observed on the practical problems discussed in this chapter.
97
6.4. CONCLUSION
98
Chapter 7
Algorithmic Details
In this chapter we present some algorithmic details that are crucial to some of the
methods developed and discussed in this thesis. These techniques relate to either
the algorithms or the artificial data generation process. Important features of some
algorithms are also discussed.
More specifically, Section 7.1 briefly describes the numeric optimization technique
used in some of the methods developed in this thesis. Section 7.2 describes in detail
how to generate the artificial datasets used in the previous chapters. Because the
instance-based methods we developed in Chapter 4, especially the MI linear logistic
regression methods, are quite good at attribute (feature) selection, we show some
more detailed results on attribute importance of the Musk datasets in Section 7.3. In
Section 7.4 we describe some algorithmic details of the TLD approach developed
in Chapter 5. Finally we analyze the Diverse Density [Maron, 1998] algorithm, and
describe what we discovered about this method in Section 7.5.
x with constraints
such problems into unconstrained optimization problems via variable transformations like x
appropriate [Gill, Murray and Wright, 1981]. Because the objective function may
not be defined outside the bound constraints, we also need a method that does not
require to evaluate the function values there. Eventually we chose a Quasi-Newton
optimization procedure with BFGS updates and the active set method suggested
by [Gill et al., 1981] and [Gill and Murray, 1976]. It is based on the projected
and (ii) is the distribution otherwise. The distribution in case (i) is a line distribution
whereas the distribution in case (ii) is a combination of two line distributions. The
cornerstone for sampling from both distributions is a line distribution, denoted by
f (:). Suppose the line is within [a; b. Then we first draw a point p uniformly in
[a; b). If p is in the interval that has greater density, [a; (a + b)=2) in the case shown
in Figure 7.1(i), then we accept p. Otherwise we draw another uniform variate v in
[0; f ( a+2 b )). If v > f (p) then we accept a + b p (i.e. we accept the point symmetric
1
Because we assumed attribute independence, we generated variates for each dimension and then
combined them into a vector.
101
(a+b)/2 p
(i)
(ii)
When a bags center is close to 0, we will have the distribution shown in Figure 7.1 (ii). We can view it as a combination of two line distributions one part
in
[a; 0) and the other in [0; b). We first decide which part the variate falls into,
according to the probabilities that are equivalent to the areas under the two lines.2
For each line distribution we use the same technique as discussed above to draw
a random variate from it. This is how we draw data points for each bag from the
conditional density given in Equation 4.3 in Chapter 4, where
P r(x) is a triangle
distribution.
In the TLD method in Chapter 5, we need to generate k2 from an Inverse-Gamma
distribution (parameterized by ak and bk ) and k from a Gaussian (parameterized
Note that the total area under the two lines sum to 1.
102
constant, the probability remains unchanged but we can arbitrarily change the value
of the scaling parameter s. This means that if we multiplied every data point
in one dimension by an arbitrary constant and found the resulting
s accordingly,
The coefficients are taken as the relative values to the maximal (absolute value of)
coefficient value, thus the largest is 100%. The data was standardized using the
weighted mean and standard deviation, as described in Chapter 4. In linear logistic
models, the meaning of the coefficients of the attributes is straightforward. If we
104
f142
f3
f44
f148
f105
f33
f23
f114
f145
f111
f103
f91
f149
f19
f126
f65
f106
f113
f84
f12
f152
f80
f74
f38
f79
f96
f137
f73
f104
f130
f28
f27
f32
f66
f101
f88
f59
f112
f11
f123
f45
f81
f18
f14
f78
f1
f75
f55
f151
f109
f110
f15
f72
f40
f99
f139
f150
f95
f71
f122
f6
f158
f120
f8
f41
f100
f87
f63
f29
f35
f164
f134
f90
f46
f160
f135
f25
f70
f57
f53
f124
f155
f93
f131
f156
f16
f108
f143
f107
f5
f146
f144
f39
f98
f92
f2
f58
f138
f136
f49
f68
f52
f64
f94
f4
f17
f89
f154
f125
f153
f47
f117
f119
f42
f22
f24
f48
f21
f54
f67
f13
f133
f9
f36
f51
f132
f121
f7
f118
f20
f62
f82
f83
f166
f77
f30
f56
f60
f86
f128
f69
f102
f26
f141
f127
f129
f85
f140
f159
f115
f34
f10
f61
f43
f161
f50
f147
f97
f165
f157
f76
f162
f163
f116
f31
f37
100
90
80
70
60
50
40
30
20
10
f39
f38
f135
f42
f150
f49
f88
f12
f143
f136
f19
f44
f45
f105
f79
f93
f15
f149
f144
f142
f111
f74
f73
f62
f100
f134
f40
f148
f110
f25
f160
f78
f59
f14
f154
f152
f104
f87
f8
f107
f126
f35
f27
f75
f137
f103
f13
f63
f55
f109
f153
f138
f4
f80
f3
f98
f95
f101
f68
f146
f70
f106
f18
f155
f69
f22
f123
f28
f54
f29
f11
f71
f6
f120
f90
f82
f130
f65
f156
f41
f85
f133
f84
f89
f114
f115
f166
f32
f51
f113
f121
f23
f57
f151
f81
f72
f83
f112
f139
f122
f158
f157
f46
f99
f52
f5
f1
f48
f53
f64
f7
f140
f125
f60
f117
f124
f141
f47
f21
f2
f24
f97
f33
f26
f91
f43
f30
f127
f10
f108
f36
f129
f20
f9
f145
f86
f159
f165
f128
f16
f34
f118
f56
f119
f77
f161
f94
f131
f92
f76
f17
f50
f58
f132
f67
f96
f116
f61
f66
f147
f102
f31
f37
f164
f162
f163
100
90
80
70
60
50
40
30
20
10
Attribute
Figure 7.2: Relative feature importance in the Musk datasets: the left plot is for
Musk1, and the right one for Musk2.
105
we have the probability of , P r ( j0y ) and the complete data likelihood function
L(Bj ; ; 0y ) = P r(Bj j; 0y ) = P r(Bj j). Thus the integral can be regarded as
taking the expectation of thecomplete data likelihood function over the unobservables, i.e.
formula of the expectation is exactly the same as the last line in Equation 5.1. Then
we regard the observables y as a random variable and maximize this expected
y , which is the M-step. Under some regularity conditions, we can maximize L(Bj ; ; y ) within the expectation sign. Hence within the
expectation sign, we take a Newton step from 0y to get a new value 1y , that is
1y = 0y + E (HL 1 )E (gL) where HL 1 is the Hessian matrix (second derivative) of
the likelihood and gL is the Jacobian (first derivative) at the point 0y . The expectation is, again, over and can be numerically evaluated now. This defines an iterative
likelihood function w.r.t.
solution of this maximum likelihood method, which was cited by the well-known
original paper of the EM algorithm [Dempster et al., 1977] as an early example of
the EM algorithm.
In the TLD method we also have a Gamma function to evaluate in Equation 5.6.
More precisely, we need to evaluate:
g = log
b n2j , and s
((bk + nj )=2)
:
(bk =2)
Ph
x=1 log (bk =2
+ h x). If nj is even, g = s
according to the well-known identity (y + 1) = y (y ). But if nj is odd, we have
=2)
g = s log (bk(=b2+1
, thus we still have log-Gamma function to evaluate. We
k =2)
Define
h =
minimum of the negative log-likelihood function requires the Jacobian and Hessian
2
matrix of the function,3 we also need to evaluate ddy log (y ) and ddy2 log
this case. Fortunately there is an easy approximation for them [Artin, 1964]:
d
log (y ) = C
dy
+1
1 X
1
+
y i=1 i
(y ) in
1
y+i
+1
1
1 X
d2
log (y ) = 2 +
2
dy
y
(y + i)2
i=1
where C is Eulers constant and is canceled out when taking differences as in Equation 5.6.
We would like to further analyze the model in TLD to get more insight into the
interpretation of the parameters in this method. We restrict our discussion to one
dimension so that we can discard the subscript k . The parameters a and b together
define the properties of 2 . The parameter m, the mean of , is quite independent on
the other parameters whereas w depends on both the variance of and the (expected
value of) 2 . Thus it seems that the most difficult part of the interpretation comes
Note that the Quasi-Newton method we used itself does not need the Hessian matrix. We provide
the Hessian to give better solutions in case of the bound constraints.
108
dF ( 2 ) =
a b=2 exp[ a
22
22 d 2
2
(b=2)
E ( 2 )
Z 1
a b=2 exp[ a
22
22
(b=2)
By setting y
d 2
= a=(2k2 ), we have
Z
1
a
1
y b=2 exp( y ) 2 dy
= b
2y
(2) 0
Z 1
b
a
=
y 2 2 exp( y ) dy
b
2 (2) 0
and if 0 < 2b
o
1 b
a n 1 2b 1
1+
1 exp( y ) dy
2
y
=
exp(
y
)
j
y
0
2 ( 2b ) 2b 1
0
n
b 1
b 1
1
b o
a
2
2
y exp( y )jy!1 y exp( y )jy!0 + ( )
=
2
2 ( 2b ) 2b 1
o
n
b
b
a
1
=
0 y 2 1 exp( y )jy!0 + ( )
b
b
2
2 (2) 2 1
If 2b
109
a
b 2 . Otherwise, if
b < 2,
E ( 2 )
a
=
2 ( 2b )
Z 1
0
y 1 exp( y ) dy = 1:
Therefore, if b 2, the distribution is proper but the mean does not exist, otherwise
the mean is b a 2 .
Hence as long as the ratio of a and b
is the same, and so is the LL function value given other parameters and the data. But
note that the variance of 2 is not the same even if a=(b
2) is a constant because
V ar( 2 ) = 2a2 =[(b 2)2 (b 4) provided that b > 4. That is why in Figure 7.3
the values of a and b corresponding to the maximal LL values are not exactly linear,
but very close to linear.
This analysis is useful for performing feature selection using the TLD method. Even
if two features have different ak and bk values but similar values of bkak 2 , wk , and
mk , they are still not useful for discrimination. Moreover, the parameter wk denotes
the ratio V ar(2k ) . Given V ar (k ), wk actually depends on ak and bk . Experiments
k
(on the artificial data) show that if we specify an independent V ar (k ) and bk
> 2,
Unfortunately, we did find many such attributes in the Musk data, which may be a
reason why the simplified model TLDSimple can work better than TLD on this
data. On the other hand, when we applied TLD on the kiwifruit datasets described
in Chapter 6, we did not observe this (adverse) phenomenon (i.e. all bk s > 2). As
a result, TLD can be applied successfully and, as discussed in Chapter 6, it actually
outperformed TLDSimple on this data.
110
In this section we briefly describe how we can view the Diverse Density (DD) algorithm [Maron, 1998] as a maximum binomial likelihood method, and how we
should interpret its parameters.
The Diverse Density (DD) method was proposed by Maron [Maron, 1998; Maron
and Lozano-Perez, 1998] and has EM-DD [Zhang and Goldman, 2002] as its successor, which claims to be an application of the EM algorithm to the most-likelycause model of the DD method. However, we think that the EM-DD algorithm has
problems to find a maximum likelihood estimate (MLE), as shown in Appendix D.
Thus we are strongly skeptical about the validity of EM-DDs solution as an ML
method. We only discuss the DD algorithm here.
The basic idea of DD is to find a point
positive bags as possible overlap on the point but no (or few) negative bags cover it.
The diverse density of a point x stands for the probability for this point to be the
true concept. Mathematically,
L(xjB) =
Y
1in
P r(x = tjBi+ )
Y
1j m
P r(x = tjBj )
(7.1)
We believe this notation is confusing and may even disguise the essence of the DD
method. Hence we would like to express this method using a different perspective.
As is usually done in single-instance learning, we would like to explicitly represent
the class labels as a variable Y . In MI, there is still a Y , not for each instance, but
111
[Maron, 1998] proposes two approximate generative models, namely the noisy-or
and the most-likely-cause models, to calculate P r (x
Noisy-or model: DD borrows the noisy-or model from the Bayesian networks literature. Noisy-or calculates the probability of an event to happen in
the face of a few causes, assuming that the probability of any cause failing to
trigger this event is independent of any other causes. DD regards an instance
as a cause and
t being the underlying concept as the event, hence using the
original notation
P r(x = tjBi ) =
P r(tjBi+) = 1
and
[1 P r(x = tjBij )
[1 P r(x = tjBij ):
we could first determine the class label of each instance (or stage) according to its
class probability. Given this, what is the probability for all the instances (stages) to
be negative if they are independent? This is a simple question and the solution is
Qm
b=1 (1
Qm
b=1 (1
P r(Y =
1jxb )) corresponds to the probability for at least one of them to be positive. This is a
L=
X
B
[Y log(1
m
Y
b=1
m
Y
b=1
selects the instance with the highest probability to be positive in a bag. Therefore
it literally degrades a bag into one instance and the one-stage Bernoulli process
is applied to determine the class label, as in the mono-instance case. The loglikelihood function is now
L=
X
B
X
B
cause model uses. The most-likely-cause also follows the standard MI assumption
because, by selecting an instance in a negative bag that has the maximal probability
to be positive and setting that probability (via the binomial model) to less than 0.5, it
implies that the probability to be positive for every instance in a negative bag cannot
be greater than 0.5.
This means that DD uses the binomial probability formulation, either one stage or
multiple stage, to model the class probabilities of the bags. In general, we can
separate the modeling into two levels if we introduce a new variable denoting the
bags
data. Mathematically, at the bag level, assuming i.i.d data,4 the marginal probability
We assume that the class labels YBi of the bags Bi are independent and identically distributed
(i.i.d) according to a binomial (or multinomial in a multi-class problem) distribution.
4
114
Y Bi
1kN
g (; Bk )YBk (1 g (; Bk ))1
YBk ;
Y
1in
g (; Bi)
Y
1j m
(1 g (; Bj )):
(7.4)
115
parameter standing for a point in the instance space. The second one is to use
exp(
jjs(Xkl p)jj2) where s is a diagonal matrix with diagonal elements that are
the scaling parameters for the different dimensions. The last model is a variation
of the second one that models a set of disjunctive concepts, each of which is the
same as the second model. As a matter of fact, suppose we knew there are
D con-
cepts to be found. Then the DD method would have D sets of parameters (instead
; vD , where vd , d = 1; 2; ; D, is a vector
of parameters consisting of both point and scaling parameters (p and s) for each
dimension. In the process of searching for the values of the pd s and sd s, the probability of each Xkl is associated with only one concept the one that makes Xkl
to have the highest P r (Y = 1jXkl ). In other words, P r (Y = 1jXkl ) is calculated
as maxd fexp( jjsd (Xkl pd )jj2 g.
P r(YX jXkl ) (f (; Xkl)) is in a radial (i.e. Gaussian-like)
form, with a center of pt and a dispersion of s1t in the tth dimension (where st is the
tth diagonal element in s). The closer an instance is to p, the higher its probability to
The formulation of
be positive. And the dispersion determines the decision boundary of the classification, i.e. the threshold of when P r (YX jXkl )
hyper-rectangle (APR) [Dietterich et al., 1997] on the instance level. But APR is
not differentiable. In order to make it differentiable, DD essentially models the
(instance-level) decision boundary as an axis-parallel hyper-ellipse (APE) instead
of a hyper-rectangle. The diameter of this APE along the tth dimension is
plog 2
st .
p1 )2
s22 (x2
p2 )2 = 0:5 )
(x1
p1 )2
log 2
s21
(x2
p 2 )2
log 2
s22
=1
formulation in DD described above. The third model in DD models more than one
116
best results reported when searching for the threshold, which are 88:9% on Musk1
data and 82:5% on Musk2 data,5 but the computational expense is greatly reduced.
The misunderstanding of the parameter
lihood of Equation 7.3, which makes it not differentiable. The softmax function
is used in DD, which is the standard way to make the max function differentiable.
The EM-DD method [Zhang and Goldman, 2002] was proposed to overcome the
5
Because [Maron, 1998] did not report how many runs of 10-fold CV were used, and neither the
standard deviation of the accuracies, we cannot do a significant test to see whether the differences
are significant.
117
118
Chapter 8
Conclusions and Future Work
The approach adopted in this thesis is a conservative one in the sense that it is
similar to existing methods of multiple instance learning. Much of the work is an
extension or a result derived from the statistical interpretation of current methods in
either single-instance or MI learning. Although some new MI methods have been
described in this thesis, we basically adopted a theoretical perspective similar to
that of the current methods. Hence much of the richness of the multiple instance
problems is left to be explored. In this chapter, we first summarize what has been
done in this thesis. Then we propose what could be done to more systematically
explore MI learning in a statistical classification context.
8.1 Conclusions
In this thesis, we first presented a framework for MI learning based on which we
summarized MI methods published in the literature (Chapter 2). We defined two
main categories of MI methods: instance-based and metadata-based approaches.
While instance-based methods focus on modeling the class probability of each instance and then combine the instance-level probabilities into bag-level ones, metadata-based methods extract metadata from each bag and model the metadata directly. Note that in the instance-based methods, the combination of the instancelevel predictions into bag-level ones requires some assumptions.
119
8.1. CONCLUSIONS
The standard assumption that can be found in the literature is the MI assumption.
We proposed a new assumption instead of the MI assumption and called it the collective assumption. We also explained that some of the current MI methods have
implicitly used this assumption. Under the collective assumption, we developed
new methods that fall into two categories: bag-conditional and group-conditional
approaches.
A bag-conditional approach models the probability of a class given a bag of n instances P r (Y jX1 ;
erative model would have been too simple to solve real-world problems. Instead,
we assumed that all instances from the same bag are from the same density while
different bags correspond to different (instance-level) densities. We then related the
parameters of these densities to one another by using a hyper-distribution (or baglevel distribution) on the parameters. This resulted in a two-level distribution (TLD)
solution (Chapter 5) to MI learning. This is essentially a metadata-based approach.
We discovered that this approach is an application of the empirical Bayes method
from statistics to the MI classification problem.
Then we explored some practical applications of MI learning the drug activity
120
by some parameter ,
instance learners are not expected to deal with this data because they cannot use
the information provided by the Bag ID attribute. Since this information resides
in the data, there is room to develop a new family of methods that can fully utilize
the bag information. Such new methods may outperform normal single-instance
learners on this type of problems.
We call such a problem a semi-MI problem because the second property of MI
problems is not satisfied. As shown above, much of the richness of multiple instances learning already appears in semi-MI problems where normal single-instance
learning cannot be applied. When classifying a test instance in semi-MI learning,
we can regard the rest of the instances within the same bag as an environment
for the classification. Even if the instance to be classified does not change, the
classification may change if the environment (i.e. other instances within the bag)
changes. Ignoring this contextual information may not give an accurate prediction.
Nevertheless MI research seems to regard the semi-MI problem as the same setting as normal single-instance learning, and semi-MI problems do not appear to
122
Applications
We expect that multiple instance learning will keep attracting researchers, mainly
due to its prospective applications in various areas. However, one of the biggest
obstacles is the lack of fielded applications and publicly available datasets. More MI
datasets and practical applications would stimulate research in real world problems
for MI learning. In fact, we observed that there are many datasets in which instances
124
125
126
Appendix A
Java Classes for MI Learning
We have developed a multiple instance package using the Java programming language for this project. A schematic description is shown in Figure A.1. This package is directly derived or modified from the corresponding codes in the WEKA
workbench [Witten and Frank, 1999].1 We put the programs for the MI experiment
environment and MI methods directly into the MI package, some artificial data generation procedures into the MI.data sub-package, and some data visualization
tools into the MI.visualize sub-package. Although the documentations of the
programs are self-explanatory, we briefly describe some of them in the following.
As a matter of fact, we copied some of the source codes from the WEKA files.
127
MI
(MI algorithms and
experimental environment)
MI.data
MI.visualize
MI.MIDistributionClassifier : An interface for any MI classifier that can provide a class distribution given a test exemplar. It extends
MI.MIClassifier.
All the MI methods developed in this thesis as well as some other MI methods are
implemented in the MI package. More precisely, they are:
MI.DD : The Diverse Density algorithm with the noisy-or model [Maron,
1998] that looks for one target concept. Details are described in Chapter 7.
128
m,
w and the generated variance 2 (i.e. generate from N (m; w 2 )). Finally,
it generates a bag of instances according to a Gaussian parameterized by
and 2 . The datasets generated by this class are used to show the estimation
properties of TLD in Chapter 5.
130
APPENDIX B. WEKA.CORE.OPTIMIZATION
Appendix B
weka.core.Optimization
Zak,
1996], [Gill et al., 1981], [Dennis and Schnabel, 1983], etc.
1. Initialization. Set iteration counter k=0, get initial variable values 0 , cal-
culate the Jacobian (gradient) vector 0 and the Hessian (second derivative)
H0 using x0.
2. Check gk . If it converges to 0, then STOP.
3. Solve for the search direction dk , Hk dk = gk . Alternatively this step can
be expressed as dk = Hk 1 gk
matrix
One can easily get the update in Step 4 below by Taylor series expansion to the second order
from xk .
131
4. Take a step to get new variable values k+1 . Normally, this is done as one
Newton step k , however, when the variable values are far from the function
minimum, one Newton step may not guarantee a decrease of the objective
5. Calculate the new gradient vector k+1 and the new Hessian matrix
Hk+1
using k+1 .
6. Set k=k+1. Go to 2.
objective function. Such a line search is called an exact line search, which is
computationally expensive. It was recommended that only a value of
that can
1983; Press et al., 1992]. Thus we use an inexact line search with backtracking and
polynomial interpolations [Dennis and Schnabel, 1983; Press et al., 1992] in our
project.
Now we carry on with Quasi-Newton methods. The idea of Quasi-Newton methods
is not to use the Hessian matrix because it is expensive to evaluate and sometimes it
is not even available. Instead a symmetric positive definite matrix, say
B, is used to
approximate the Hessian (or the inverse of the Hessian). As a matter of fact, it can
be shown no matter what matrix is used, as long as it is symmetric positive definite,
and an appropriate update of the matrix is taken in each iteration, the search result
132
APPENDIX B. WEKA.CORE.OPTIMIZATION
will be the same! [Gill et al., 1981] Thus the key issue is how to update this matrix
B. One of the most famous methods is a variable metric method called Broyden-
B. There are, of course, many other update methods, but it was reported
that BFGS method works better with the inexact line search. Hence this method is
preferred in practice. In summary, the only difference between the Quasi-Newton
method and the Newton method concerns
1. Initialization. Set iteration counter k=0, get initial variable values 0 , cal-
B0 .
T
T
Bk+1 = Bk + ggkkTgxk k BkxkxTkBkxk xBk k
6. Set k=k+1. Go to 2.
Before we move on to the optimization with bound constraints, there are some details to be elaborated here.
Bk+1 twice in Step 5, Bk+11 is readily available and in the next iteration Step 3 is
easy to carry out. Nevertheless, we will not adopt this strategy because of a minor
but important practical issue involved.
133
Since the whole algorithm depends on the positive definiteness property of the matrix
B (otherwise the search direction will be wrong, and it can take much much
longer to find the right direction and the target!), it would be good to keep the positive definiteness during all iterations. But there are two cases where the update in
Step 5 can result in a non-positive-definite
B:
>
0 and this condition can be ensured in an exact line search [Gill et al., 1981]. When
using an inexact line search, apart from the sufficient function decrease criterion,
we should also impose this second condition on it. Thus we cannot use the line
search in [Press et al., 1992], instead we use the modified line search described in
[Dennis and Schnabel, 1983] in Step 4.
Second, even if the hereditary positive-definiteness is theoretically guaranteed, the
matrix
B can still lose positive-definiteness during the update due to rounding errors
when the matrix is nearly singular, which are not uncommon in practice. Therefore
trix after a low rank update is theoretically positive definite, there exist algorithms
that avoid rounding errors during the updates and ensure that all the diagonal ele-
D are positive [Gill, Golub, Murray and Saunders, 1974; Goldfarb, 1976].
This factorized version of the BFGS updates is the reason why we do not use Bk 1
in Step 3 because with a Cholesky factorization, the equation Bk dk = gk can
ments of
that the BFGS update formula in Step 5 is not convenient if the Cholesky factorization of
134
gk ,
APPENDIX B. WEKA.CORE.OPTIMIZATION
Note that this involves two rank one modifications, and the coefficients g T1 x >
0 and g T1 d < 0 respectively. Hence the first update is a positive rank one update
k
and the second one a negative rank one update. There is a direct rank two modification algorithm [Goldfarb, 1976], but for simplicity we implemented two rank one
modifications using the C1 algorithm in [Gill et al., 1974] for the former update and
the C2 algorithm, also in [Gill et al., 1974], for the latter one. Note that all these
algorithms have O (N 2 ) complexity.
In summary, we use a factorized version of the Quasi-Newton method to avoid the
rounding errors and achieve positive-definiteness of
B during updates.
Note that
Ax = b. It
entries of 1 for free (i.e. not in the constraints) variables and 0s otherwise.
Next let us go further into the problem of optimization subject to linear inequality
constraints
we are interested in the one(s) that does not allow variables to take values over the
bounds, because in our case the objective function is not defined there. Hence we
use the active set method, which has this essential feature [Gill et al., 1981]. The
idea of the active set method is to check the search step in each iteration such that,
if some variables are about to violate the constraints, these constraints become the
active set of constraints. Then, in later iterations, use the projected gradient and
the projected Hessian (or
constraints to perform further steps. We will not dig deeply into this method because
in our case (i.e. for bound constraints), the task is especially easy. In each iteration
in the above Quasi-Newton method with BFGS updates, we simply test whether
a search step can cause a variable to go beyond the corresponding bound. If this
occurs, we fix this variable, i.e. treat it as a constant in later iterations, and use
only the free variables to carry out the search. Thus the main modification in the
above algorithm is in the line search in Step 4. We should use an upper bound for
for all possible variables. The upper bound is min( bi dixi ) (where bi is the bound
constraint for the ith variable xi ) if the direction di of xi is an infeasible direction
(i.e. if di < 0). This means that we always calculate the maximal step length that
does not violate any of the inactive constraints and set this as the upper bound for
the trial. Therefore this line search is called safeguarded line search. This method
can be readily extended to the case of two-sided bound constraints, i.e.
u x l,
APPENDIX B. WEKA.CORE.OPTIMIZATION
Hessian back to the corresponding higher dimensions, i.e. update the corresponding
entries in
g and B (basically set them to the initial state for these originally fixed
variables). Nonetheless, if the user does not provide the second derivative,2 we only
use the first order estimate of the Lagrange multiplier.
The above is a description of what we have done for the optimization subject to
bound constraints. For completeness, we write down the final algorithm in the
following, although it is basically just a repetition of the above description. Note
we use the superscript FREE to indicate a projected vector or matrix below (i.e.
they only have entries corresponding to the free variables).
1. Initialization. Set iteration counter k=0, get initial variable values 0 , cal-
culate the Jacobian (gradient) vector 0 using 0 and compute the Cholesky
factorization of a symmetric positive definite matrix
D0 .
2. Check k . If it converges to 0, then test whether any fixed (or bound) variables can be released from their constraints using both first and second order
estimates of the Lagrange multipliers.
3. If no variable can be released, then STOP, otherwise release the variables and
add corresponding entries in k F REE (set to the bound values), k F REE (set
variable is to be released, we set ljj and djj to 1 and the other entries in j th
row/column to 0).
4. Solve for the search direction k F REE using backward and forward substitu-
gkF REE .
5. Cast an upper bound on and search for the best value of along the direc-
tion k F REE using a safeguarded inexact line search with backtracking and
= xk F REE + dk F REE .
It is often the case because one of the reasons why people use the Quasi-Newton method is that
they do not need to provide the Hessian matrix.
137
6. If any variable is fixed at its bound constraint, delete its corresponding en-
g
xkF REE and gkF REE
DkF REE .
x
7. Calculate the gradient vector k+1 F REE using k+1 F REE . Set
xk+1F REE
xk F REE =
Then the
update is:
Lk+1F REE Dk+1F REE (Lk+1F REE )T = LkF REE DkF REE (LkF REE )T
g F REE (g F REE )T
+ k F REE T k F REE
(gk
) xk
F
REE
g
(g F REE )T
+ k F REE Tk F REE
(gk
) dk
We use the aforementioned C1 and C2 algorithms to perform the updates.
8. Set k=k+1. GO TO 2.
138
Appendix C
Fun with Integrals
This Appendix is about the detailed derivation of the final equations for both TLD
and TLDSimple in Chapter 5.
Bjk =
Z +1Z +1 (
0
(2k2 )
bk
+1
nj =2 exp
ak2 2 k2
p
( 2 )
(wk ) (bk =2) k
bk +3
2 + n (x
Sjk
j jk
2k2
exp
k )2 i
2 i)
ak + (k wmk k )
2k2
dk dk2 :
bk
bk +nj +3
a
ak2
(k2 ) 2 exp( k2 )
=
bk +nj p
2k
0
1 2 2
(2wk ) (bk =2)
)
1
2
2
2
[wk Sjk + nj wk (xjk k ) + (k mk ) dk dk2 :
exp
2
2wk k
139
k )2 + (k
nj wk xjk + mk i2
+
1 + n j wk
2 (1 + n w )
mk )2 + wk Sjk
j k
;
1 + nj wk
mk )2 = (1 + nj wk ) k
wk nj (xjk
Bjk =
Z +1Z +1 (
0
exp
where Mk
bk
b +n +3
ak2
2 ) k 2 j exp( ak )
(
b +n p
k
2k2
1 2 k 2 j (2wk ) (bk =2)
)
2 (1 + n w )
(k Mk )2
1 nj (xjk mk )2 + Sjk
j k
exp
dk dk2
2k2
1 + nj wk
2Vk
nj wk xjk +mk
1+nj wk and Vk
Z +1
(2Vk )
=
1
2
wk k2
1+nj wk . Using the identity
exp[
(k
Mk )2
dk = 1;
2Vk
we integrate out k ,
Z +1 (
bk
bk +nj +3
a
ak2
(k2 ) 2 exp( k2 )
=
bk +nj p
2k
0
2 2
(2wk ) (bk =2)
s
)
2 (1 + n w )
wk k2
1 nj (xjk mk )2 + Sjk
j k
2
dk2 :
exp
2k2
1 + nj wk
1 + nj wk
Now we set y
Bjk =
where
ak
2k2 and re-arrange again:
Z +1 (
0
bk +nj +2
2a (nj +2)=2
2
p k
y
exp( y ) dk2
n
=
2
j
1 + nj wk (bk =2)
2 ) + n (x
= (1 + nj wk )(ak + Sjk
j jk
140
mk )2 = ak (1 + nj wk ) . Because
k2 =
ak
2y
) dk2 =
=
=
ak y 2
2
dy , we set =
Z 0 (
nj =2
+1
Z +1 (
nj =2
bk +nj
2 and get
nj =2
ak
y 1 exp( y ) dy
1 + nj wk (bk =2)
nj =2
ak
y 1 exp( y ) dy
1 + nj wk (bk =2)
Bjk =
=
nj =2
nj =2
( )
1 + nj wk (bk =2)
akbk =2 (1 + nj wk )(bk +nj
a
p k
1)=2
2 ) + n (x
(1 + nj wk )(ak + Sjk
j jk
(bk + nj =2)
mk )2
bk +nj nj
2 2
( b2k )
Bjk =
Z +1 (
h
1
q
exp
2
2 nkj
h
1
(xjk k )2 i
p
exp
2
(2wk )
2( nkj )
(k
Z +1 (
h
1
p
exp
(2Vk )
wk nj + k2
2
nj
(k
M k )2 i
2Vk
1=2
exp
141
mk )2 i
dk :
2wk
nj (xjk mk )2 i
dk
2(wk nj + k2 )
Z +1
wk k2
2
k +nj wk . With the identity
h
1
exp
(2Vk )
(k
Mk )2 i
dk = 1;
2Vk
Bjk = 2
wk nj + k2
nj
1=2
142
exp
nj (xjk mk )2 i
2(wk nj + k2 )
Appendix D
Comments on EM-DD
EM-DD [Zhang and Goldman, 2002] was proposed to overcome the difficulty of
the optimization problem required to find the maximum likelihood estimate (MLE)
of the instance-level class probability parameters in Diverse Density (DD) [Maron,
1998]. Note that there are two approximate generative models proposed in DD to
construct the bag-level class probability from the instance-level ones the noisyor and the most-likely-cause model. In the noisy-or model, there is no difficulty
in optimizing the objective (log-likelihood) function while in the most-likely-cause
model, the objective function is not differentiable because of the maximum functions involved. DD used the softmax function to approximate the maximum function in order to facilitate gradient-based optimization, which is a standard way to
solve non-differentiable optimization problems. EM-DD, on the other hand, claims
to use an application of the EM algorithm [Dempster et al., 1977] to circumvent
the difficulty. Therefore EM-DD provides no improvement on the modeling process in DD, only on the optimization process. Since DD uses the standard way
to treat non-differentiable optimization problems [Lemarechal, 1989] and was supposed to find the MLE, why can EM-DD be such an improvement? Besides, we
have not found any case in the literature that EM can simplify a non-differentiable
log-likelihood function in conjunction with a gradient-based optimization method
(i.e. a Newton-type method) in the M-step. Can the standard EM algorithm be
applied to a non-differentiable log-likelihood function? These questions lead us to
be skeptical about the validity of EM-DD.
143
L=
X
B
X
B
lect one instance from each bag (the most-likely-cause instance) to construct the
log-likelihood. In the process of searching for MLE , when the parameter value
changes, the most-likely-cause instance to be selected may also change. Thus,
the log-likelihood function may suddenly change forms when the parameter value
changes. More specifically, if we arbitrarily pick up one instance from each bag and
construct the log-likelihood function, we have one possible log-likelihood function
we call it one component function. If we change the instance in one bag,
we obtain another, different (unless the changed instance is identical to the old
one) component function. Obviously, if there are s1 instances in the 1st bag, s2
instances in the 2nd bag, . . . , sm+n instances in the
0
-2
Log-likelihood
-4
-6
-8
-10
-12
-14
-2
Parameter value
Figure D.2: An illustrative example of the loglikelihood function in DD using the most-likely-cause
model.
of positive and negative bags respectively). And the true log-likelihood function is
constructed using some of these component functions.1 When the parameter value
falls into one range in the domain of , the log-likelihood function is in the form
of a certain component function. And if it falls into another range, then it becomes
another component function. Although the true log-likelihood function changes its
form for different domains of , it is continuous at the point when the form of the
function changes from one component function to another because of the Euclidean
distance (L2 norm) used in the radial (or Gaussian-like) formulation of
P r(Y jX ),
Note that not all of the component functions are used because some instances (from different
bags) will never be picked up simultaneously for any value of .
145
1 there).
If we ignore the (small) local maxima in each of the component functions, we can illustrate (a part of) the true likelihood function using curves like those in Figure D.2.
Note that this figure does not rigorously describe the situation of maximizing the
log-likelihood in Equation D.1. It only serves an illustration. However, it does give
us an idea when EM-DD can work and when it cannot. Although some details may
not be accurate (like the coordinates or the exact shape of the component functions),
these factors do not affect the key ideas of the illustration.
There are three component functions in the plot, denoted by 1, 2 and 3. The dotted lines are the part of the component functions not used in the true log-likelihood
function. The solid line plots the true function. Again we only plot, in one dimension, the point parameter against the log-likelihood. The shapes of the component
functions are different because we also incorporate a fixed value of the scaling parameter for each one of the component functions (i.e. different component functions have different scaling parameter values). Therefore, we simplify the optimization procedure as a steepest descent method where we first fix the scaling
parameter and search for the best point parameter, then fix the point parameter
and search for the optimal scaling parameter. In Figure D.2, we do not show how
to search for the value of scaling parameter. We assume that we are given the
146
P r(Y = 1jX; f ) (i.e. the minimal value of P r(Y = 0jX; f )) in a negative bag
has to result in a smaller log-likelihood value than selecting other instances in the
same bag. Therefore the true log-likelihood function is often not the one with the
greatest value among all the component functions, as illustrated in Figure D.2, as
long as there is more than one instance in each negative bag.
In this example, the true function shifts between the components twice, and the
shifting points are indicated by small triangles and X and Y respectively. At
both points the log-likelihood function is still increasing but the newly-shifted component function value is less than the value of the old component function. Shifting
between components means that we should change the instances in each bag that
construct the log-likelihood function. In the new component function, a new set
of instances, one from each bag, are picked up. Obviously this true function is
non-differentiable at the two points when it shifts components, but it has a local
maximum at point D.
147
L0 =
X
1in
X
1j m
Then it cycles between the E-step and M-step as follows until convergence (Suppose
this is the pth iteration):
E-step: find the instance in each bag that has the greatest
p+1 = argmaxf
X
1in
X
1j m
where there are n positive bags and m negative bags. And then compute
Lp+1 =
X
1in
X
1j m
The convergence test is performed before each E-step (or after each M-step) via the
comparison of the values of the log-likelihood function Lp and Lp+1 .
To give an illustration, we first apply the above algorithm to the artificial example
shown in Figure D.2 and see what it could find given A as the start value for .
This example is deliberately set up so that we may see both cases in which EM148
finds point B as the maximum of Component 3 (assuming it also finds the optimal
scaling parameters in the M-step, which determine the shape of next component in
Figure D.2). When it computes the log-likelihood L1 , it will necessarily return to
the true log-likelihood function. In other words, it picks up the new set of instances
according to 1 (the X -axis coordinate of B/B). As a result L1 is the value of the
log-likelihood function at point B.
In the second iteration, EM-DD first compares L1 and L0 , i.e. the function values at
A and B. In this case, it indeed finds a better value of the parameter, so it continues.
In the E-step it picks an instance in each bag according to 1 . The corresponding
new component function is Component 1. In the M-step it will find the maximum
of Component 1 at point C. Then it calculates L2 as the true function value. This
is actually point C on Component 2.
In the third iteration, EM-DD first compares L2 and L1 , i.e. the function values at
B and C. However, this time it finds that it cannot increase the function value so
the algorithm stops and point B is returned as the solution, i.e. EM-DD will not
be able to find point D which is the true maximum of the log-likelihood function.
If it kept searching, it would have found D. Nonetheless, to do this, it must break
the convergence test, which is a crucial part of the proof of convergence for EMDD [Zhang and Goldman, 2002]. Without this convergence test EM-DD is not
149
Lp
EM, since the E-step also increases the log-likelihood, the convergence test is only
to test whether Lp
The reason why the proof used in the standard EM algorithm does not apply to
EM-DD is due to the special property of the unobservable data. In EM-DD, the
unobservable data is zi , an index for the ith bag that indicates which instance should
be used in the log-likelihood function. This variable zi is not quite unobservable
in this case because for each value of , it is fixed (i.e. no probability distribution is
needed) and observable in the data, although for different values of its value also
changes. Therefore if one insists on regarding it as a latent variable in EM, then
given a certain parameter value p , the probability of zi is
P r(zi jp) =
8
<
:
This probability function is very unusual, and still involves a max function, which
is not smooth, and thus the proof in EM cannot apply to the expected log-likelihood
function in the E-step in EM-DD.
As a matter of fact, as shown in Section D.2, in the E-step of EM-DD, the loglikelihood is very likely to decrease, in which case the algorithm has to stop. Note
minb2m f1
151
153
154
BIBLIOGRAPHY
Bibliography
Ahrens, J. and Dieter, U. [1974]. Computer methods for sampling from Gamma,
Beta, Poisson and Binomial distributions. Computing, (12), 223246.
Ahrens, J., Kohrt, K. and Dieter, U. [1983]. Algorithm 599: sampling from Gamma
and Poisson distributions. ACM Transactions on Mathematical Software, 9(2),
255257.
Artin, E. [1964]. The Gamma Function. New York, NY: Holt, Rinehart and Winston. Translated by M. Butler.
Auer, P. [1997]. On learning from multiple instance examples: empirical evaluation of a theoretical approach. In Proceedings of the Fourteenth International
Conference on Machine Learning (pp. 2129). San Francisco, CA: Morgan
Kaufmann.
Bilmes, J. [1997]. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical
Report ICSI-TR-97-021, University of Berkeley.
Blake, C. and Merz, C. [1998]. UCI repository of machine learning databases.
Blum, A. and Kalai, A. [1998]. A note on learning from multiple-instance examples.
Machine Learning, 30(1), 2330.
Bratley, P., Fox, B. and Schrage, L. [1983]. A Guide to Simulation. New York, NY:
Springer-Verlag.
155
BIBLIOGRAPHY
Breiman, L. [1996]. Bagging predictors. Machine Learning, 24(2), 123140.
le Cessie, S. and van Houwelingen, J. [1992]. Ridge estimators in logistic regression. Applied Statistics, 41(1), 191201.
Chevaleyre, Y. and Zucker, J.-D. [2000]. Solving multiple-instance and multiplepart learning problems with decision trees and decision rules. Application to
the mutagenesis problem. Internal Report, University of Paris 6.
Chevaleyre, Y. and Zucker, J.-D. [2001]. A framework for learning rules from
multiple instance data. In Proceedings of the Twelveth European Conference
on Machine Learning (pp. 4960). Berlin: Springer-Verlag.
BIBLIOGRAPHY
Frank, E. and Witten, I. [1998]. Generating accurate rule sets without global optimization. In Proceedings of the Fifteenth International Conference on Machine
Learning (pp. 144151). San Francisco, CA: Morgan Kaufmann.
Frank, E. and Witten, I. [1999]. Making better use of global discretization. In
Proceedings of the Sixteenth International Conference on Machine Learning
(pp. 115123). San Francisco, CA: Morgan Kaufmann.
Frank, E. and Xu, X. [2003]. Applying propositional learning algorithms to multiinstance data. Working Paper 06/03, Department of Computer Science, University of Waikato, New Zealand.
Freund, Y. and Schapire, R. [1996]. Experiments with a new boosting algorithm. In
Proceedings of the Thirteenth International Conference on Machine Learning
(pp. 148156). San Francisco, CA: Morgan Kauffman.
Friedman, J., Hastie, T. and Tibshirani, R. [2000]. Additive logistic regression, a
statistical view of boosting (with discussion). Annals of Statistics, 28, 307
337.
Gartner, T., Flach, P., Kowalczyk, A. and Smola, A. [2002]. Multi-instance kernels. In Proceedings of the Nineteenth International Conference on Machine
Learning (pp. 179186). San Francisco, CA: Morgan Kaufmann.
Gill, P., Golub, G., Murray, W. and Saunders, M. [1974]. Methods for modifying
matrix factorizations. Mathematics of Computation, 28(126), 505535.
Gill, P. and Murray, W. [1976]. Minimization subject to bounds on the variables.
Technical Report NPL Report NAC-72, National Physical Laboratory.
Gill, P., Murray, W. and Wright, M. [1981]. Practical Optimization. London:
Academic Press.
Goldfarb, D. [1976]. Factorized variable metric methods for unconstrained optimization. Mathematics of Computation, 30(136), 796811.
157
BIBLIOGRAPHY
Hastie, T., Tibshirani, R. and Friedman, J. [2001]. The Elements of Statistical
Learning : Data mining, Inference, and Prediction. New York, NY: SpringerVerlag.
John, G. and Langley, P. [1995]. Estimating continuous distributions in Bayesian
classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338345). San Mateo, CA: Morgan Kaufmann.
Lemarechal, C. [1989]. Nondifferentiable optimization. In Nemhauser, R. Kan and
Todd (Eds.), Optimization, Volume 1 of Handbooks in Operations Research
and Management Science chapter VII, (pp. 529569). Amsterdam: NorthHolland.
Long, P. and Tan, L. [1998]. PAC learning axis-aligned rectangles with respect
to product distribution from multiple-instance examples. Machine Learning,
30(1), 721.
Maritz, J. and Lwin, T. [1989]. Empirical Bayes Methods (2 Ed.). London: Chapman and Hall.
Maron, O. [1998]. Learning from Ambiguity. PhD thesis, Massachusetts Institute
of Technology, United States.
Maron, O. and Lozano-Perez, T. [1998]. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems, 10 (pp. 570576).
Cambridge, MA: MIT Press.
McLachlan, G. [1992]. Discriminant Analysis and Statistical Pattern Recognition.
New York, NY: John Wiley & Sons, Inc.
McLachlan, G. and Krishnan, T. [1996]. The EM Algorithm and Extensions. New
York, NY: John Wiley & Sons, Inc.
Minh, D. [1988]. Generating Gamma variates. ACM Transactions on Mathematical
Software, 4(3), 261266.
158
BIBLIOGRAPHY
von Mises, R. [1943]. On the correct use of Bayes formula. The Annals of Mathematical Statistics, 13, 156165.
Nadeau, C. and Bengio, Y. [1999]. Inference for the generalization error. In Advanced in Neural Information Processing Systems, Volume 12 (pp. 307313).
Cambridge, MA: MIT Press.
OHagan, A. [1994]. Bayesian Inference, Volume 2B of Kendalls Advanced Theory
of Statistics. London: Edward Arnold.
Platt, J. [1998]. Fast training of support vector machines using sequential minimal
optimization. In B. Scholkopf, C. Burges and A. Smola (Eds.), Advances in
Kernel MethodsSupport Vector Learning. Cambridge, MA: MIT Press.
Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. [1992]. Numerical Recipes
in C: The Art of Scientific Computing (2 Ed.). Cambridge, England: Cambridge
University Press.
Quinlan, J. [1993]. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan
Kaufmann.
Ramon, J. and Raedt, L. D. [2000]. Multi instance neural networks. In AttributeValue and Relational Learning: Crossing the Boundaries. Workshop at the
Seventeenth International Conference on Machine Learning.
Ruffo, G. [2001]. Learning Single and Multiple Instance Decision Trees for Computer Security Applications. PhD thesis, Universita di Torino, Italy.
Srinivasan, A., Muggleton, S., King, R. and Sternberg, M. [1994]. Mutagenesis:
ILP experiments in a non-determinate biological domain. In Proceedings of
the Fourth International Inductive Logic Programming Workshop (pp. 161
174).
Stuart, A., Ord, J. and Arnold, S. [1999]. Classical Inference and the Linear Model,
Volume 2A of Kendalls Advanced Theory of Statistics. London: Arnold.
159
BIBLIOGRAPHY
Vapnik, V. [2000]. The Nature of Statistical Learning Theory. New York, NY:
Springer-Verlag.
Wang, J. and Zucker, J.-D. [2000]. Solving the multiple-instance problem: a lazy
learning approach. In Proceedings of the Seventeenth International Conference
on Machine Learning (pp. 11191134). San Francisco, CA: Morgan Kaufmann.
Wang, Y. and Witten, I. [2002]. Modeling for optimal probability prediction. In
Proceedings of the Nineteenth International Conference on Machine Learning
(pp. 650657). San Francisco, CA: Morgan Kaufmann.
Weidmann, N. [2003]. Two-level classification for generalized multi-instance data.
Masters thesis, Albert-Ludwigs-Universitat Freiburg, Germany.
Weidmann, N., Frank, E. and Pfahringer, B. [2003]. A two-level learning method
for generalized multi-instance problems. In Proceedings of the Fourteenth
European Conference on Machine Learning. To be published.
Witten, I. and Frank, E. [1999]. Data Mining: practical machine learning tools and
techniques with Java implementations. San Francisco, CA: Morgan Kaufmann.
Zhang, Q. and Goldman, S. [2002]. EM-DD: An improved multiple-instance learning technique. In Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference (pp. 10731080). Cambridge, MA: MIT Press.
Zhang, Q., Goldman, S., Yu, W. and Fritts, J. [2002]. Content-based image retrieval
using multiple-instance learning. In Proceedings of the Nineteenth International Conference on Machine Learning (pp. 682689). San Francisco, CA:
Morgan Kaufmann.
160