MIL Rules 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 174

Statistical Learning

in Multiple Instance Problems


Xin Xu
A thesis submitted in partial fulfilment of
the requirements for the degree of
Master of Science
at the
University of Waikato

Department of Computer Science

Hamilton, New Zealand

June 2003

c 2003 Xin Xu

Abstract

Multiple instance (MI) learning is a relatively new topic in machine learning. It


is concerned with supervised learning but differs from normal supervised learning
in two points: (1) it has multiple instances in an example (and there is only one
instance in an example in standard supervised learning), and (2) only one class
label is observable for all the instances in an example (whereas each instance has its
own class label in normal supervised learning). In MI learning there is a common
assumption regarding the relationship between the class label of an example and
the unobservable class labels of the instances inside it. This assumption, which
is called the MI assumption in this thesis, states that An example is positive if at
least one of its instances is positive and negative otherwise.
In this thesis, we first categorize current MI methods into a new framework. According to our analysis, there are two main categories of MI methods, instancebased and metadata-based approaches. Then we propose a new assumption for MI
learning, called the collective assumption. Although this assumption has been
used in some previous MI methods, it has never been explicitly stated,1 and this is
the first time that it is formally specified. Using this new assumption we develop
new algorithms more specifically two instance-based and one metadata-based
methods. All of these methods build probabilistic models and thus implement statistical learning algorithms. The exact generative models underlying these methods
are explicitly stated and illustrated so that one may clearly understand the situations
1

As a matter of fact, for some of these methods, it is actually claimed that they use the standard
MI assumption stated above.

to which they can best be applied. The empirical results presented in this thesis
show that they are competitive on standard benchmark datasets. Finally, we explore
some practical applications of MI learning, both existing and new ones.
This thesis makes three contributions: a new framework for MI learning, new MI
methods based on this framework and experimental results for new applications of
MI learning.

ii

Acknowledgements

There are a number of people I want to thank for their help with my thesis.
First of all, my supervisor, Dr. Eibe Frank. I do not really know what I can say to
express my gratitude. He contributed virtually all the ideas involved in this thesis.
What is more important, he always provided me with his support for my work.
When I was puzzled by a derivation or analysis of an algorithm, he was always the
one to help me sort it out. His review of this thesis was always in such a detail that
any kind of mistakes could be spotted, from logic mistakes to grammar errors. He
even helped me with practical work and tools like latex, gnuplot, shell scripts and
so on. He absolutely provided me with much more than a supervisor usually does.
This thesis is dedicated to my supervisor, Dr. Eibe Frank.
Secondly, I would like to thank my project-mate, Nils Weidmann, a lot. I feel
so lucky to have worked in the same research area as Nils. He shared with me
many great ideas and provided me with much of his work, including datasets, which
made my job much more convenient (and made myself lazier). In fact Chapter 6
is the result of joint work with Nils Weidmann. He constructed the Mutagenesis datasets using the MI setting and in an ARFF file format [Witten and Frank,
1999] that allowed me to easily apply the MI methods developed in this thesis to
them. He also kindly provided me with the photo files from the online photo library
www.photonewzealand.com and the resulting MI datasets. Some of the experimental results in Chapter 6 are taken from his work on MI learning [Weidmann,
2003]. But he helped me definitely much more than that. When I met any diffiiii

culties during the writeup of my thesis, Nils was usually the first one I asked for
help.
Thirdly, many thanks to Dr. Yong Wang. It was him who first introduced me to the
empirical Bayes methods. He also spent so much precious time selflessly sharing
with me his statistical knowledge and ideas, which inspired and benefited my study
and research a lot. I am really grateful for this great help.
As a matter of fact, the Machine Learning (ML) family at the Computer Science
Department here at the University of Waikato provided me with such a superb environment for study and research. More precisely, I am grateful to: our group leader
Prof. Geoff Holmes, Prof. Ian Witten, Dr. Bernhard Pfahringer, Dr. Mark Hall, Gabi
Schmidberger, Richard Kirkby (especially for helping me survive with WEKA). I
was once trying to name everybody working in the ML lab when I kicked off my
project at that moment we had only three people in the lab, Gabi, Richard and
myself. Now there are so many people in the lab, which makes me eventually give
up my attempt. But anyway, I would like to thank every one in the ML lab for
her/his help or concerns regarding my work.
Help also came from outside the Computer Science Department. More specifically,
I am thankful for Dr. Bill Bolstard and Dr. Ray Littler of the Statistics Department.
Apart from the lectures they provided to me, 2 they also kindly answered lots of my
(perhaps stupid) questions about Statistics.
As for the experiment and development environment used in this thesis, I heavily
relied on the WEKA workbench [Witten and Frank, 1999] to build the MI package.
My work would have been impossible without WEKA. As for the datasets, I would
like to thank Dr. Alistair Mowat for providing the kiwifruit datasets. I also want to
acknowledge the online photo gallery www.photonewzealand.com for their
(indirect) provision of a photo dataset through Nils.
Finally, thanks to my family for their consistent support of my project and research.
2

I always regarded the comments from Ray on the weekly ML discussion as lectures to me.

iv

In fact their question of havent you finished your thesis yet? has always been my
motivation to push the progress of the thesis.
My research was funded by a postgraduate study award as part of Marsden Grant
01-UOW-019, from the Royal Society of New Zealand, and I am very grateful for
this support.

vi

Contents

Abstract

Acknowledgements

iii

List of Figures

xii

List of Tables

xiii

1 Introduction

1.1

Some Example Problems . . . . . . . . . . . . . . . . . . . . . . .

1.2

Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . .

1.3

Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . .

2 Background

2.1

Multiple Instance Problems . . . . . . . . . . . . . . . . . . . . . .

2.2

Current Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1

1997-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2

2000-Now . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3

A New Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5

Some Notation and Terminology . . . . . . . . . . . . . . . . . . . 24

3 A Heuristic Solution for Multiple Instance Problems

25

3.1

Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2

An Artificial Example Domain . . . . . . . . . . . . . . . . . . . . 31

3.3

The Wrapper Heuristic . . . . . . . . . . . . . . . . . . . . . . . . 34


vii

3.4

Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Upgrading Single-instance Learners

41

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2

The Underlying Generative Model . . . . . . . . . . . . . . . . . . 43

4.3

An Assumption-based Upgrade . . . . . . . . . . . . . . . . . . . . 47

4.4

Property Analysis and Regularization Techniques . . . . . . . . . . 54

4.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.7

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Learning with Two-Level Distributions

65

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2

The TLD Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3

The Underlying Generative Model . . . . . . . . . . . . . . . . . . 72

5.4

Relationship to Single-instance Learners . . . . . . . . . . . . . . . 75

5.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.6

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.7

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Applications and Experiments


6.1

83

Drug Activity Prediction . . . . . . . . . . . . . . . . . . . . . . . 83


6.1.1

The Musk Prediction Problem . . . . . . . . . . . . . . . . 84

6.1.2

The Mutagenicity Prediction Problem . . . . . . . . . . . . 85

6.2

Fruit Disease Prediction . . . . . . . . . . . . . . . . . . . . . . . . 90

6.3

Image Categorization . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.4

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7 Algorithmic Details

99

7.1

Numeric Optimization . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2

Artificial Data Generation and Analysis . . . . . . . . . . . . . . . 101

7.3

Feature Selection in the Musk Problems . . . . . . . . . . . . . . . 103


viii

7.4

Algorithmic Details of TLD . . . . . . . . . . . . . . . . . . . . . 106

7.5

Algorithmic Analysis of DD . . . . . . . . . . . . . . . . . . . . . 111

8 Conclusions and Future Work

119

8.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Appendices
A Java Classes for MI Learning

127

A.1 The MI Package . . . . . . . . . . . . . . . . . . . . . . . . . . 127


A.2 The MI.data Package . . . . . . . . . . . . . . . . . . . . . . . . 129
A.3 The MI.visualize Package . . . . . . . . . . . . . . . . . . . . . 130
B weka.core.Optimization

131

C Fun with Integrals

139

C.1 Integration in TLD . . . . . . . . . . . . . . . . . . . . . . . . . . 139


C.2 Integration in TLDSimple . . . . . . . . . . . . . . . . . . . . . . . 141
D Comments on EM-DD

143

D.1 The Log-likelihood Function . . . . . . . . . . . . . . . . . . . . . 144


D.2 The EM-DD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 148
D.3 Theoretical Considerations . . . . . . . . . . . . . . . . . . . . . . 150
Bibliography

154

ix

List of Figures

2.1

Data generation for single-instance and multiple-instance learning. .

2.2

A framework for MI Learning. . . . . . . . . . . . . . . . . . . . . 20

3.1

An artificial dataset with 20 bags. . . . . . . . . . . . . . . . . . . . 31

3.2

Parameter estimation of the wrapper method. . . . . . . . . . . . . 36

3.3

Test errors on artificial data of the MI wrapper method. . . . . . . . 36

4.1

An artificial dataset with 20 bags. . . . . . . . . . . . . . . . . . . . 45

4.2

Parameter estimation of the MILogisticRegressionGEOM on the artificial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3

Test error of MILogisticRegressionGEOM and the MI AdaBoost


algorithm on the artificial data. . . . . . . . . . . . . . . . . . . . . 54

4.4

Error of the MI AdaBoost algorithm on the Musk1 data. . . . . . . . 55

5.1

An artificial simplified TLD dataset with 20 bags. . . . . . . . . . . 74

5.2

Estimated parameters using the TLDSimple method. . . . . . . . . 74

5.3

Estimated parameters using the TLD method. . . . . . . . . . . . . 75

5.4

An illustration of the rationale for the empirical cut-point technique.

6.1

Accuracies achieved by MI Algorithms on the Musk datasets. . . . . 84

6.2

A positive photo example for the concept of mountains and blue

77

sky. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3

A negative photo example for the concept of mountains and blue


sky. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.1

Sampling from a normalized part-triangle distribution. . . . . . . . 102


xi

7.2

Relative feature importance in the Musk datasets. . . . . . . . . . . 105

7.3

Log-Likelihood function expressed via parameter a and b. . . . . . . 108

A.1 A schematic description of the MI package. . . . . . . . . . . . . . 128


D.1 A possible component function in one dimension. . . . . . . . . . . 145
D.2 An illustrative example of the log-likelihood function in DD using
the most-likely-cause model. . . . . . . . . . . . . . . . . . . . . . 145

xii

List of Tables

3.1

Performance of the wrapper method on the Musk datasets. . . . . . 34

4.1

The upgraded MI AdaBoost algorithm. . . . . . . . . . . . . . . . . 50

4.2

Properties of the Musk 1 and Musk 2 datasets. . . . . . . . . . . . . 58

4.3

Performance of some instance-based MI methods on the Musk datasets. 59

5.1

Performance of different versions of naive Bayes on some two-class


datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2

Performance of some metadata-based MI methods on the Musk


datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.1

Properties of the Mutagenesis datasets. . . . . . . . . . . . . . . . . 86

6.2

Error rate estimates for the Mutagenesis datasets and standard deviations (if available). . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.3

Properties of the Kiwifruit datasets. . . . . . . . . . . . . . . . . . 90

6.4

Accuracy estimates for the Kiwifruit datasets and standard deviations. 91

6.5

Properties of the Photo dataset. . . . . . . . . . . . . . . . . . . . . 94

6.6

Accuracy estimates for the Photo dataset and standard deviations. . 95

xiii

CHAPTER 1. INTRODUCTION

Chapter 1
Introduction

Multiple instance (MI) learning has been a popular research topic in machine learning since seven years ago when it first appeared in the pioneering work of Dietterich
et al. [Dietterich, Lathrop and Lozano-Perez, 1997]. One of the reasons that it attracts so many researches is perhaps the exotic conceptual setting that it presents.
Most of machine learning follows the general rationale of learning by examples
and MI learning is no exception. But unlike the single-instance learning problem,
which describes an example using one instance, the MI problem describes an example with multiple instances. However, there is still only one class label for each
example. At this stage we try to avoid any special notation or terminology and
simply give some rough ideas of the MI learning problem using real-world examples. Whilst some practical applications of MI learning will be discussed in detail
in Chapter 6, we briefly describe them here to give some flavor of how MI problems
were identified.

1.1 Some Example Problems


The need for multiple-instance learning arises naturally in several practical learning problems, for instance, the drug activity prediction and fruit disease prediction
problems.
1

1.1. SOME EXAMPLE PROBLEMS


The drug activity prediction problem was first described in [Dietterich et al., 1997].
The potency of a drug is determined by the degree its molecules bind to the larger,
target molecule. It is believed that the binding strength of a drug molecule is largely
determined by its shape (or conformation). Unfortunately one molecule can have
multiple shapes by rotating some of its internal bonds and it is generally unknown
which shape(s) determines the binding. However the potency of a specific molecule,
which is either active or inactive, can be observed directly based on the past experiences. One can recognize this as a good instance of MI learning. Each molecule
is an example, with each of its shapes as one instance inside it. Thus we have to
use multiple instances to present an example. Dietterich et al. believed that an additional assumption is intuitively reasonable in this problem, that is, if one of the (sampled) shapes of a molecule is active, then the whole molecule is active, otherwise
its inactive. We call this assumption the standard MI assumption. Dietterich et
al. also provided two datasets associated with the drug activity prediction problem,
namely, the Musk datasets. The two datasets describe different musk molecules,
with some molecules overlapping between them. These two datasets are the only
publicly available real-world MI datasets up to now and are the standard benchmark
datasets in the MI domain. We use these datasets throughout this thesis.
The fruit disease prediction problem is another natural case for MI learning. When
fruits are collected from different orchards, they are normally stored in the warehouse in batches, one batch for one orchard. After some time (say, 3 to 5 months,
which is the usual transportation duration), some disease symptoms are found in
some batches. Since the disease may be epidemic, often the whole batch of fruits
are infected and exhibit similar symptoms. It would be good to predict which batch
is disease-prone before shipment based on some non-destructive measures taken for
each fruit as soon as it is collected from the orchards. The task is to predict, given
a new batch of fruits, whether it will be affected by a certain disease or not (some
months later). This is another obvious MI problem. Each batch is an example with
every fruit inside it as an instance. In the training data, a batch is labeled diseaseprone (or positive) if the symptoms are observable after the shipment, otherwise
disease-free (or negative). Here one may also think the standard MI assumption can
2

CHAPTER 1. INTRODUCTION
fit because the disease may originate in only one or few fruits in a batch. However,
even if some fruits indeed exhibit some minor symptoms, the majority of fruits in a
batch may be resistant to the disease, rendering the symptoms for the whole batch
negligible. Hence the batch can be negative even if some fruits are affected to a
small degree.
There are also some real-world problems that are not apparently MI problems.
However with some proper manipulation, one can model them as MI problems and
generate MI datasets for them. The content-based image processing task is one of
the most popular ones [Maron, 1998; Maron and Lozano-Perez, 1998; Zhang, Goldman, Yu and Fritts, 2002], while the stock market prediction task [Maron, 1998;
Maron and Lozano-Perez, 1998] and computer intrusion prediction task [Ruffo,
2001] are also among them.
The key to modeling content-based image processing as an MI problem is that only
some parts of an image account for the key words that describe it. Therefore one
can view each image as an example and small segments of that image as instances.
Some measures are taken to describes the pixels of an image. Since there are various ways of fragmenting an image and at least two ways to measure the pixels
(using RGB values or using Luminance-Chrominance values), there could be many
configurations of the instances. The class label of an image is whether its content is
about a certain concept and the task is to predict the class label given a new image.
When the title of an image is a single simple concept like sunset or waterfall,
the MI assumption is believed to be appropriate because only a small part of a positive image really accounts for its title while no parts of a negative image can account
for the title. However, for more complicated concepts of contents, it may be necessary to drop this assumption [Weidmann, 2003]. In this thesis, we consider this
application and the others in more detail in Chapter 6.

1.2. MOTIVATION AND OBJECTIVES

1.2 Motivation and Objectives


During these years, significant research efforts have been put into MI learning and
many approaches have been proposed to tackle MI problems. Although some very
good results have been reported on the Musk datasets [Dietterich et al., 1997], the
data-generating mechanisms in MI problems remain unclear, even for the wellstudied Musk datasets. Thus the MI problem itself may need more attention. Moreover, the reason why some methods perform well and the assumptions they are
based on are usually not explicitly explained,1 which leads to some confusion. Due
to the scarcity of MI data, convincing tests and comparisons between MI algorithms
are not possible. Hence it is not clear what kinds of MI problems a particular method
can deal with. Finally, the relationship between different methods has not been
studied enough. Recognizing the essence and the connection between the various
methods can sometimes inspire new solutions with well-founded theoretic justifications. Therefore we believe that the study of MI learning still requires stronger and
clearer theoretic interpretation, as well as more practical and artificial datasets for
the purpose of evaluation.
Therefore in order to capture the big picture of MI learning, we set up the following
three objectives:

1. To establish a general framework for MI methods


2. To create some new MI methods within this framework that are strongly justified and make their assumptions explicit
3. To perform experiments on a collection of real-world datasets.

We aimed to achieve these objectives to the extent we could. Although some objectives may be too big to fully accomplish, it is hoped that this thesis improves the
interpretation and understanding of MI problems.
1

We also noticed that some methods actually explicitly expressed assumptions that they never
used.

CHAPTER 1. INTRODUCTION

1.3 Structure of this Thesis

The rest of the thesis is organized in seven chapters as follows.


Chapter 2 provides some detailed background on MI problems, the standard MI
assumption and published solutions. We also introduce our perspective on the problem and a general framework to analyze current MI methods and create new solutions.
Chapter 3 presents a new MI assumption and a generative model according to the
framework introduced in Chapter 2. Then we propose a heuristic MI learning
method that wraps around normal single-instance methods. We also show that it
is suitable for practical classification problems.
In Chapter 4, we put forward some more specific generative models. Under these
models we formally derive a way to upgrade single-instance learners to deal with
MI data. The rationale for the upgraded algorithms is analogous to that of the
corresponding single-instance learners but based on some MI assumptions. As an
example, we upgrade two popular single-instance learning methods, linear logistic
regression and AdaBoost [Freund and Schapire, 1996], to tackle MI problems.
In Chapter 5, we develop a two-level distribution approach to tackle MI problems.
Again we explicitly state the underlying generative model this approach assumes.
It turns out that this method, in one of its simplest forms, can be regarded as an
approximate upgrade of the naive Bayes [John and Langley, 1995] method to the
MI setting.
The methods developed in Chapters 3, 4 and 5 are all incorporated in the solution
framework introduced in Chapter 2.
Chapter 6 elaborates on the three applications mentioned above and the corresponding datasets. It also presents experimental results of the new methods on these
datasets.
5

1.3. STRUCTURE OF THIS THESIS


Chapter 7 describes the implementation of all the algorithms in this thesis in more
detail, and also the process that was used for generating the artificial data used
throughout this thesis.
Chapter 8 gives a summary and briefly describes future work. This concludes the
thesis.
There are some more implementation details in the Appendices. These details are
referenced whenever necessary.

CHAPTER 2. BACKGROUND

Chapter 2
Background

2.1 Multiple Instance Problems


Multiple instance problems first arose in machine learning in a supervised learning
context [Dietterich et al., 1997]. Thus we briefly review supervised learning first.
This extends naturally to multiple instance learning.
Supervised learning, more specifically classification, involves a learning scheme
that takes a set of classified examples from which it is expected to learn a way of
classifying unseen examples [Witten and Frank, 1999]. Typically supervised learning has a training process that takes some examples described by a pre-defined set
of attributes (or a feature vector), and class labels (or responses), one for each example. Attributes can be nominal, ordinal, interval and ratio [Witten and Frank, 1999].
The task for training is to infer a relationship, normally represented as a function,
from the attributes to the class labels. If the class labels are nominal values, the task
is called classification. It is called regression if the class is numeric. Here
we are only concerned with the classification task. In this thesis, we only consider
two-class classification problems. Nonetheless, it is straightforward to apply them
to multi-class problems, e.g. using error-correcting output codes [Dietterich and
Bakiri, 1995].
MI learning basically adopts the same setting as single-instance supervised learning
7

2.1. MULTIPLE INSTANCE PROBLEMS

Figure 2.1: Data generation for (a) single-instance learning and (b) multipleinstance learning [Dietterich et al., 1997].

described above. It still has examples with attributes and class labels; one example
still has only one class label; and the task is still the inference of the relationship
between attributes and class labels. The only difference is that every one of the
examples is represented by more than one feature vector. If we regard one feature
vector as an instance of an example, normal supervised learning only has one instance per example while MI learning has multiple instances per example, hence
named Multiple Instance learning problem.
The difference can best be depicted graphically as shown in Figure 2.1 [Dietterich
et al., 1997]. In this figure, the Object is an example described by some attributes,
the Result is the class label and the Unknown Process is the relationship. Figure 2.1(a) depicts the case in normal supervised learning while in 2.1(b) there are
multiple instances in an example. We interpret the Unknown Process in Figure 2.1(b) as different from that in Figure 2.1(a) because the input is different. Note
that the dashed and solid arrows representing the input of the process in (b) imply
8

CHAPTER 2. BACKGROUND
that only some of the input instances may be useful. Therefore while the Unknown
Process in (a) is simply a classification problem, the Unknown Process in (b) is
commonly viewed as a two-step process with a first step consisting of a classification problem and a second step that is a selection process based on the first step and
some assumptions.1 However, as we shall see, there are methods that do not even
consider the instance selection step and directly infer the output from the interactions between the input instances. But the assumption of a selection process played
such an influential role in the MI learning domain that virtually every paper in this
area quoted it. We refer to this standard MI assumption simply as the MI assumption throughout this thesis for brevity. Let us briefly review what this assumption
entails.
Within a two class context, with class labels fpositive, negativeg, the MI assumption states that an example is positive if at least one of its instance is positive
and negative if all of its instances are negative. Note that this assumption is based
on instance-level class labels, thus the Unknown Process in Figure 2.1(b) has
to consist of two steps: the first step provides the instances class labels and the
MI assumption is applied to the second step. However, we noticed that several
MI methods do not actually follow the MI assumption (even though it is generally
mentioned), and do not explicitly state which assumptions they use. As a matter
of fact, we believe that this assumption may not be so essential to make accurate
predictions. What matters is the combination of the model from the first step and
the assumption used in the second step. Given a certain type of model for classification, it may be appropriate to explicitly drop the MI assumption and establish other
assumptions in the second step. We will discuss this in Section 2.4.
The two-step paradigm itself is only one possibility to model the MI problem. We
noticed that in general, when extending the number of instances of an example
from one to many, we may have a potentially very large number of possibilities
to model the relationship between the set of instances and their class label. For

We may like to call the selection process a parametric instance selection process because it is
based on the model built in the classification process.

2.2. CURRENT SOLUTIONS


the instances within an example, there may be ambiguity, redundancy, interactions
and many more properties to exploit. The two-step model and the MI assumption
may be suitable to exploit ambiguity [Maron, 1998] but if we are interested in other
properties of the examples, other models and assumptions may be more convenient
and appropriate. In practice, we may need strong background knowledge to choose
the right way to model the problem a priori. However in most cases we actually
lack such knowledge. Consequently it is necessary that we try a variety of models
or assumptions in order to come up with an accurate representation of the data.

2.2 Current Solutions


As mentioned above, there are a variety of methods that have been developed to
tackle MI problems. Almost all of them are special-purpose algorithms that are either created for the MI setting or upgraded from the normal single-instance learners.
Unfortunately some of them do not provide sufficient explanation of the underlying
assumptions or generative models. Hence the working mechanisms of the methods remain unclear. The algorithms are discussed in a chronological order in the
following section and comments are provided whenever possible. We discuss the
algorithms developed before 2000 and those after 2000 separately. We did so because we observed that the post-2000 methods are based on different assumptions
than the pre-2000 methods.

2.2.1 1997-2000
The first MI algorithm stems from the pioneering paper by Dietterich et al. [1997],
which also introduced the aforementioned Musk datasets. The APR algorithms [Dietterich et al., 1997] modeled the MI problem as a two-step process: a classification process that is applied to every instance and then a selection process based on
the MI assumption. A single Axis-Parallel hyper-Rectangle (APR) is used as the
10

CHAPTER 2. BACKGROUND
pattern to be found in the classification process. As a parametric approach,2 the
objective of these methods is to find the parameters that, together with the MI assumption, can best explain the class labels of all the examples in the training data.
In other words, they look for parameters involved in the first step according to the
observed class label formed in the second step and the assumed mechanism between the two steps. There are standard algorithms that can build an APR (i.e. a
single if-then rule). Unfortunately they do not take the MI assumption into account.
Thus a special-purpose algorithm is needed. Several basic APR algorithms were
proposed [Dietterich et al., 1997], but interestingly the best method for the Musk
data was not among them. The best APR algorithm for the Musk data consists of
an iterative process with two steps: the first step expands an APR from a positive
seed instance using a back-fitting algorithm, and the second step selects useful
features greedily based on some specifically designed measures. It turned out that
the APR that best explain the training data does not generalize very well. Hence
the objective was changed a little bit to produce the lowest generalization error, and
the kernel density estimation (KDE) was used to refine some of the parameters
the boundaries of the hyper-rectangle. These tuning steps resulted in an algorithm
called iterated-discrim APR, that gave good results on the Musk datasets.
It is interesting that Dietterich et al. based all the algorithms on the APR pattern
and the MI assumption combination and never considered to drop either of them
even when difficulties were encountered. This might be due to their background
knowledge that made them believe an APR and the MI assumption are appropriate
for the Musk data. However, using KDE may have already introduced a bias into
the parameter estimates, which indicates that it might be more appropriate to model
the Musk datasets in a different way.
In spite of the above observation, the APR and the MI assumption combination
dominated the early stage of MI learning. Researchers from the computational
learning theory played an active role in MI learning before 2000. Some PAC al-

In this case, the parameters to be found are the useful features and the bounds of the hyperrectangle along these features.

11

2.2. CURRENT SOLUTIONS


gorithms were developed to look for the APR that Dietterich et al. defined [Long
and Tan, 1998; Blum and Kalai, 1998]. They also proved that finding such an APR
under the MI assumption is actually an NP-complete problem. Some more practical MI algorithms were also developed in this domain, such as MULTINST [Auer,
1997]. PAC learning theory tends to view classification as a deterministic process.
Instances with the class labels that violate the assumed concept are usually regarded
as noisy. Indeed, coming up with the idea of varying noise rates for the examples and assuming a specific distribution of the number of instances per example,
Auer [1997] ended up estimating the examples misclassification errors and looking
for an APR to minimize this estimated error. Note that since MULTINST tries to
calculate the expected number of instances per bag that fall into the hypothesis for
the positive class (in this case an APR), it adheres closely to the MI assumption.
Another way to view the classification problem is from a probabilistic point of view,
which is prevalently adopted by statisticians and the researchers in the statistical
learning domain [Hastie, Tibshirani and Friedman, 2001; Vapnik, 2000; McLachlan, 1992; Devroye, Gyorfi and Lugosi, 1996]. Normally one thinks of a joint distribution over the feature vector variable X and the class variable Y . By calculating

the marginal probability of Y conditional on X , P r (Y jX ), we can predict the prob-

ability of the class label of a test feature vector. According to statistical decision
theory, one should make the prediction based on whether the probability is over 0.5
in a two-class case.
But this is in single-instance supervised learning. In the MI domain, the first (and
so far the only published) probabilistic model is the Diverse Density (DD) model
proposed by Maron [Maron, 1998; Maron and Lozano-Perez, 1998]. As will be described in more details in Chapter 4 and Chapter 7, DD was also heavily influenced
by the APR and the MI assumption combination and the two-step process. The
DD method actually used the maximum binomial log-likelihood method, a statistical paradigm used by many normal single-instance learners like logistic regression,
to search for the parameters in the first step. In particular, to model an APR in
the first step, it used a radial form or a Gaussian-like form to model
12

P r(Y jX ).

CHAPTER 2. BACKGROUND
Because of the radial form, the pattern (or decision boundary) of the classification
step is an Axis-Parallel hyper-Ellipse (APE) instead of a hyper-rectangle.3 In the
second step, where we assume that we have already obtained each instances class
probability, we still need ways to model the process that decides the class label of
an example.4 There were two ways in DD, namely the noisy-or model and the mostlikely-cause model, that are both probabilistic ways to model the MI assumption.
In the single-instance case, the process to decide the class label of each example
(i.e. each instance) can be regarded as a one-stage Bernoulli process. Now, since
we have multiple instances in an example, it seems natural to extend it to a multiplestage Bernoulli process, with each instances (latent) class label determined by its
class probability in one stage.5 And this is exactly the noisy-or generative model
in DD. As we shall see, according to [Maritz and Lwin, 1989], a very similar way of
modeling was adopted by some statisticians in as early as 1943 [von Mises, 1943].
However we can also model the process as an one-stage Bernoulli process if we
assume some way to squeeze the multiple probabilities (one per instance) into
one probability. The most-likely-cause model in DD is of this kind. Either way,
we can form a binomial log-likelihood function, either multi-stage or one-stage.
The noisy-or model computes the probability of seeing all the stages negative and
the complement, the probability of seeing at least one stage positive. The mostlikely-cause model picks only one instances probability in an example to form
the binomial log-likelihood. It selects the instance within an example that has the
highest probability to be positive. Both processes to generate a bags class label
are model-based6 and are compatible with the MI assumption. By maximizing the
log-likelihood function we can find the parameters involved in the radial formula-

P r(Y jX ). This is how DD recovers the instance-level class probability,


P r(Y; X ). With the virtues of the maximum likelihood (ML) method, namely its
tion of

consistency and efficiency, one can usually assure the correctness of the solution if
3

Note that the function for a rectangle is not differentiable whereas that of an ellipse is. But in
the sense of classification decision making, they are very similar.
4
The process to decide an examples class label was called generative model in DD.
5
Note that the normal binomial distribution formula Cnr pr (1 p)n r does not apply here because
in this Bernoulli process, the probability p changes from stage to stage.
6
The processes are based on the radial formulation of P r(Y jX ).

13

2.2. CURRENT SOLUTIONS


the underlying assumptions and generative model (described in [Maron, 1998]) are
true. In the testing phase, the estimated parameters and the same Bernoulli process
used in training provide a class probability for a test example. Once again we decide
its class label using the probability threshold of 0.5.
Within the most-likely-cause model, the log-likelihood function involves the max(:)
functions and becomes non-differentiable, making it hard to maximize the loglikelihood function. That is why the EM-DD was proposed [Zhang and Goldman, 2002] some years later. EM-DD uses the EM algorithm [Dempster, Laird
and Rubin, 1977] to overcome the non-differentiability in the optimization of the
log-likelihood function. Thus in methodology, EM-DD is simply DD.7 Nonetheless, it can be shown (see Appendix D) by some theoretical analysis and an illustrative counter-example that such an attempt is generally a failure. The theoretic
proof of the convergence of EM-DD is also problematic. As a result, EM-DD will
not generally find a maximum likelihood estimate (MLE) of the parameters. Since
DD is an ML method, EM-DDs solution will not be a correct one if it cannot find
the MLE. EM-DD was found to produce a very good result on the Musk data but
this was due to a flawed evaluation involving intensive parameter tuning on the test
data. Not surprisingly, the EM-DD algorithm did not work better than DD on the
content-based image retrieval task [Zhang, Goldman, Yu and Fritts, 2002].

2.2.2 2000-Now
The new millennium saw the breaking of the MI assumption and abandonment of
APR-like formulations. Virtually no new methods (apart from a neural networkbased method) created since then use the MI assumption, although interestingly
enough some of them were motivated based on this assumption. Hence it may not
be fair to compare these methods with the methods developed before 2000 because
they were based on different assumptions. However, one can usually compare them
in the sense of verifying which assumptions and models are more appropriate for
7

That is why we list it here and not among the new methods after 2000, although it was published
in 2002.

14

CHAPTER 2. BACKGROUND
the real-world data.
The new methods created since 2000 were all aimed to upgrade the single-instance
learners to deal with MI data. The methods upgraded so far are decision trees [Ruffo,
2001], nearest neighbour [Wang and Zucker, 2000], neural networks [Ramon and
Raedt, 2000], decision rules [Chevaleyre and Zucker, 2001] and support vector machines (SVM) [Gartner, Flach, Kowalczyk and Smola, 2002]. Nevertheless the
techniques involved in some of these methods are significantly different from those
used in their single-instance parents. We can categorize them into instance-based
methods and metadata-based methods. The term instance-based denotes that a
method is trying to select some (or all) representative instances from an example and
model these representatives for the example. The selection could be based on the MI
assumption or, more often after 2000, not. The term metadata-based means that
a method actually ignores the instances within an example. Instead it extracts some
meta-data from an example that is no longer related to any specific instances. The
metadata-based approaches cannot possibly adhere to the MI assumption because
the MI assumption must be associated with instance selection within an example.
We briefly describe the post-2000 methods using this categorization, which is also
the backbone of the framework we will discuss in the next section.
The instance-based approaches are the nearest neighbour technique, the neural network, the decision rule learner, and the SVM (based on a multi-instance kernel).
The MI nearest neighbour algorithms [Wang and Zucker, 2000] introduces a measure that gives the distance between two example, namely the Hausdorff distance.
It basically regards the distance between two examples as the distance between the
representatives within each example (one representative instance per example). The
selection of the representative is based on the maximum or minimum of the distances between all the instances from the two examples. While it is not totally clear
from the paper [Wang and Zucker, 2000] what the so-called Bayesian-KNN does
in the testing phase, the Citation-KNN method definitely violates the MI assumption because it decides a test examples class label by the majority class of its nearest

15

2.2. CURRENT SOLUTIONS


examples. Thus in general it does not classify an example based on whether at least
one of its instance is positive or all of the instances are negative.
On the other hand, MI neural networks [Ramon and Raedt, 2000] closely adhered
to the MI assumption. They adopt the same two-step framework used in the APR
method [Dietterich et al., 1997] as described above. As a matter of fact, one may
recognize that searching for parameters in the aforementioned two-step process is
well suited for a two-level neural network architecture. Neural network is used
to learn a pattern in the classification step, and a model-based instance selection
method is applied in the second step. In the first step the family of patterns is not
explicitly specified but implicitly defined according to complexity of the network
constructed. In the second step, like the most-likely-cause model in DD [Maron,
1998], the neural network picks up the instance with the highest output value in an
example.8 Backpropagation is used to search for the parameter values. Therefore
it can be said that this method is based on the MI assumption. Indeed, the reported
results obtained seemed to be very similar to those of the DD algorithm on the Musk
datasets.
NaiveRipperMI [Chevaleyre and Zucker, 2001] is a modification of the rule learner
RIPPER [Cohen, 1995] with a different counting method. Instead of counting how
many instances are covered by a hypothesis, it counts how many examples are
covered. If at least one instance of an example is covered, the whole example is
counted. Because positive and negative examples are treated the same way, this
violates the MI assumption. In fact this method could find the hyper-rectangle that
covers all negative examples but no positive ones. Since NaiveRipperMI [Chevaleyre and Zucker, 2001] emphasized so much on the MI assumption, we assume9
that it does what the MI assumption states in the testing procedure. However, this
would mean that it is not consistent with what happens in the training procedure.
The SVM with the MI kernel [Gartner et al., 2002] also violates the MI assumption.
The MI kernel simply replaces the standard dot product by the sum over all pairwise
8
9

Since the output value is 2 [0; 1, we can regard it as the probability to be positive.
There is no information on how a test example is classified in [Chevaleyre and Zucker, 2001].

16

CHAPTER 2. BACKGROUND
dot products between instances from two examples. This can be combined with
another non-linear kernel, e.g. RBF kernel. This effectively assumes that the class
label of an example is the true class label of all the instances inside it, and attempts to
search for the hyperplane that can separate all (or most of) 10 the training examples in
an extended feature space (because of the RBF kernel function). Since this is done
the same way for both positive and negative examples, the MI assumption is not
used in this method at all. Indeed, we observe that some methods, including SVMs,
that do not model the probability

P r(Y jX ) directly, will find the MI assumption

very hard to apply, if not impossible. It would be very convenient for those methods
to have other assumptions associated with the measure they attempt to estimate.
The metadata-based approach is implemented in the MI decision tree learner RELIC
and the SVM based on a polynomial minimax kernel. This approach extracts some
metadata from each example, and regards such metadata as the characteristics of
the examples. When a new example is seen, we can directly predict its class label with regards to the metadata without knowing the class labels of the instances.
Therefore each instance is not important in this approach. What matters is the underlying properties of the instances. Hence we cannot tell which instance is positive
or negative because an examples class label is associated with the properties that
are presented by the attribute values of all the instances. Hence this approach cannot
possibly use the MI assumption.
The MI decision tree learner RELIC [Ruffo, 2001] is of this kind. In each node in
the tree, RELIC partitions the examples according to the following method:

For a nominal attribute with R values, say r where r

= 1; 2; : : : R, it assigns

an example to the r th subset if there is at least one instance in the example


whose value of this attribute is r .

For a numeric attribute, and given a threshold  , there will be two subsets
to be chosen: the subset less than

 and that greater than it. It assigns an

example in either subset based on two types of tests. The first type assigns
10

The regularization parameter C in SVM will tolerate some errors in the training data.

17

2.2. CURRENT SOLUTIONS


an example into the subset less than  if the minimum of this attribute values
of all the instances within the example is less than or equal to

 and other-

wise to the subset greater than  . For the second type, it assigns an example

 if the maximum of this attribute values of all the


instances within the example is less than or equal to  and otherwise to the
subset greater than  .
into the subset less than

Then it seeks the best

 according the entropy measure, same way as the

single-instance tree learner C4.5 [Quinlan, 1993]. Note that for numeric attributes, it looks for the best of the two types of tests simultaneously so that
only one type of tests and one  will be selected for each numeric attribute.

The way that RELIC assigns examples to subsets means that it is equivalent to extracting some metadata, namely minimax values, from each example and applying
the single-instance learner C4.5 to the transformed data. Since RELIC examines
the attribute values along each dimension individually, such metadata of an example does not correspond to any specific instance inside it, although it is possible,
but very unlikely, to match the instances to the minimax values of all the attributes
simultaneously. Moreover, in the testing phase, we can directly tell an examples
class label using the tree without getting the class labels of the instances. Thus the
MI assumption obviously does not apply.
The SVM with a polynomial minimax kernel explicitly transform the original feature space to a new feature space where there are twice the number of attributes as in
the original one. For each attribute in the original feature space, two new attributes
are created in the transformed space: minimal value and maximal value. It then
maps each example in the original feature space into an instance in the new space
by finding the minimum and maximum value of each attribute for the instances in
that example. Clearly some information is lost during the transformation and this is
significantly different from the two-step paradigm used in [Dietterich et al., 1997].
The MI assumption cannot possibly apply. Note this is effectively the same as what
RELIC does.
18

CHAPTER 2. BACKGROUND
The assumption of the metadata approach is that the classification of the examples
is only related to the metadata (in this case the minimax values) of the examples,
and that the transformation does not lose (or lose little) information in terms of classification. The convenience of the approach is that it transforms the multi-instance
problem to the common mono-instance one. In general, the metadata approach enables us to transform the original feature space to other feature spaces that facilitate
the single-instance learning. The other feature spaces are not necessarily the result
of simple metadata extracted from the examples. They could be, for instance, a
model space where we build a model (either a classification model or a clustering
model) for examples and transform each example into one instance according to,
say, the count of its instances that can be explained by the model. Methods similar
to this are actively being researched [Weidmann, 2003]. The validity of such transformations really depends on the background knowledge. If one believes that interactions or relationships between instances account for classification of an example,
then this approach may outperform methods that do not have such a sophisticated
view of the problem. In this thesis, we refer to all the methods that transform the
original feature space to another feature space as the metadata-based approach,
no matter how complicated the extracted metadata may be.
In summary, the methods developed in the earlier stage of MI learning usually have
an APR-like formulation and hold the MI assumption whereas the methods developed later on often implicitly drop the MI assumption and are based on other types
of models.

2.3 A New Framework


As mentioned in the previous section, we can categorize all the current MI solutions
into a simple framework. As for the methods before 2000, it is quite obvious that
they belong to the instance-based approaches because they strictly adhered to the
MI assumption, which implies that one implicitly assumes a class label for each
instance. We now present a hierarchical framework giving our view of MI learning.
19

2.3. A NEW FRAMEWORK


MI learning
Instancebased
Approaches

The MI
Assumption

Metadatabased
Approaches

Other
Assumptions

Modeloriented

Dataoriented
Fixed
Metadata

Random
Metadata

Figure 2.2: A framework for MI Learning.

This is shown in Figure 2.2.


It is now almost a convention that the MI assumption is associated with MI learning.
In this thesis, we generalize the definition of multiple instance learning and allow
the algorithm developers to plug in whatever assumption they believe reasonable.
Thus MI learning in our framework is based on generalizing the MI assumption.
As already discussed we categorize MI methods into two categories: instance-base
approaches and metadata-based approaches. We have already analyzed the current
solutions based on this distinction, and we will create more methods in this thesis
that belong to either one of the two categories. We can specialize the two categories
further.
In the instance-based approach, one normally estimates some parameters of a function mapping the feature variable

X to the class variable Y . This function is then

used to form a prediction at the bag level. All current instance-based methods for
MI learning amount to estimating the parameters of the function that enable them
to predict the class label of unseen examples. Some of these methods are based on
the MI assumption but some are not, which results in two sub-categories. We will
also develop some more methods within the sub-category that are not based on the
MI assumption. We will explicitly state our assumptions and present the generative
models that the methods are based on.
The metadata-based approach has already been discussed in the last section. The
metadata could either be directly extracted from the data, so-called data-oriented
20

CHAPTER 2. BACKGROUND
in the framework shown in Figure 2.2, or from some models built on the data, called
model-oriented.
The two-level classification method [Weidmann, Frank and Pfahringer, 2003; Weidmann, 2003] is the only published approach that is model-oriented. It builds a
first-level model using a single-instance (either supervised or unsupervised) learner
to describe the potential patterns in the whole instance space as the metadata. It
then applies a second-level learner to determine the model based on the extracted
patterns. A second-level learner is a single-instance (supervised) learner. Thus it
effectively transforms the original instance space into a new (model-oriented) instance space that single-instance learners can be applied to.
For example, we can apply a clustering algorithm at the first level to the
original instances and build a clustering model from the (original) instance space.
Then we can construct a new single-instance dataset, with every attribute corresponding to one cluster extracted, and each instance corresponding to one example
in the original data. The new instances attribute values are the number of instances
of the corresponding example that fall into one cluster. Finally we apply another
single-instance learner, say a decision tree, to the new data to build a second-level
model. At testing time, the first-level model is applied to the test example to extract
the metadata (and generate a new instance), and the second-level learner to classify
the test example according to the model built on the (training) metadata. Of course
there are many combinations of the first and second-level learners, and they are not
further described in this thesis. Interested readers should refer to [Weidmann, 2003]
for more detail.
In the data-oriented sub-category, we can further specialize into fixed metadata
and random metadata sub-categories. All of the current metadata-based methods
(i.e. RELIC and SVM based on a minimax kernel) are data-oriented. In addition,
the metadata extracted is thought to be fixed and directly used to find the function mapping the metadata to

Y . However we can regard the metadata as random

and governed by some distributions, which results in another sub-category. Here

21

2.4. METHODOLOGY
we assume the data within each example is random and follows a certain distribution. Thus we can extract some low-order sufficient statistics to summarize the
data and regard the statistics as the metadata. In this case, the metadata (statistics)
have a sampling distribution parameterized by some parameters. If we further think
of these parameters as governed by some distribution, the metadata is necessarily
random, not only wandering around the parameters within a bag but from bag to
bag as well. This is the thinking behind our new two-level distribution approach
discussed in Chapter 5. It turns out that when assuming independence between attributes, it constitutes an approximate way to upgrade the naive Bayes method to
deal with MI data, which has not been tried in the MI learning domain. We also
discovered the relationship between this method and the empirical Bayes methods
in Statistics [Maritz and Lwin, 1989].

2.4 Methodology
When we generalize the assumptions and approaches of MI learning, we find much
flexibility within our framework described above. Nonetheless, we still like to restrict our methods to some scope so that we may easily find theoretical justifications
for them. We propose that it would be desirable for an MI algorithm to have the following three properties:

1. The assumptions and generative models that the method is based on are clearly
explained. Because of the richness of the MI setting, there could be many
(essentially infinitely many) mechanisms that generate MI data. It is very unlikely that one method can deal with all of them. A method has to be based on
some assumptions. Thus it is important to state the assumptions the method
is based on and their feasibility.
2. The method should be consistent in training and testing. Note consistency
here means that a method should make a prediction for a new example in
22

CHAPTER 2. BACKGROUND
the same way as it builds a model on the training data.11 For example, if a
method tries to build a model at the instance level and is based on a certain
assumption other than the MI assumption, then it should also predict the class
label of a test example using that assumption.
3. When the number of instances within each example reduces to one, the MI
method degenerates into one of the popular single-instance learning algorithms. Although this is not such an important property compared to the
former two, it is useful when we upgrade a single-instance learner, which
is theoretically well founded, to deal with MI data.
Even though not all current MI methods hold the above three properties, we aim to
achieve them in this project. In order to do so we develop the following methodology for this thesis.
1. We explicitly drop the MI assumption but state the corresponding new assumptions whenever new methods are created. The underlying generative
model of each new method will be explicitly provided so that one may clearly
understand what kind of problem the methods can solve.
2. We adopt a statistical decision theoretic point of view, that is, we always assume a joint probability distribution over the feature variable X (or other new
variables introduced for MI learning) and the class variable

Y , P r(X; Y ),

as assumed by most of single-instance learning algorithms. We base all our


modeling purely on this distribution. Although we can end up with different
methods by factorizing the joint distribution differently, the root of them is
the same.
3. As the standard single-instance learners are mostly well justified and empirically verified, we are interested in creating new methods related to them.
Even if we develop a new method that is in totally different context, we try to
show the relationship between this method and some single-instance learning
algorithms whenever possible.
11

This is different from the notation of consistency in a statistical context.

23

2.5. SOME NOTATION AND TERMINOLOGY

2.5 Some Notation and Terminology


To avoid confusion, this section explicitly lists the common notation that will be
used in this thesis, and some terminology. Special notation and terminology will be
used together with proper explanations.
An example in the MI domain is also called a bag or an exemplar and we
will use all three terms in this thesis. Likewise an instance is sometimes called
a feature vector or a point in the feature space. Every instance is regarded as
a value of the feature variable
variable

X whereas its class label is a value of the class

Y . Y is also called the response variable and group variable. An

attribute will also be called a feature or a dimension. There are many names
for the algorithms in normal single-instance (or mono-instance) supervised learning
like propositional learner, or Attribute-Value (AV) learner. We will sometimes
use them without distinction.
There is also some notation related to the joint probability distribution P r (X; Y ).
In classification problems we are more concerned with the conditional (or marginal)

Y , P r(Y jX ). We also call it the posterior probability and/or the


point-conditional probability because it is conditional on a certain point x. But we
also have the marginal probability of X , P r (X jY ) which we also call the groupconditional probability. Of course we call P r (X ) and P r (Y ) the prior probabilities of X and Y respectively. Note when X is numeric, we abuse the symbol
P r (:) because it is a density function that we refer to. However we will not dis-

probability of

tinguish this difference in the notation.

24

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS

Chapter 3
A Heuristic Solution for Multiple
Instance Problems

This chapter introduces new assumptions for MI learning, and we discard the standard MI assumption. We regard the class label of a bag as a property that is related
to all the instances within that bag. We call this new assumption the collective
assumption because the class label of a bag is now a collective property of all the
corresponding instances. Why can the collective assumption be reasonable for some
practical problems like drug activity prediction? Because the features variables (in
this case measuring the conformations, or shapes, of a molecule) usually cannot
absolutely explain the response variables (a molecules activity), it is appropriate
to model a probabilistic mechanism to decide the class label of a data point in the
instance space. The collective assumption means that the probabilistic mechanisms
of the instances within a bag are intrinsically related, although the relationship is
unknown to us. Consider the drug activity prediction problem: a molecules conformations are not arbitrary in the instance space but confined to some certain areas.
Thus if we assume that the mechanism determining the class label of a molecule
is similar to the mechanism determining the (latent) class labels of all (or most) of
the molecules conformations, we may better explain the molecules activity. Even
if the activity of a molecule were truly determined by only one specific shape (or
a very limited number of shapes) in the instance space, it would have a very small
25

3.1. ASSUMPTIONS
probability of being sampled. Together with some measurement errors, the samples
of a molecule are very likely to wander around the true shape(s). Therefore it is
more robust to model the collective class properties of the instances within a bag
rather than that of some specific instance. We believe this is true for many practical
MI datasets, including the musk drug activity datasets.
The collective assumption is a broad concept and there is great flexibility under
this assumption. In Section 3.1 we present several possible options to model exact
generative models based on this assumption. We further illustrate it with an artificial dataset generated by one exact generative model in Section 3.2. This generative
model is strongly related to those used in Chapters 4 and 5. In fact, all the methods developed in this thesis are based on some form of the collective assumption.
Section 3.3 presents a heuristic wrapper method for MI learning that is based on the
collective assumption [Frank and Xu, 2003]. It will be shown that in some cases it
can perform classification pretty well even though it introduces some bias into the
probability estimation. We interpret and analyze some properties of the heuristic in
Section 3.4. Note that some of the material in this chapter has been published in
[Frank and Xu, 2003].

3.1 Assumptions
Under the collective assumption, we have several options to build an exact generative model. We have to decide which options to take in order to generate a specific
model. We found that answering the following question is helpful in making the
decision:

1. How to define the class label property of an instance and of a bag?


Suppose we use a function of the class variable Y , C (Y j:) to denote the class
label property, then what is the exact form of

C (Y j:)? In single-instance

C (Y jX ) to be related to the posterior probability function P r (Y jX ): we either use P r (Y jX ) itself or its logit

statistical learning, we usually model

26

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS

jX )
log PP rr((YY =1
=0jX ) . In MI learning, we can also
model it at the bag level, i.e. we can build a function of C (Y jB ) where B are
the bags. Thus we can also have P r (Y jB ) = P r (Y jX1 ; X2 ;    ; Xn ) and
P r(Y =1jX1 ;X2 ; ;Xn )
=1jB )
log PP rr((YY =0
jB) = log P r(Y =0jX1 ;X2 ; ;Xn ) where a bag B = b has n instances
x1 ; x2 ;    ; xn . Note that in this thesis we restrict ourselves to the same form
of the property for instances and bags. For example, if we model P r (Y jX )
for the instances, we also model P r (Y jB ) for bags instead of the log-odds.
transform, the log-odds function

We will show the reason for doing so in the answer to the next question.
2. How to view the members of a bag and what is the relationship between
their class label properties and that of the bag?
Almost all the current MI methods regard the members (instances) of a bag as
a finite number of fixed and unrelated elements. However we have a different
point of view. We think of each bag as a population, which is continuous
and generate instances in the instance space in a dense manner. What we
have in the data for each bag are some samples randomly sampled from its
population. This point of view actually relates all the instances to each other
given a bag, that is, they are all dominated by the specific distribution of the
population of that bag. Every bag is unbounded, i.e., it ranges over the whole
instance space. However its distribution may be bounded, i.e. its instances
may only be possibly located in a small region of the instance space.1 If
one really thinks of the bags elements as random selected from the whole
instance space, we can still fit this into our thinking by modeling each bag
as with a uniform distribution. Note that the distributions of different bags
are different from each other and bags may overlap. Thus each point in the
instance space x (that is, an instance) will have different density given different bags. Therefore, unlike in normal single-instance learning that assumes

P r(X ), we have P r(X jB ) instead. We still regard each point in the instance
space as having its own class label (that could be determined by either a deterministic or a probabilistic process) but this is unobservable in the data. What
is observable is the class label of a bag that is determined using the instances
1

In fact if the bags are bounded, we can always think of their distributions as bounded ones.

27

3.1. ASSUMPTIONS
(latent) class label properties.
Now how to decide the class label of a bag given those of its instances? There
are two options here: the population version and the sample version. Since
we take the above perspective for a bag and we have already defined the class
label property C (Y j:), one application of the collective assumption is to take
the expected property of the population of each bag as the class property of
that bag. Given a bag b, we calculate

C (Y jb) = EX jb [C (Y jX ) =

Z
X

C (Y jx)P r(xjb) dx

(3.1)

We simply regard the bags class label property as the conditional expectation of the class property of all the instances given b. This is the population

C (Y jb). It is not related to


any individual instance, but we must know P r (xjb) exactly to calculate the
integral. However, there is also a sample version of C (Y jb). If given b we can
sample nb instances from the instance space (in other words, there are nb inP b
stances in b), then C (Y jb) = n1b ni=1
C (Y jxi ). Since instances are drawn via
version for determining the class label of a bag

random sampling, we can give each instance an equal weight and calculate
the weighted average no matter what the distribution

P r(xjb) is. The pop-

ulation and the sample version are approximately the same when the in-bag
sample size is large, but there may be large difference if the sample size is
small. For example, if only one instance is sampled per bag. In this case the
sample version reduces to the single-instance case because an instances (in
this case also a bags) class label is determined by its own class label property.
However, the population version will still determine the instances class label
by the overall class property of its bag. Consequently it may be less desirable
because it does not degenerate naturally to the single-instance case.
It is now clear why we choose consistent formulations of C (Y j:) for both bags

and instances because we regard C (Y jB ) simply as the expected value of

C (Y jX ) conditional on the existence of a bag B = b. While there may be

other applications of the collective assumption, in this thesis we take this per-

28

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS
spective only. There are still two options, as discussed above, for generating
MI data under the collective assumption: the sample version and the population version. In this thesis we use the sample version because of its elegant
degradation into single-instance supervised learning.
3. How to model P r (X jB ) ?
Whether using the population or the sample version, we need to know the
conditional density of the instances,

P r(X jB ) because we need to generate

the instances of each bag from this distribution. There are also two possibilities to model P r (X jB ): one is to model it indirectly using

other is to model P r (X jB ) directly.

P r(X ) and the

The first option still assumes the existence of P r (X ), thus we are now in the
same framework as normal single-instance learning where we assume both

P r(X ) and P r(Y jX ). What is new is that we introduce a new variable B to


denote the bags and define P r (X jB ) in terms of P r (X ). In particular, we
assume the distribution of each bag P r (X jB ) is bounded and only occupies
a limited area in the instance space. As a result we model the conditional
density of the feature variable X given a bag B

P r(xjb) =

8
>
<

P r(x)

x2b P r(x)

>
:0

dx

= b as

if x 2 b;
otherwise:

(3.2)

P r(:) here because we really have a density function of X instead of probability if X is numeric. To put it another
Note that we abuse the notation

way, the distribution of each bag is simply the normalized instance distribution P r (X ), restricted to the corresponding range. In both the population and
the sample version of the generative model, we need Equation 3.2 to generate
instances for one bag. In the population version we also need it to create the
class label of a bag b, whereas in the sample version we do not need it because
the class label of a bag is not related to a specific form of the density function
of its instances.
The second option for modeling P r (X jB ) does not assume the existence of
29

3.1. ASSUMPTIONS

P r(X ). Indeed if all we need is P r(X jB ), why should we still rely on the
single-instance statistical learning paradigm? Given a bag b, we can directly
model P r (xjb) as some distribution, say a Gaussian, parameterized by some
parameters. Now a bag is basically described by its parameters because once
the parameters of a bag are decided, that bag has been formed. Thus we
need some mechanism to generate the parameters of the bags we can conveniently regard the parameters themselves as distributed according to some
hyper-distribution. The data generation process is exactly the same as in
the first option, for both the population version and the sample version. Note
that if we model P r (X jB ) directly, P r (X ) may or may not exist, depending
on the specific distribution involved.
The above two options may coincide sometimes, as will be shown in Section 3.2, but in general they generate different data. Note that although the
conditional density function P r (X jB ) must be specified in order to generate
instances for each bag, it is not important for the instance-based learning algorithms that will be discussed (particularly in Chapter 4) because they are
based on the sample version of the generative model, in which

P r(X jB ) is

not relevant to the class probability of the bags P r (Y jB ). However, in Chap-

ter 5, we pay much attention to P r (X jB ) as we develop a method that models


it in a group-conditional manner (i.e. conditional on Y ).

Now that we have answered the above questions, we are able to specify how to generate a bag of instances and how to generate the class label of that bag. Thus it is the
time to generate an MI dataset based on an exact generative model, under the collective assumption. In the following section, we illustrate the above specifications
via an artificial dataset, specifying the answers of the above question as follows.
First, the class label property C (Y ) is the posterior probability at both the instance

and the bag level. In other words, we model P r (Y jB ) and P r (Y jX ). Second, we

think of each bag as a hyper-rectangle in the instance space and assume the center
(i.e., the middle-point of the hyper-rectangle) of each bag is uniformly distributed.
Thus the bags are bounded and the instances for each bag are drawn from within
30

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS

Figure 3.1: An artificial dataset with 20 bags.


the corresponding hyper-rectangle. We model

P r(Y jX ) as a linear logistic model,

P r(Y = 1jX ) = 1+exp(1 X ) , and based on the sample version of the generative
P
model, P r (Y jB ) = n1 ni=1 P r (Y jXi ), where n is the number of instances in a bag.
i.e.

Finally we take the first option for the last question and assume the existence of

P r(X ), which we model as a uniform distribution. Thus the conditional density in


Equation 3.2 is simply a uniform distribution within the region that a bag occupies.

3.2 An Artificial Example Domain


In this section, we consider an artificial domain with two attributes. More specifically, we created bags of instances by defining rectangular regions and sampling
instances from within each region. First, we generated coordinates for the centroids
of the rectangles according to a uniform distribution with a range of [

5; 5 for each

of the two dimensions. The size of a rectangle in each dimension was chosen from
2 to 6 with equal probability (i.e. following a uniform distribution). Each rectangle

n instances from
within a rectangle according to a uniform distribution. The value of n was chosen
was used to create a bag of instances. To this end we sampled

31

3.2. AN ARTIFICIAL EXAMPLE DOMAIN


from 1 to 20 with equal probability. Note that although we adopt the first option to
model P r (X jB ) from the previous section, it is identical to the second option if we
assume a different uniform distribution for each bag, and that the parameters of the
uniform distributions are dominated by another (hyper-) uniform distribution.
It remains the question how to generate the class label for a bag. As mentioned before, our generative model assumes that the class probability of a bag is the average
class probability of the instances within it, and this is what we used to generate the
class labels for the bags. The instance-level class probability was defined by the
following linear logistic model:

P r(y = 1jx1 ; x2 ) =

1+e

3x1 3x2

Figure 3.1 shows a dataset with 20 bags that was generated according to this model.
The black line in the middle is the instance-level decision boundary (i.e. where

P r(y = 1jx1 ; x2 ) = 0:5) and the sub-space on the right side has instances with
higher probability to be positive. A rectangle indicates the region used to sample
points for the corresponding bag (and a dot indicates its centroid). The top-left
corner of each rectangle shows the bag index, followed by the number of instances
in the bag. Bags in gray belong to class negative and bags in black to class
positive. In this plot we mask the class labels of the instances with the class of
the corresponding bag because only the bags class labels are observable. Note that
bags can be on the wrong side of the instance-level decision boundary because
each bag was labeled by flipping a coin based on the average class probability of
the instances in it.
Note that the bag-level decision boundary is not explicitly defined but the instancelevel one is. However, if the number of instances in each bag goes to infinity (i.e. basically working with the population version), then the bag-level decision boundary
is defined. Since the instance-level decision boundary is symmetric w.r.t. every
rectangle and so is

P r(X jB ), the bag-level decision boundary is defined in terms

of the centroid of each bag and is the same as the instance-level one. Thus the best

32

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS
choice to classify a bag when given its entire population is simply to predict based

3x1 + 3x2 = 0 the bags centroid is on. There is also


another interesting property in this asymptotic situation. Since we define C (Y jX )
as P r (Y jX ), we can plug this into the right-hand side of Equation 3.1 to calculate
the conditional expectation of P r (Y jX ) given a bag b,
on which side of the line

EX jb [P r(Y jX ) =

Z
X

P r(Y jx)P r(xjb) dx

Assuming conditional independence between Y and b given a x,

=
=

Z
ZX
X

P r(Y jx; b)P r(xjb) dx


P r(x; Y jb) dx

= P r(Y jb)

X in the joint distribution P r(X; Y ) conditional on the existence of b and get the (conditional) prior probability: the class probability of b,
P r(Y jb).
Thus we marginalize

Chapter 4 presents an exact instance-based learning algorithm for this problem. It


is quite obvious that the exact solution is instance-based because only the instancelevel probabilities are defined. However, this artificial problem can also be tackled
with other methods. We may regard

P r(X jB ) for each bag as a uniform distri-

bution whose parameters are dominated by some hyper-distribution, for example,


two Gaussian distributions with different mean but the same variance, one for each
class. Since, as mentioned before, asymptotically the bag-level decision boundary
is linear, one can imagine that this type of model is not bad for this generative model
in terms of classification performance. In particular, if we model one Gaussian centered in the right-top corner in Figure 3.1 and another Gaussian in the left-bottom
corner, it would be quite a good approximation of the true generative model. This
thinking underlies the approach presented in Chapter 5.

33

3.3. THE WRAPPER HEURISTIC

Musk 1
Bagging with Discretized PART
90.222.11
RBF Support Vector Machine
89.131.15
Bagging with Discretized C4.5
90.982.51
AdaBoost.M1 with Discretized C4.5 89.241.66
AdaBoost.M1 with Discretized PART 89.782.30
Discretized PART
84.782.51
Discretized C4.5
85.432.95

Musk 2
87.161.42
87.162.14
85.002.74
85.492.73
83.701.81
87.062.16
85.691.86

Table 3.1: The best accuracies (and standard deviations) achieved by the wrapper
method on the Musk datasets (10 runs of stratified 10-fold cross-validation).
In the remainder of this chapter, we analyze a heuristic method based on the above
generative model. Although it does not find an exact solution, it performs very well
on the classification task based on this generative model. The method is very simple but the empirical performance on the Musk benchmark datasets is surprisingly
good, which may be due to the similarity between these datasets and our generative
model.

3.3 The Wrapper Heuristic


In this section, we briefly summarize results for a simple wrapper that, in conjunction with appropriate single-instance learning algorithms, achieves high accuracy
on the Musk benchmark datasets [Frank and Xu, 2003]. Consistent with the collective assumption, this method assigns every instance the class label of the bag that it
pertains to, so that a single-instance learner can learn from it. Moreover, it has two
special properties: (1) at training time, instances are assigned a weight inversely
proportional to the size of the bag that they belong to so that each bag receives
equal weights, and (2) at prediction time, the class probability for a bag is estimated
by averaging the class probabilities assigned to the individual instances in the bag.
This method does not require any modification of the underlying single-instance
learner as long as it generates class probability estimates and can deal with instance
weights.
34

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS
Table 3.1 shows the results of the wrapper method on the musk activity prediction problem for some popular single-instance learners implemented in the WEKA
workbench [Witten and Frank, 1999].2 All estimates were obtained using stratified
10-fold cross-validation (CV) repeated 10 times. The cross-validation runs were
performed at the bag level.
Bagging [Breiman, 1996] and AdaBoost.M1 [Freund and Schapire, 1996] are standard single-instance ensemble methods. PART [Frank and Witten, 1998] and C4.5
[Quinlan, 1993] learn decision lists and decision trees respectively. The RBF Support Vector Machine is a support vector machine with a Gaussian kernel using the
sequential minimal optimization algorithm [Platt, 1998]. The discretized versions are the same algorithms run on discretized training data using equal-width
discretization [Frank and Witten, 1999]. For more details on these algorithms and
how they can deal with weights and generate probability estimates, please check
[Frank and Xu, 2003]. It is suffice to say that all these results are competitive with
the best results achieved by other MI methods, as shown in Chapter 6.

3.4 Interpretation
Recall the wrapper method has two key features: (1) the way it assigns instance
weights and class labels at training time, and (2) the probability averaging of a bag
at prediction time. We provide some explanation for why this makes sense in the
following.
The wrapper method is an instance-based method and like other methods in this
category, it also tries to recover the instance-level probability. It is well known that
many popular propositional learners aim to the minimize an expected loss function
over X and

Y , that is, EX EY jX (Loss( ; Y )) where is the parameter to be estimated and usually involved in the probability function P r (Y jX ). As matter of fact,
all the single-instance learning schemes in Table 3.1 are within this category.3 Now,
2
3

Some of the properties of the Musk datasets are summarized in Table 4.2 in Chapter 4.
PART and C4.5 use an entropy-based criterion to select the optimal split point in a certain region.

35

3.4. INTERPRETATION
4

22

Estimated coefficient 1
Estimated coefficient 2
Estimated intercept

Test Error Rate(%)

Parameter value

Test error of the wrapper method trained on the masked data


Test error of the wrapper method trained on the unmasked data
Test error of the model with the true parameters

20

18

16

14
0
12
-1
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Number of exemplars

40

60

80

100

120

Number of training exemplars

Figure 3.2: Parameter estimation of


the wrapper method.

given a concrete bag

20

Figure 3.3: Test errors on the artificial data of the wrapper method
trained on masked and unmasked
data.

b of size n, as defined by Equation 3.2, the sample version

P r(xjb) is 1=n for each instance in the bag and zero otherwise. Therefore the
conditional expectation of the loss function given the presence of a bag b is simply
EX jb EY jX;b (Loss( ; Y )) = EX jb EY jX (Loss( ; Y )) assuming conditional independence of Y on b given X . Plug in given P r (xjb), the conditional expected loss is
P
simply j n1 Eyj jxj (Loss( ; yj )) where xj and yj is the attribute vector and class
label of the j th instances in b respectively. We want to minimize this expected loss
of

over all the bags, thus the final expected loss to be minimized is


EB EX jB [EY jX;B (Loss( ; Y )) =

X
i

1
N

X
j

1
E
(Loss( ; yij ))
ni yij jxij

(3.3)

where N is the number of bags and ni the number of instances in bag i. Thus the
weight

1=n of each instance x serves as P r(xjb) and the bags weight 1=N is a

constant outside the sum over all the bags and does not affect the minimization process of the expected loss. Nonetheless, Equation 3.3 can never be realized because

yij , i.e. the class label of each instance, is not observable. If it were observable,
this formulation could be used to find the true . However, under the collective
assumption, if we assign the bags class label yi to each of its instances, it may not
be a bad approximation as yi is related to all the yij . This is exactly what the wrapper method does at training time. The approximation makes the wrapper method a
heuristic because it necessarily introduces bias into the probability estimates.
This is to minimize the cross-entropy or deviance loss in a piece-wise fashion.

36

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS
This can be illustrated by running (weighted) linear logistic regression on the artificial data described in Section 3.2. As shown in Figure 3.2, asymptotically the
estimates differ from the true parameters with a multiplicative constant.4 It seems
that the objective to recover the instance-level probability function exactly cannot
be achieved with this method. Nevertheless what we really want is classification
performance and it is well-known that unbiased class probability estimates are not
necessary to obtain accurate classifications. In this case, since the bias is a multiplicative constant for all the parameters, the correct decision boundary on the instance level can be recovered. As discussed in Section 3.2, asymptotically the baglevel decision boundary is the same as the instance-level one. Thus given enough
instances (normally 10 to 20) within a bag, the classification can be very accurate
because the classification decision is always correctly made. This is shown in Figure 3.3. We generate a hold-out dataset of 10000 independent test bags using the
same generative model described in Section 3.2. Then we use the wrapper method
trained on both the masked data, i.e. yij is not given and yi is assigned to every
instance instead, and the unmasked data, i.e. yij is given for every instance. In
the latter case linear logistic regression converges to the true function (convergence
not shown in this thesis). At prediction time, we use the probability average of each
bag, which is reasonable in this case because it is assumed in the generative model.
Figure 3.3 shows that assigning a bags class labels to its instances does not harm
classification performance. In fact, the wrapper method trained on the masked data
predicts equally well as the one trained on the unmasked data using 60 training bags.
It achieves the best possible accuracy (shown as the bottom line) when trained on
more than 120 training bags.
The above observations are obtained in a specific artificial setting, which is much
simpler than real-world problems. In general, the bias may not be a constant for all
the parameters so even the true decision boundary cannot be recovered. However,
practical generative models may have factors that restrict, to some extent, the bias
or the harmful effect of the bias on the classification. For example, we observed
4

The reason the intercept estimate seems unbiased is that the true value is 0 and with a multiplicative bias, the estimate is still 0.

37

3.4. INTERPRETATION
from the artificial data that the less area each bag occupies in the instance space, the
less bias the wrapper method has. When the ranges of bags become small, the bias
is literally negligible. Hence if there are some restrictions on the range of a bag in
the instance space, the wrapper method can work well. Indeed, we observed that on
the Musk datasets, the in-bag variances are very small for most of the bags, which
may explain the feasibility of the wrapper method. Intuitively, this heuristic will
work well if the true class probabilities P r (Y jX ) are similar for all the instances
in a bag (under the above generative model) because we use the bags class labels
to approximate those of the corresponding instances. Therefore in general, as long
as this condition is approximately correct, no matter by what means, the wrapper
method can work.
Finally what the wrapper method does at prediction time is reasonable assuming the
above generative model. Nevertheless, it seems that the wrapper method does not
use the assumption of a bags class probability being the averaged instances class
probabilities at training time. In fact, by assigning the bags class labels to their
instances, it only assumes the general collective assumption but not any specific
assumption regarding how the bags class labels are created. Therefore we can use
other methods at prediction time such as taking the normalized geometric average of
the instances class probabilities within a bag as the bags probability, as long as the
method is consistent with the general collective assumption. However, according to
our experience, taking the arithmetic average of the instances probability for a bag
is more robust on practical datasets like the Musk datasets. Thus we recommend
this method for the wrapper method in general.
As explained, the wrapper method is only a heuristic method that can work well in
practice under some conditions. At least for the specific generative model proposed
in Section 3.2 of this chapter, it produces accurate classifications, although it is not
good for probability estimation. In Chapter 4, we will present an exact algorithm
that can give accurate probability estimates for the same generative model. They are
also based on the normal single-instance learning schemes and aim to upgrade them
to deal with MI data. The disadvantage of such an approach is that it heavily relies
38

CHAPTER 3. A HEURISTIC SOLUTION FOR MULTIPLE INSTANCE


PROBLEMS
on exact assumptions at the training time. Thus the significant modifications of the
underlying propositional learning algorithms are inevitable. The wrapper method,
on the other hand, is much simpler and more convenient.

3.5 Conclusions
In this chapter we first introduced a new assumption other than the MI assumption
for MI learning the collective assumption. This assumption regards the class
label of a bag related to all the instances within a bag. We then showed some
concrete applications of the collective assumption to exactly generate MI data. One
of the applications is to take the averaged probability of all the instances in the same
bag as the class probability of that bag.
Under the above exact generative model, we further assumed that the instances
within the same bag have similar class probabilities. Consequently the probability
of a bag is also similar to those of its instances. These assumptions allow us to
develop a heuristic wrapper method for MI problems. This method wraps around
normal single-instance learning algorithms by (1) assigning the class label of a bag
and instance weights to the instances at training time, and (2) averaging instances
class probabilities of a bag at testing time.
Assigning bags class labels to their corresponding instances and averaging the instances class probabilities are the application of the above two assumptions (the
collective assumption and the similar probability assumption). The instances
weights and the wrapping scheme were motivated based on the instance-level loss
function (encoded in the single-instance learners) over all the bags. We also showed
an artificial example where this wrapper method can perform well for classification
although its probability estimates are biased. Empirically, we found out that this
method works very well with the Musk benchmark datasets, in spite of its simplicity.

39

3.5. CONCLUSIONS

40

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS

Chapter 4
Upgrading Single-instance Learners

Among many solutions to tackle multiple instance (MI) learning problems, one approach has become increasingly popular, that is, to upgrade paradigms from the
normal single-instance learning to deal with MI data. The efforts described in this
chapter also fall into this category. However, unlike most of the current algorithms
within this category, we adopt an assumption-based approach that is based on the
statistical decision theory. Starting with analyzing the assumptions and the underlying generative models of MI problems, we provide a fairly general and justified
framework for upgrading single-instance learners to deal with MI data. The key
feature of this framework is the minimization of the expected bag-level loss function based on some assumptions. As an example we upgrade two popular singleinstance learners, linear logistic regression and AdaBoost and test their empirical
performance. The assumptions and underlying generative models of these methods
are explicitly stated.

4.1 Introduction
The motivating application for MI learning was the drug activity problem considered by Dietterich et al. [1997]. The generative model for this problem was basically
regarded as a two-step process. Dietterich et al. assumed there is an Axis-Parallel
41

4.1. INTRODUCTION
Rectangle (APR) in the instance space that accounts for the class label of each
instance (or each point in the instance space). Each instance within the APR is positive and all others negative. In the second step, a bag of instances is formed by
sampling (not necessarily randomly) from the instance space. The bags class label
is determined by the MI assumption, i.e., a bag will be positive if at least one of its
instances is positive (within the assumed APR) and otherwise negative. With this
perspective, Dietterich et al. proposed APR algorithms that attempts to find the best
APR under the MI assumption. They showed empirical results of the APR methods
on the Musk datasets, which represent a musk activity prediction problem.
In this chapter, we follow the same perspective as that in [Dietterich et al., 1997]
but with different assumptions. We adopt an approach based on the statistical decision theory and select assumptions that are well-suited for upgrading each singleinstance learner.
The rest of the chapter is organized as follows. In Section 4.2 we explain the underlying generative model we assume and show artificial MI data generated using
this model. In Section 4.3 we describe a general framework for upgrading normal
single-instance learners according to the generative model. We also provide two examples of how to upgrade the linear logistic regression and AdaBoost [Freund and
Schapire, 1996] within this framework. These methods have not been studied in the
MI domain before. Section 4.4 shows some properties of our methods on both the
artificial data and practical data. Some regularization techniques, as used in normal
single-instance learning, will also be introduced in an MI context. In Section 4.5 we
show that the methods presented in this chapter perform comparatively well on the
benchmark datasets, i.e. the Musk datasets. Section 4.6 summarizes related work
and Section 4.7 concludes this chapter.

42

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS

4.2 The Underlying Generative Model


The basis for the work presented in this chapter was an analysis of the generative
model (implicitly) assumed by Dietterich et al.. The result of this analysis was
that the generative model is actually modeled as a two-step process. The first step
is an instance-level classification process and the second step determines the class
labels of a bag based on the first step and the MI assumption. As a matter of fact,
the only probabilistic algorithm in MI learning, the Diverse Density (DD) [Maron,
1998] algorithm, followed the same line of thinking. In the first step, DD assumes
a radial (or Gaussian-like) formulation for the true posterior probability function

P r(Y jX ). In the second step, based on the values of P r(Y jX ) of all the instances
within a bag, it assumes either a multi-stage (as in the noisy-or model) or a one-stage
(as in the most-likely-cause model) Bernoulli process to determine the class label of
a bag. Therefore DD amounts to finding one (or more) Axis-Parallel hyper-Ellipse
(APE)1 under the MI assumption.
It is natural to extend the above process of generating MI data to a more general
framework. Specifically, as in single-instance learning, we assume a joint distribu-

X and the class (response) variable Y , P r(X; Y ) =


P r(Y jX )P r(X ). The posterior probability function P r(Y jX ) determines the inst-

tion over the feature variable

ance-level decision boundary that we are looking for. However, in MI learning

B , denoting the bags, and what we really want is


P r(Y jB ). Let us assume that given a bag b and all its instances xb , P r(Y jb) is
a function of P r (Y jxb ), i.e., P r (Y jb) = g (P r (Y jxb )). The form of g (:) is deter-

we introduce another variable

mined based on some assumptions. In the APR algorithms [Dietterich et al., 1997],
the instance-level decision boundary pattern is modeled as an APR and

g (:) is an

instance-selection function based on the MI assumption. In DD [Maron, 1998], the


instance-level decision boundary pattern is an APE and g (:) is either the noisy-or or
the most-likely-cause model, which are both due to the MI assumption. Therefore
the APR-like pattern and MI assumption combination is simply a special case of
1

Note that an APE is very similar to an APR but that it is differentiable.

43

4.2. THE UNDERLYING GENERATIVE MODEL


our framework.
In this chapter we are creating new combinations within this framework but with
different assumptions. We change the decision boundary to other patterns and substitute the MI assumption with the collective assumption introduced in Chapter 3.
We model the instance-level class label based on P r (Y jX ) or its logit transforma-

jX )
tion, i.e. the log-odds function log PP rr((YY =1
=0jX ) , because many normal single-instance
learners aim to estimate them and we are aiming to upgrade these algorithms. We
can think of a bag

b as a certain area in the instance space. Then given b with n

instances, we have

n
1X
P r(y jxi)
P r(Y jb) =
n i

(4.1)

or

log

n
P r(Y = 1jb) 1 X
P r(Y = 1jxi )
log
=
P r(Y = 0jb) n i=1 P r(Y = 0jxi )

8
<

):
where xi

Qn

1=n

[ i=1 P r(y=1
Qjxi)
P r(Y = 1jb) = [Qni=1 P r(y=1
j
xi )1=n +[ ni=1 P r(y=0jxi )1=n
Q
[ ni=1 P r(y=0
jxi )1=n
Q
P r(Y = 0jb) = [Qni=1 P r(y=1
1
=n
jxi ) +[ ni=1 P r(y=0jxi )1=n

(4.2)

2 b. Even though the assumptions are quite intuitive, we provide a formal

derivation in the following.


As discussed in Section 3.1 of Chapter 3, we define two versions of generative
models the population version and the sample version. In the population version,
the conditional density of the feature variable X given a bag B

P r(xjb) =
2

8
>
<

P r(x)

x2b P r(x)

>
:0

dx

if x 2 b;
otherwise:

= b is

(4.3)

Note that we abuse the term P r(:) here because for a numeric feature, what we have is a density
function of X instead of the probability.
2

44

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS

Figure 4.1: An artificial dataset with 20 bags.

And in the sample version,

P r(xjb)

8
>
<1
= n
>
:

if x 2 b;

0 otherwise:

(4.4)

where n is the number of instances inside b. This is true no matter what distribution

P r(X ) is, as long as the instances are randomly sampled. Then, for any formulation
of the class label property of a bag C (Y jb), according to our collective assumption
we associate it with the instances by calculating the conditional expectation over

X , which results in Equation 3.1 discussed in the second question in Section 3.1 of
Chapter 3.

P r(xjb) in Equation 3.1 with that in Equation 4.4 and use the sum instead of the integral. If C (Y j:) is P r (Y j:), then we

In the sample version, we substitute

get Equation 4.1, which is the arithmetic average of the corresponding instance

j:)
probabilities. If C (Y j:) is log PP rr((YY =1
=0j:) , then we obtain Equation 4.2, which is the
normalized geometric average of the instance probabilities. Note that in this model
introducing the bags

B does not change the joint distribution P r(X; Y ). It only

casts a new condition so that the class labels of the instances are masked by the
collective class label.
45

4.2. THE UNDERLYING GENERATIVE MODEL


The above generative models are better illustrated by an artificial dataset, which will
also be used in later sections. We consider an artificial domain with two independent
attributes. More specifically, we used the same mechanism for generating artificial
data as that in Section 3.2 of Chapter 3 except that we changed the density of

X,

P r(X ) and use a different linear logistic model. Now the density function, along
each dimension, is a triangle distribution instead of a uniform distribution:

f (x) = 0:2 0:04jxj


And the instance-level class probability was defined by the linear logistic model of

P r(y = 1jx1 ; x2 ) =

1+e

x1 2x2

Thus the instance-level decision boundary pattern is still a hyperplane, but different
from the one modeled in the artificial data in Section 3.2. We changed these in order
to demonstrate that our framework can deal with any form of P r (X ) and P r (Y jX ),
as long as the correct family of

P r(Y jX ) (in this case the linear logistic family)

and the correct underlying assumption (in this case the collective assumption) are
chosen. Finally we took Equation 4.2 to calculate P r (y jb). Again we labeled each
bag according to its class probability. The class labels of the instances are not
observable.
Now we have constructed a dataset based on the Hyperplane (or linear boundary)
and the collective Assumption combination instead of the APR-like (or quadratic
boundary) and MI Assumption combination used in the APR algorithms [Dietterich et al., 1997] and DD [Maron, 1998].
Figure 4.1 shows a dataset with 20 bags that was generated according to this generative model. Same as in Chapter 3, the black line in the middle is the instance-level
decision boundary (i.e. where

P r(y = 1jx1 ; x2 ) = 0:5) and the sub-space on the

right side has instances with higher probability to be positive. A rectangle indicates
the region used to sample points for the corresponding bag (and a dot indicates its
centroid). The top-left corner of each rectangle shows the bag index, followed by
46

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS


the number of instances in the bag. Bags in gray belong to class negative and
bags in black to class positive. Note that bags can be on the wrong side of the
instance-level decision boundary because each bag was labeled by flipping a coin
based on the normalized geometric average class probability of the instances in it.

4.3 An Assumption-based Upgrade

In this section we first show how to solve the above problem in the artificial domain. Since the generative model is a linear logistic model, we can upgrade linear
logistic regression to solve this problem exactly. Then we generalize the underlying ideas to a general framework to upgrade single-instance learners based on some
assumptions, which also includes the APR algorithms [Dietterich et al., 1997] and
the DD algorithm [Maron, 1998]. Finally within this framework we also show how
to upgrade AdaBoost algorithm [Freund and Schapire, 1996] to deal with MI data.
First we upgrade linear logistic regression together with the collective assumption so that it can deal with MI data. Note that normal linear logistic regression can no longer apply here because the class labels of instances are masked by
the collective class label of a bag. Suppose we knew what exactly the collective assumption is, say, Equation 4.2, then we could first construct the probability

P r(Y jb) using Equation 4.2, and estimate the parameters (the coefficients of attributes in this case) using the standard maximum binomial likelihood method. In
this way we fully recover the instance-level probability function in spite of the
fact that the class labels are masked. When a test bag is seen, we can calculate
the class probability P r (Y jbtest ) according to the recovered probability estimate
and the same assumption we used at training time. The classification is based on

P r(Y jbtest ). Mathematically, in the logistic model, P r(Y = 1jx) = 1+exp(1 x) and
1
P r(Y = 0jx) = 1+exp(
x) where is the parameters to be estimated. According to
47

4.3. AN ASSUMPTION-BASED UPGRADE


Equation 4.2, we construct:

Qn
1 P
1=n
n
Q
[ ni P r(y=0Qjxi )1=n
1 P
: P r (Y = 0jb) = Qn
[ i P r(y=1jxi )1=n +[ ni P r(y=0jxi )1=n = 1+exp( n1 i x )
8
<

exp( n i x )
=1Qjxi )
P r(Y = 1jb) = [Qni P r(y=1[ jxii)P1r=n(y+[
n P r(y=0jx )1=n = 1+exp( 1 P x )
i
i
i
i

Then we model the class label determination process of each bag as a one stage
Bernoulli process. Thus the binomial log-likelihood function is:

LL =

N
X
i=1

[Yi log P r(Y = 1jb) + (1 Yi ) log P r(Y = 0jb)

(4.5)

N is the number of bags. By maximizing the likelihood function in Equation 4.5 can we estimate the parameters . Maximum likelihood estimates (MLE)
where

are known to be asymptotically unbiased, as illustrated in Section 4.4. This formulation is based on the assumption of Equation 4.2. In practice, it is impossible
to know the underlying assumptions so other assumptions may also apply, for instance, the assumption of Equation 4.1. In that case, the log-likelihood function
of Equation 4.5 remains unchanged but the formulation of

P r(Y jb) is changed to

Equation 4.1. We call the former method MILogisticRegressionGEOM and the


latter MILogisticRegressionARITH in this chapter.
As usual, the maximization of the log-likelihood function is carried out via numeric optimization because there is no analytical form of the solution in our model.3
Based on the Rule of Parsimony, we want as few parameters as possible. In the
case of linear logistic regression, we only search for parameter values around zero.
Thus the linear pattern means that only local optimization is needed, which saves
us great computational costs. The radial formulation in DD [Maron, 1998], on the
other hand, implies a complicated global optimization problem.
In general, since many single-instance learners amount to minimizing the expected
loss function EX EY jX [Loss(X; ) in order to estimate the parameter in P r (Y jX ),
we can upgrade any normal single-instance learner that factorizes

P r(X; Y ) into

P r(Y jX )P r(X ) and estimates P r(Y jX ) directly. This category covers a wide
3

The choice of numeric optimization methods used in this thesis is discussed in Chapter 7.

48

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS


range of single-instance learners, thus the method is quite general. It involves four
steps:
1. Based on whatever assumptions believed appropriate, build a relationship be-

P r(Y jb), and that of its instances xb ,


P r(Y jxb ), i.e. P r(Y jb) = g (P r(Y jxb )). Since P r(Y jX ) is usually defined
tween the class probability of a bag b,

by the single-instance learner under consideration, the only thing to decide is

g (:).
2. Construct a loss function at the bag level and take the expectation over all
the bags instead of over instances, i.e. the expected loss function is now

EB EY jB [Loss(B; ). Note that the parameter vector is the instance-level


parameter vector because it determines P r (Y jX ) (or its transformations).
3. Minimize the bag-level loss function to estimate .
4. When given a new bag btest , first calculate
late P r (Y jbtest )

P r(Y jxtest ; ^) and then calcu-

= g (P r(Y jxtest ; ^)) based on the same assumption used in


Step 1. Then classify btest according to whether P r (Y jbtest ) is above 0.5.

The negative binomial log-likelihood is a loss function (also called deviance or


cross-entropy loss). Thus the above two MI linear logistic regression methods fit
into this framework. They use the linear logistic formulation for P r (Y jX ) and the
collective assumption. The Diverse Density (DD) algorithm [Maron, 1998] also
uses the maximum binomial likelihood method thus it can be easily recognized as
a member of this framework. It uses a radial formulation for

P r(Y jX ) and the

MI assumption. The APR algorithms [Dietterich et al., 1997], on the other hand,
directly minimize the misclassification error loss at the bag level and the bounds of
the APR are the parameters to be estimated for P r (Y jX ). Since the APR algorithms

= 1jx) can
be regarded as 1 for any instance x within the APR. Otherwise P r (Y = 1jx) = 0.
regard the instance-level classification as a deterministic process, P r (Y

Note that in this framework, significant changes to the underlying single-instance


algorithms seems inevitable. This is in contrast to the heuristic wrapper method
presented in Chapter 3.
49

4.3. AN ASSUMPTION-BASED UPGRADE

1. Initialize weights of each bag Wi

= 1=N , i = 1; 2; : : : ; N .

2. Repeat for m = 1; 2; : : : ; M :
Wi =ni , assign the bags class label to each
(a) Set Wij
of its instances, and build an instance-level model hm (xij ) 2 f 1; 1g.
(b) Within the ith bag (with ni instances), compute the error rate ei 2 [0; 1
by counting
the number of misclassified instances within that bag,
P
i.e. ei = j 1(hm (xij )6=yi ) =ni .
(c) If ei < 0:5 for all is, STOP iterations, Go to step 3.
P
(d) Compute m = argmin i Wi exp[(2ei 1) m .
(e) If( m  0) STOP iterations, Go to step 3.
P
(f) Set Wi
Wi exp[(2ei 1) m and renormalize so that i Wi = 1.
3. return sign[

P P
i m m hm (xtest ).

Table 4.1: The upgraded MI AdaBoost algorithm.

Linear logistic regression and the quadratic formulation (as in DD) assume a limited
family of underlying patterns. There are more flexible single-instance learners like
boosting and the support vector machine (SVM) algorithms that can model larger
families of patterns. However, the general upgrading framework presented above
means that, under certain assumptions, one can model a wide range of decision
boundary patterns. Here we provide an example of how to upgrade the AdaBoost
algorithm into an MI learner based on the collective assumption. Intuitively AdaBoost can easily be wrapped around an MI algorithm (without changes to the AdaBoost algorithm), but since there are not many weak MI learners available we
are more interested in taking the single-instance learners as the base classifier of the
upgraded method.
AdaBoost originated in the Computational Learning Theory domain [Freund and
Schapire, 1996], but received a statistical explanation later on [Friedman, Hastie
and Tibshirani, 2000]. It can be shown that it aims to minimize an exponential loss
)
function in a forward stagewise manner and ultimately estimates 12 log PPrr(Y(Y==11jX
jX )

(based on an additive model) [Friedman et al., 2000].4 Now, under the collective
assumption, we use the Equation 4.2 because the underlying single-instance learner
4

Note that in AdaBoost, class labels are coded into f

50

g instead of f0; 1g.

1; 1

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS


(i.e. normal AdaBoost) estimates the log-odds function (We could also use the standard MI assumption here, but this would necessarily introduce the

max function,

which makes the optimization much harder). We first describe the upgraded AdaBoost algorithm in Table 4.1 and then briefly explain the derivation. The notation

N is the number of bags and we use the subscript


i to denote the ith bag, where i = 1; 2; : : : ; N . Suppose there are ni instances
within the ith bag. Then we use the subscript j to refer to the j th instance, where
j = 1; 2; : : : ; ni . Therefore xij denotes the j th instance in the ith bag.

used in Table 4.1 is as follows:

The derivation follows exactly the same line of thinking as that in [Friedman et al.,
2000]. In the derivation below we regard the expectation sign

E as the sample

average instead of the population expectation. We are now looking for a function
over all the bags F (B ) that minimizes EB EY jB [exp(

b,
F (b) =

X
n

yF (B )) where given a bag

F (xb )=n:

(4.6)

F (B ) to F (B ) + f (B ) in each iteration with the restriction


> 0. First, given > 0 and the current bag-level weights WB = exp( yF (B )),
we are searching for the best f (B ). After second order expansion of exp( y f (B ))
about f (B ) = 0, we are seeking the maximum of EW [yf (B ). If we had an MI
base learner in hand, we could estimate f (B ) directly. However we are interested
in wrapping around a single-instance learner, thus we expand f (B ) for each bag b
P
according to Equation 4.6: f (b) = n h(xb )=n where h(xb ) 2 f 1; 1g. Now we
are seeking h(:) to maximize
We want to expand

EW [yh(xb )=n =

ni
N X
X
i=1 j

1
[ Wi P r(y = 1jbi )h(xij )
ni

1
W P r(y = 1jbi )h(xij )
ni i

The solution is

h(xij ) =

8
<
:

i
if W
ni P r (y

= 1jbi )

1 otherwise

Wi P r (y
ni

= 1jbi ) > 0

This formula simply means that we are looking for the function h(:) at the instance
51

4.3. AN ASSUMPTION-BASED UPGRADE


i
level such that given a weight of W
ni for each instance, its value is determined by
the probability P r (y jbi). Note that this probability is the same for all the instances

in the bag. Since the class label of each bag reflects its probability, we can asi
sign the class label of a bag to its instances. Then, with the weights W
ni , we use
a single-instance learner to provide the value of h(:). This constitutes Step 2a in

the algorithm in Table 4.1. Note that in Chapter 3, we proposed a wrapper method
to apply normal single-instance learners to MI problems. There, it is a heuristic
to assign the class label of each bag to the instances pertaining to it because we
minimize the loss function within the underlying (instance-level) learner. Since the
underlying learners are at the instance level, they require the class label for each
instance instead of each bag. That is why it is a heuristic. Here we only use the

h(:), and our objective is to estimate


h(:) such that yh(xb )=n is maximized over all the (weighted) instances, where y is

underlying instance-level learners to estimate

the bags class label. Therefore assigning the class label of a bag to its instances is
actually our aim here because we are trying to minimize a bag-level loss function.
Hence it is not a heuristic or approximation in this method.

yi h(xij ) for each bag, we get yi f (bi ) = 2ei 1, where


ei = j 1(hm (xij )6=yi ) =ni and this is Step 2b in the algorithm. Then given yf (B ) 2
[ 1; 1 (more precisely, yf (B ) 2 f 1; 1 + 1=n; 1 + 2=n;    ; 1 2=n; 1
1=n; 1g) we are looking for the best > 0. To do this we can directly optimize the

Next, if we average
P

objective function

EB EY jB [exp(F (B ) +  ( yf (B ))) =
=

X
i
X
i

Wi exp[ m  (
Wi exp[(2ei

j h(xij )

ni

1) m :

This constitutes Step 2d.


This objective function will not have a global minimum if all ei

< 0:5. The inter-

pretation is analogous to that in the mono-instance AdaBoost. In normal AdaBoost,


the objective function has no global minimum if every instance is correctly classified based on the sign of f (x) (i.e. zero misclassification error is achieved). Here
52

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS


we can classify a bag b according to f (b) =

n h(xb )=n. Then

yi f (bi ) = 2ei 1

simply means that if the error rates within all the bags ei are less than 0.5, all the
bags will be correctly classified because all the f (bi ) will have the same sign as yi .
Thus we have reached zero error and no more boosting can be done as in normal
AdaBoost. Therefore we check this in Step 2c.
The solution of this optimization problem may not have an easy analytical form.
However, it is a one-dimensional optimization problem. Hence the Newton family
of optimization techniques can find a solution in super-linear time [Gill, Murray
and Wright, 1981] and the computational cost is negligible compared to the time to
build the weak classifier. Therefore we simply search for using a Quasi-Newton
method in Step 2d.

is not necessarily positive. If it is negative, we can simply reverse the


prediction of the weak classifier and get a positive (this can be done automatically
in Step 2f). However, we bow to the AdaBoost convention and restrict to be

Note that

positive. Thus we check that in Step 2e.


Finally we update the bag-level weights in Step 2f according to the additive structure of F (B ), in the same way as in normal AdaBoost. Note that if a bag has more
misclassified instances in it, it gets a higher weight in the next iteration, which is
intuitively appealing. Another appealing property of this algorithm is that if there
is only one instance per bag, i.e. the data is actually single-instance data, this algorithm naturally degrades to normal AdaBoost. To see why, note that the solution for

in Step 2d will be exactly 12 log 1 errerrww where errw is the weighted error. Hence
the weight update will also be the same as in AdaBoost.
It is easy to see that to classify a test bag, we can simply regard F (B ) as the baglevel log-odds function and take Equation 4.2 to make a prediction.

53

4.4. PROPERTY ANALYSIS AND REGULARIZATION TECHNIQUES


7

Estimated coefficient 1
Estimated coefficient 2
Estimated intercept

Test error of MI AdaBoost on the artificial data


Test error of MILogisticRegressionGEOM on the artificial data
Test error of the model with the true parameters

24
22

5
Test Error Rate(%)

Parameter value

20
4
3
2

18
16
14

12

10

-1

8
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Number of exemplars

50

100

150

200

250

300

Number of training exemplars

Figure 4.2: Parameter estimation of the MILogisticRegressionGEOM on the artificial data.

Figure 4.3: Test error of MILogisticRegressionGEOM and the MI AdaBoost algorithm on the artificial
data.

4.4 Property Analysis and Regularization Techniques


In this section we first show some properties of the upgraded MI learners. We use
the artificial data generated according to the scheme in Section 4.2 to show some
asymptotic properties of the probability estimates using an MI logistic regression
algorithm. Next we show the test error of the MI AdaBoost algorithm on the artificial data to see whether it can perform well on classifying MI data. Finally, since
the number of iterations property of the boosting algorithms is usually of interest, we show this property for our MI AdaBoost algorithm against the training and
test errors on the Musk1 dataset. We observe that its behavior is very similar to
its single-instance predecessor. After analyzing the properties we introduce some
regularization methods for the above algorithms in order to increase generalization
power on practical datasets.
Because we know the artificial data described in Section 4.2 is generated using a
linear logistic model based on the normalized geometric average formulation from
Equation 4.2, MILogisticRegressionGEOM is the natural candidate to deal with
this specific MI dataset. Figure 4.2 shows the parameters estimated by this method
when the number of bags increases. It can be seen that the estimates converge to
the true parameters asymptotically. The more training bags, the more stable the
estimates. This is a consequence of the fact that the class probability estimates of
this algorithm are consistent (or asymptotically unbiased) by virtue of the maxi54

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS


100

Training error of MI AdaBoost on the Musk1 data


Training RRSE of MI AdaBoost on the Musk1 data
Test error of MI AdaBoost on the Musk1 data

Error (%)

80

60

40

20

0
0

100

200

300

400

500

600

700

800

900

1000

Number of iterations

Figure 4.4: Error of the MI AdaBoost algorithm on the Musk1 data.

mum likelihood method. The consistency of the MLE can be proven under fairly
general conditions [Stuart, Ord and Arnold, 1999]. In this specific artificial data,
we used a triangle distribution for the density of

X , as explained in Section 4.2.

However, we observed that the exact form of this density does not matter. This is
reasonable because our model only assumes random sampling from the distribution
corresponding to a bag. The exact form of P r (X ) does not matter.
MILogisticRegressionGEOM successfully recovers the instance-level class probability function (and the underlying assumption also holds for the test data). Consequently we expect it to achieve the optimal error rate on this data. As explained
before, the MI AdaBoost algorithm is also based on Equation 4.2, thus it is also
expected to perform well in this case. To test this, we first generate an independent
test dataset of 10000 bags, then generate training data (with different random seeds)
for different numbers of training bags. The decision stumps are used as the weak
learner and the maximal number of boosting iterations was set to 30. The test error of both methods on the test data against different numbers of bags is plotted in
Figure 4.3. When the number of training bags increases, the error rate goes closer
to the best error rate and eventually both methods accomplish close to the perfect
performance. However, unlike MILogisticRegressionGEOM, MI AdaBoost cannot
achieve the exact best error rate, mainly because in this case we are approximating
a line (the decision boundary) using axis-parallel rectangles, only based on a finite
amount of data.
55

4.4. PROPERTY ANALYSIS AND REGULARIZATION TECHNIQUES


Finally, people are often interested in how many iterations are needed in the boosting algorithms and it is also an issue in our MI AdaBoost algorithm. Cross Validation (CV) is a common way to find the best number of iterations. It is known that
even after the training error ceases to decrease, it is still worthwhile boosting in order to increase the confidence (or margin) of the classifier [Witten and Frank, 1999].
In the two-class case (as in the Musk datasets), increasing the margin is equivalent
to reducing the estimated root relative squared errors (RRSE) of the probability estimates. It turns out that this statement also holds well for MI AdaBoost. As an
example, we show the results of MI AdaBoost on the Musk1 dataset in Figure 4.4.
The training error is reduced to zero after about 100 iterations but the RRSE keeps
decreasing until around 800 iterations. The test error, averaged over 10 runs of
10-fold CV, also reaches a minimum at around 800 iterations, which is 10.44%
(standard deviation: 2.62%). After 800 iterations, the RRSE does not seem to improve any further and the test error rises due to overfitting. The error rate is quite
low for the Musk1 dataset, as will be shown in Section 4.5. However, boosting on
decision stumps is too slow for the Musk2 dataset. We observed that the RRSE does
not settle down even after 8000 iterations. Therefore it is computationally too expensive to be used in practice and we present results based on regularized decision
trees as the weak classifier.
Regularization is commonly used in single-instance learning. It introduces a bias in
the probability estimates in order to increase the generalization performance by reducing the chance of overfitting. It works because the underlying assumptions of a
learner rarely perfectly hold in practice. In MI learning, regularization can be naturally inherited from the corresponding single-instance learners if we upgrade one of
them. For example, in the MI support vector machine (SVM) [Gartner et al., 2002],
there is a model complexity parameter that controls the scale of the regularization.
For non-standard methods like the APR algorithms, other ways like kernel density
estimation (KDE) [Dietterich et al., 1997] are adopted to achieve the function of
regularization. We can also impose regularization in our methods as follows.
For single-instance linear logistic regression, a shrinkage method is proposed in
56

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS


[le Cessie and van Houwelingen, 1992]. We adopt this ridge regression method
here. Thus we simply add an L2 penalty term

jj jj2 to the likelihood function in

Equation 4.5 where  is the ridge parameter. Note that in ridged logistic regression
since it is not invariant along each dimension we need to standardize the data
to zero-mean/unit-standard-deviation before fitting the model, and transform the
model back after fitting [Hastie et al., 2001]. The mean and standard deviation used
in the standardization in our MI logistic regression methods are estimated using the
weighted instance data, with each instance weighted by the inverse of the number
of instances in the corresponding bag. This is done because intuitively we would
like to have equal weight for every bag. Note that unlike some current MI methods
that change the data, we do not pre-process the data the standardization is simply
part of the algorithm.
In boosting with decision trees, both the tree size and the number of iterations determine the degrees of freedom, and there are several ways to regularize it in singleinstance learning [Hastie et al., 2001]. We use C4.5 [Quinlan, 1993] which does not
have an explicit option to specify how many nodes will be in a tree, we use an alternative way to shrink the tree size, that is, we specify a larger minimal number of
(weighted) instances for the leaf nodes (the default setting in C4.5 is two). Together
with the restriction of the number of iterations we can achieve a very coarse form
of regularization in our MI AdaBoost algorithm. By enlarging the minimal number
of leaf-node instances and in turn shrinking the tree size, we effectively make the
tree learner weaker. We will only show experimental results for the MI AdaBoost
algorithm based on these regularized trees.

4.5 Experimental Results


This section we present experimental results of our MI algorithms on the Musk
benchmark datasets [Dietterich et al., 1997]. As already discussed in Chapter 1,
there are two overlapping datasets describing the musk activity prediction problem,
namely the Musk1 and Musk2 data. Some of the properties of the datasets are
57

4.5. EXPERIMENTAL RESULTS

Number of bags
Number of attributes
Number of instances
Number of positive bags
Number of negative bags
Average bag size
Median bag size
Minimum bag size
Maximum bag size

Musk 1 Musk 2
92
102
166
166
476
6598
47
39
45
63
5.17
64.69
4
12
2
1
40
1044

Table 4.2: Properties of the Musk 1 and Musk 2 datasets.

summarized in Table 4.2.


Table 4.3 shows the performance of related MI methods on the Musk datasets, as
well as that of our methods developed in this chapter. The evaluation is either by
leave-one-out (LOO) or by 10-fold Cross Validation (CV) at the bag level. In the
first part we include some of the current solutions to MI learning problems. All the
methods that depend on the APR-like pattern and MI assumption combination
are shown. They are: the best of the APR algorithms, iterated-discrim APR with
KDE [Dietterich et al., 1997], the Diverse Density algorithm [Maron, 1998]5 and
the MULTINST algorithm [Auer, 1997]. Although MI Neural Networks [Ramon
and Raedt, 2000] do not model the APR-like pattern exactly, they depend heavily
on the MI assumption and followed a thinking very similar to the above methods.
The pattern that they model really depends on the complexity of the networks that
are built. Thus we also include it. It can be seen that the models based on the
linear pattern and the collective assumption combination can also achieve comparably good results on the Musk datasets. Therefore we believe that in practice
what really matters is the right combination of the decision boundary pattern and
5

Note that we do not include the EM-DD algorithm [Zhang and Goldman, 2002] here even
though it was reported to have the best performance on the Musk datasets. We do so for two reasons:
1. There were some errors in the evaluation process of EM-DD; 2. It can be shown via some theoretical analysis and an artificial counter-example (Appendix D) that EM-DD cannot find a maximum
likelihood estimate (MLE) in general due to the (tricky) likelihood function it aims to optimize.
Since the DD algorithm is a maximum likelihood method, the solution that EM-DD finds cannot be
trusted if it fails to find the MLE.

58

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS

Methods

Musk 1
Musk 2
LOO
10CV
LOO
10CV
iterated-discrim APR with KDE

7.6

10.8
maxDD

11.1

17.5
MULTINST

23.3

16.0
MI Neural Networksa

12.0

18.0
SVM with the MI kernel
13.0
13.61.1
7.8
12.01.0
Citation-kNN
7.6

13.7

MILogisticRegressionGEOM
13.04 14.132.23 17.65 17.741.17
MILogisticRegressionARITH
10.87 13.261.83 16.67 15.881.29
MI AdaBoost with 50 iterations 10.87 12.071.95 15.69 15.981.31
a
It was not clear which evaluation method the MI Neural Network used. We put it into 10CV
column just for convenience.

Table 4.3: Error rate estimates from either 10 runs of stratified 10-fold crossvalidation or leave-one-out evaluation. The standard deviation of the estimates (if
available) is also shown.
the assumption. Simple patterns and assumptions together may be sufficient.
There are many methods that aim to upgrade single-instance learners to deal with
MI data, as described in Chapter 2. Although they are definitely not in the same
category as the methods in the first part, and their assumptions and generative models are not totally clear, we include some of them in the second part. Because there
are too many to list here, we simply chose the ones with the best performance on
the Musk datasets: the SVM with the MI kernel [Gartner et al., 2002] and an MI
K-Nearest-Neighbour algorithm, Citation-kNN [Wang and Zucker, 2000]. For the
SVM with the MI kernel, the evaluation is via leave-10(bags)-out, which is similar
to that of 10-fold CV because the total number of bags in either dataset is close to
100. Thus we put its results into the 10CV columns.
Finally in the third part we present results for the methods from this chapter. Since
we introduced regularization, as also used by some other methods like SVM, there
is possibility to tune the regularization parameters. However we avoided extensive
parameter tunings. In both MI logistic regression methods, we used a small, fixed
regularization parameter

 = 2. In our MI AdaBoost algorithm, we restricted our-

selves to 50 iterations to control the degrees of freedom. The base classifier used
59

4.6. RELATED WORK


is an unpruned C4.5 decision tree, but not fully expanded (as explained above). Instead of using the default setting of minimal 2 instances per leaf, we set it to be 2
bags per leaf. More specifically, we used the average number of instances per bag
shown in Table 4.2 as the size of one bag, and thus used 10 (instances) for Musk1
and 120 (instances) for Musk2.
As can be shown, the performance of our methods is comparable to some of the
best methods in MI learning, and compares favorably to most of the APR-like +
MI assumption family. More importantly, since the generative models and the assumptions of our methods are totally clear, it is easy to assess their applicability
when new MI data becomes available. In the Musk datasets, it seems that logistic
regression based on the arithmetic average of the instances class probabilities performs better than based on the normalized geometric average, thus the assumptions
of this method could be more realistic in this case. Finally, the overall performance
of our methods is also comparable to that of the wrapper method described in Chapter 3, slightly worse in Musk2 data though. Although the methods in this chapter are
the exact solutions to the underlying generative model, they do not seem superior to
the heuristic methods with the biased probability estimates. This is especially true
when the assumptions of the heuristic reasonably hold in reality, which seems to be
the case in the Musk datasets.

4.6 Related Work


The Diverse Density (DD) algorithm [Maron, 1998] is the only probabilistic model
for MI learning in the literature, and hence it is the work most closely related. DD
also models

P r(Y jB ) and P r(Y jX ), and [Maron, 1998] proposed two ways to

model the relationship between them, namely the noisy-or model and the mostlikely-cause model. Both ways were aimed to fit the MI assumption. The noisy-or
model regards the process of determining a bags class probability (i.e.

P r(Y =

0; 1jB )) as a Bernoulli process with n independent stages where n is the number of

instances in the bag. In each stage, one decides the class label using the probability
60

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS

P r(y jx) of an (unused) instance in the bag. Roughly speaking, a bag will be labeled
positive if one sees at least one positive class label in this process and negative
otherwise. The most-likely-cause model regards the above process as one-stage. It
thus only picks one instance per bag. The selection of the instance is parametric
or model-based, that is, according to the radial formulation of P r (Y jX ) it uses the
instance with the maximal probability maxx2b fP r (y

= 1jx)g in a positive example


and the instance with the minimal probability minx2b fP r (y = 0jx)g in a negative
example as the representative of the corresponding bag.
Indeed, one can recognize the similarity of our methods with DD, especially the linear logistic regression method we simply take out the radial formulation and the
noisy-or/most-likely-cause model, and plug in the linear logistic formulation and
the geometric/arithmetic average model. It is natural to also try the Gaussian-like
formulation together with the collective assumption (i.e. the geometric/arithmetic
average model), or the linear + MI assumption (i.e. noisy-or) combinations. However these combinations do not work well on the Musk datasets according to our experiments. When experimented on the Musk1 dataset, for example, the Gaussianlike + collective assumption (arithmetic average of probabilities) combination has
the 1010-fold CV error rate of 19.02% 3.41%, and the linear + MI assump-

tion (noisy-or) combination has 20.00%1.47% with the ridge parameter set to

 = 12 (the high value of ridge parameter may already indicate the unsuitability of
the linear model).
As a matter of fact, the whole family of APR-like + MI Assumption-based methods are related to this chapter because their rationale is the same as that of DD.
The current methods that upgrade the single-instance learners (e.g. MI decision
tree learner RELIC [Ruffo, 2001], MI nearest neighbour algorithms [Wang and
Zucker, 2000], MI decision rule learner NaiveRipperMI [Chevaleyre and Zucker,
2001] and the SVM with an MI kernel and a polynomial minimax kernel [Gartner
et al., 2002]), on the other hand, are not so closely related to the framework presented here because the line of thinking is quite different. The SVM with the MI
kernel [Gartner et al., 2002] may also depend on the collective assumption because
61

4.6. RELATED WORK


the MI kernel essentially uses every instance within a bag (with equal weights).
Other methods assumptions are not very clear because these methods are purely
heuristic and therefore hard to analyze.
Finally we propose an improvement on DD. The radial (or Gaussian-like) formulation is not convenient for optimization because it has too many local maxima in
the log-likelihood function. If one really believes the quadratic formulation is reasonable, we suggest a linear logistic model with polynomial fitting to order 2 (i.e.
adding quadratic terms). Together with the noisy-or model (or the most-likely-cause
model), this would enable us to search for local maxima of the log-likelihood only.
If we were to use every possible term in a polynomial expansion, we would have
too many parameters to estimate, especially for the Musk datasets in which we already have too many attributes in the original instance space. In order to overcome
this difficulty, we can assume no interactions between any two attributes, which is
effectively the same as DD did. By deleting any interaction terms in the quadratic
expansion, we only have twice the number of attributes coefficients (a quadratic
term and a linear term for each attribute) plus an intercept to estimate. The number
of parameters is roughly the same as in DD. However this model will have three
major improvements over DD:

1. The optimization is now a local optimization problem as mentioned before,


which is much easier. Consequently the computational cost is greatly reduced. The time of running this model will be very similar to the MI logistic
regression methods, which according to our observation is very short. We believe it is even faster than EM-DD regardless of EM-DDs validity (because
EM-DD still needs many optimization trials with multiple starts) whereas we
can ensure the correctness of the MLE solutions in this model;
2. The sigmoid function in the logistic model tends to make the log-likelihood
function more gentle and smooth than the exponential function used in the
radial model proposed by DD. As a matter of fact, the log-likelihood function
in DD is discontinuous (the discontinuity occurs whenever the point variables
62

CHAPTER 4. UPGRADING SINGLE-INSTANCE LEARNERS


value equals the attribute value of an instance in a negative bag) due to the
radial form of the probability formulation. Besides, the exponential function
in the probability formulation only allows DD to deal with data values around
one, which means that we need to pre-process the data on most occasions.
We can avoid this inconvenience in the logistic model. Despite the functional
differences, the effects of the two models are virtually the same. To see why,
in our model the log-odds function can always be arranged into a form that
describes an Axis-Parallel hyper-Ellipse (APE) plus an intercept (bias) term,
hence the instance-level decision boundaries of both models are exactly the
same. But the gentleness of the sigmoid function in the logistic model (and
the intercept term) may improve the prediction;

3. Since there are a lot of mature techniques that can be directly applied to the
linear logistic model, like regularization or deviance-based feature selection,
we can apply them directly to the new model. For example, ridge methods
may further improve the performance if the instance-level decision boundary
cannot be modeled well by an APE.

4.7 Conclusions
This chapter described a general framework for upgrading single-instance learners
to deal with MI problems. Typically we require that the single-instance learners
model the posterior probability

P r(Y jX ) or its transformation. Then we can con-

struct a bag-level loss function with some assumptions. By minimizing the expected
loss function at the bag level we can recover the instance-level probability function P r (Y jX ). The prediction is based on the recovered function P r (Y jX ) and the
assumption used. This framework is quite general and justified thanks to the strong
theoretical basis of most single-instance learners. It also incorporates background
knowledge (based on the assumptions) involved and could serve as general-purpose
guidance for solving MI problems.
63

4.7. CONCLUSIONS
Within this framework, we upgraded linear logistic regression and AdaBoost based
the collective assumption. We have also shown that these methods, together with
mild regularizations, perform quite well on the Musk benchmark datasets.
For group-conditional single-instance learners that estimate the density P r (X jY )
and then transform to

P r(Y jX ) via Bayes rule (like naive Bayes or discriminant

analysis), there appears to be no easy way to directly upgrade them. However we


explore some methods in the next chapter that follow the same line of thinking and
are very similar to the group-conditional methods.

64

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS

Chapter 5
Learning with Two-Level
Distributions

Multiple instance learning problems are commonly tackled in a point-conditional


manner, i.e. the algorithms aim to directly model a function that, given a bag of

n instances, x1 ;    ; xn , outputs a group (or class) label y . No methods developed


so far build a model in a group-conditional manner. In this chapter we developed
a group-conditional method, the two-level distribution approach to handle MI
data. As the other approaches presented in this thesis, it essentially discards the
standard MI assumption. However, in contrast to the other approaches, it derives a
distributional property for each bag of instances. We describe its generative model
and demonstrate that it is an asymptotically unbiased classifier based on artificial
data. In spite of its simplicity, the empirical results of this approach on the Musk
benchmark datasets are surprisingly good. Finally, we show the relationship of this
approach with some single-instance group-conditional learners. We also discover
its relationship with the empirical Bayes, an old statistical method, within a classification context, and the relationship of MI learning with meta-analysis.
65

5.1. INTRODUCTION

5.1 Introduction
Many special-purpose MI algorithms can be found in the literature. However we
observed that all these methods aim to directly model a function of f (y jx1 ;   

; xn )

y is the group (class) label of a bag of n instances, and x1 ;    ; xn are the


instances. From a statistical point of view f (:) is necessarily a probability function
P r(:), although there can be other interpretations. We refer to this approach as a
where

point-conditional approach. In normal single-instance learning, we have a category


of popular methods that model group-conditional probability distributions and then
transform to class probabilities using Bayes rule. This category includes discriminant analysis [McLachlan, 1992], naive Bayes [John and Langley, 1995] and kernel
density estimation [Hastie et al., 2001]. It is thus natural to ask whether it is possible
to develop a group-conditional approach for MI algorithms.
In this chapter we present a group-conditional approach called two-level distribution (TLD) approach. The underlying idea of this approach is simple: we extract
distributional properties from each bag of instances for each class and try to discriminate classes (or groups) according to their distributional properties. Note that
this approach essentially discards the standard MI assumption because there is no
instance selection within each bag. Instead, by deriving the distributional properties
of a bag we imply the collective assumption that was proposed earlier in this thesis.
This chapter is organized as follows. Section 5.2 describes the TLD approach in
detail. It turns out that this approach is equivalent to extracting low-order sufficient
statistics from each bag, which puts this approach into a broader framework. The
underlying generative model of this approach is presented in Section 5.3, where
we also show this approach is asymptotically unbiased if the generative model is
true. In Section 5.4 we show the relationship of the TLD approach with the normal
group-conditional single-instance learners. When assuming independence between
attributes, one of our methods looks very similar to naive Bayes, thus we present
an (approximate) upgrade of naive Bayes to MI learning. In Section 5.5 we show
experimental results for the TLD approach on the Musk benchmark datasets. Sec66

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS


tion 5.6 draws the connection to the empirical Bayes (EB) methods in a classification context. It turns out that the TLD methods follow exactly the EB line of
thinking. As an old statistical method, EB has many fielded applications in practice, especially the meta-analysis field in medical research. Such applications
may draw interests to the MI domain. Finally Section 5.7 concludes this chapter.

5.2 The TLD Approach


First let us consider normal single-instance learning. The essence of the groupconditional methods is to derive distributional properties within each group (and the
group priors) so that we can decide the frequency of a point in the instance space for
each group. We classify a point according to the most frequently appearing group
(class) label. We follow the same line of thinking in the multi-instance (MI) case.
Dependent on different groups, we first derive distributional properties for each
bag, that is, on the instance level. Because even within one class, the distributional
properties are different from bag to bag, we need a second-level distribution to
relate the instance-level distribution to one another. We refer to the second level
as the bag level distribution. That is why we call this approach the two-level
distribution (TLD) approach.
The obvious question is how to derive distributions over distributions? Since commonly used distributions are usually parameterized by some parameters, we can
think of these parameters as random and governed by some hyper-distribution. This
is essentially a Bayesian perspective [OHagan, 1994]. These hyper-distributions
themselves are parameterized by some hyper-parameters. In a full Bayesian approach, we would cast further distributions on the hyper-parameters. In this chapter,
we simply regard the hyper-parameters as fixed and assume that they can be estimated from the data. Therefore our task is to simply estimate the hyper-parameters
for each group (class). We show the formal derivation of how to estimate them in
the following.
67

5.2. THE TLD APPROACH


First, we introduce some notation. We denote the j th bags as bj for brevity. For-

= fxj 1 ;    ; xjk ;    ; xjnj g. Y denotes the


class variable. Then, given a class label Y = y (in two-class case y = 0; 1) and
a bag bj , we have the distribution P r (bj jY ) for each class, which is parameterized with a fixed bag-level parameter y (hence we simply write down P r (bj jY ) as
P r(bj j y )). We estimate using the maximum likelihood method:
mally, if bj has nj instances, then bj

y
MLE
= argmaxL = argmax

Here L is the likelihood function,

j P r (bj

Y
j

P r(bj j y ):

jy ). Now, the instances in bag bj are not

directly related to , as we discussed before. Instead, the instances xjk are governed
by an instance-level distribution parameterized by a parameter vector  , that in turn
is governed by a distribution parameterized by . Since  is a random variable, we
integrate it out in L. Mathematically,

L=
=
=

P r(bj j y )

j
YZ
j
YZ
j

P r(bj ; j y ) d
P r(bj j; y )P (j y ) d

and assuming conditional independence of bj and y given  ,

YZ
j

P r(bj j)P r(j y ) d:

In Equation 5.1, we effectively marginalize

(5.1)

 in order to relate the observations

(i.e. instances) to the bag-level parameter . Now assuming the instances within
a bag are independent and identically distributed (i.i.d) according to a distribution
parameterized by  , P r (bj j ) =

Qn j
i

P (xji j) where xji denotes the ith instance in

the j th bag.
The complexity of the calculus would have stopped us here had we not assumed

68

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS


independence between attributes. Thus, like naive Bayes, we assume independent
attributes, and reduce the integral in Equation 5.1 into several one-dimensional integrals (one for each attribute), which are much easier to solve. Now, assuming m
dimensions and e bags (for one class), the likelihood function L becomes

L=
=
=

e ZZ
Y



Z Z (h Y
nj

j =1
i=1
( Z nj
e
m
h

Y
Y Y
j =1 k=1
e Y
m
Y
j =1

k=1

Bjk

i=1

P r(xjkijk ) P r(k jky ) d1    dm


i

P r(xjkijk ) P r(k jky ) dk

(5.2)

where xjki denotes the value of the k th dimension of the ith instance in the j th
exemplar, k and k are the parameters for the k th dimension, and

Bjk =

Z hY
nj
i=1

P r(xjkijk ) P r(k jky ) dk :

P r(xjkijk ) and P r(k jky ). Here we model


P r(xjkijk ) as a Gaussian with parameters k and k2 :
There are many options for modeling

nj
Y
i=1

P r(xjkijk ) =

nj
Y
i=1

P r(xjkijk ; k2 )

= (2k2 )
Pn j
i=1 xjki =nj and

nj =2 exp

2 + n (x
Sjk
j jk
2 2

 k )2 i

(5.3)

Pnj
i=1 (xjki

xjk )2 . As is usually done


in Bayesian statistics [OHagan, 1994], we conveniently model P r (k jky ) as the
where xjk

2 =
Sjk

corresponding natural conjugate form of the Gaussian distribution. The natural


conjugate has four parameters, ak , bk , wk , and mk , and is given by:

P r(k ky )

= g (ak ; bk ; wk )(k2 )

69

bk +3

exp

ak + (k wmk k )
2k2

2

(5.4)

5.2. THE TLD APPROACH


where

bk

+1

ak2 2 k2
p
:
g (ak ; bk ; wk ) =
(wk ) (bk =2)

Taking a closer look at the natural conjugate prior in Equation 5.4, it is straightforward to see that k follows a normal distribution with mean mk and variance wk k2 ,
and ( ak2 ) follows a Chi-squared distribution with bk degrees of freedom (d.f. ). It is
k

Chi-squared because k2 is positive, and so is ( ak2 ) given that ak


k

> 0. In the Bayesian

literature [OHagan, 1994], k2 is said to follow an Inverse-Gamma distribution. The


natural conjugate prior is quite reasonable, but the fact that the variance of k is a
multiple (wk ) of k2 is unrealistic and uninterpretable, although it brings a considerable convenience to the calculus (In fact we could not find a simple analytical form
of the likelihood function without this dependency). We will drop this dependency
later on when we simplify the model. Plugging Equation 5.3 and Equation 5.4 into
the middle part of the likelihood function in Equation 5.2 we get Bjk :

Bjk =
=

Z hY
nj

P r(xjkijk ) P r(k jky ) dk

i=1
Z +1Z +1 (
0

(2k2 )

nj =2 exp

g (ak ; bk ; wk )(k2 )

2 + n (x
Sjk
j jk
2 2

bk +3

exp

k )2 i

(5.5)
2 i)

ak + (k wmk k )
2k2

dk dk2

The integration is easy to do due to the form of the natural conjugate prior, resulting
in (the details of the calculation are given in Appendix C):

abkk =2 (1 + nj wk )(bk +nj

1)=2

2 ) + n (x
(1 + nj wk )(ak + Sjk
j jk

( bk +2 nj )

mk )

i bk +nj
2
2

nj

2

(5.6)

( b2k )

Thus, the log-likelihood is:

LL = log[
where

m
e Y
Y

j =1 k=1

Bjk ) =

m X
e
X
k=1 j =1

( log Bjk )

(5.7)

Bjk is given in Equation 5.6 above. We maximize the log-likelihood LL


70

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS


y
in Equation 5.7 using a constrained optimization procedure, to get MLE
, four pa-

rameters per attribute. Note that

LL actually only involves the sample mean xjk

and sum of square errors Sjk , thus it turns out that this method extracts low-order
sufficient statistics, which is a kind of metadata, from each bag to estimate the
group-conditional hyper-parameters.
When a new bag is encountered, we simply extract these statistics from the bag. We
then compute the log-odds:

log

1
P r(btest jMLE
)P r(Y = 1)
P r(Y = 1jbtest )
= log
0
P r(Y = 0jbtest )
P r(btest jMLE )P r(Y = 0)

(5.8)

We estimate the prior P r (Y ) from the number of bags for each class in the training
data and classify btest according to whether the log-odds value is greater than zero.
In spite of its sound theoretic basis, the above method does not perform well on
the Musk datasets, mainly due to the unrealistic dependency of the variance of k
on k2 in the natural conjugate, as well as the restrictive assumptions of an InverseGamma for  2 and a Gaussian for the instances within a bag. However regarding
this as a metadata-based approach allows us to simplify it in order to drop some
assumptions. Here we make two simplifications.
The first simplification stems from the Equation 5.3. It is relatively straightforward
to see that the likelihood within each bag is equivalent to the product of the sampling
distributions of the two statistics xjk and Sjk .1 . If we think xjk is enough for the
classification task, we can simply drop the second-order statistics Sjk here. By
dropping Sjk we can generalize this method to virtually any distribution within a
bag because according to the central limit theorem, xjk

 N (k ; nk2j ) no matter how

the xjki s are distributed. The second simplification is that we no longer model k2 as
drawn from an Inverse-Gamma distribution in Equation 5.4. Instead we regard it as
fixed and directly estimate ^k2 from the data. Accordingly we do not need to estimate

2
For a Gaussian distribution, the sampling distributions are xjk  N (k ; nkj ) and jk
2   (nj
k
1) If we multiply these two sampling distributions, we will get the same form as in Equation 5.3,
differing only by a constant.

2

71

5.3. THE UNDERLYING GENERATIVE MODEL

ak and bk any more. To estimate k2 , we give a weight of n1 to each instance where n
is the number of instances in the corresponding bag (because intuitively we regard
a bag as an object and thus should assign each bag the same weight). Based on the
weighted data, an unbiased estimate of k2 is ^k2

P 1
j nj ), where e is the number of bags.

P 1 P
i (xjki
i [ nj

xjk )2 =(e

With these two simplifications, the dependency in the natural conjugate prior disappears. We only have a Gaussian-Gaussian model within the integral in Equation 5.1,
which makes the calculus extremely simple. The resulting formula is again a Gaussian, which is


Bjk = 2

wk nj + k2 
nj

1=2

exp

nj (xjk mk )2 i
2(wk nj + k2 )

(5.9)

wk in the first TLD method denotes the ratio of the variance of k to


k2 . Here, since we have dropped this dependency, wk is simply the variance of k .
2
Equation 5.9 means that xjk  N (mk ; wk + nkj ). Thus it basically tells us that the
Note that

means of the bags of the two classes are wandering around two Gaussian centroids
respectively, although with different variances, as shown in Figure 5.1. We simply
substitute Equation 5.6 in the log-likelihood function by Equation 5.9. The MLE
of

mk is in a simple analytical form but wk is not. Thus we still need a numeric

optimization procedure to search for a solution. However, since the search space is
two-dimensional (for convenience, we also search for

m^k even though its solution

can be obtained directly) and has no local maxima, the search is very fast. This
constitutes our simplified TLD method and we call it TLDSimple, while we call
the former TLD method simply TLD.

5.3 The Underlying Generative Model


The underlying generative model of the TLD approach is straightforward we
simply have distributions on two levels, and it is also straightforward to generate data from it. First, we generate some random data from the distribution pa72

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS


rameterized by the fixed bag-level hyper-parameters. Then we regard this data as
the instance-level parameters and generate instance-level random data according to
some distribution parameterized by these instance-level parameters.
We can generate artificial data based on this generative model for both TLD and
TLDSimple. The artificial data generated using the generative model assumed by
the simplified TLD method is easy to visualize. An example is in Figure 5.1. More
specifically, we generated data from two levels of Gaussians along two independent dimensions for two classes (black and grey in the graph). First, we identified
two (bag-level) Gaussians, one for each class, with the same variance but different
means. We then generated 10 data points from each Gaussian and regard each of
the points as the mean of an instance-level Gaussian. Finally we specified a fixed
variance (same for both classes) for the instance-level Gaussians and generated 1
to 20 data points from each instance-level Gaussian to form a bag. The number of
instances per bag is uniformly distributed.
In Figure 5.1 we plot the contour of both levels of Gaussians with 3 standard deviations. The dot inside each Gaussian is the mean and we can observe its variance
from the dispersion along each dimension. The straight line between two baglevel Gaussians is the bag-level decision boundary. Note that unlike the methods
that intend to recover the instance-based decision pattern such as the APR algorithms [Dietterich et al., 1997], Diverse Density [Maron, 1998] or other methods
developed in this thesis, the instance-level decision boundary is not defined here.
Instead only the decision boundary of the true means (i.e.

) of the bags is well-

defined it is a hyperplane because we used the same variance for both classes2 .
Also note that the decision boundary is only defined in terms of the true parameters

, but not even in terms of the statistics (or metadata) x. This is due to the extra
2
term ( nj ) in the variance of x. This term is distinct for each bag even if we have the
same  2 but different number of instances per bag.
We generated the artificial data mainly to analyze the properties of our methods.
2

If the Gaussian of either class had different variance, then the decision boundary would be
quadratic instead of linear.

73

5.3. THE UNDERLYING GENERATIVE MODEL

12

Estimated sigma^2
Estimated w
Estimated m

10

Parameter value

-2
0

100

200

300

400

500

600

700

Number of exemplars

Figure 5.1: An artificial simplified TLD dataset with 20


bags.

Figure 5.2: Estimated parameters using the


TLDSimple method.

In order to verify the (at least asymptotic) unbiasedness of both TLD methods,
we generated a one-dimensional dataset with 20 instances per bag using the exact
generative model assumed by them (the data generation using the exact generative
model assumed by TLD is described in Chapter 7 and Appendix A). The estimated
parameters (of one class) are shown in Figure 5.2 for TLDSimple and Figure 5.3
for TLD. In both figures, Pthe solid
P lines denote the true parameter values. Note
that in TLDSimple, ^k2 =

i [ nj

(e

j (xjki

Pj

xjk )2

nj )

is not a maximum likelihood estimate

(MLE), but obviously it is an unbiased estimate as mentioned in Section 5.2 because

X
j

1 X
(x
nj i jki

xjk )2 = (e

X
j

1 2
) :
nj k

All other estimates are MLEs. As can be seen, the estimated parameters converge
to the true parameters as the number of bags increases, thus this method is at least
asymptotically unbiased. We observe that the number of instances per bag does not
affect the convergence but only the rate of convergence. The more instances per
bag, the faster the convergence. Varying number of instances will slow down the
convergence but result in similar behaviors.
74

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS


120

Estimated a
Estimated b
Estimated w
Estimated m

100

Parameter value

80

60

40

20

-20
0

10cm

100

200

300
400
Number of exemplars

500

600

700

Figure 5.3: Estimated parameters using the TLD method.


In TLDSimple, the MLE of the variance of  (i.e. w ) is a biased one, but still asymptotically unbiased. The reason why we use the MLE instead of another estimate is
due to its robustness. We have observed that when the number of bags is small,
which is the case in the Musk datasets, other estimates of the variance are very
likely to be wildly wrong, even negative. The MLE, on the other hand, is relatively
stable even for a small sample size.

5.4 Relationship to Single-instance Learners


The relationship between the TLD approach and single-instance learners can best be
explained based on the TLDSimple method. Note that when the number of instances
per bag is reduced to 1, Equation 5.9 degrades into the same Gaussian for all bags
(or instances in this case) from one class, with mean mk and variance wk , because

k2 cannot be estimated and can only be regarded as 0. Hence this method degrades
into naive Bayes in the single-instance case. If we had dropped the independence
assumption between attributes, we would have obtained the discriminant analysis
method. Even in the multi-instance case, if k2

 0 and/or nj  0, the term nkj

can be neglected in Equation 5.9. In that case for all the bags in one class, the xjk s
can be regarded as coming from the same Gaussian. Then we can use standard
75

5.4. RELATIONSHIP TO SINGLE-INSTANCE LEARNERS


naive Bayes to simulate the TLDSimple method we simply calculate the sample
average over every bag along all dimensions and construct one instance per bag
using the sample average. Then we use naive Bayes to learn this single-instance
data. This gives an approximate upgrade of naive Bayes in the MI case. Such an
approximation may be appropriate in the Musk datasets because we observed that
the in-bag variance is indeed very small for many bags.
The above view of the TLD approach leads us to consider improvement techniques
from naive Bayes in the TLDSimple method. In TLDSimple, it is assumed that the

k s are normally distributed (i.e.  N (mk ; wk )). But if the normality does not hold
very well, we need some adjustment techniques. We present one simple technique
here. When a new bag btest is met, we calculate the log-odds function according to
Equation 5.8 and usually decide its class label based on whether the log-odds > 0 or
not. Now, we determine the class label of btest based on whether the log-odds >

v is a real number. We choose v so as to minimize the classification


errors in the training data. We call v the cut-point. Thus we select an empirically

or not, where

optimal cut-point value instead of 0. This technique was mentioned in the context
of discriminant analysis [Hastie et al., 2001] but it is obviously applicable in our
TLD approach and in naive Bayes.
The rationale of this technique is easy to see, and illustrated in Figure 5.4 in one
dimension. Here we have one Gaussian and one standard Gamma distribution for
each class (solid lines). The dotted line plots the Gaussian estimated using the
mean and variance of the Gamma. Now the estimated decision boundary becomes
C whereas it should be C. Since in classification we are only concerned about the
decision boundary, we do not need to adjust the density. By looking for the empirically optimal cut-point, we can move it from C back to C and improve prediction
performance.
More specifically, we look for the cut-point as follows. First, we calculate the logodds ratio for each of the training bags and sort them in ascending order. The
log-odds ratios of bags with class 1 should be greater than those of bags with class

76

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS

Figure 5.4: An illustration of the rationale for the empirical cut-point technique.

T denotes the total number of training bags,


and S a split point. There will be t bags with log-odds  S and (T t) ones with
log-odds > S . We count the number of bags in the first class among the first t bags.
This number is p1 . The number of bags in the second class among the second set of
(T t) bags is p2 . Finally, we go through the log-odds of all bags from the smallest
to the largest and find the log-odds value with the largest value of (p1 + p2 ) and use
0. Second, we look for a split. Let

that value as the cut-point. If there is a tie, we choose the value closest to 0.
This technique is not commonly used to reduce the potential negative impact of
the normality assumption. Instead kernel density estimation or discretization of numeric attributes are more often used. Nonetheless, we found that this technique
can also work pretty well in the normal single-instance learning. We have tested
it in association with naive Bayes on some two-class datasets from the UCI repository [Blake and Merz, 1998] within the experimental environment of the WEKA
workbench [Witten and Frank, 1999]. The results are shown in Table 5.1.
In Table 5.1 we use NB to denote naive Bayes. The first column is NB with
empirical cut-point (EC) selection and this is the base line for comparison. The
second, third and fourth columns are NB without any adjustment, NB with kernel
density estimation (KDE) and NB with disretization of numeric attributes (DISC)
respectively. The results were obtained using 100 runs of 10-fold cross validation
77

5.4. RELATIONSHIP TO SINGLE-INSTANCE LEARNERS

Dataset
breast-cancer
breast-cancer-W
german-credit
heart-disease-C
heart-disease-H
heart-statlog
hepatitis
ionosphere
kr-vs-kp
labor
mushroom
sick
sonar
vote
pima-diabetes
Summary

NB+EC
72.53( 0.96)
95.87( 0.23)
74.75( 0.51)
82.58( 0.75)
83.52( 0.71)
83.78( 0.74)
83.33( 1 )
89.68( 0.54)
87.85( 0.16)
93.29( 2.16)
98.18( 0.03)
95.99( 0.09)
72.08( 1.39)
89.55( 0.63)
75.43( 0.46)
(v/ /*)

NB
72.76( 0.68)
96.07( 0.1 )
75.07( 0.43)
83.4 ( 0.41)
84.23( 0.52)
83.73( 0.61)
83.71( 0.82)
82.51( 0.45) *
87.8 ( 0.15)
94 ( 1.92)
95.76( 0.04) *
92.76( 0.12) *
67.88( 1.06) *
90.09( 0.15)
75.65( 0.37)
(0/11/4)

NB+KDE
72.76( 0.68)
97.51( 0.11) v
74.5 ( 0.46)
84.02( 0.58) v
84.95( 0.39) v
84.4 ( 0.59)
84.76( 0.62) v
91.83( 0.32) v
87.8 ( 0.15)
93.18( 1.36)
95.76( 0.04) *
95.78( 0.08) *
72.68( 1.13)
90.09( 0.15)
75.15( 0.37)
(5/8/2)

NB+DISC
72.76( 0.68)
97.17( 0.13) v
74.38( 0.62)
83.2 ( 0.64)
84.12( 0.36)
82.91( 0.67)
83.67( 1.14)
89.27( 0.45)
87.8 ( 0.15)
88.44( 1.85) *
95.76( 0.04) *
97.14( 0.08) v
76.43( 1.51) v
90.09( 0.15)
75.51( 0.74)
(3/10/2)

Table 5.1: Performance of different versions of naive Bayes on some two-class


datasets.

(CV), with standard deviations specified in brackets. The confidence level used for
the comparison was 99.5%. We use a * sign to indicate significantly worse than
where as v means significantly better than. We regard two results of a dataset
as significant different if the difference is statistically significant at the 99.5%
confidence level according to the corrected resampled t-test [Nadeau and Bengio,
1999]. It can be seen that the EC technique is comparable to discretization and
worse than KDE in general. However, it can indeed improve the performance of the
original naive Bayes. Moreover, in cases where KDE cannot be used, like in our
TLD approach, EC can be a convenient option.
It turns out that in the Musk datasets the normality assumption for k is a problem
and the EC technique can apply. For instance, for the naive Bayes approximation
of TLDSimple without EC, the leave-one-out (LOO) error rates on the Musk1 and
Musk2 datasets are 13.04% and 17.65% respectively. With EC, the error rates are
10.87% and 14.71%, which is an improvement of about 3% in each dataset. Note
that in the naive Bayes approximation of TLDSimple we do not use KDE because
78

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS

Methods

Musk 1
Musk 2
LOO
10CV
LOO
10CV
SVM with Minimax Kernel
7.6
8.40.7
13.7
13.71.2
RELIC

16.3

12.7
TLDSimple+EC
14.13 16.961.86 9.80 15.882.56
naive Bayes Approximation+EC 10.87 15.112.32 14.71 17.351.85
Table 5.2: Error rate estimates from 10 runs of stratified 10-fold CV and LOO
evaluation. The standard deviation of the estimates (if available) is also shown.
we do not want to accurately estimate the density of xjk but that of k . In this
2

approximation, we ignored the term nkj in the variance of xjk . Hence xjk is only
an approximation of k . The EC technique is only concerned about the cut-point
instead of the whole density, and thus does not suffer from this approximation.

5.5 Experimental Results


There are no other directly related MI methods in the literature. The closest cousins
may be other methods that also extract metadata from bags, which are the MI decision tree learner RELIC [Ruffo, 2001] and the support vector machine (SVM)
with the minimax kernel [Gartner et al., 2002]. Both methods can be viewed as
extracting minimax metadata from each bag and applying a standard learner to the
resulting single-instance data. Although it has been mentioned that some statistics
can be extracted from bags for the purpose of modeling [Gartner et al., 2002], no
specific methods have been put forward in the MI learning domain.
We present experimental results for TLDSimple and its naive Bayes approximation
(both used with the EC technique) on the Musk benchmark datasets in Table 5.2.
We also present published results for RELIC and the SVM method in the table.
Since these two metadata-based methods are among the best methods on the Musk
datasets, we can see that the new methods are competitive with the state-of-the-art
MI methods.
79

5.6. RELATED WORK


In spite of its simplicity, TLDSimple performs surprisingly well on the Musk2
dataset, with reasonably good result on Musk1. Since it is a distributional method,
it is expected that more data can improve its estimation and thats why its LOO performance is better than that of 10-fold CV. Its naive Bayes approximation is equally
simple and also quite good on the datasets. Hence the sound theoretical basis of
these methods appears to bear fruits in practical datasets.

5.6 Related Work


As mentioned before, RELIC and the SVM with a minimax kernels are related to the
TLD approach in the sense that they both extract metadata from the bags and model
directly on the bag-level. But there are no other multi-instance methods that model
the MI problem in a group-conditional manner, neither are there any methods trying
to extract low-order sufficient statistics. Therefore our approach is quite unique in
the MI learning domain.
However, when developing the TLD methods, we discovered empirical Bayes (EB)
[Maritz and Lwin, 1989], an old statistical method that has the exactly same line of
thinking as our approach. As a matter of fact, it is well suited for multi-instance data
and our approach is one of its applications to classification. In the EB literature a
more general EM-based solution procedure is also proposed that can model arbitrary
distributions on the two levels. In such a case, the integral may not be solvable but
via the EM algorithm it is possible to apply numeric integration techniques
to make the maximum likelihood method feasible. This is actually one of the early
applications of the EM algorithm [Dempster et al., 1977].
According to the EB literature [Maritz and Lwin, 1989], an early example of an EB
application was an experiment with contaminated water [von Mises, 1943]. In this
experiment, 3420 batches of water were collected with five samples in each batch.
Each sample was either classified as positive (contaminated) or negative (uncontaminated). The task was to estimate the probability of a sample to be positive. One
80

CHAPTER 5. LEARNING WITH TWO-LEVEL DISTRIBUTIONS


may recognize this dataset as very similar to MI data. If we classified each batch
as positive or negative (according to some assumptions), and had some attributes to
describe the samples, we would end up with an MI dataset.
Today EB is an important method to solve the meta-analysis problem in medical
(or other scientific) research, which has a similar rationale as MI learning. In scientific research, different scientists carry out experiments associated with the same
topic but usually get different results. If we can construct some instances (with
the same attributes) for each one of the experiments, and try to identify the homogeneity and heterogeneity of these experiments, this is actually an MI dataset.
Each experiment is a bag of instances but there is only one class label per bag.
MI learning solutions will be very useful here because the identification of homogeneity/heterogeneity of scientific experiments is an important objective in metaanalysis.
We are presenting the relationship of the EB method and meta-analysis with MI
learning in the hope that they may attract some interest in the field of MI research.
We believe that this may help spawn fielded applications and more publicly available datasets the perhaps most severe obstacles in research on MI learning today.

5.7 Conclusions
In this chapter we have proposed a two-level distribution (TLD) approach for solving multiple instance (MI) problems. We built a model in a group (class)-conditional
manner, and think of the data as being generated by two levels of distributions
the instance-level distribution and the bag-level distribution within each group.
It turns out that the modeling process can be based on low-order sufficient statistics,
thus we can regard it as one of the methods that are based on meta-data extracted
from each bag.
Despite its simplicity, we have shown that a simple version of TLD learning and
81

5.7. CONCLUSIONS
its naive Bayes approximation perform quite well on the Musk benchmark datasets.
We have also showed the relationship of our approach with normal group-conditional
single-instance learners, other metadata-based MI methods, and the empirical Bayes
method. Finally, we pointed out the similarity between MI learning problems and
the meta-analysis problems from scientific research.

82

CHAPTER 6. APPLICATIONS AND EXPERIMENTS

Chapter 6
Applications and Experiments

As mentioned in Chapter 1, we focus on three practical applications of MI learning:


drug activity prediction, fruit disease prediction, and content-based image categorization. Throughout this chapter, the term number of attributes does not include
the Bag-ID attribute and the class attribute in the datasets. When we refer to a
positive bag in this chapter, we simply mean that its class label is 1. Likewise
negative for 0 in the data.

6.1 Drug Activity Prediction


The most famous application for MI learning is the drug activity prediction problem,
first described in [Dietterich et al., 1997], which resulted in the first MI datasets, the
Musk datasets. These are the only real-world MI datasets publicly available. While
the interested readers should refer to the paper of [Dietterich et al., 1997] for detailed background on the Musk datasets, we briefly describe them in Section 6.1.1.
Section 6.1.2 we discuss another molecule activity prediction problem predicting
the mutagenicity of the molecules. Note that this problem was not originally an MI
problem. However, with proper transformations we can use its data in MI learning.
83

6.1. DRUG ACTIVITY PREDICTION

Figure 6.1: Accuracies achieved by MI Algorithms on the Musk datasets.

6.1.1 The Musk Prediction Problem


There are two Musk datasets, each describing different molecules (some molecules
are presented in both of them). The task is to decide which molecules result in a
musky smell. On average, Musk2 has more bags than Musk1, and each bag has
more instances. Table 4.2 in Chapter 4 lists some key properties of the two datasets.
In both datasets, each molecule is represented as a bag and different shapes (or
conformations) of that molecule are instances in the bag. Since one molecule has
multiple shapes, we have to use multiple instances to represent an example. The
shape of a molecule is measured using a ray-based representation that results in 162
features. Together with 4 extra oxygen features, we have 166 attributes. Since every
shape of a molecule (i.e. every instance) can be measured with these 166 attributes,
we can construct an MI dataset. Both Musk datasets have the same attributes as
shown in Table 4.2 in Chapter 4.
The Musk datasets are the most popular ones in the MI domain. Virtually every MI
algorithm developed so far has been tested on this problem and so have the algo84

CHAPTER 6. APPLICATIONS AND EXPERIMENTS


rithms in this thesis. Although we have used these datasets throughout this thesis to
compare the performance of our methods to that of many other MI methods, we list
the accuracies again in Figure 6.1. Based on this figure we can see some evidence
which assumption really holds in the Musk datasets by examining the accuracies
of methods based on various assumptions. We use black bars to denote the accuracy on Musk1, and white ones for Musk2. We used the best accuracy an algorithm
can achieve, regardless of the evaluation method used. The accuracy of the wrapper method developed in Chapter 3 (MI Wrapper in the figure) is that achieved
by wrapping around bagged PART with discretization. The MI AdaBoost method
shown in the figure has unpruned regularized C4.5 as the base classifier and uses
50 iterations, same as described in Chapter 4. The two MI linear logistic regression
methods (MILogitRegGEOM and MILogitRegARITH in the figure) are also
described in Chapter 4. TLDSimple and the naive Bayes (NB) approximation of
TLDSimple are described in Chapter 5.
We deliberately divide the algorithms into 3 categories: the first corresponds to
instance-based methods that are consistent with the MI assumption; the second corresponds to the instance-based methods that we believe do not use the MI assumption; and the last corresponds to the metadata-based methods that cannot possibly
use the MI assumption. As the figure shows, the methods that strongly depend on
the MI assumption do not seem to benefit much from this background knowledge.
Instead the overall performance of the methods that are not based on this assumption
is as good as the MI assumption family on these datasets. Therefore the validity
of the MI assumption as the background knowledge on the Musk datasets appears
to be questionable.

6.1.2 The Mutagenicity Prediction Problem


There is another drug activity prediction problem, namely the mutagenicity prediction problem [Srinivasan, Muggleton, King and Sternberg, 1994], that can also be
represented as an MI problem. It originated from the inductive logic programming
85

6.1. DRUG ACTIVITY PREDICTION

Number of bags
Number of attributes
Number of instances
Number of positive bags
Number of negative bags
Average bag size
Median bag size
Minimum bag size
Maximum bag size

Friendly
188
7
10486
125
63
55.78
56
28
88

Unfriendly
42
7
2132
13
29
50.76
46
26
86

Table 6.1: Properties of the Mutagenesis datasets.

(ILP) domain. The original dataset is in a relational format, which is especially


suitable for ILP algorithms. However we can transform the original dataset into a
MI dataset in several ways [Chevaleyre and Zucker, 2000; Weidmann, 2003]. Here
we consider a setting representing each molecule as a set of bonds and the pairs
of atoms connected by each of the bonds. After this transformation each molecule
corresponds to a bag. But unlike the Musk datasets where an instance describes
the conformation of an entire molecule, each instance denotes an atom-bond-atom
fragment of a molecule. One such fragment is described using 7 attributes 5
symbolic and 2 numeric attributes. Interested readers may refer to [Weidmann,
2003] for more details on the construction of these MI datasets. The same construction method was used in other studies of MI learning [Chevaleyre and Zucker,
2001; Gartner et al., 2002].
The original Mutagenesis dataset consists of 230 molecules: 188 molecules that can
be fitted using linear regression and the remaining 42, which are more difficult to
fit. These two subsets are called friendly and unfriendly respectively. The key
properties of the two datasets, after the transformation, are shown in Table 6.1.
We evaluated some of the methods developed in this thesis on these two datasets.
Since the TLD approach can only deal with numeric attributes and there are some
nominal attributes in the Mutagenesis datasets, we cannot apply TLD methods to
them. We normally consider the transformation of nominal attributes to binary ones
86

CHAPTER 6. APPLICATIONS AND EXPERIMENTS

Method
Friendly
Unfriendly
Best of the TLC methods
9.311.28 18.331.61
MILogisticRegressionGEOM 15.740.91
16.670
MI AdaBoost
13.621.31 18.572.19
Wrapper with Bagged PART 18.781.26 23.102.26
Best of ILP methods
18.0

SVM with RBF MI kernel


7.0
25.0
Table 6.2: Error rate estimates for the Mutagenesis datasets and standard deviations
(if available).
in this case. However the TLD methods involves estimating in-bag variance and this
transformation often introduces zero variances for many attributes. Consequently
the TLD approach cannot be applied even after such transformation. Hence we only
applied the instance-based methods, i.e. the wrapper method, MI linear logistic regression methods and MI AdaBoost. In contrast to the experiments with the Musk
datasets, MILogisticRegressionGEOM outperforms MILogisticRegressionARITH
and thus we only present the results for MILogisticRegressionGEOM. Note that
logistic regression cannot deal with nominal attributes either, but it can deal with
the transformed binary attributes. Hence we transform a nominal attributes with K
values into K binary attributes, each with value 0 and 1. As for the regularization,
we (again) use a fixed ridge parameter
for the friendly data and

 instead of CV. Here we use  = 0:05

 = 2 for the unfriendly data. MI AdaBoost, on the

other hand, can naturally deal with nominal attributes, and we apply it directly to
the Mutagenesis datasets. For the regularization in MI AdaBoost, we use the same
strategy as in the Musk datasets. The base classifier is an unpruned regularized
C4.5 decision tree [Quinlan, 1993] as in Chapter 4. The minimal number of leafnode instances was set to around twice the averaged bag size, i.e. 120 instances for
Friendly and 100 instances for Unfriendly. Under these regularization conditions, we then look for a reasonable number of iterations. It turns out that the best
performance often occurs with 1 iteration on the unfriendly data, which lead us to
think that maybe decision trees are too strong for this dataset, and we should use decision stumps instead. Hence we eventually used 200 iterations of regularized C4.5
on the friendly dataset and 250 iterations of decision stumps on the unfriendly
87

6.1. DRUG ACTIVITY PREDICTION


dataset. The estimated errors over 10 runs of 10-fold CV are shown in Table 6.2. Table 6.2 also shows the error estimates of the wrapper method described in Chapter 3
on these two datasets. The classifier used in the wrapper was Bagging [Breiman,
1996] based on the decision rule learner PART [Frank and Witten, 1998], both implemented in WEKA [Witten and Frank, 1999]. Bagged PART will also be used for
the experiments with the wrapper method on other applications in this chapter. We
used the default parameter setting of the implementations in WEKA, and did not do
any further parameter tuning.
It was reported that the ILP learners have achieved error rates ranging from 39.0%
(FOIL) to 18.0% (RipperMI) on the friendly dataset using one run of 10-fold
CV [Chevaleyre and Zucker, 2001]. The SVM with the MI kernel achieved 7.0% on
Friendly and 25% on Unfriendly using 20 runs of random leave-10-out [Gartner
et al., 2002]. Note that unlike the Musk datasets, the experimental results between
10-fold CV and leave-10-out are not comparable here because the number of bags
is quite different from 100 in both datasets. In Friendly, leave-10-out allows the
learner to use more training data whereas in Unfriendly the training data in leave10-out will be less. In addition, leave-10-out is not stratified, thus the 20 runs used
in the SVM with the MI kernel [Gartner et al., 2002] may suffer from high variance due to the fact that the classes are not evenly distributed in both datasets (see
Table 6.1). The two-level classification method (TLC) [Weidmann, 2003], which
is a model-oriented metadata-based approach, can achieve 9.31% on Friendly and
18.33% on Unfriendly via 10 runs of 10-fold CV. In Table 6.2, we list the error rates
of these methods, together with those of the methods developed in this thesis.
The standard deviation of MILogisticRegressionGEOM of the 10 runs is zero on the
unfriendly dataset, indicating the stability of the results. The evaluation process
allows us to compare the results of the methods developed in this thesis to those
of the TLC method [Weidmann, 2003] and of ILP learners [Chevaleyre and Zucker,
2001]. While comparable to the TLC method on the unfriendly dataset (actually it
seems that MILogisticRegressionGEOM slightly better than TLC on this dataset),
MILogisticRegressionGEOM and MI AdaBoost are worse than TLC methods on
88

CHAPTER 6. APPLICATIONS AND EXPERIMENTS


the friendly dataset. However they seem to outperform the ILP learners. The
small ridge parameter value

 = 0:05 used in MILogisticRegressionGEOM on the

friendly dataset indicates that under the collective assumption, there is indeed a
quite strong linear pattern while the larger ridge parameter value used on the unfriendly dataset may imply a weaker linear pattern. This is consistent with the
observation that the friendly data can be fit with linear regression whereas unfriendly data cannot. The performances of MILogisticRegressionGEOM and MI
AdaBoost are similar on the Mutagenesis datasets, which is reasonable because they
are based on the same assumption. As long as the instance-level decision boundary
pattern is close to linear under the geometric-average probability assumption, their
behaviors will be very similar. The wrapper method with bagged PART is not as
good as TLC, MILogisticRegressionGEOM and MI AdaBoost, but still comparable
to the best of the ILP methods. Besides, we did not do any parameter tuning or
use other improvement techniques like discretization here. Neither did we try the
wrapper around other single-instance classifiers. We conjecture that we might be
able to improve the accuracy considerably with these efforts because we believe the
underlying assumption of the wrapper method (that the class probabilities of all the
instances within a bag are very similar to each other) may hold reasonably well in
the drug activity prediction problems.
The overall empirical evidence seems to show that the collective assumption may
be at least as appropriate in the molecular chemistry domain as the MI assumption.
We can get good experimental results on the problems of this kind using methods
based on the collective assumption. This empirical observation seems to be contradictory to some claims that the MI assumption is precisely the case in the domain
of molecular chemistry [Chevaleyre and Zucker, 2000]. However, it would be interesting to discuss this with a domain expert.

89

6.2. FRUIT DISEASE PREDICTION

Number of positive bags


Number of negative bags
Number of bags
Number of attributes
Number of instances
Average bag size
Median bag size
Minimum bag size
Maximum bag size

1
23
49

Fruit
2
28
44
72
23
4560
63.33
760
60
120

3
19
53

Table 6.3: Properties of the Kiwifruit datasets.

6.2 Fruit Disease Prediction


As discussed in Chapter 1, the fruit disease prediction problem provides a MI setting. The identification of this problem as a MI problem was due to Dr. Eibe Frank.
In this problem, each bag is a batch of fruits, usually from one orchard. Every fruit
within a batch is an instance. Some non-destructive measures are taken for each
fruit. The class labels are only observable on the bag level, in other words, either a
whole batch of fruits is infected by a disease or not. Although there is no assumption explicitly stated for this problem, the collective assumption is easy to interpret
here: if the whole batch of fruits is resistant to a certain disease, then the whole
batch is disease-free.
We have obtained one dataset consisting of 72 batches of kiwifruits. they were labeled according to three different diseases, resulting in three datasets with identical
attributes, namely Fruit 1, 2 and 3. There are 23 non-destructive measures describing each kiwifruit, resulting in 23 attributes, all numeric. The key properties are
listed in Table 6.3. The table shows that number of instances per bag in this dataset
is more regular than in the datasets describing the drug activity prediction problem,
and comparatively larger. Nonetheless, according to our observation, the number
of instances per bag does not seem to be very important in classification. What is
more important is the number of bags, which is scarce in this case.
90

CHAPTER 6. APPLICATIONS AND EXPERIMENTS

Method
Default accuracy
TLD+EC
MILogisticRegressionARITH
MILogisticRegressionGEOM
MI AdaBoost
Wrapper with bagged PART

Fruit 1
68.06
62.750
71.811.47
71.112.34
73.191.74
73.060.97

Fruit 2
Fruit 3
61.11
73.61
52.78 0
66.67  0
65.69 1.97 73.61  0
64.03 2.31 73.61  0
63.61 2.84 73.61  0
61.81 2.87 71.810.67

Table 6.4: Accuracy estimates for the Kiwifruit datasets and standard deviations.
One important fact that is not shown in Table 6.3 is that there are some missing
values in the dataset. Typically instances values are missed for a whole bag for a
specific attribute. We developed different strategies to deal with the missing values
for our methods developed in this thesis. MI AdaBoost has the advantage that the
base classifier usually has its own way to deal with missing values, hence it is not a
problem for it. For the TLD approach, since we assume attribute independence, we
simply skip bags with missing values for a certain attribute when collecting loworder statistics for that attribute. Thus in the log-likelihood function of Equation 5.7
in Chapter 5, we use less bags to estimate the hyper-parameters of that attribute. The
cost is that we have less data for the estimation. As for the MI Logistic Regression
algorithms, we use the usual way of logistic regression to tackle this difficulty, that
is, we substitute the missing values with some values estimated from other training
instances without missing values. Again, in order to be consistent to our convention, we substitute the missing values with the weighted average of other instances
values of the concerned attribute. The weight of an instance is given by the inverse
of the number of (non-missing) instances of the bag it pertains to.
Now the three datasets can be tackled by all of the methods in this thesis. We
do not have other methods to compare (apart from DD, as discussed below), so
we list the default accuracy of the data, i.e. the accuracy if we simply predict the
majority class, in the first line of Table 6.4 for comparison. However we found out
that these datasets are very hard to classify. For example, we tried DD (looking
for one target concept) with the noisy-or model and got 10 10-fold CV accuracies

of 64.44%(3.29%) and leave-one-out (LOO) accuracy of 66.67% on the Fruit 1


91

6.2. FRUIT DISEASE PREDICTION


dataset1 worse than the default accuracy. Thus the MI assumption does not seem
to hold very well for these datasets. Here we simply present some methods that can
give reasonable results on these datasets. We list the accuracy estimates for 10 runs
of 10-fold CV in Table 6.4.
As usual, we use the empirical cut point (EC) techniques together with the TLD
method. We only show the results of TLD in Table 6.4 because the TLDSimple
did worse than TLD in this case. Perhaps in these datasets the mean of each bag
is not enough to discriminate the bags from the two classes. Table 6.4 shows that
the TLD method, which encodes information about the first two sufficient statistics, cannot achieve the default accuracy either, although the accuracy estimates are
extraordinarily stable. It could be the case that in these datasets, higher order sufficient statistics are required for discrimination. It may also be due to the lack of
training data as we observed that the LOO accuracy of TLD for the Fruit 1 dataset
is 70.83%, which is much better than that shown in Table 6.4 and better than the default accuracy. We have seen in Chapter 5 that the TLD approach is a distributional
method and may usually require many bags to provide reasonable estimates. In this
case the sample size may be too small to give accurate estimates.
The instance-based methods perform reasonably well on the Fruit 1 and Fruit 2
datasets. It seems that the Fruit 1 dataset has a non-linear instance-level class decision boundary under the collective assumption, while the Fruit 2 dataset exhibits
linearity. This is due to the observation that the methods specializing in fitting nonlinear functions, like MI AdaBoost and the wrapper method based on bagged PART,
can perform better on the Fruit 1 data than the MI linear logistic regression methods,
but worse on the Fruit 2 data. Again we did not finely tune the parameters in any of
the methods. In the two variations of MI logistic regression, we simply used  = 1
for both methods and for all the datasets. In MI AdaBoost, we first tried unpruned
C4.5 with the minimal leaf-node instance number of 120 (about twice the average
1

Since DD does not have a mechanism to deal with missing values, we pre-processed the data,
substituting the missing values with the average of the non-missing attribute values. Note that this
process may give DD an edge in classification because the resulting training and testing instances
are different from the original ones and may provide more information for classification.

92

CHAPTER 6. APPLICATIONS AND EXPERIMENTS


bag size). The result for the Fruit 2 dataset presented in Table 6.4 is based on this
base classifier running 100 iterations. However in Fruit 1, we observed (again) that
one iteration gave a good result for this base classifier, which may indicate that we
should use a weaker base classifier. Thus for Fruit 1, we used 250 iterations of decision stumps whose results are shown in Table 6.4. As for the wrapper method, we
simply used the default setting in the wrapped single-instance learners, i.e. bagging
and PART.
The Fruit 3 dataset seems to be very random and hard to classify. In fact, no method
can achieve better than the default accuracy. The MI linear logistic regression methods fail to find a linear instance-level decision boundary under the collective assumption. Any ridge parameter values greater than 1 only resulted in predicting the
majority class, which appears to be the best accuracy achievable on this dataset. The
same happened for MI AdaBoost. When based on the decision stump, any iteration
number greater than 10 resulted in worse test accuracy than the default accuracy,
although the training error can be improved with more iterations. Thus the best one
can do in this dataset is to predict the majority class.

6.3 Image Categorization


Content-based image categorization involves classifying an image according to the
content of the image. For example, if we are looking for images of mountains,
then any images containing mountains are supposed to be classified as positive and
those without mountains as negative. The task is to predict the class label of an
unseen image based on its content.
This problem can be represented as an MI problem with some special techniques.
The key of such a representation is that we can segment an image into several pixel
regions, usually called blobs. A bag corresponds to an image, which has the blobs
or some combinations of blobs as its instances. Again one bag only has one class
label. Now the problem is how to generate features for the instances? The data
93

6.3. IMAGE CATEGORIZATION

Number of bags
Number of attributes
Number of instances
Number of positive bags
Number of negative bags
Average bag size
Median bag size
Minimum bag size
Maximum bag size

Photo
60
15
540
30
30
9
9
9
9

Table 6.5: Properties of the Photo dataset.

Figure 6.2: A positive photo example for the concept of mountains


and blue sky.

Figure 6.3: A negative photo example for the concept of mountains


and blue sky.

used in this thesis was generated based on the single-blob with neighbours (SBN)
approach proposed by [Maron, 1998], although there are many other techniques that
can be used [Zhang et al., 2002]. More precisely, there are 15 attributes in the data.
The first three attributes are the averaged R, G, B values of one blob. The remaining
twelve dimensions are the difference of the R, G, B values between one specific
blob and the four neighbours (up, right, down and left) of that blob. As in [Maron,
1998], an image was transformed into 8x8 pixel regions and each 2x2 region was
regarded as a blob. There are 9 blobs that can possibly have 4 neighbours in each
image. Therefore the resulting data has 15 attributes and 9 instances per bag, fixed
for each bag. Details of the construction of the MI datasets from the images can be
found in [Weidmann, 2003], where some other methods have also been tried.
Table 6.5 lists some of the properties of the photo dataset we used. Since all the
94

CHAPTER 6. APPLICATIONS AND EXPERIMENTS

Method
Best of the TLC methods
TLDSimple+EC
MILogisticRegressionARITH
MILogisticRegressionGEOM
MI AdaBoost
Wrapper with bagged PART

Photo
81.67
69.172.64
71.672.48
71.002.96
76.672.36
750

Table 6.6: Accuracy estimates for the Photo dataset and standard deviations.

dimensions describe RGB values, they are all numeric. The dataset is about the
concept mountains and blue sky. There are 30 photos that contain both mountains
and blue sky, which are the positive bags, and 30 negative photos. The negative
photos could contain only sky, only mountains, or neither. Two examples from
[Weidmann, 2003] illustrate the class label properties of the bags. The photo shown
in Figure 6.2 is a positive photo, which contains both mountains and the sky, while
the one shown in Figure 6.3 is negative because it only contains the sky (and the
plains) but no mountains. Note that the classification task here is more difficult
than that in [Maron, 1998] and [Zhang et al., 2002] because their target objects
only involve one simple concept, say mountains, whereas we have a conjunctive
concept here, which is more complicated.
Table 6.6 lists the 1010-fold CV experimental results of some of the methods
developed in this thesis. Again we did not finely tune the user parameters. In the
MI logistic regression methods, we used a ridge parameter of 2 for both methods. In
MI AdaBoost, we used an unpruned C4.5 tree as the base classifier and the minimal
instance number for the leaf-nodes is (again) twice the bag size (18 instances). We
obtained the results in Table 6.6 using 300 iterations. The base classifiers in the
wrapper method were configured using the default settings. We also list the best
result for the TLC algorithms [Weidmann, 2003] for comparison.
For this particular dataset, the methods based on the collective assumption may be
appropriate sometimes. Because the objects mountains and blue sky are large
objects that usually occupy the whole image, the class label of an image is very
95

6.3. IMAGE CATEGORIZATION


likely to be related to the RGB values of all the blobs in that image. However, if this
is not the case for an image, the collective assumption may not be suitable. Moreover, the decision boundary in the instance space may be more sophisticated than
linear or close to linear (like quadratic) patterns. This is because we take the RGB
values as the attributes and the colours of other objects may be very similar to the
concept objects (i.e. mountains or blue sky in this case). To discriminate the
positive and negative images, we may need more complex decision boundary patterns. Therefore the TLD approach and the MI logistic regression methods, which
model, either asymptotically or exactly, linear or quadratic decision boundaries, are
not expected to give good accuracies on this task. MI AdaBoost and the wrapper
method (based on appropriate base learners) are more flexible and may be more
suitable for this problem.
The natural candidate learning techniques for this problem is the TLC method [Weidmann, 2003], because it can deal with a very general family of concepts, including
conjunctive concepts. The DD algorithm (results not shown), on the other hand,
cannot deal with this dataset because, although it can be extended to deal with disjunctive concepts, it cannot handle conjunctive concepts.
As expected, the best of the TLC algorithms (based on one run of 10-fold CV) indeed performs the best on this dataset. MI AdaBoost and the wrapper method seem
to be better than the other methods developed in this thesis because they are able to
model more sophisticated instance-level decision boundaries. We also notice that
the result of the wrapper method is very stable on this dataset. TLDSimple performs better than TLD and we only present the result for TLDSimple. It produces
similar results as the MI linear logistic regression methods. They are not good on
this task, as mentioned above, mainly because the instance-level decision boundary
is unlikely to be linear (or close to linear), even if the collective assumption holds.

96

CHAPTER 6. APPLICATIONS AND EXPERIMENTS

6.4 Conclusion
In this chapter, we have explored some practical applications of MI learning, namely,
drug activity prediction, fruit disease prediction and content-based image categorization problems. We experimented with some of the methods developed in this
thesis on the datasets related to these problems. We believe that the collective assumption is widely applicable in the real world. This belief is supported by the
empirical evidence we observed on the practical problems discussed in this chapter.

97

6.4. CONCLUSION

98

CHAPTER 7. ALGORITHMIC DETAILS

Chapter 7
Algorithmic Details

In this chapter we present some algorithmic details that are crucial to some of the
methods developed and discussed in this thesis. These techniques relate to either
the algorithms or the artificial data generation process. Important features of some
algorithms are also discussed.
More specifically, Section 7.1 briefly describes the numeric optimization technique
used in some of the methods developed in this thesis. Section 7.2 describes in detail
how to generate the artificial datasets used in the previous chapters. Because the
instance-based methods we developed in Chapter 4, especially the MI linear logistic
regression methods, are quite good at attribute (feature) selection, we show some
more detailed results on attribute importance of the Musk datasets in Section 7.3. In
Section 7.4 we describe some algorithmic details of the TLD approach developed
in Chapter 5. Finally we analyze the Diverse Density [Maron, 1998] algorithm, and
describe what we discovered about this method in Section 7.5.

7.1 Numeric Optimization


As already noted above, many methods in this thesis use a numeric technique to
solve an optimization problem whenever the solution of the problem is not in a simple analytical form. For example, when the maximum likelihood (ML) method is
99

7.1. NUMERIC OPTIMIZATION


used (in logistic regression and the TLD approach), we usually resort to numeric
optimization procedures to find the MLE. In some situations, we may have bound
constraints for the variables, i.e. optimization w.r.t. variables

 C and/or x  C for some constant C .

x with constraints

It can be shown that transforming

such problems into unconstrained optimization problems via variable transformations like x

= y 2 + C or x = y 2 + C (where y is the new variable) may not be

appropriate [Gill, Murray and Wright, 1981]. Because the objective function may
not be defined outside the bound constraints, we also need a method that does not
require to evaluate the function values there. Eventually we chose a Quasi-Newton
optimization procedure with BFGS updates and the active set method suggested
by [Gill et al., 1981] and [Gill and Murray, 1976]. It is based on the projected

gradient method [Gill et al., 1981; Chong and Zak,


1996], that is primarily used
to solve optimization problems with linear constraints, and that updates the orthogonal projection in each step according to the change of the active constraints. In
the case of bound constraints, the orthogonal projection of the gradient is very easy
to calculate. It is simply the gradients of the free variables at that moment (i.e.
variables that are not at the bounds). Therefore the searching strategy we adopt is
quite similar to that of an unconstrained optimization [Dennis and Schnabel, 1983].
We implemented this optimization procedure in WEKA [Witten and Frank, 1999],
as the class weka.core.Optimization. The details of the implementation
are described in Appendix B. This also serves as a formal documentation of this
optimization class.
In order to enhance the efficiency of the optimization procedure and reduce the
computational costs, we separate the variables as much as possible. Such a separation of the variables basically divides one optimization problem into several
small sub-problems and applies the optimization procedure to each of the subproblems. Hence in each (smaller) optimization problem, the number of variables to
be searched is greatly reduced. Because we use a positive definite matrix to approximate the Hessian matrix in the Quasi-Newton method, reduction of the variables
reduces sparsity of the matrix and leads to faster computation. In this thesis, we
can sometimes separate the variables to make the optimization easier. For instance,
100

CHAPTER 7. ALGORITHMIC DETAILS


when the parameters of different attributes are separable because we assume the
independence of attributes in the likelihood function, we can conveniently divide
the likelihood function into sub-functions, usually one for each dimension. Each of
these sub-function only involves a few parameters, and we only have to search for
the maximum of each sub-function individually, which is a much simpler problem.

7.2 Artificial Data Generation and Analysis


In Chapters 4 and 5, we needed to generate some artificial datasets. To generate
these datasets, we needed techniques for generating random variates from some
specific distributions. The standard Java library provides us with routines to generate uniform and Gaussian variates but we may need variates generated from other
distributions. Specifically, we needed to generate variates in one dimension1 from
a normalized part-triangle distribution in Chapter 4, and from an Inverse-Gamma
distribution in Chapter 5. We briefly describe the details here.
Recall that in Chapter 4, we specify the density of X as a triangle distribution and
draw data points of one bag from this triangle distribution but within the range of
the bag. Since we further normalize the density so that it sums (integrates) to one,
we need to generate variates from a normalized part-triangle distribution. Figure 7.1
shows the two cases that can happen for this distribution: (i) is the distribution if the
center a bag is outside the interval [

l=2; l=2 where l is the length of the bag range

and (ii) is the distribution otherwise. The distribution in case (i) is a line distribution
whereas the distribution in case (ii) is a combination of two line distributions. The
cornerstone for sampling from both distributions is a line distribution, denoted by

f (:). Suppose the line is within [a; b. Then we first draw a point p uniformly in
[a; b). If p is in the interval that has greater density, [a; (a + b)=2) in the case shown
in Figure 7.1(i), then we accept p. Otherwise we draw another uniform variate v in
[0; f ( a+2 b )). If v > f (p) then we accept a + b p (i.e. we accept the point symmetric
1

Because we assumed attribute independence, we generated variates for each dimension and then
combined them into a vector.

101

7.2. ARTIFICIAL DATA GENERATION AND ANALYSIS


f

(a+b)/2 p

(i)

(ii)

Figure 7.1: Sampling from a normalized part-triangle distribution.


to p w.r.t. a+2 b ), otherwise we accept p. In such a way we can generate variates with

a probability that is equivalent to the area under the line f (:).

When a bags center is close to 0, we will have the distribution shown in Figure 7.1 (ii). We can view it as a combination of two line distributions one part
in

[a; 0) and the other in [0; b). We first decide which part the variate falls into,

according to the probabilities that are equivalent to the areas under the two lines.2

[0; 1) and if it is less than the area under


the line in [a; 0), we pick this line distribution, otherwise the other part (in [0; b)).

We simply draw a uniform variate within

For each line distribution we use the same technique as discussed above to draw
a random variate from it. This is how we draw data points for each bag from the
conditional density given in Equation 4.3 in Chapter 4, where

P r(x) is a triangle

distribution.
In the TLD method in Chapter 5, we need to generate k2 from an Inverse-Gamma
distribution (parameterized by ak and bk ) and k from a Gaussian (parameterized

mk and wk k2 ). Then we further generate a bag of instances with a Gaussian


parameterized by k and k2 . There is a standard routine in Java for generating a
standard Gaussian variate v . Using the transformation v   +  we can get a variate
following N (;  2 ). For the Inverse-Gamma distribution of k2 , it means that ( ak2 )
k
follows a Chi-squared distribution with bk degree of freedom (d.f.). Also note that
if X  Chi-squared(2v ) with (2v ) d.f. , then Y = X2  Gamma(v ). Therefore we
by

Note that the total area under the two lines sum to 1.

102

CHAPTER 7. ALGORITHMIC DETAILS

u  Gamma(bk =2) and then by simple transformation we


= 2u ) k2 = 2auk .

first generate a variate


can get k2 because ak2
k

Since there is no standard routine in Java to generate standard Gamma variates, we


built one from scratch. There are many methods to generate standard Gamma variates [Bratley, Fox and Schrage, 1983; Ahrens and Dieter, 1974; Ahrens, Kohrt and
Dieter, 1983]. We chose the one described in [Minh, 1988], and implemented it in
weka.core.RandomVariates class, which also includes routines generating
the standard exponential and Erlang (Gamma with integer parameter) variates.

7.3 Feature Selection in the Musk Problems


First of all, note that we always base our models on certain assumptions, and feature
selection is no exception. In this section we discuss feature selection based on the
instance-based methods discussed in Chapter 4 under the collective assumptions
outlined in Chapter 3. The feature selection is based on the explanatory power
of the features for the instance-level class probabilities. In this sense, the feature
selection discussed here is at the instance level, and is assumption-based. Different
assumptions may lead to totally different interpretations of the features.
The Diverse Density [Maron, 1998] (DD) algorithm has also been applied for feature selection on the Musk datasets. It was recognized that the scaling parameters,
one for each attribute, found by DD, indicate the importance of the corresponding
attributes the greater the value of the scaling parameter, the more important the
corresponding attribute is. This explanation fits into our understanding of the radial
form of the instance-level probability function modeled by DD. The (inverse of) the
scaling parameter controls the dispersion of the radial probability function. Intuitively, the larger the dispersion along one dimension, i.e. the smaller the scaling
parameter, the flatter the class probability around 0.5. Hence this attribute is less
useful for discrimination (because a flat probability function implies low purity of
the classes on the two sides of the decision boundary). On the contrary, the smaller
103

7.3. FEATURE SELECTION IN THE MUSK PROBLEMS


the dispersion, i.e. the larger the scaling parameter, the sharper the probability
function along this specific dimension. In other words, this dimension is more useful for discrimination.
However, such a dispersion can be easily scaled. In the probability function for one
2
dimension exp[ (x s2p) , if we multiply both the denominator and numerator by a

constant, the probability remains unchanged but we can arbitrarily change the value
of the scaling parameter s. This means that if we multiplied every data point
in one dimension by an arbitrary constant and found the resulting

s accordingly,

we could effectively manipulate the scaling parameter. Therefore it is necessary


to standardize the data before using the scaling parameters as a direct indicator of
feature importance. Although [Maron, 1998] divided every data point (along all dimensions) in the Musk datasets by 100 in order to facilitate the optimization, which
happened to alleviate the scaling problem, the data points from different dimensions
are still on different scales. Thus it may be premature to directly use the scaling parameter values for feature selection.
Formally, in order to test the hypothesis whether one parameter is significant (typically significantly different from 0), we should really find out the sampling distribution of the parameter in question and estimate its standard error from the data.
Then we can standardize it to test the significance. However, the (assumed) sampling distribution of the parameter concerned and its standard error are not easy to
find, especially in MI datasets. Thus we think it intuitive to use the parameter values
found for the standardized data as an indicator. We use this approach to assess the
feature importance in the linear logistic regression model for the Musk datasets.
In Figure 7.2 we show the absolute value of the relative linear coefficients found by
MILogisticRegressionARITH with a ridge parameter 

= 2 on the Musk datasets.

The coefficients are taken as the relative values to the maximal (absolute value of)
coefficient value, thus the largest is 100%. The data was standardized using the
weighted mean and standard deviation, as described in Chapter 4. In linear logistic
models, the meaning of the coefficients of the attributes is straightforward. If we

104

f142
f3
f44
f148
f105
f33
f23
f114
f145
f111
f103
f91
f149
f19
f126
f65
f106
f113
f84
f12
f152
f80
f74
f38
f79
f96
f137
f73
f104
f130
f28
f27
f32
f66
f101
f88
f59
f112
f11
f123
f45
f81
f18
f14
f78
f1
f75
f55
f151
f109
f110
f15
f72
f40
f99
f139
f150
f95
f71
f122
f6
f158
f120
f8
f41
f100
f87
f63
f29
f35
f164
f134
f90
f46
f160
f135
f25
f70
f57
f53
f124
f155
f93
f131
f156
f16
f108
f143
f107
f5
f146
f144
f39
f98
f92
f2
f58
f138
f136
f49
f68
f52
f64
f94
f4
f17
f89
f154
f125
f153
f47
f117
f119
f42
f22
f24
f48
f21
f54
f67
f13
f133
f9
f36
f51
f132
f121
f7
f118
f20
f62
f82
f83
f166
f77
f30
f56
f60
f86
f128
f69
f102
f26
f141
f127
f129
f85
f140
f159
f115
f34
f10
f61
f43
f161
f50
f147
f97
f165
f157
f76
f162
f163
f116
f31
f37
100

90

80

70

60

50

40

30

20

10

f39
f38
f135
f42
f150
f49
f88
f12
f143
f136
f19
f44
f45
f105
f79
f93
f15
f149
f144
f142
f111
f74
f73
f62
f100
f134
f40
f148
f110
f25
f160
f78
f59
f14
f154
f152
f104
f87
f8
f107
f126
f35
f27
f75
f137
f103
f13
f63
f55
f109
f153
f138
f4
f80
f3
f98
f95
f101
f68
f146
f70
f106
f18
f155
f69
f22
f123
f28
f54
f29
f11
f71
f6
f120
f90
f82
f130
f65
f156
f41
f85
f133
f84
f89
f114
f115
f166
f32
f51
f113
f121
f23
f57
f151
f81
f72
f83
f112
f139
f122
f158
f157
f46
f99
f52
f5
f1
f48
f53
f64
f7
f140
f125
f60
f117
f124
f141
f47
f21
f2
f24
f97
f33
f26
f91
f43
f30
f127
f10
f108
f36
f129
f20
f9
f145
f86
f159
f165
f128
f16
f34
f118
f56
f119
f77
f161
f94
f131
f92
f76
f17
f50
f58
f132
f67
f96
f116
f61
f66
f147
f102
f31
f37
f164
f162
f163
100

90

Relative coefficient (%)

80

70

60

50

40

30

20

10

Attribute

CHAPTER 7. ALGORITHMIC DETAILS

Figure 7.2: Relative feature importance in the Musk datasets: the left plot is for
Musk1, and the right one for Musk2.
105

7.4. ALGORITHMIC DETAILS OF TLD


fix the value of all dimensions but one, and plot the probability function along that
one dimension, we will find the familiar shape of the sigmoid (or inverse sigmoid)
function, and the absolute value of coefficient controls the sharpness of the function
around the value of 0.5. If the absolute value of a coefficient of one attribute is high,
then the probability function is very sharp around the value of 0.5. It means that
on both sides of the decision boundary, the purity of two classes is high and we can
easily separate them with this attribute, and vice versa. Therefore, it is intuitive to
regard this as an indicator of the feature importance. Note that in order to avoid
the scaling problem mentioned above, we use the coefficients estimated from the
standardized data (and do not include the intercept). The ridge method tends to
shrink the coefficients to zero, and this is what we observed after the transformation
of the coefficients back to original scale. As can be seen in Figure 7.2, the relative
coefficients estimated from both datasets do not differ as much between attributes
as we might expect from the results presented in [Maron, 1998], although we do
observe that some attributes are much more important than others, especially in the
Musk1 dataset. If we set a threshold for the relative coefficients to select attributes,
say 40% or 50%, we may indeed pick up only a minority of the attributes (based on
the collective assumption and the linear logistic instance-level model).

7.4 Algorithmic Details of TLD


There are three important factors in the implementation of the TLD method developed in Chapter 5: the first is the integral in the model; the second is the constrained
optimization; and the third is the numeric evaluation of the Gamma function. The
second factor has already been addressed in Section 7.1 and we discuss the other
two in this section. In addition, we discuss more about the interpretations of the
parameters involved in TLD.
For the specific integral calculus for Equations 5.6 and 5.9 in Chapter 5, we list
the solution in Appendix C. We then maximize the formula resulting from the
integration. However, in general, if we specify arbitrary instance and bag-level
106

CHAPTER 7. ALGORITHMIC DETAILS


distributions, the integration of the instance-level parameters is really hard, if not
impossible, which may restrict the practical feasibility of this method. It was thus
suggested in the EB literature that an EM algorithm is used for this method [Maritz
and Lwin, 1989]. More specifically, using EM terminology, we regard the instancelevel parameter  as the unobservable data and the bag-level parameter y as the
observable data in Equation 5.1 of Chapter 5. Now given a specific value of 0y ,

we have the probability of  , P r ( j0y ) and the complete data likelihood function

L(Bj ; ; 0y ) = P r(Bj j; 0y ) = P r(Bj j). Thus the integral can be regarded as

taking the expectation of thecomplete data likelihood function over the unobservables, i.e.

E [L(Bj ; ; 0y ). This constitutes the E-step in EM and the resulting

formula of the expectation is exactly the same as the last line in Equation 5.1. Then
we regard the observables y as a random variable and maximize this expected

y , which is the M-step. Under some regularity conditions, we can maximize L(Bj ; ; y ) within the expectation sign. Hence within the
expectation sign, we take a Newton step from 0y to get a new value 1y , that is
1y = 0y + E (HL 1 )E (gL) where HL 1 is the Hessian matrix (second derivative) of
the likelihood and gL is the Jacobian (first derivative) at the point 0y . The expectation is, again, over  and can be numerically evaluated now. This defines an iterative
likelihood function w.r.t.

solution of this maximum likelihood method, which was cited by the well-known
original paper of the EM algorithm [Dempster et al., 1977] as an early example of
the EM algorithm.
In the TLD method we also have a Gamma function to evaluate in Equation 5.6.
More precisely, we need to evaluate:

g = log

b n2j , and s

((bk + nj )=2)
:
(bk =2)

Ph
x=1 log (bk =2

+ h x). If nj is even, g = s
according to the well-known identity (y + 1) = y (y ). But if nj is odd, we have
=2)
g = s log (bk(=b2+1
, thus we still have log-Gamma function to evaluate. We
k =2)
Define

h =

used the implementation suggested by [Press, Teukolsky, Vetterling and Flannery,


1992] to evaluate

log (y ). However, since the method we use to search for the


107

7.4. ALGORITHMIC DETAILS OF TLD

Figure 7.3: Log-Likelihood function expressed via parameter a and b.

minimum of the negative log-likelihood function requires the Jacobian and Hessian
2

matrix of the function,3 we also need to evaluate ddy log (y ) and ddy2 log
this case. Fortunately there is an easy approximation for them [Artin, 1964]:

d
log (y ) = C
dy

+1 
1 X
1
+
y i=1 i

(y ) in

1 
y+i

+1
1
1 X
d2
log (y ) = 2 +
2
dy
y
(y + i)2
i=1

where C is Eulers constant and is canceled out when taking differences as in Equation 5.6.
We would like to further analyze the model in TLD to get more insight into the
interpretation of the parameters in this method. We restrict our discussion to one
dimension so that we can discard the subscript k . The parameters a and b together
define the properties of  2 . The parameter m, the mean of , is quite independent on
the other parameters whereas w depends on both the variance of  and the (expected
value of)  2 . Thus it seems that the most difficult part of the interpretation comes

a and b. If we fix the values of w and m, and assume some


2
reasonable value for the sample mean x and the sample variance nS 1 , we can get a
from the parameters

Note that the Quasi-Newton method we used itself does not need the Hessian matrix. We provide
the Hessian to give better solutions in case of the bound constraints.

108

CHAPTER 7. ALGORITHMIC DETAILS


log-likelihood (LL) function similar to that in Figure 7.3, which shows the value of
the LL function we constructed associated with different values of a and b, denoted
by LL(a; b).
Although it looks flat, there are maximal points in LL, according to the contour
plot on the a-b plane. As a matter of fact, the maximal points seem to lie on a
particular line. To explain this, note that  2 follows an Inverse-Gamma distribution,
alternatively, ( a2 ) follows a Chi-squared distribution with b degree of freedom. Thus

the density function of  2 is:

dF ( 2 ) =

a b=2 exp[ a
22
22 d 2
2
 (b=2)

If we calculate the mean (i.e. the first moment), it is:

E ( 2 )

Z 1

a b=2 exp[ a
22
22

(b=2)

By setting y

d 2

= a=(2k2 ), we have
Z

1
a
1
y b=2 exp( y ) 2 dy
= b
2y
(2) 0
Z 1
b
a
=
y 2 2 exp( y ) dy
b
2 (2) 0
and if 0 < 2b

6= 1, we use integration by parts,


Z

o
1 b
a n 1  2b 1
1+
1 exp( y ) dy 
2
y
=
exp(
y
)
j
y
0
2 ( 2b ) 2b 1
0
n

b 1
b 1
1
b o
a
2
2
y exp( y )jy!1 y exp( y )jy!0 + ( )
=
2
2 ( 2b ) 2b 1
o
n
b
b 
a
1 
=
0 y 2 1 exp( y )jy!0 + ( )
b
b
2
2 (2) 2 1

If 2b

> 1, i.e. b > 2, y 2 1 exp( y )jy!0 = 0, so E ( 2 ) =


b

109

a
b 2 . Otherwise, if

b < 2,

7.4. ALGORITHMIC DETAILS OF TLD

y 2 1 exp( y )jy!0 = 1 and thus E ( 2 ) = 1. If b = 2,


b

E ( 2 )

a
=
2 ( 2b )

Z 1
0

y 1 exp( y ) dy = 1:

Therefore, if b  2, the distribution is proper but the mean does not exist, otherwise
the mean is b a 2 .
Hence as long as the ratio of a and b

2 keeps a constant, the (expected) value of  2

is the same, and so is the LL function value given other parameters and the data. But
note that the variance of  2 is not the same even if a=(b

2) is a constant because

V ar( 2 ) = 2a2 =[(b 2)2 (b 4) provided that b > 4. That is why in Figure 7.3
the values of a and b corresponding to the maximal LL values are not exactly linear,
but very close to linear.
This analysis is useful for performing feature selection using the TLD method. Even
if two features have different ak and bk values but similar values of bkak 2 , wk , and
mk , they are still not useful for discrimination. Moreover, the parameter wk denotes
the ratio V ar(2k ) . Given V ar (k ), wk actually depends on ak and bk . Experiments
k
(on the artificial data) show that if we specify an independent V ar (k ) and bk

> 2,

wk as V ar(k )= bkak 2 , which is reasonable according to


the above analysis. However, if bk  2, whatever TLD finds will be uninterpretable.

the TLD method will find

Unfortunately, we did find many such attributes in the Musk data, which may be a
reason why the simplified model TLDSimple can work better than TLD on this
data. On the other hand, when we applied TLD on the kiwifruit datasets described
in Chapter 6, we did not observe this (adverse) phenomenon (i.e. all bk s > 2). As
a result, TLD can be applied successfully and, as discussed in Chapter 6, it actually
outperformed TLDSimple on this data.

110

CHAPTER 7. ALGORITHMIC DETAILS

7.5 Algorithmic Analysis of DD

In this section we briefly describe how we can view the Diverse Density (DD) algorithm [Maron, 1998] as a maximum binomial likelihood method, and how we
should interpret its parameters.
The Diverse Density (DD) method was proposed by Maron [Maron, 1998; Maron
and Lozano-Perez, 1998] and has EM-DD [Zhang and Goldman, 2002] as its successor, which claims to be an application of the EM algorithm to the most-likelycause model of the DD method. However, we think that the EM-DD algorithm has
problems to find a maximum likelihood estimate (MLE), as shown in Appendix D.
Thus we are strongly skeptical about the validity of EM-DDs solution as an ML
method. We only discuss the DD algorithm here.
The basic idea of DD is to find a point

x in the instance space such that as many

positive bags as possible overlap on the point but no (or few) negative bags cover it.
The diverse density of a point x stands for the probability for this point to be the
true concept. Mathematically,

DD(x) = P r(x = tjB1+ ;    ; Bn+ ; B1 ;    ; Bm )


where B is one bag. The +/ sign indicates a bags class label and t is the true
concept. DD aims to maximize this probability with respect to t, to find the target
concept t that is most likely to agree with the data[Maron, 1998]. After some
manipulations, it is equivalent to maximizing the following likelihood function

L(xjB) =

Y
1in

P r(x = tjBi+ )

Y
1j m

P r(x = tjBj )

(7.1)

We believe this notation is confusing and may even disguise the essence of the DD
method. Hence we would like to express this method using a different perspective.
As is usually done in single-instance learning, we would like to explicitly represent
the class labels as a variable Y . In MI, there is still a Y , not for each instance, but
111

7.5. ALGORITHMIC ANALYSIS OF DD

= 1; 2;    ; K for K -class data but the Musk datasets are


two-class data. Here we say Y = 0 if the bag is negative and 1 if it is positive. In the
above formulation this variable is disguised by the + and sign and we believe
that P r (x = tjBi+ ) really means P r (Y = 1jBi ), and likewise P r (x = tjBj ) means
P r(Y = 0jBj ).
for each bag. In general Y

[Maron, 1998] proposes two approximate generative models, namely the noisy-or
and the most-likely-cause models, to calculate P r (x

= tjBi+ ) and P r(x = tjBi )

in practical problems. They can be described as follows:

Noisy-or model: DD borrows the noisy-or model from the Bayesian networks literature. Noisy-or calculates the probability of an event to happen in
the face of a few causes, assuming that the probability of any cause failing to
trigger this event is independent of any other causes. DD regards an instance
as a cause and t being the underlying concept as the event, hence using the
original notation

P r(x = tjBi ) =

P r(tjBi+) = 1

and

[1 P r(x = tjBij )

[1 P r(x = tjBij ):

Most-likely-cause model: In the most-likely-cause model, only one instance


in each bag is regarded as the cause for making a certain x the true concept,
say, the zith instance in the ith bag. zi is defined as

zi = argmaxzi fP r(x = tjBizi )g:


As before, we have the two complementary expressions as

P r(x = tjBi ) = 1 maxj P r(x = tjBij )


112

CHAPTER 7. ALGORITHMIC DETAILS


and

P r(x = tjBi+ ) = maxj P r(x = tjBij+ ):


A well-known category of single-instance learners model the posterior probability
of

Y given the feature data X , P r(Y = kjX ) (k = 1; 2) directly, e.g. logistic

regression or entropy-based decision trees. Logistic regression models the process


that determines an instances class label as a one-stage Bernoulli process and hence
fits a one-stage binomial distribution for P r (Y

= kjX ). It then uses the ML method

to estimate the parameters involved. According to our understanding, DD still uses


the above single-instance paradigm. Each point (i.e. each instance) in the instance
space has a probability of being positive and negative and DD models the posterior
probability of

Y directly. The difference is that we now have multiple instances

associated with one class label.


In the noisy-or model, we model the process that determines the class label of a
bag as an m-stage Bernoulli process where m is the number of instances inside the
bag each one of the instances corresponds to one stage. Suppose we knew the
probability for each instance to be positive P r (Y

= 1jxb ) (b = 1; 2;    ; m), then

we could first determine the class label of each instance (or stage) according to its
class probability. Given this, what is the probability for all the instances (stages) to
be negative if they are independent? This is a simple question and the solution is
Qm
b=1 (1

Qm
b=1 (1

P r(Y = 1jxb )). The complementary probability 1

P r(Y =

1jxb )) corresponds to the probability for at least one of them to be positive. This is a

probabilistic representation of the standard MI assumption. Now the log-likelihood


function of this multi-stage Binomial probability is simply

L=

X
B

[Y log(1

where there are

m
Y
b=1

(1 P r(Y = 1jxb )))+(1 Y ) log(

m
Y
b=1

(1 P r(Y = 1jxb )))


(7.2)

B bags in total. This is exactly what DD with the noisy-or model

uses. Within this multi-stage Bernoulli process, it is straightforward to write down


other formulae if the conditions for the MI assumption change, for instance, as
a bag is positive if at least

r (1 < r < m) instances are positive and negative


113

7.5. ALGORITHMIC ANALYSIS OF DD


otherwise, etc.
The most-likely-cause, on the other hand, selects a representative of each bag based
on

P r(Y = 1jxb ). More specifically, b = argmaxb2m fP r(Y = 1jxb )g, i.e. it

selects the instance with the highest probability to be positive in a bag. Therefore
it literally degrades a bag into one instance and the one-stage Bernoulli process
is applied to determine the class label, as in the mono-instance case. The loglikelihood function is now

L=

X
B

[Y log(maxb2m fP r(Y = 1jxb )g)+

X
B

(1 Y ) log(1 maxb2m fP r(Y = 1jxb )g)


[Y log(maxb2m P r(Y = 1jxb )) + (1 Y ) log(minb2m P r(Y = 0jxb ))
(7.3)

where there are

B bags in total. This is exactly what DD with the most-likely-

cause model uses. The most-likely-cause also follows the standard MI assumption
because, by selecting an instance in a negative bag that has the maximal probability
to be positive and setting that probability (via the binomial model) to less than 0.5, it
implies that the probability to be positive for every instance in a negative bag cannot
be greater than 0.5.
This means that DD uses the binomial probability formulation, either one stage or
multiple stage, to model the class probabilities of the bags. In general, we can
separate the modeling into two levels if we introduce a new variable denoting the

B . On the bag level we always have a one-stage binomial formulation for


P r(Y jB ) and on the instance level we build a relationship between P r(Y jB ) and
P r(YX jX ) where YX is an instances class label, which is unobservable from the

bags

data. Mathematically, at the bag level, assuming i.i.d data,4 the marginal probability

We assume that the class labels YBi of the bags Bi are independent and identically distributed
(i.i.d) according to a binomial (or multinomial in a multi-class problem) distribution.
4

114

CHAPTER 7. ALGORITHMIC DETAILS


of Y is a one-stage binomial distribution,

P r(Y jBi ) = YBi (1 )1

Y Bi

 = P r(Y = 1jBi ). Normally we have a parametric model for  with a


parameter vector estimated using the data, i.e.  = g ( ; Bi ) in this case. Thus
we can regard the marginal probability of Y as being parameterized by . In other
words, the likelihood function for is
where

L( jYB) = P r(YB1 ;    ; YBN jB1;    ; BN ; )


=

1kN

g ( ; Bk )YBk (1 g ( ; Bk ))1

YBk ;

and, assuming n positive bags and m negative bags,

Y
1in

g ( ; Bi)

Y
1j m

(1 g ( ; Bj )):
(7.4)

P r(Y jBk ) and P r(YX jXkl )


with Xkl 2 Bk , i.e. P r (Y jBk ) = h(P r (YX jXkl )) ) g ( ; Bi ) = h(f ( ; Xkl ))
Qnk
where f ( ; Xkl ) = P r (YX jXkl ). In the noisy-or model, h(f ) = 1
l=1 (1
f ( ; Xkl). In the most-likely-cause model, h(f ) = maxl ff ( ; Xkl)g. One could
plug in other h(:)s based on other assumptions believed to be true but the binomial
At the instance level, we build a relationship between

likelihood function in Equation 7.4 remains unchanged. This perspective on DD


establishes its relationship with the MI methods described in Chapter 4.
The likelihood in Equation 7.4 is a generalization of Equations 7.2 and 7.3, and it is
identical to the likelihood function in Equation 7.1 that was given in the original
description of DD. Now if we think of as fixed and estimate ^ by maximizing the
likelihood function, it is easily recognized that DD is simply a maximum binomial
likelihood method.
The last question to ask is how to establish an exact formula for the instance-

115

7.5. ALGORITHMIC ANALYSIS OF DD

P r(YX jXkl ) = f ( ; Xkl). [Maron, 1998] proposed three ways.


One is to use exp( jjXkl pjj2 ) where jj:jj is the Euclidean norm and p is the
level probability

parameter standing for a point in the instance space. The second one is to use

exp(

jjs(Xkl p)jj2) where s is a diagonal matrix with diagonal elements that are

the scaling parameters for the different dimensions. The last model is a variation
of the second one that models a set of disjunctive concepts, each of which is the
same as the second model. As a matter of fact, suppose we knew there are

D con-

cepts to be found. Then the DD method would have D sets of parameters (instead

of one set of parameters) v1 ; v2 ;   

; vD , where vd , d = 1; 2;    ; D, is a vector
of parameters consisting of both point and scaling parameters (p and s) for each
dimension. In the process of searching for the values of the pd s and sd s, the probability of each Xkl is associated with only one concept the one that makes Xkl
to have the highest P r (Y = 1jXkl ). In other words, P r (Y = 1jXkl ) is calculated
as maxd fexp( jjsd (Xkl pd )jj2 g.
P r(YX jXkl ) (f ( ; Xkl)) is in a radial (i.e. Gaussian-like)
form, with a center of pt and a dispersion of s1t in the tth dimension (where st is the
tth diagonal element in s). The closer an instance is to p, the higher its probability to
The formulation of

be positive. And the dispersion determines the decision boundary of the classification, i.e. the threshold of when P r (YX jXkl )

= 0:5. It is similar to the axis-parallel

hyper-rectangle (APR) [Dietterich et al., 1997] on the instance level. But APR is
not differentiable. In order to make it differentiable, DD essentially models the
(instance-level) decision boundary as an axis-parallel hyper-ellipse (APE) instead
of a hyper-rectangle. The diameter of this APE along the tth dimension is

plog 2

st .

For example, in a two-dimensional space, the decision boundary is

exp[ s21 (x1

p1 )2

s22 (x2

p2 )2 = 0:5 )

(x1

p1 )2

log 2
s21

(x2

p 2 )2

log 2
s22

=1

where p1 ; p2 ; s1 ; s2 are the parameters to be estimated. We know this is an ellipse


2
log 2
centered at p1 and p2 , and being log
s21 and s22 in diameter along the two axes. Any
point within this ellipse should be classified as positive. This is exactly the second

formulation in DD described above. The third model in DD models more than one
116

CHAPTER 7. ALGORITHMIC DETAILS


APE using f ( ; Xkl ). No matter what the formulation for f ( ; Xkl ) is, note that is
an instance-level parameter and DD aims to recover the instance-level probability
in a structured form under the MI assumption. Hence we categorize the DD method
as an instance-based approach.
Nonetheless, DD interprets the parameter vector s purely as a scaling parameter and
does not recognize it as related to the diameter of the decision boundary. As a result
it never uses it for classifying a new exemplar (bag). Instead it tries to find new axisparallel thresholds via an additional time-consuming optimization procedure. We
regard this as unnecessary because the instance-level probability has already been
recovered (parameterized by p and s) and why not use it? We hence suggest that all
the parameters are simply plugged into the noisy-or (or most-likely-cause) model to
calculate the binomial probability of YBnew for a new bag Bnew . The classification
is made depending on whether this probability is greater than 0.5. We have done
some experiments with DD based on the noisy-or model but without searching for
the threshold (i.e. using 0.5 as the threshold) and found that the 10 runs of 10fold cross validation (CV) accuracy of the DD method is 87.07%1.40% on the
Musk1 data and 83.24% 2.29% on the Musk2 data. These are very similar to the

best results reported when searching for the threshold, which are 88:9% on Musk1
data and 82:5% on Musk2 data,5 but the computational expense is greatly reduced.
The misunderstanding of the parameter

s may have also compromised the feature

selection in DD, which was discussed in Section 7.3.


Finally we discuss the optimization problem in the ML method in DD. There are
no difficulties to numerically maximize the likelihood function with the noisy-or

L in Equation 7.2 can be maximized directly via a numeric optimization


procedure. But in the most-likely-cause model, there are max functions in the likemodel.

lihood of Equation 7.3, which makes it not differentiable. The softmax function
is used in DD, which is the standard way to make the max function differentiable.
The EM-DD method [Zhang and Goldman, 2002] was proposed to overcome the
5

Because [Maron, 1998] did not report how many runs of 10-fold CV were used, and neither the
standard deviation of the accuracies, we cannot do a significant test to see whether the differences
are significant.

117

7.5. ALGORITHMIC ANALYSIS OF DD


difficulty of non-differentiability and to make this model faster. However, as shown
in Appendix D, it has problems to find the MLE.
Even with the noisy-or model, DD still has a difficult global optimization to solve
due to the radial form of P r (YX jXkl ). The usual way to tackle the global optimization problem is to try different initial values when searching for the optimal value
of the variables. [Maron, 1998] proposed a strategy to start searching with the value
of every instance within all the positive bags. However this strategy is computationally too expensive to be practical, especially on large datasets like Musk2. [Maron,
1998] also mentioned that, theoretically, to start from every instance within one
positive bag can be enough to approximately find the point with the highest diverse
density. We thus adopt the latter strategy in our implementation of DD (in the MI
package, as described in Appendix A). More precisely, we picked up the positive
bag(s) with the largest size (we picked all of them if there are more than one), and
tried every instance within the bag(s) as the start value of the search. In fact we
observed that this strategy gives a higher accuracy than the strategy that starts with
instances values from all the positive bags on the Musk1 data. However, the improvement proposed in Section 4.6 of Chapter 4, namely to change the formulation
of P r (YX jXkl ), can help avoid this inconvenience in the optimization process.
In summary, we recognize DD as a parametric method that uses the maximum binomial likelihood method to recover the instance-level probability function in an APE
form based on the MI assumption. Therefore it is a member of the APR-like + MI
assumption family.

118

CHAPTER 8. CONCLUSIONS AND FUTURE WORK

Chapter 8
Conclusions and Future Work

The approach adopted in this thesis is a conservative one in the sense that it is
similar to existing methods of multiple instance learning. Much of the work is an
extension or a result derived from the statistical interpretation of current methods in
either single-instance or MI learning. Although some new MI methods have been
described in this thesis, we basically adopted a theoretical perspective similar to
that of the current methods. Hence much of the richness of the multiple instance
problems is left to be explored. In this chapter, we first summarize what has been
done in this thesis. Then we propose what could be done to more systematically
explore MI learning in a statistical classification context.

8.1 Conclusions
In this thesis, we first presented a framework for MI learning based on which we
summarized MI methods published in the literature (Chapter 2). We defined two
main categories of MI methods: instance-based and metadata-based approaches.
While instance-based methods focus on modeling the class probability of each instance and then combine the instance-level probabilities into bag-level ones, metadata-based methods extract metadata from each bag and model the metadata directly. Note that in the instance-based methods, the combination of the instancelevel predictions into bag-level ones requires some assumptions.
119

8.1. CONCLUSIONS
The standard assumption that can be found in the literature is the MI assumption.
We proposed a new assumption instead of the MI assumption and called it the collective assumption. We also explained that some of the current MI methods have
implicitly used this assumption. Under the collective assumption, we developed
new methods that fall into two categories: bag-conditional and group-conditional
approaches.
A bag-conditional approach models the probability of a class given a bag of n instances P r (Y jX1 ;   

; Xn ) (or some transformation of the probability). Under the


collective assumption we can model it as some function f [: of the point-conditional
probability P r (Y jX ) (or a transformation of this probability), i.e.
P r(Y jX1 ;    ; Xn ) = f [P r(Y jXi ); i = 1;    ; n:
Because many single-instance learners model P r (Y jX ) (or a transformation of it),
we can either wrap around them (Chapter 3) or upgrade them (Chapter 4) to enable
them to deal with MI data. The resulting methods are instance-based MI learners.
A group-conditional approach models the probability density of a bag of n instances

P r(X1 ;    ; XnjY ), and then calculates P r(Y jX1 ;    ; Xn )


based on P r (X1 ;    ; Xn jY ) and Bayes rule. It is not obvious how to model the
density P r (X1 ;    ; Xn jY ) directly. Under the collective assumption, we could
have simply assumed X1 ;    ; Xn are from the same density. However, the gen-

given a class, i.e.

erative model would have been too simple to solve real-world problems. Instead,
we assumed that all instances from the same bag are from the same density while
different bags correspond to different (instance-level) densities. We then related the
parameters of these densities to one another by using a hyper-distribution (or baglevel distribution) on the parameters. This resulted in a two-level distribution (TLD)
solution (Chapter 5) to MI learning. This is essentially a metadata-based approach.
We discovered that this approach is an application of the empirical Bayes method
from statistics to the MI classification problem.
Then we explored some practical applications of MI learning the drug activity
120

CHAPTER 8. CONCLUSIONS AND FUTURE WORK


prediction, fruit disease prediction and content-based image categorization (Chapter 6). We also performed some experiments on datasets from these practical domains and found that the methods developed in this thesis are competitive with
published methods.
Finally we presented some important algorithmic details in the methods discussed
in this thesis (Chapter 7). These include numeric optimization techniques, artificial
data generation details, feature selection on the Musk datasets, algorithmic details
of the TLD method, and the analysis of the DD algorithm [Maron, 1998].
As a by-product of this thesis, we also discovered the relationship between MI learning and the meta-analysis problem in statistics (Section 5.6 in Chapter 5), notified
some errors in some of the current MI methods (Appendix D) and implemented
some numeric procedures for the WEKA workbench [Witten and Frank, 1999]
(Chapter 7 and Appendix B).

8.2 Future Work


MI learning differs from single-instance learning in two ways: (1) it has multiple
instances in an example, and (2) only one class label is observable in the data for
each bag of instances. Although the name multiple instance seems to denote only
the first property, it has become a convention in MI learning that both should be
satisfied. Let us factorize these two ways into two steps, which may help us see a
direction for future work on MI learning with a statistical perspective.

Learning problems with multiple instances per bag


First, let us consider a problem simpler than the MI problemwe have multiple instances in an example, but each instance has its own class label. In other words, we
construct the data as in single-instance learning, adding one more attribute named
121

8.2. FUTURE WORK


Bag ID that indicates which bag an instance is from. At testing time, a new bag of
instances is given but each instance is to be classified individually. The reader might
think that this is an uninteresting problem because we could apply single-instance
learners directly to solve this problem by deleting the Bag ID. However, this line
of thinking may not be true. If the fact that some instances are from the same bag
indeed provides us with some additional information about their class labels, none
of the single-instance learners can perform well on this problem because they all
ignore this extra information that implicitly resides in the data.
For example, suppose the posterior (class) probability of each instance is dominated

P r(Y jX; ), where X includes all the attributes except the


Bag ID attribute. Now suppose changes from bag to bag, following a specific
distribution. Then we have 1 , for each instance in the first bag and can generate
its instances class labels according to P r (Y jX; 1 ), 2 for another bag, generating
its instances class labels based on P r (Y jX; 2 ), etc. Obviously normal single-

by some parameter ,

instance learners are not expected to deal with this data because they cannot use
the information provided by the Bag ID attribute. Since this information resides
in the data, there is room to develop a new family of methods that can fully utilize
the bag information. Such new methods may outperform normal single-instance
learners on this type of problems.
We call such a problem a semi-MI problem because the second property of MI
problems is not satisfied. As shown above, much of the richness of multiple instances learning already appears in semi-MI problems where normal single-instance
learning cannot be applied. When classifying a test instance in semi-MI learning,
we can regard the rest of the instances within the same bag as an environment
for the classification. Even if the instance to be classified does not change, the
classification may change if the environment (i.e. other instances within the bag)
changes. Ignoring this contextual information may not give an accurate prediction.
Nevertheless MI research seems to regard the semi-MI problem as the same setting as normal single-instance learning, and semi-MI problems do not appear to

122

CHAPTER 8. CONCLUSIONS AND FUTURE WORK


be actively researched. Almost all of the current MI methods and the methods in
this thesis (except the TLD approach) have not thoroughly and systematically explored the extra information provided purely from the setting of multiple instances
per example. Therefore we regard this as the first step to tackle MI problems in the
future.
Note that even when there is only one instance per bag, i.e. the data degenerates
into mono-instance data, methods that treat instances as degenerated bags may be
totally different from normal single-instance learners. There is some work in normal
single-instance learning that already has such a perspective. Such methods can be
shown to have some asymptotic optimal properties [Wang and Witten, 2002].
The setting of multiple instances per bag is not restricted to the classification case.
It can be extended naturally to regression and clustering, which may be more commonly seen in practical problems. Therefore we strongly advocate the study of
semi-MI problems in the MI domain.

One class label for a bag


Once we have fully explored the richness of semi-MI problems, we can consider
MI problems where the instances labels are not observable. This can be, for example, based on some assumptions that relate a bags class label to the corresponding instances class labels. The MI assumption has been adopted by many current
(instance-based) MI methods, and the collective assumption is adopted in this thesis.
Future work is likely to focus on these assumptions made. The study of assumptions
can follow two directions: (1) the creation and formulation of new assumptions, and
(2) the interpretation and assessment of existing assumptions.
Note that the categorization in our framework described in Chapter 2 is actually
highly related to the assumptions. In instance-based methods, the underlying assumptions are purely related to the (unobservable) instances class labels, while
in metadata-based methods the assumptions are, partly or solely, associated with
123

8.2. FUTURE WORK


the attribute values of the instances. Note that if the assumptions are no longer
associated with the instances (latent) class labels (as in metadata-based methods
generative models), the problem is not related to either single-instance or semi-MI
learning because, whether the instances class labels exist or not, the bags class
labels are generated by some procedures irrelevant to the instances labels. In the
future, more assumptions can be created within this framework. Usually the domain
knowledge gives rise to these assumptions, and the assumptions reside outside the
data. The prediction could benefit from incorporating some forms of background
knowledge. A common way to incorporate background knowledge is to formulate
it mathematically in the model, thus we are typically interested in the exact formulation of the assumptions involved.
The second avenue of future work regarding assumptions can be to assess and interpret existing assumptions, using both domain knowledge and data. Currently
the assessment of the validity of the assumptions on a specific dataset is performed
via prediction accuracy on the data. However, there is a dilemma sometimes. On
the one hand, methods based on seemingly sound domain knowledge may not perform well on corresponding datasets. On the other hand, methods that perform
well on practical datasets may be based on some assumptions whose interpretation
in the corresponding domain is not straightforward. Therefore we need to acquire
both strong background knowledge and modeling skills to fully understand some
assumptions. Such efforts may lead to breakthroughs in the understanding of the
domain and in the understanding of the learning algorithms.

Applications
We expect that multiple instance learning will keep attracting researchers, mainly
due to its prospective applications in various areas. However, one of the biggest
obstacles is the lack of fielded applications and publicly available datasets. More MI
datasets and practical applications would stimulate research in real world problems
for MI learning. In fact, we observed that there are many datasets in which instances
124

CHAPTER 8. CONCLUSIONS AND FUTURE WORK


are grouped into bags while each instance has its own class label. Hence semiMI learning may actually be more promising in terms of applications in the real
world.
On the whole, we regard research to MI learning as still in its early stages. Much
work on algorithm development, property analysis and practical applications remains to be done.

125

8.2. FUTURE WORK

126

APPENDIX A. JAVA CLASSES FOR MI LEARNING

Appendix A
Java Classes for MI Learning

We have developed a multiple instance package using the Java programming language for this project. A schematic description is shown in Figure A.1. This package is directly derived or modified from the corresponding codes in the WEKA
workbench [Witten and Frank, 1999].1 We put the programs for the MI experiment
environment and MI methods directly into the MI package, some artificial data generation procedures into the MI.data sub-package, and some data visualization
tools into the MI.visualize sub-package. Although the documentations of the
programs are self-explanatory, we briefly describe some of them in the following.

A.1 The MI Package


We first developed an experiment environment for MI learning, which includes evaluation tools, some interfaces and related classes. Some of the classes are:

MI.MIEvaluation : A bag-level evaluation environment for MI algorithms.

MI.Exemplar : The class for storing an exemplar, or a bag. Each exemplar


has multiple instances but only one class label.

As a matter of fact, we copied some of the source codes from the WEKA files.

127

A.1. THE MI PACKAGE

MI
(MI algorithms and
experimental environment)

MI.data

MI.visualize

(MI data generation


procedures)

(MI data visualization


tools)

Figure A.1: A schematic description of the MI package.

MI.Exemplars : The class holding a set of exemplars.

MI.MIClassifier : An interface for any MI classifier that can provide a


prediction given a test exemplar.

MI.MIDistributionClassifier : An interface for any MI classifier that can provide a class distribution given a test exemplar. It extends
MI.MIClassifier.

All the MI methods developed in this thesis as well as some other MI methods are
implemented in the MI package. More precisely, they are:

MI.MIWrapper : The wrapper method described in Chapter 3.

MI.MILRGEOM : MILogisticRegressionGEOM described in Chapter 4.

MI.MILRARITH : MILogisticRegressionARITH described in Chapter 4.

MI.MIBoost : MI AdaBoost described in Chapter 4.

MI.TLD : TLD described in Chapter 5.

MI.TLDSimple : TLDSimple described in Chapter 5.

MI.DD : The Diverse Density algorithm with the noisy-or model [Maron,
1998] that looks for one target concept. Details are described in Chapter 7.
128

APPENDIX A. JAVA CLASSES FOR MI LEARNING

A.2 The MI.data Package


The sub-package MI.data includes the classes that generate the artificial datasets
used in this thesis:

MI.data.MIDataPopulation : This class uses the population version


of the generative models described in Chapter 3 (not used in this thesis).

MI.data.MIDataSample : This class implements the sample version of


the generative models described in Chapter 3. The artificial data generated by
this procedure is used in Chapters 3 and 4, with different formulations of the
density of the feature variable X , P r (X ).

MI.data.TLDData : This class generates the data according to what the


method TLD models. In particular, given the parameters provided by the
user, it first generates the variance of a bag  2 according to an Inverse-Gamma
distribution parameterized by the user parameters and the mean of that bag 
based on a Gaussian distribution parameterized with the user parameters

m,

w and the generated variance  2 (i.e. generate  from N (m; w 2 )). Finally,
it generates a bag of instances according to a Gaussian parameterized by 
and  2 . The datasets generated by this class are used to show the estimation
properties of TLD in Chapter 5.

MI.data.TLDSimpleData : This class generates the data that fits what


the method TLDSimple models. The details of the data generation process
and the generated data are shown in Chapter 5. The generated datasets are
used to show the estimation properties of TLDSimple.

MI.data.BagStats : This is a class written by Nils Weidmann, with


some functionality added by myself, to summarize bag information for a
dataset. It was used for the descriptions of the datasets in Chapter 6.
129

A.3. THE MI.VISUALIZE PACKAGE

A.3 The MI.visualize Package


In the sub-package of MI.visualize, we developed some visualization tools to
visualize MI datasets. The key class is MI.visualize.MIExplorer, in which
we implemented a data plot and a distribution plot for MI datasets. The data plot is
used to provide a 2D visualization of a dataset, with the Bag ID of each instance
clearly specified. Some modifications of this plot were used in Chapters 3, 4 and
5 to illustrate the artificial datasets. The distribution plot is trying to capture the
distributional properties within each bag, if possible. Thus it draws a distribution
per bag using a kernel density approach. This type of plot was not used in this
thesis.

130

APPENDIX B. WEKA.CORE.OPTIMIZATION

Appendix B
weka.core.Optimization

This appendix serves as a documentation of the weka.core.Optimization


class. Interested readers or users of this class may find the description here helpful
to understand the algorithm.
In brief, the strategy we used is a Quasi-Newton method based on projected gradients, which is primarily used to solve optimization problems subject to linear
inequality constraints. The details of the method are described in the following.
First of all, let us introduce the Newton method to solve an unconstrained optimization problem. We shall convince ourselves, without a proof, that the following
procedures can find at least a local minimum if the objective function is smooth.1
The rigorous proof can be found in various optimization books like [Chong and

Zak,
1996], [Gill et al., 1981], [Dennis and Schnabel, 1983], etc.

1. Initialization. Set iteration counter k=0, get initial variable values 0 , cal-

culate the Jacobian (gradient) vector 0 and the Hessian (second derivative)

H0 using x0.
2. Check gk . If it converges to 0, then STOP.
3. Solve for the search direction dk , Hk dk = gk . Alternatively this step can
be expressed as dk = Hk 1 gk
matrix

One can easily get the update in Step 4 below by Taylor series expansion to the second order
from xk .

131

4. Take a step to get new variable values k+1 . Normally, this is done as one

Newton step k , however, when the variable values are far from the function
minimum, one Newton step may not guarantee a decrease of the objective

function even if k is a descent direction. Thus we are looking for a multiplier

such that = argmin(f (xk + dk )) where f (:) is the objective function


to be minimized. The search for is carried out using a line search and once
it is done, xk+1 = xk + dk .

5. Calculate the new gradient vector k+1 and the new Hessian matrix

Hk+1

using k+1 .
6. Set k=k+1. Go to 2.

As a convention, k is often referred to as newton direction or simply direction,

as step length, and x = dk as newton step or simply step. We adopt


these terms here.
Note that in Step 4 above, we search for the exact value of

that minimizes the

objective function. Such a line search is called an exact line search, which is
computationally expensive. It was recommended that only a value of

that can

lead to a sufficient decrease in the objective function is needed. In other words,


an inexact line search is preferable in practice and more computational resources
should be put into searching for the values of

x instead of [Dennis and Schnabel,

1983; Press et al., 1992]. Thus we use an inexact line search with backtracking and
polynomial interpolations [Dennis and Schnabel, 1983; Press et al., 1992] in our
project.
Now we carry on with Quasi-Newton methods. The idea of Quasi-Newton methods
is not to use the Hessian matrix because it is expensive to evaluate and sometimes it
is not even available. Instead a symmetric positive definite matrix, say

B, is used to

approximate the Hessian (or the inverse of the Hessian). As a matter of fact, it can
be shown no matter what matrix is used, as long as it is symmetric positive definite,
and an appropriate update of the matrix is taken in each iteration, the search result
132

APPENDIX B. WEKA.CORE.OPTIMIZATION
will be the same! [Gill et al., 1981] Thus the key issue is how to update this matrix

B. One of the most famous methods is a variable metric method called Broyden-

Fletcher-Goldfarb-Shanno(BFGS) algorithm, which uses a rank two modification


of the old

B. There are, of course, many other update methods, but it was reported

that BFGS method works better with the inexact line search. Hence this method is
preferred in practice. In summary, the only difference between the Quasi-Newton
method and the Newton method concerns

B and H. Hence the major modifications

of the above algorithm concern Step 5. A Quasi-Newton algorithm using BFGS


updates can be described as follows:

1. Initialization. Set iteration counter k=0, get initial variable values 0 , cal-

culate the Jacobian (gradient) vector 0 using 0 , and initialize a symmetric


positive definite matrix

B0 .

2. Check k . If it converges to 0, then STOP.

d Bkdk = gk. Alternatively this step can


be expressed as dk = Bk 1 gk .
4. Search for using a line search and set xk+1 = xk + dk .
5. Calculate the gradient vector gk+1 using xk+1 . Set xk = xk+1 xk and
gk = gk+1 gk . The BFGS update is:

3. Solve for the search direction k ,

T
T
Bk+1 = Bk + ggkkTgxk k BkxkxTkBkxk xBk k

6. Set k=k+1. Go to 2.
Before we move on to the optimization with bound constraints, there are some details to be elaborated here.

First of all, if we apply the Sherman-Morrison formula [Chong and Zak,


1996] to

Bk+1 twice in Step 5, Bk+11 is readily available and in the next iteration Step 3 is

easy to carry out. Nevertheless, we will not adopt this strategy because of a minor
but important practical issue involved.
133

Since the whole algorithm depends on the positive definiteness property of the matrix

B (otherwise the search direction will be wrong, and it can take much much

longer to find the right direction and the target!), it would be good to keep the positive definiteness during all iterations. But there are two cases where the update in
Step 5 can result in a non-positive-definite

B:

First, the hereditary positive-definiteness is theoretically guaranteed iff  k T  k

>

0 and this condition can be ensured in an exact line search [Gill et al., 1981]. When
using an inexact line search, apart from the sufficient function decrease criterion,
we should also impose this second condition on it. Thus we cannot use the line
search in [Press et al., 1992], instead we use the modified line search described in
[Dennis and Schnabel, 1983] in Step 4.
Second, even if the hereditary positive-definiteness is theoretically guaranteed, the
matrix

B can still lose positive-definiteness during the update due to rounding errors

when the matrix is nearly singular, which are not uncommon in practice. Therefore

B during the iterations: B = LDLT where L is


the lower triangle matrix and D is a diagonal matrix. The positive definiteness of B
can be guaranteed if the diagonal elements of D are all positive. If the resulting mawe keep a Cholesky factorization of

trix after a low rank update is theoretically positive definite, there exist algorithms
that avoid rounding errors during the updates and ensure that all the diagonal ele-

D are positive [Gill, Golub, Murray and Saunders, 1974; Goldfarb, 1976].
This factorized version of the BFGS updates is the reason why we do not use Bk 1
in Step 3 because with a Cholesky factorization, the equation Bk dk = gk can
ments of

be solved in O (N 2 ) time (where


backward substitution, and hence

N is the number of variables) using forward and

Bk 1 is no longer needed. The reader might notice

that the BFGS update formula in Step 5 is not convenient if the Cholesky factorization of

Bk is involved. However, using the fact that Bkxk = Bkdk =

we can simplify the formula to:


T
T
Bk+1 = Bk + ggkkTgxk k + ggkkTgdk k

134

gk ,

APPENDIX B. WEKA.CORE.OPTIMIZATION
Note that this involves two rank one modifications, and the coefficients g T1 x >
0 and g T1 d < 0 respectively. Hence the first update is a positive rank one update
k

and the second one a negative rank one update. There is a direct rank two modification algorithm [Goldfarb, 1976], but for simplicity we implemented two rank one
modifications using the C1 algorithm in [Gill et al., 1974] for the former update and
the C2 algorithm, also in [Gill et al., 1974], for the latter one. Note that all these
algorithms have O (N 2 ) complexity.
In summary, we use a factorized version of the Quasi-Newton method to avoid the
rounding errors and achieve positive-definiteness of

B during updates.

the total complexity of using Cholesky factorization is


it, the computational cost would still be

Note that

O(N 2 ). If we did not use

O(N 2 ) due to the matrix multiplication.

Therefore there is hardly additional expense for computing Cholesky factorization.


Finally, we reach the topic of optimization subject to bound constraints. We adopt
basically the same strategy as described in [Gill and Murray, 1976] and [Gill et al.,
1981]. It is fairly similar to the above unconstrained optimization method.
First we consider the optimization subject to linear equality constraints

Ax = b. It

is an easy problem because it can actually be cast as an unconstrained optimization


problem with a reduced dimensionality. A common method to solve this problem
is the projected gradient method, in which the above Quasi-Newton method with
BFGS updates remains virtually unchanged. We simply replace the gradient vector

g and the matrix B by projected versions Zg and ZT BZ respectively, where Z


is a projection matrix. There are various methods to calculate Z and usually the

orthogonal projection of A (in the constraints) is taken [Chong and Zak,


1996; Gill
et al., 1981]. Particularly if the constraints are bound constraints, it is typically
easy to calculate because some variables become constants and do not affect the
objective function any more. The projection matrix

Z is thus simply a vector with

entries of 1 for free (i.e. not in the constraints) variables and 0s otherwise.
Next let us go further into the problem of optimization subject to linear inequality
constraints

Ax  b. There are several options to solve this kind of problems but


135

we are interested in the one(s) that does not allow variables to take values over the
bounds, because in our case the objective function is not defined there. Hence we
use the active set method, which has this essential feature [Gill et al., 1981]. The
idea of the active set method is to check the search step in each iteration such that,
if some variables are about to violate the constraints, these constraints become the
active set of constraints. Then, in later iterations, use the projected gradient and
the projected Hessian (or

B in the Quasi-Newton case) corresponding to the inactive

constraints to perform further steps. We will not dig deeply into this method because
in our case (i.e. for bound constraints), the task is especially easy. In each iteration
in the above Quasi-Newton method with BFGS updates, we simply test whether
a search step can cause a variable to go beyond the corresponding bound. If this
occurs, we fix this variable, i.e. treat it as a constant in later iterations, and use
only the free variables to carry out the search. Thus the main modification in the
above algorithm is in the line search in Step 4. We should use an upper bound for

for all possible variables. The upper bound is min( bi dixi ) (where bi is the bound
constraint for the ith variable xi ) if the direction di of xi is an infeasible direction
(i.e. if di < 0). This means that we always calculate the maximal step length that
does not violate any of the inactive constraints and set this as the upper bound for
the trial. Therefore this line search is called safeguarded line search. This method
can be readily extended to the case of two-sided bound constraints, i.e.

u  x  l,

which is now in the implementation in WEKA.


Last but not least, there is a natural question to be asked regarding our method: will
any fixed variables be released some time? If so, when and how?. The answer is
certainly yes. In our strategy, we only check the possibility of releasing fixed variables when the convergence of the gradient is detected. At that moment, we verify
both the first and second order estimates of the Lagrange multipliers of all the fixed
variables (where the function implementing the second derivatives are provided by
the user). If they are consistent with each other, we regard the second order estimate
as a valid one and check whether it is negative. The negativity of a valid Lagrange
multiplier indicates non-optimality, hence the corresponding variables can be made
free. If any fixed variables are to be released, then we project the gradient and the
136

APPENDIX B. WEKA.CORE.OPTIMIZATION
Hessian back to the corresponding higher dimensions, i.e. update the corresponding
entries in

g and B (basically set them to the initial state for these originally fixed

variables). Nonetheless, if the user does not provide the second derivative,2 we only
use the first order estimate of the Lagrange multiplier.
The above is a description of what we have done for the optimization subject to
bound constraints. For completeness, we write down the final algorithm in the
following, although it is basically just a repetition of the above description. Note
we use the superscript FREE to indicate a projected vector or matrix below (i.e.
they only have entries corresponding to the free variables).

1. Initialization. Set iteration counter k=0, get initial variable values 0 , cal-

culate the Jacobian (gradient) vector 0 using 0 and compute the Cholesky
factorization of a symmetric positive definite matrix

unit matrix 0 and a diagonal matrix

D0 .

B0 using a lower triangle

2. Check k . If it converges to 0, then test whether any fixed (or bound) variables can be released from their constraints using both first and second order
estimates of the Lagrange multipliers.
3. If no variable can be released, then STOP, otherwise release the variables and

add corresponding entries in k F REE (set to the bound values), k F REE (set

to the gradient values at the bound values), k F REE , and

DkF REE (if the j th

variable is to be released, we set ljj and djj to 1 and the other entries in j th
row/column to 0).

4. Solve for the search direction k F REE using backward and forward substitu-

tion, k F REE k F REE ( k F REE )T k F REE

gkF REE .

5. Cast an upper bound on and search for the best value of along the direc-

tion k F REE using a safeguarded inexact line search with backtracking and

polynomial interpolation. Set k+1 F REE


2

= xk F REE + dk F REE .

It is often the case because one of the reasons why people use the Quasi-Newton method is that
they do not need to provide the Hessian matrix.

137

6. If any variable is fixed at its bound constraint, delete its corresponding en-

tries in k F REE , k F REE , k F REE , and

g
xkF REE and gkF REE

DkF REE .
x

7. Calculate the gradient vector k+1 F REE using k+1 F REE . Set

xk+1F REE

xk F REE =

gk+1F REE gkF REE .

Then the

update is:

Lk+1F REE Dk+1F REE (Lk+1F REE )T = LkF REE DkF REE (LkF REE )T
g F REE (g F REE )T
+ k F REE T k F REE
(gk
) xk
F
REE
g
(g F REE )T
+ k F REE Tk F REE
(gk
) dk
We use the aforementioned C1 and C2 algorithms to perform the updates.
8. Set k=k+1. GO TO 2.

138

APPENDIX C. FUN WITH INTEGRALS

Appendix C
Fun with Integrals

This Appendix is about the detailed derivation of the final equations for both TLD
and TLDSimple in Chapter 5.

C.1 Integration in TLD


As discussed based on Equations 5.3, 5.4 and the second line of 5.6 in Chapter 5,

Bjk =

Z +1Z +1 (
0

(2k2 )

bk

+1

nj =2 exp

ak2 2 k2
p
( 2 )
(wk ) (bk =2) k

bk +3

2 + n (x
Sjk
j jk
2k2

exp

 k )2 i
2 i)

ak + (k wmk k )
2k2

dk dk2 :

Re-arranging it, we get


Z +1Z +1 (

bk

bk +nj +3
a
ak2
(k2 ) 2 exp( k2 )
=
bk +nj p
2k
0
1 2 2
(2wk ) (bk =2)
)


1
2
2
2
[wk Sjk + nj wk (xjk k ) + (k mk ) dk dk2 :
exp
2
2wk k

139

C.1. INTEGRATION IN TLD


Since
2 + n w (x
[wk Sjk
j k jk

k )2 + (k

nj wk xjk + mk i2
+
1 + n j wk
2 (1 + n w )
mk )2 + wk Sjk
j k
;
1 + nj wk

mk )2 = (1 + nj wk ) k
wk nj (xjk

we can further re-arrange the above equation as

Bjk =

Z +1Z +1 (
0

exp

where Mk

bk

b +n +3
ak2
2 ) k 2 j exp( ak )
(

b +n p
k
2k2
1 2 k 2 j (2wk ) (bk =2)
)
2 (1 + n w ) 
 (k Mk )2 
1  nj (xjk mk )2 + Sjk
j k
exp
dk dk2
2k2
1 + nj wk
2Vk

nj wk xjk +mk
1+nj wk and Vk

Z +1

(2Vk )

=
1
2

wk k2
1+nj wk . Using the identity

exp[

(k

Mk )2
dk = 1;
2Vk

we integrate out k ,
Z +1 (

bk

bk +nj +3
a
ak2
(k2 ) 2 exp( k2 )
=
bk +nj p
2k
0
2 2
(2wk ) (bk =2)
s
)
2 (1 + n w ) 

wk k2
1  nj (xjk mk )2 + Sjk
j k
2
dk2 :
exp
2k2
1 + nj wk
1 + nj wk

Now we set y

Bjk =
where

ak
2k2 and re-arrange again:

Z +1 (
0

bk +nj +2
2a (nj +2)=2
2
p k
y
exp( y ) dk2
n
=
2
j

1 + nj wk (bk =2)

2 ) + n (x
= (1 + nj wk )(ak + Sjk
j jk

140

 

mk )2 = ak (1 + nj wk ) . Because

APPENDIX C. FUN WITH INTEGRALS

k2 =

ak
2y

) dk2 =
=
=

ak y 2
2

dy , we set =

Z 0 (

 nj =2

+1
Z +1 (

 nj =2

bk +nj
2 and get

nj =2

ak
y 1 exp( y ) dy
1 + nj wk (bk =2)

nj =2

ak
y 1 exp( y ) dy
1 + nj wk (bk =2)

( ) = 0+1 e t t 1 dt, substituting with t = y , we get the well-known


R +1
identity: ( ) = 0 e y y 1 dy [Artin, 1964]. Hence the solution becomes:
Since

Bjk =
=

 nj =2


nj =2

( )
1 + nj wk (bk =2)
akbk =2 (1 + nj wk )(bk +nj

a
p k

1)=2

2 ) + n (x
(1 + nj wk )(ak + Sjk
j jk

(bk + nj =2)

mk )2

 bk +nj nj
2 2

( b2k )

This is the formula we got in Equation 5.6.

C.2 Integration in TLDSimple


In TLDSimple, we regard k2 as fixed and estimate it directly from the data. Therefore in Equation 5.2 we plug in a Gaussian-Gaussian formulation, that is, the xjk
2

N (k ; nkj ) (according to


Central Limit Theorem), and k further follows a Gaussian parameterized by mk
and wk , N (mk ; wk ). k2 is now fixed. Hence Equation 5.6 now becomes:
of each bag has the sampling distribution of a Gaussian,

Bjk =

Z +1 (

h
1
q
exp
2
2 nkj

h
1
(xjk k )2 i
p
exp
2
(2wk )
2( nkj )

(k

Re-arranging it, we have

Z +1 (

h
1
p
exp
(2Vk )


wk nj + k2 
2
nj

(k

M k )2 i
2Vk

1=2

exp

141

mk )2 i
dk :
2wk

nj (xjk mk )2 i
dk
2(wk nj + k2 )

C.2. INTEGRATION IN TLDSIMPLE


where Mk

nj wk xjk +mk k2


and Vk
k2 +nj wk

Z +1

wk k2
2
k +nj wk . With the identity

h
1
exp
(2Vk )

(k

Mk )2 i
dk = 1;
2Vk

we integrate out k and get




Bjk = 2

wk nj + k2 
nj

1=2

This is Equation 5.9 from Chapter 5.

142

exp

nj (xjk mk )2 i
2(wk nj + k2 )

APPENDIX D. COMMENTS ON EM-DD

Appendix D
Comments on EM-DD

EM-DD [Zhang and Goldman, 2002] was proposed to overcome the difficulty of
the optimization problem required to find the maximum likelihood estimate (MLE)
of the instance-level class probability parameters in Diverse Density (DD) [Maron,
1998]. Note that there are two approximate generative models proposed in DD to
construct the bag-level class probability from the instance-level ones the noisyor and the most-likely-cause model. In the noisy-or model, there is no difficulty
in optimizing the objective (log-likelihood) function while in the most-likely-cause
model, the objective function is not differentiable because of the maximum functions involved. DD used the softmax function to approximate the maximum function in order to facilitate gradient-based optimization, which is a standard way to
solve non-differentiable optimization problems. EM-DD, on the other hand, claims
to use an application of the EM algorithm [Dempster et al., 1977] to circumvent
the difficulty. Therefore EM-DD provides no improvement on the modeling process in DD, only on the optimization process. Since DD uses the standard way
to treat non-differentiable optimization problems [Lemarechal, 1989] and was supposed to find the MLE, why can EM-DD be such an improvement? Besides, we
have not found any case in the literature that EM can simplify a non-differentiable
log-likelihood function in conjunction with a gradient-based optimization method
(i.e. a Newton-type method) in the M-step. Can the standard EM algorithm be
applied to a non-differentiable log-likelihood function? These questions lead us to
be skeptical about the validity of EM-DD.
143

D.1. THE LOG-LIKELIHOOD FUNCTION


In the following we first analyze the log-likelihood function that EM-DD aims to
maximize, then we present the EM-DD algorithm and an illustrative example to see
whether it can work on this function. Finally we point out a mistake in the proof
for EM-DD. We can also show that in general the monotonicity of EM-DD cannot
be proved, thus theoretically EM-DD is not guaranteed to work.

D.1 The Log-likelihood Function


EM-DD is based on the log-likelihood function constructed with the most-likelycause model (Equation 7.3 in Chapter 7):

L=

X
B

[Y log(maxb2m fP r(Y = 1jxb )g)+

X
B

(1 Y ) log(1 maxb2m fP r(Y = 1jxb )g)


[Y log(maxb2m P r(Y = 1jxb )) + (1 Y ) log(minb2m P r(Y = 0jxb ))
(D.1)

determines our estimate of P r(Y = 1jxb ), and we seek


the value of that maximizes L, i.e. MLE . Given a certain value of , we seThe parameter vector

lect one instance from each bag (the most-likely-cause instance) to construct the
log-likelihood. In the process of searching for MLE , when the parameter value
changes, the most-likely-cause instance to be selected may also change. Thus,
the log-likelihood function may suddenly change forms when the parameter value
changes. More specifically, if we arbitrarily pick up one instance from each bag and
construct the log-likelihood function, we have one possible log-likelihood function
we call it one component function. If we change the instance in one bag,
we obtain another, different (unless the changed instance is identical to the old
one) component function. Obviously, if there are s1 instances in the 1st bag, s2
instances in the 2nd bag, . . . , sm+n instances in the

(m + n)th bag, then there are


s1  s2    sm+n component functions available (where m and n are the number
144

APPENDIX D. COMMENTS ON EM-DD

0
-2

Log-likelihood

-4
-6
-8
-10
-12
-14
-2

Parameter value

Figure D.1: A possible


component function in one
dimension.

Figure D.2: An illustrative example of the loglikelihood function in DD using the most-likely-cause
model.

of positive and negative bags respectively). And the true log-likelihood function is
constructed using some of these component functions.1 When the parameter value
falls into one range in the domain of , the log-likelihood function is in the form
of a certain component function. And if it falls into another range, then it becomes
another component function. Although the true log-likelihood function changes its
form for different domains of , it is continuous at the point when the form of the
function changes from one component function to another because of the Euclidean
distance (L2 norm) used in the radial (or Gaussian-like) formulation of

P r(Y jX ),

but it is no longer differentiable at that point.


In Figure D.1 we show the shape of a part of one possible component function in
one dimension, i.e. we only have one attribute and fix the value of the scaling
parameter in DD. Thus the only variable here is the point parameter. Note that
there are three local maxima in this function, and the function is not continuous because the log-likelihood is undefined in two locations (actually the parameter value
in any location is equal to the attribute value of an instance in a negative bag). This
is due to the radial formulation of P r (Y jX ) in DD. The usual way in both DD and
1

Note that not all of the component functions are used because some instances (from different
bags) will never be picked up simultaneously for any value of .

145

D.1. THE LOG-LIKELIHOOD FUNCTION


EM-DD to tackle this problem is to search for the parameter values using (different)
multiple starts, in the hope that one start point can lead to the global maximum of
the function. From now on, we assume that we can always find the (global) maximum for each component function, perhaps using multiple starts. Regardless of
the local maxima, the shape of a component function is roughly quadratic. This is
reasonable because at least for the parts of the function constructed using only the
positive instances (in this appendix we say an instance in a positive bag a positive instance and an instance in a negative bag a negative instance), it is exactly
quadratic. The negative instances only make the function discontinuous, as shown
in the functions shoulders in Figure D.1 (actually the figure does not show clearly
the discontinuity of the function there should be no minimum on the shoulders,
instead the function value goes to

1 there).

If we ignore the (small) local maxima in each of the component functions, we can illustrate (a part of) the true likelihood function using curves like those in Figure D.2.
Note that this figure does not rigorously describe the situation of maximizing the
log-likelihood in Equation D.1. It only serves an illustration. However, it does give
us an idea when EM-DD can work and when it cannot. Although some details may
not be accurate (like the coordinates or the exact shape of the component functions),
these factors do not affect the key ideas of the illustration.
There are three component functions in the plot, denoted by 1, 2 and 3. The dotted lines are the part of the component functions not used in the true log-likelihood
function. The solid line plots the true function. Again we only plot, in one dimension, the point parameter against the log-likelihood. The shapes of the component
functions are different because we also incorporate a fixed value of the scaling parameter for each one of the component functions (i.e. different component functions have different scaling parameter values). Therefore, we simplify the optimization procedure as a steepest descent method where we first fix the scaling
parameter and search for the best point parameter, then fix the point parameter
and search for the optimal scaling parameter. In Figure D.2, we do not show how
to search for the value of scaling parameter. We assume that we are given the
146

APPENDIX D. COMMENTS ON EM-DD


optimal scaling parameter value every time we start searching for the point parameter. We also assume that the steepest descent procedure does equally well
as the Quasi-Newton method used in DD and EM-DD on this simple problem. We
need to make all these assumptions to simplify the situation and enable us to see the
essence of this (extremely complicated) optimization problem.
Note that the true function value can be smaller than the values of component function. To see why, let us look back to the Equation D.1. Given a fixed parameter

f , picking an instance with the maximal value of P r(Y = 1jX; f ) in a positive


bag must result in a greater log-likelihood value than picking up another instances
in the same bag. On the contrary, selecting an instance with the maximal value of

P r(Y = 1jX; f ) (i.e. the minimal value of P r(Y = 0jX; f )) in a negative bag
has to result in a smaller log-likelihood value than selecting other instances in the
same bag. Therefore the true log-likelihood function is often not the one with the
greatest value among all the component functions, as illustrated in Figure D.2, as
long as there is more than one instance in each negative bag.
In this example, the true function shifts between the components twice, and the
shifting points are indicated by small triangles and X and Y respectively. At
both points the log-likelihood function is still increasing but the newly-shifted component function value is less than the value of the old component function. Shifting
between components means that we should change the instances in each bag that
construct the log-likelihood function. In the new component function, a new set
of instances, one from each bag, are picked up. Obviously this true function is
non-differentiable at the two points when it shifts components, but it has a local
maximum at point D.

147

D.2. THE EM-DD ALGORITHM

D.2 The EM-DD Algorithm


Now we briefly sketch the EM-DD algorithm [Zhang and Goldman, 2002]. EM-DD
iterates with initial values of , say 0 , and the initial log-likelihood as

L0 =

X
1in
X

log[maxb2m fP r(Y = 1jxb ; 0 )g+

1j m

log[1 maxb2m fP r(Y = 1jxb ; 0 )g:

Then it cycles between the E-step and M-step as follows until convergence (Suppose
this is the pth iteration):

E-step: find the instance in each bag that has the greatest

P r(Y jX; p ), say,

the zith instance in the ith bag.

M-step: search for

p+1 = argmaxf

X
1in
X

log[P r(Y = 1jxizi ; )+

1j m

log[(1 P r(Y = 1jxjzj ; ))g

where there are n positive bags and m negative bags. And then compute

Lp+1 =

X
1in
X

log[maxb2m fP r(Y = 1jxb ; p+1 )g+

1j m

log[1 maxb2m fP r(Y = 1jxb ; p+1)g

The convergence test is performed before each E-step (or after each M-step) via the
comparison of the values of the log-likelihood function Lp and Lp+1 .
To give an illustration, we first apply the above algorithm to the artificial example
shown in Figure D.2 and see what it could find given A as the start value for .
This example is deliberately set up so that we may see both cases in which EM148

APPENDIX D. COMMENTS ON EM-DD


DD can work (in the first iteration) and cannot work (in the second iteration). As
mentioned before we assume the search procedure used in the M-step is steepest
descent one member of the gradient descent family. Although EM-DD used
a Quasi-Newton method, this does not matter much in this simple situation. Note
that the M-step in EM-DD corresponds to searching for the maximum in one
component function because the instance to be selected in each bag is fixed, which
means it often goes above the true log-likelihood function in M-step.
In the E-step in the first iteration, EM-DD selects the set of instances as required.
This is to say it finds the correct component function the curve with the solid
line. Then it calculates

L0 as the function value at the point A. In the M-step it

finds point B as the maximum of Component 3 (assuming it also finds the optimal
scaling parameters in the M-step, which determine the shape of next component in
Figure D.2). When it computes the log-likelihood L1 , it will necessarily return to
the true log-likelihood function. In other words, it picks up the new set of instances
according to 1 (the X -axis coordinate of B/B). As a result L1 is the value of the
log-likelihood function at point B.
In the second iteration, EM-DD first compares L1 and L0 , i.e. the function values at
A and B. In this case, it indeed finds a better value of the parameter, so it continues.
In the E-step it picks an instance in each bag according to 1 . The corresponding
new component function is Component 1. In the M-step it will find the maximum
of Component 1 at point C. Then it calculates L2 as the true function value. This
is actually point C on Component 2.
In the third iteration, EM-DD first compares L2 and L1 , i.e. the function values at
B and C. However, this time it finds that it cannot increase the function value so
the algorithm stops and point B is returned as the solution, i.e. EM-DD will not
be able to find point D which is the true maximum of the log-likelihood function.
If it kept searching, it would have found D. Nonetheless, to do this, it must break
the convergence test, which is a crucial part of the proof of convergence for EMDD [Zhang and Goldman, 2002]. Without this convergence test EM-DD is not

149

D.3. THEORETICAL CONSIDERATIONS


an application of EM but simply a stochastic searching method for a combinatoric
optimization problem.
Therefore, even if we assume that the global maximum in each component function can be found, EM-DD cannot find the maximum of the log-likelihood function. Note that we have simplified this example a lot the objective function is
concave and in one dimension in this case. In reality, since there are many more
dimensions (typically EM-DD searches for 2166 parameters simultaneously on
the Musk datasets), the situation is much more complicated than the above example. For instance, saddle points can occur, which is a case that EM cannot deal with
anyway [McLachlan and Krishnan, 1996].
Note that in the above example, we required EM-DD to start from point A. With a
different start point for the search for , it may find the maximum point D. Indeed,
we observed that EM-DD depends heavily on multiple start points, not only for
searching for the global maximum but also for improving the chances to find just
a local maximum. In other words, it really relies on good luck rather than strong
theoretical justifications.

D.3 Theoretical Considerations


In spite of the above counter-example, it is not sufficient to convince ourselves that
EM-DD is not a valid algorithm because it was proved in [Zhang and Goldman,
2002] that this algorithm will converge to a local maximum. This proof is analogous
to that in the EM algorithm. However, it turns out that the most important part was
missed.
The Expectation-Maximization (EM) algorithm [Dempster et al., 1977] was proposed to facilitate the maximum likelihood method when unobservable data is
involved in the log-likelihood function. It is discussed in detail in a variety of articles or books [Bilmes, 1997; McLachlan and Krishnan, 1996]. Since the M-step
150

APPENDIX D. COMMENTS ON EM-DD


necessarily increases the log-likelihood function, the key for proving the monotonicity of the EM algorithm is to prove that the E-step can also increase the loglikelihood. The property of an increase in the log-likelihood function in the E-step
is a consequence of Jensens inequality and the concavity of the logarithmic function [Dempster et al., 1977; McLachlan and Krishnan, 1996]. That is why EM can
also be viewed as a Maximum-Maximum procedure [McLachlan and Krishnan,
1996; Hastie et al., 2001]. Nonetheless, [Zhang and Goldman, 2002] does not provide proof of the increase of the log-likelihood function (in Equation D.1) in the
E-step at all. Instead it uses the convergence test (of terminating the algorithm if

Lp

 Lp+1) to prevent the log-likelihood from decreasing.

Note that in standard

EM, since the E-step also increases the log-likelihood, the convergence test is only
to test whether Lp

= Lp+1 , involving no > sign.

The reason why the proof used in the standard EM algorithm does not apply to
EM-DD is due to the special property of the unobservable data. In EM-DD, the
unobservable data is zi , an index for the ith bag that indicates which instance should
be used in the log-likelihood function. This variable zi is not quite unobservable
in this case because for each value of , it is fixed (i.e. no probability distribution is
needed) and observable in the data, although for different values of its value also
changes. Therefore if one insists on regarding it as a latent variable in EM, then
given a certain parameter value p , the probability of zi is

P r(zi j p) =

8
<
:

1 if zi = argmaxP r(Y = 1jxizi ; p)


0 otherwise

This probability function is very unusual, and still involves a max function, which
is not smooth, and thus the proof in EM cannot apply to the expected log-likelihood
function in the E-step in EM-DD.
As a matter of fact, as shown in Section D.2, in the E-step of EM-DD, the loglikelihood is very likely to decrease, in which case the algorithm has to stop. Note

maxb2m fP r(Y = 1jxb )g is equivalent


P r(Y = 1jxb )g = minb2m fP r(Y = 0jxb )g. Hence in the E-

that in Equation D.1, in a negative bag, 1


to

minb2m f1

151

D.3. THEORETICAL CONSIDERATIONS


step, changing any negative instances to construct a new log-likelihood function will
always decrease the log-likelihood function. In the extreme, if in one E-step, the
positive instances involved in the current log-likelihood function remain unchanged,
but some negative instances are changed, then the new log-likelihood is guaranteed
to be lower than the current one. In that case, it may be premature to halt the
algorithm, as shown in the example in Section D.2. Therefore, unlike EM, the
monotonicity of the E-step in EM-DD cannot be proved in general, which is the
major theoretical flaw in EM-DD.
The fact that the log-likelihood fails to increase after the first several iterations for
EM-DD [Zhang and Goldman, 2002] is probably due to a decrease in the E-step.
Moreover, it was also observed that it is often beneficial to allow NLDD (the negative log-likelihood of Diverse Density) to increase slightly [Zhang and Goldman,
2002]. We believe this is not solely because of local maxima (or minima for the negative log-likelihood) it may also allow the algorithm to keep searching, where it
would otherwise fail. Indeed, without the convergence requirement of EM (that
EM-DD cannot achieve), we can develop an algorithm that is guaranteed to find
the solution of the MLE we simply search for the global maximum in each of
the component functions, either in a systematic (say, branch-and-bound) or stochastic manner, and pick up the parameter with the highest true log-likelihood. This
amounts to searching in all the component functions involved in the log-likelihood
function. However this has nothing to do with EM. And the computational expense
varies greatly from case to case. The worst-case cost could be very high.
The problem with EM-DD lies in the objective of using a normal gradient-based EM
to solve a non-differentiable optimization problem. Non-differentiable optimization
has been a hot research topic in the optimization domain for some time [Lemarechal,
1989]. One of the methods to deal with non-differentiable optimization problem is
to transform the objective function into a differentiable function. The method used
in DD to substitute the max function with the softmax function is of this kind.
Although the softmax does not accurately transform the function, it approximates
the true log-likelihood function precisely enough. In the case shown in Figure D.2,
152

APPENDIX D. COMMENTS ON EM-DD


it will approximate the true function using a differentiable function, hence this differentiable function will look similar to the true log-likelihood function and has a
maximum point close enough to point D. Using a normal Newton-type method we
can easily find this point.
In summary, because DD uses a sound maximization procedure whereas EM-DDs
approach may not find an MLE, we are inclined to believe the statement in [Maron,
1998] that DD with the most-likely-cause model actually performs worse than with
the noisy-or model on the Musk datasets, and we are skeptical about the good results reported for EM-DD [Zhang and Goldman, 2002] (especially considering that
there are also problems with the evaluation procedure used in [Zhang and Goldman,
2002]).

153

D.3. THEORETICAL CONSIDERATIONS

154

BIBLIOGRAPHY

Bibliography

Ahrens, J. and Dieter, U. [1974]. Computer methods for sampling from Gamma,
Beta, Poisson and Binomial distributions. Computing, (12), 223246.
Ahrens, J., Kohrt, K. and Dieter, U. [1983]. Algorithm 599: sampling from Gamma
and Poisson distributions. ACM Transactions on Mathematical Software, 9(2),
255257.
Artin, E. [1964]. The Gamma Function. New York, NY: Holt, Rinehart and Winston. Translated by M. Butler.
Auer, P. [1997]. On learning from multiple instance examples: empirical evaluation of a theoretical approach. In Proceedings of the Fourteenth International
Conference on Machine Learning (pp. 2129). San Francisco, CA: Morgan
Kaufmann.
Bilmes, J. [1997]. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical
Report ICSI-TR-97-021, University of Berkeley.
Blake, C. and Merz, C. [1998]. UCI repository of machine learning databases.
Blum, A. and Kalai, A. [1998]. A note on learning from multiple-instance examples.
Machine Learning, 30(1), 2330.
Bratley, P., Fox, B. and Schrage, L. [1983]. A Guide to Simulation. New York, NY:
Springer-Verlag.
155

BIBLIOGRAPHY
Breiman, L. [1996]. Bagging predictors. Machine Learning, 24(2), 123140.
le Cessie, S. and van Houwelingen, J. [1992]. Ridge estimators in logistic regression. Applied Statistics, 41(1), 191201.
Chevaleyre, Y. and Zucker, J.-D. [2000]. Solving multiple-instance and multiplepart learning problems with decision trees and decision rules. Application to
the mutagenesis problem. Internal Report, University of Paris 6.
Chevaleyre, Y. and Zucker, J.-D. [2001]. A framework for learning rules from
multiple instance data. In Proceedings of the Twelveth European Conference
on Machine Learning (pp. 4960). Berlin: Springer-Verlag.

Chong, E. and Zak,


S. [1996]. An Introduction to Optimization. New York, NY:
John Wiley & Sons, Inc.
Cohen, W. [1995]. Fast effective rule induction. In Proceedings of the Twelveth
International Conference on Machine Learning (pp. 115123). San Francisco,
CA: Morgan Kaufmann.
Dempster, A., Laird, N. and Rubin, D. [1977]. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistics Society, Series
B, 39(1), 138.
Dennis, J. and Schnabel, R. [1983]. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Englewood Cliffs, NJ: Prentice-Hall, Inc.
Devroye, L., Gyorfi, L. and Lugosi, G. [1996]. A Probabilistic Theory of Pattern
Recognition. New York, NY: Springer-Verlag.
Dietterich, T. and Bakiri, G. [1995]. Solving multiclass learning problems via errorcorrecting output codes. Journal Artificial Intelligence Research, 2, 263286.
Dietterich, T., Lathrop, R. and Lozano-Perez, T. [1997]. Solving the multipleinstance problem with the axis-parallel rectangles. Artificial Intelligence, 89(12), 3171.
156

BIBLIOGRAPHY
Frank, E. and Witten, I. [1998]. Generating accurate rule sets without global optimization. In Proceedings of the Fifteenth International Conference on Machine
Learning (pp. 144151). San Francisco, CA: Morgan Kaufmann.
Frank, E. and Witten, I. [1999]. Making better use of global discretization. In
Proceedings of the Sixteenth International Conference on Machine Learning
(pp. 115123). San Francisco, CA: Morgan Kaufmann.
Frank, E. and Xu, X. [2003]. Applying propositional learning algorithms to multiinstance data. Working Paper 06/03, Department of Computer Science, University of Waikato, New Zealand.
Freund, Y. and Schapire, R. [1996]. Experiments with a new boosting algorithm. In
Proceedings of the Thirteenth International Conference on Machine Learning
(pp. 148156). San Francisco, CA: Morgan Kauffman.
Friedman, J., Hastie, T. and Tibshirani, R. [2000]. Additive logistic regression, a
statistical view of boosting (with discussion). Annals of Statistics, 28, 307
337.
Gartner, T., Flach, P., Kowalczyk, A. and Smola, A. [2002]. Multi-instance kernels. In Proceedings of the Nineteenth International Conference on Machine
Learning (pp. 179186). San Francisco, CA: Morgan Kaufmann.
Gill, P., Golub, G., Murray, W. and Saunders, M. [1974]. Methods for modifying
matrix factorizations. Mathematics of Computation, 28(126), 505535.
Gill, P. and Murray, W. [1976]. Minimization subject to bounds on the variables.
Technical Report NPL Report NAC-72, National Physical Laboratory.
Gill, P., Murray, W. and Wright, M. [1981]. Practical Optimization. London:
Academic Press.
Goldfarb, D. [1976]. Factorized variable metric methods for unconstrained optimization. Mathematics of Computation, 30(136), 796811.
157

BIBLIOGRAPHY
Hastie, T., Tibshirani, R. and Friedman, J. [2001]. The Elements of Statistical
Learning : Data mining, Inference, and Prediction. New York, NY: SpringerVerlag.
John, G. and Langley, P. [1995]. Estimating continuous distributions in Bayesian
classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338345). San Mateo, CA: Morgan Kaufmann.
Lemarechal, C. [1989]. Nondifferentiable optimization. In Nemhauser, R. Kan and
Todd (Eds.), Optimization, Volume 1 of Handbooks in Operations Research
and Management Science chapter VII, (pp. 529569). Amsterdam: NorthHolland.
Long, P. and Tan, L. [1998]. PAC learning axis-aligned rectangles with respect
to product distribution from multiple-instance examples. Machine Learning,
30(1), 721.
Maritz, J. and Lwin, T. [1989]. Empirical Bayes Methods (2 Ed.). London: Chapman and Hall.
Maron, O. [1998]. Learning from Ambiguity. PhD thesis, Massachusetts Institute
of Technology, United States.
Maron, O. and Lozano-Perez, T. [1998]. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems, 10 (pp. 570576).
Cambridge, MA: MIT Press.
McLachlan, G. [1992]. Discriminant Analysis and Statistical Pattern Recognition.
New York, NY: John Wiley & Sons, Inc.
McLachlan, G. and Krishnan, T. [1996]. The EM Algorithm and Extensions. New
York, NY: John Wiley & Sons, Inc.
Minh, D. [1988]. Generating Gamma variates. ACM Transactions on Mathematical
Software, 4(3), 261266.
158

BIBLIOGRAPHY
von Mises, R. [1943]. On the correct use of Bayes formula. The Annals of Mathematical Statistics, 13, 156165.
Nadeau, C. and Bengio, Y. [1999]. Inference for the generalization error. In Advanced in Neural Information Processing Systems, Volume 12 (pp. 307313).
Cambridge, MA: MIT Press.
OHagan, A. [1994]. Bayesian Inference, Volume 2B of Kendalls Advanced Theory
of Statistics. London: Edward Arnold.
Platt, J. [1998]. Fast training of support vector machines using sequential minimal
optimization. In B. Scholkopf, C. Burges and A. Smola (Eds.), Advances in
Kernel MethodsSupport Vector Learning. Cambridge, MA: MIT Press.
Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. [1992]. Numerical Recipes
in C: The Art of Scientific Computing (2 Ed.). Cambridge, England: Cambridge
University Press.
Quinlan, J. [1993]. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan
Kaufmann.
Ramon, J. and Raedt, L. D. [2000]. Multi instance neural networks. In AttributeValue and Relational Learning: Crossing the Boundaries. Workshop at the
Seventeenth International Conference on Machine Learning.
Ruffo, G. [2001]. Learning Single and Multiple Instance Decision Trees for Computer Security Applications. PhD thesis, Universita di Torino, Italy.
Srinivasan, A., Muggleton, S., King, R. and Sternberg, M. [1994]. Mutagenesis:
ILP experiments in a non-determinate biological domain. In Proceedings of
the Fourth International Inductive Logic Programming Workshop (pp. 161
174).
Stuart, A., Ord, J. and Arnold, S. [1999]. Classical Inference and the Linear Model,
Volume 2A of Kendalls Advanced Theory of Statistics. London: Arnold.
159

BIBLIOGRAPHY
Vapnik, V. [2000]. The Nature of Statistical Learning Theory. New York, NY:
Springer-Verlag.
Wang, J. and Zucker, J.-D. [2000]. Solving the multiple-instance problem: a lazy
learning approach. In Proceedings of the Seventeenth International Conference
on Machine Learning (pp. 11191134). San Francisco, CA: Morgan Kaufmann.
Wang, Y. and Witten, I. [2002]. Modeling for optimal probability prediction. In
Proceedings of the Nineteenth International Conference on Machine Learning
(pp. 650657). San Francisco, CA: Morgan Kaufmann.
Weidmann, N. [2003]. Two-level classification for generalized multi-instance data.
Masters thesis, Albert-Ludwigs-Universitat Freiburg, Germany.
Weidmann, N., Frank, E. and Pfahringer, B. [2003]. A two-level learning method
for generalized multi-instance problems. In Proceedings of the Fourteenth
European Conference on Machine Learning. To be published.
Witten, I. and Frank, E. [1999]. Data Mining: practical machine learning tools and
techniques with Java implementations. San Francisco, CA: Morgan Kaufmann.
Zhang, Q. and Goldman, S. [2002]. EM-DD: An improved multiple-instance learning technique. In Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference (pp. 10731080). Cambridge, MA: MIT Press.
Zhang, Q., Goldman, S., Yu, W. and Fritts, J. [2002]. Content-based image retrieval
using multiple-instance learning. In Proceedings of the Nineteenth International Conference on Machine Learning (pp. 682689). San Francisco, CA:
Morgan Kaufmann.

160

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy