2015 Book MachineLearningAndKnowledgeDis
2015 Book MachineLearningAndKnowledgeDis
Knowledge Discovery
in Databases
European Conference, ECML PKDD 2015
Porto, Portugal, September 7–11, 2015
Proceedings, Part I
123
Lecture Notes in Artificial Intelligence 9284
123
Editors
Annalisa Appice Carlos Soares
University of Bari Aldo Moro University of Porto - INESC TEC
Bari Porto
Italy Portugal
Pedro Pereira Rodrigues João Gama
University of Porto University of Porto - INESC TEC
Porto Porto
Portugal Portugal
Vitor Santos Costa Alípio Jorge
University of Porto - CRACS/INESC TEC University of Porto - INESC TEC
Porto Porto
Portugal Portugal
We are delighted to introduce the proceedings of the 2015 edition of the European
Conference on Machine Learning and Principles and Practice of Knowledge Discovery
in Databases, or ECML PKDD for short. This conference stems from the former ECML
and PKDD conferences, the two premier European conferences on, respectively,
Machine Learning and Knowledge Discovery in Databases. Originally independent
events, the two conferences were organized jointly for the first time in 2001. The
sinergy between the two led to increasing integration, and eventually the two merged in
2008. Today, ECML PKDD is a world-wide leading scientific event that aims at
exploiting the synergies between Machine Learning and Data Mining, focusing on the
development and application of methods and tools capable of solving real-life
problems.
ECML PKDD 2015 was held in Porto, Portugal, during September 7–11. This was
the third time Porto hosted the major European Machine Learning event. In 1991, Porto
was host to the fifth EWSL, the precursor of ECML. More recently, in 2005, Porto was
host to a very successful ECML PKDD. We were honored that the community chose to
again have ECML PKDD 2015 in Porto, just ten years later. The 2015 ECML PKDD
was co-located with “Intelligent System Applications to Power Systems”, ISAP 2015, a
well-established forum for scientific and technical discussion, aiming at fostering the
widespread application of intelligent tools and techniques to the power system network
and business. Moreover, it was collocated, for the first time, with the Summer School
on “Data Sciences for Big Data.”
ECML PKDD traditionally combines the research-oriented extensive program of the
scientific and journal tracks, which aim at being a forum for high quality, novel
research in Machine Learning and Data Mining, with the more focused programs of the
demo track, dedicated to presenting real systems to the community, the PhD track,
which supports young researchers, and the nectar track, dedicated to bringing relevant
work to the community. The program further includes an industrial track, which brings
together participants from academia, industry, government, and non-governmental
organizations in a venue that highlights practical and real-world studies of machine
learning, knowledge discovery, and data mining. The industrial track of ECML PKDD
2015 has a separate Program Committee and separate proceedings volume. Moreover,
the conference program included a doctoral consortium, three discovery challenges,
and various workshops and tutorials.
The research program included five plenary talks by invited speakers, namely,
Hendrik Blockeel (University of Leuven and Leiden University), Pedro Domingos
(University of Washington), Jure Leskovec (Stanford University), Nataša Milić-Fray-
ling (Microsoft Research), and Dino Pedreschi (Università di Pisa), as well as one ISAP
+ECML PKDD joint plenary talk by Chen-Ching Liu (Washington State University).
Three invited speakers contributed to the industrial track: Andreas Antrup (Zalando and
VI Preface
University of Edinburgh), Wei Fan (Baidu Big Data Lab), and Hang Li (Noah’s Ark
Lab, Huawei Technologies).
Three discovery challenges were announced this year. They focused on “MoRe-
BikeS: Model Reuse with Bike rental Station data,” “On Learning from Taxi GPS
Traces,” and “Activity Detection Based on Non-GPS Mobility Data,” respectively.
Twelve workshops were held, providing an opportunity to discuss current topics in a
small and interactive atmosphere: “MetaSel - Meta-learning and Algorithm Selection,”
“Parallel and Distributed Computing for Knowledge Discovery in Databases,”
“Interactions between Data Mining and Natural Language Processing,” “New Frontiers
in Mining Complex Patterns,” “Mining Ubiquitous and Social Environments,”
“Advanced Analytics and Learning on Temporal Data,” “Learning Models over
Multiple Contexts,” “Linked Data for Knowledge Discovery,” “Sports Analytics,”
“BigTargets: Big Multi-target Prediction,” “DARE: Data Analytics for Renewable
Energy Integration,” and “Machine Learning in Life Sciences.”
Ten tutorials were included in the conference program, providing a comprehensive
introduction to core techniques and areas of interest for the scientific community:
“Similarity and Distance Metric Learning with Applications to Computer Vision,”
“Scalable Learning of Graphical Models,” “Meta-learning and Algorithm Selection,”
“Machine Reading the Web - Beyond Named Entity Recognition and Relation
Extraction,” “VC-Dimension and Rademacher Averages: From Statistical Learning
Theory to Sampling Algorithms,” “Making Sense of (Multi-)Relational Data,” “Col-
laborative Filtering with Binary, Positive-Only Data,” “Predictive Maintenance,”
“Eureka! - How to Build Accurate Predictors for Real-Valued Outputs from Simple
Methods,” and “The Space of Online Learning Problems.”
The main track received 380 paper submissions, of which 89 were accepted. Such a
high volume of scientific work required a tremendous effort by the Area Chairs, Pro-
gram Committee members, and many additional reviewers. We managed to collect
three highly qualified independent reviews per paper and one additional overall input
from one of the Area Chairs. Papers were evaluated on the basis of significance of
contribution, novelty, technical quality, scientific, and technological impact, clarity,
repeatability, and scholarship. The industrial, demo, and nectar tracks were equally
successful, attracting 42, 32, and 29 paper submissions, respectively.
For the third time, the conference used a double submission model: next to the
regular conference tracks, papers submitted to the Springer journals Machine Learning
(MACH) and Data Mining and Knowledge Discovery (DAMI) were considered for
presentation at the conference. These papers were submitted to the ECML PKDD 2015
special issue of the respective journals, and underwent the normal editorial process
of these journals. Those papers accepted for one of these journals were assigned a
presentation slot at the ECML PKDD 2015 conference. A total of 191 original
manuscripts were submitted to the journal track during this year. Some of these papers
are still being refereed. Of the fully refereed papers, 10 were accepted in DAMI and 15
in MACH, together with 4+4 papers from last year’s call, which were also scheduled
for presentation at this conference. Overall, this resulted in a number of 613 submis-
sions (to the scientific track, industrial track and journal track), of which 126 were
selected for presentation at the conference, making an overall acceptance rate of
about 21%.
Preface VII
Part I and Part II of the proceedings of the ECML PKDD 2015 conference contain
the full papers of the contributions presented in the scientific track, the abstracts of the
scientific plenary talks, and the abstract of the ISAP+ECML PKDD joint plenary talk.
Part III of the proceedings of the ECML PKDD 2015 conference contains the full
papers of the contributions presented in the industrial track, short papers describing the
demonstrations, the nectar papers, and the abstracts of the industrial plenary talks.
The scientific track program results from continuous collaboration between the
scientific tracks and the general chairs. Throughout we had the unfaltering support
of the Local Chairs, Carlos Ferreira, Rita Ribeiro, and João Moreira, who managed this
event in a thoroughly competent and professional way. We thank the Social Media
Chairs, Dunja Mladenić and Márcia Oliveira, for tweeting the new face of
ECML PKDD, and the Publicity Chairs, Ricardo Campos and Carlos Ferreira, for their
excellent work in spreading the news. The beautiful design and quick response time
of the web site is due to the work of our Web Chairs, Sylwia Bugla, Rita Ribeiro, and
João Rodrigues. The beautiful image on all the conference materials is based on the
logo designed by Joana Amaral e João Cravo, inspired by Porto landmarks. It has been
a pleasure to collaborate with the Journal, Industrial, Demo, Nectar, and PhD Track
Chairs. ECML PKDD would not be complete if not for the efforts of the Tutorial
Chairs, Fazel Famili, Mykola Pechenizkiy, and Nikolaj Tatti, the Workshop Chairs,
Stan Matwin, Bernhard Pfahringer, and Luís Torgo, and the Discovery Challenge
Chairs, Michel Ferreira, Hillol Kargupta, Luís Moreira-Matias, and João Moreira. We
thank the Awards Committee Chairs, Pavel Brazdil, Sašo Džerosky, Hiroshi Motoda,
and Michèle Sebag, for their hard work in selecting papers for awards. A special meta
thanks to Pavel: ECML PKDD at Porto is only possible thanks to you. We gratefully
acknowledge the work of the Sponsorship Chairs, Albert Bifet and André Carvalho, for
their key work. Special thanks go to the Proceedings Chairs, Michelangelo Ceci and
Paulo Cortez, for the difficult task of putting these proceedings together. We appreciate
the support of Artur Aiguzhinov, Catarina Félix Oliveira, and Mohammad Nozari
(U. Porto) for helping to check this front matter. We thank the ECML PKDD Steering
Committee for kindly sharing their experience, and particularly the General Steering
Committe Chair, Fosca Giannotti. The quality of ECML PKDD is only possible due to
the tremendous efforts of the Program Committee; our sincere thanks for all the great
work in improving the quality of these proceedings. Throughout, we relied on the
exceptional quality of the Area Chairs. Our most sincere thanks for their support, with a
special thanks to the members who contributed in difficult personal situations, and to
Paulo Azevedo for stepping in when the need was there. Last but not least, we would
like to sincerely thank all the authors who submitted their work to the conference.
Conference Co-chairs
João Gama University of Porto, INESC TEC, Portugal
Alipío Jorge University of Porto, INESC TEC, Portugal
Program Co-chairs
Annalisa Appice University of Bari Aldo Moro, Italy
Pedro Pereira Rodrigues University of Porto, CINTESIS, INESC TEC, Portugal
Vtor SantosCosta University of Porto, INESC TEC, Portugal
Carlos Soares University of Porto, INESC TEC, Portugal
Tutorial Chairs
Fazel Famili CNRC, France
Mykola Pechenizkiy TU Eindhoven, The Netherland
Nikolaj Tatti Aalto University, Finland
X Organization
Workshop Chairs
Stan Matwin Dalhousie University, NS, Canada
Bernhard Pfahringer University of Waikato, New Zealand
Luís Torgo University of Porto, INESC TEC, Portugal
PhD Chairs
Jaakko Hollmén Aalto University, Finland
Panagiotis Papapetrou Stockholm University, Sweden
Proceedings Chairs
Michelangelo Ceci University of Bari, Italy
Paulo Cortez University of Minho, Portugal
Sponsorship Chairs
Albert Bifet Huawei Noah’s Ark Lab, China
André Carvlho University of São Paulo, Brazil
Pedro Pereira Rodrigues University of Porto, Portugal
Organization XI
Publicity Chairs
Ricardo Campos Polytechnic Institute of Tomar, INESC TEC, Portugal
Carlos Ferreira Oporto Polytechnic Institute, INESC TEC, Portugal
Web Chairs
Sylwia Bugla INESC TEC, Portugal
Rita Ribeiro University of Porto, INESC TEC, Portugal
João Rodrigues INESC TEC, Portugal
Area Chairs
Paulo Azevedo University of Minho
Michael Berthold Universität Konstanz
Francesco Bonchi Yahoo Labs Barcelona
Henrik Boström University of Stockholm
Jean-Françis Boulicaut Institut National des Sciences Appliquées de Lyon, LIRIS
Pavel Brazdil University of Porto
André Carvalho University of São Paulo
Michelangelo Ceci Università degli Studi di Bari Aldo Moro
XII Organization
Program Committee
Additional Reviewers
Sponsors
Platinum Sponsors
BNP PARIBAS http://www.bnpparibas.com/
ONR Global www.onr.navy.mil/science-technology/onr-global.aspx
Gold Sponsors
Zalando https://www.zalando.co.uk/
HUAWEI http://www.huawei.com/en/
Silver Sponsors
Deloitte http://www2.deloitte.com/
Amazon http://www.amazon.com/
Bronze Sponsors
Xarevision http://xarevision.pt/
Farfetch http://www.farfetch.com/pt/
NOS http://www.nos.pt/particulares/Pages/home.aspx
Award Sponsor
Machine Learning http://link.springer.com/journal/10994
Data Mining and http://link.springer.com/journal/10618
Knowledge
Discovery Deloitte http://www2.deloitte.com/
Lanyard Sponsor
KNIME http://www.knime.org/
Additional Supporters
INESCTEC https://www.inesctec.pt/
University of Porto, http://sigarra.up.pt/fep/pt/web_page.inicial
Faculdade de
Economia
Springer http://www.springer.com/
University of Porto http://www.up.pt/
Official Carrier
TAP http://www.flytap.com/
Abstracts of Invited Talks
Towards Declarative, Domain-Oriented
Data Analysis
Hendrik Blockeel
Abstract. The need for advanced data analysis now pervades all areas of science,
industry and services. A wide variety of theory and techniques from statistics,
data mining, and machine learning is available. Addressing a concrete question or
problem in a particular application domain requires multiple non-trivial steps:
translating the question to a data analysis problem, selecting a suitable approach
to solve this problem, correctly applying that approach, and correctly interpreting
the results. In this process, specialist knowledge on data analysis needs to be
combined with domain expertise. As data analysis becomes ever more advanced,
this becomes increasingly difficult. In an ideal world, data analysis would be
declarative and domain-oriented: the user should be able to state the question,
rather than describing a solution procedure, and the software should decide how
to provide an answer. The user then no longer needs to be, or hire, a specialist in
data analysis for every step of the knowledge discovery process. This would
make data analysis easier, more efficient, and less error-prone. In this talk, I will
discuss contemporary research that is bringing the state of the art in data analysis
closer to that long-term goal. This includes research on inductive databases,
constraint-based data mining, probabilistic-logical modeling, and declarative
experimentation.
Pedro Domingos
University of Washington
Abstract. Big data makes it possible in principle to learn very rich probabilistic
models, but inference in them is prohibitively expensive. Since inference is
typically a subroutine of learning, in practice learning such models is very hard.
Sum-product networks (SPNs) are a new model class that squares this circle by
providing maximum flexibility while guaranteeing tractability. In contrast to
Bayesian networks and Markov random fields, SPNs can remain tractable even
in the absence of conditional independence. SPNs are defined recursively: an
SPN is either a univariate distribution, a product of SPNs over disjoint variables,
or a weighted sum of SPNs over the same variables. It’s easy to show that the
partition function, all marginals and all conditional MAP states of an SPN can
be computed in time linear in its size. SPNs have most tractable distributions as
special cases, including hierarchical mixture models, thin junction trees, and
nonrecursive probabilistic context-free grammars. I will present generative and
discriminative algorithms for learning SPN weights, and an algorithm for
learning SPN structure. SPNs have achieved impressive results in a wide variety
of domains, including object recognition, image completion, collaborative fil-
tering, and click prediction. Our algorithms can easily learn SPNs with many
layers of latent variables, making them arguably the most powerful type of deep
learning to date. (Joint work with Rob Gens and Hoifung Poon.)
Bio. Pedro Domingos is Professor of Computer Science and Engineering at the Uni-
versity of Washington. His research interests are in machine learning, artificial intel-
ligence and data science. He received a PhD in Information and Computer Science
from the University of California at Irvine, and is the author or co-author of over 200
technical publications. He is a winner of the SIGKDD Innovation Award, the highest
honor in data science. He is a AAAI Fellow, and has received a Sloan Fellowship, an
NSF CAREER Award, a Fulbright Scholarship, an IBM Faculty Award, and best paper
awards at several leading conferences. He is a member of the editorial board of the
Machine Learning journal, co-founder of the International Machine Learning Society,
and past associate editor of JAIR. He was program co-chair of KDD-2003 and
SRL-2009, and has served on numerous program committees.
Mining Online Networks and Communities
Jure Leskovec
Stanford University
Abstract. The Internet and the Web fundamentally changed how we live our
daily lives as well as broadened the scope of computer science. Today the Web
is a ‘sensor’ that captures the pulse of humanity and allows us to observe
phenomena that were once essentially invisible to us. These phenomena include
the social interactions and collective behavior of hundreds of millions of people,
recorded at unprecedented levels of scale and resolution. Analyzing this data
offers novel algorithmic as well as computational challenges. Moreover, it offers
new insights into the design of information systems in the presence of complex
social feedback effects, as well as a new perspective on fundamental questions in
the social sciences.
Chen-Ching Liu
Nataša Milić-Frayling
Microsoft Research
Abstract. This presentation will shed light on user tracking and behavioural
targeting on the Web. Through empirical studies of cookie tracking practices,
we will take an alternative view of the display ad business by observing the
network of third party trackers that envelopes the Web. The practice begs a
question of how to resolve a dissonance between the current consumer tracking
practices and the vendors’ desire for consumers’ loyalty and trustful long-term
engagements. It also makes us aware of how computing designs and techniques,
inaccessible to individuals, cause imbalance in the knowledge acquisition and
enablement, disempowering the end-users.
Dino Pedreschi
University of Pisa
Abstract. My seminar discusses the novel questions that big data and social
mining allow to raise and answer, how a new paradigm for scientific explora-
tion, statistics and policy making is emerging, and the major scientific, tech-
nological and societal barriers to be overcome to realize this vision. I will focus
on concrete projects with telecom providers and official statistics bureau in Italy
and France aimed at measuring, quantifying and possibly predicting key
demographic and socio-economic indicators based on nation-wide mobile phone
data: the population of different categories of city users (residents, commuters,
visitors) in urban spaces, the inter-city mobility, the level of well-being and
economic development of geographical units at various scales.
Bio. Dino Pedreschi is a Professor of Computer Science at the University of Pisa, and a
pioneering scientist in mobility data mining, social network mining and
privacy-preserving data mining. He co-leads with Fosca Giannotti the Pisa KDD Lab -
Knowledge Discovery and Data Mining Laboratory, a joint research initiative of the
University of Pisa and the Information Science and Technology Institute of the Italian
National Research Council, one of the earliest research lab centered on data mining. His
research focus is on big data analytics and mining and their impact on society. He is a
founder of the Business Informatics MSc program at Univ. Pisa, a course targeted at the
education of interdisciplinary data scientists. Dino has been a visiting scientist at Bar-
abasi Lab (Center for Complex Network Research) of Northeastern University, Boston
(2009-2010), and earlier at the University of Texas at Austin (1989-90), at CWI
Amsterdam (1993) and at UCLA (1995). In 2009, Dino received a Google Research
Award for his research on privacy-preserving data mining.
Abstracts of Journal Track Articles
A Bayesian Approach for Comparing Cross-Validated Algorithms
on Multiple Data Sets
Giorgio Corani and Alessio Benavoli
Machine Learning
DOI: 10.1007/s10994-015-5486-z
We present a Bayesian approach for making statistical inference about the accuracy (or
any other score) of two competing algorithms which have been assessed via
cross-validation on multiple data sets. The approach is constituted by two pieces. The
first is a novel correlated Bayesian t-test for the analysis of the cross-validation results
on a single data set which accounts for the correlation due to the overlapping training
sets. The second piece merges the posterior probabilities computed by the Bayesian
correlated t-test on the different data sets to make inference on multiple data sets.
It does so by adopting a Poisson-binomial model. The inferences on multiple data sets
account for the different uncertainty of the cross-validation results on the different data
sets. It is the first test able to achieve this goal. It is generally more powerful than the
signed-rank test if ten runs of cross-validation are performed, as it is anyway generally
recommended.
Outlier detection methods automatically identify instances that deviate from the
majority of the data. In this paper, we propose a novel approach for unsupervised
outlier detection, which re-formulates the outlier detection problem in numerical data as
a set of supervised regression learning problems. For each attribute, we learn a
predictive model which predicts the values of that attribute from the values of all other
attributes, and compute the deviations between the predictions and the actual values.
From those deviations, we derive both a weight for each attribute, and a final outlier
score using those weights. The weights help separating the relevant attributes from the
irrelevant ones, and thus make the approach well suitable for discovering outliers
otherwise masked in high-dimensional data. An empirical evaluation shows that our
approach outperforms existing algorithms, and is particularly robust in datasets with
many irrelevant attributes. Furthermore, we show that if a symbolic machine learning
method is used to solve the individual learning problems, the approach is also capable
of generating concise explanations for the detected outliers.
XXX Abstracts of Journal Track Articles
Defining appropriate distance measures among rankings is a classic area of study which
has led to many useful applications. In this paper, we propose a more general
abstraction of preference data, namely directed acyclic graphs (DAGs), and introduce a
measure for comparing DAGs, given that a vertex correspondence between the DAGs
is known. We study the properties of this measure and use it to aggregate and cluster a
set of DAGs. We show that these problems are NP-hard and present efficient methods
to obtain solutions with approximation guarantees. In addition to preference data, these
methods turn out to have other interesting applications, such as the analysis of a
collection of information cascades in a network. We test the methods on synthetic and
real-world datasets, showing that the methods can be used to, e.g., find a set of
influential individuals related to a set of topics in a network or to discover meaningful
and occasionally surprising clustering structure.
Abstracts of Journal Track Articles XXXI
Graphs - such as friendship networks - that evolve over time are an example of data that
are naturally represented as binary tensors. Similarly to analysing the adjacency matrix
of a graph using a matrix factorization, we can analyse the tensor by factorizing it.
Unfortunately, tensor factorizations are computationally hard problems, and in
particular, are often significantly harder than their matrix counterparts. In case of
Boolean tensor factorizations - where the input tensor and all the factors are required to
be binary and we use Boolean algebra - much of that hardness comes from the
possibility of overlapping components. Yet, in many applications we are perfectly
happy to partition at least one of the modes. For instance, in the aforementioned
timeevolving friendship networks, groups of friends might be overlapping, but the time
points at which the network was captured are always distinct. In this paper we
investigate what consequences this partitioning has on the computational complexity
of the Boolean tensor factorizations and present a new algorithm for the resulting
clustering problem. This algorithm can alternatively be seen as a particularly
regularized clustering algorithm that can handle extremely high-dimensional observa-
tions. We analyse our algorithm with the goal of maximizing the similarity and argue
that this is more meaningful than minimizing the dissimilarity. As a by-product we
obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary
factorizations. Our algorithm for Boolean tensor clustering achieves high scalability,
high similarity, and good generalization to unseen data with both synthetic and
realworld data sets.
Consensus Hashing
Cong Leng and Jian Cheng
Machine Learning
DOI: 10.1007/s10994-015-5496-x
Hashing techniques have been widely used in many machine learning applications
because of their efficiency in both computation and storage. Although a variety of
hashing methods have been proposed, most of them make some implicit assumptions
about the statistical or geometrical structure of data. In fact, few hashing algorithms can
adequately handle all kinds of data with different structures. When considering hybrid
structure datasets, different hashing algorithms might produce different and possibly
inconsistent binary codes. Inspired by the successes of classifier combination and
clustering ensembles, in this paper, we present a novel combination strategy for
multiple hashing results, named Consensus Hashing (CH). By defining the measure of
consensus of two hashing results, we put forward a simple yet effective model to learn
XXXII Abstracts of Journal Track Articles
consensus hash functions which generate binary codes consistent with the existing
ones. Extensive experiments on several large scale benchmarks demonstrate the overall
superiority of the proposed method compared with state-of-the art hashing algorithms.
Community search is the problem of finding a good community for a given set of query
vertices. One of the most studied formulations of community search asks for a
connected subgraph that contains all query vertices and maximizes the minimum
degree. All existing approaches to min-degree-based community search suffer from
limitations concerning efficiency, as they need to visit (large part of) the whole input
graph, as well as accuracy, as they output communities quite large and not really
cohesive. Moreover, some existing methods lack generality: they handle only
single-vertex queries, find communities that are not optimal in terms of minimum
degree, and/or require input parameters. In this work we advance the state of the art on
Abstracts of Journal Track Articles XXXV
community search by proposing a novel method that overcomes all these limitations: it
is in general more efficient and effective—one/two orders of magnitude on average, it
can handle multiple query vertices, it yields optimal communities, and it is
parameter-free. These properties are confirmed by an extensive experimental analysis
performed on various real-world graphs.
We study the problem of finding the Longest Common Sub-Pattern (LCSP) shared by
two sequences of temporal intervals. In particular we are interested in finding the LCSP
of the corresponding arrangements. Arrangements of temporal intervals are a powerful
way to encode multiple concurrent labeled events that have a time duration.
Discovering commonalities among such arrangements is useful for a wide range of
scientific fields and applications, as it can be seen by the number and diversity of the
datasets we use in our experiments. In this paper, we define the problem of LCSP and
prove that it is NP-complete by demonstrating a connection between graphs and
arrangements of temporal intervals, which leads to a series of interesting open
problems. In addition, we provide an exact algorithm to solve the LCSP problem, and
also propose and experiment with three polynomial time and space underapproximation
techniques. Finally, we introduce two upper bounds for LCSP and study their
suitability for speeding up 1-NN search. Experiments are performed on seven datasets
taken from a wide range of real application domains, plus two synthetic datasets.
We prove bounds on complexity measures of the hypothesis space for quadratic and
conic side knowledge, and show that these bounds are tight in a specific sense for the
quadratic case.
There has been a growing interest in mutual information measures due to its wide range
of applications in Machine Learning and Computer Vision. In this manuscript, we
present a generalized structured regression framework based on Shama-Mittal
divergence, a relative entropy measure, firstly addressed in the Machine Learning
community, in this work. Sharma-Mittal (SM) divergence is a generalized mutual
information measure for the widely used Rényi, Tsallis, Bhattacharyya, and
Kullback-Leibler (KL) relative entropies. Specifically, we study Sharma-Mittal
divergence as a cost function in the context of the Twin Gaussian Processes, which
generalizes over the KL-divergence without computational penalty. We show
interesting properties of Sharma-Mittal TGP (SMTGP) through a theoretical analysis,
Abstracts of Journal Track Articles XXXVII
much larger universe of N(>>k) incomplete instances so as to learn the most accurate
classifier. We propose a principled framework which motivates a generally applicable
yet efficient meta-technique for choosing k such instances. Since we cannot know a
priori the classifier that will result from the completed dataset, i.e. the final classifier,
our method chooses the k instances based on a derived upper bound on the expectation
of the distance between the next classifier and the final classifier. We additionally
derive a sufficient condition for these two solutions to match. We then empirically
evaluate the performance of our method relative to the state-of-the-art methods on 4
UCI datasets as well as 3 proprietary e-commerce datasets used in previous studies. In
these experiments, we also demonstrate how close we are likely to be to the optimal
solution, by quantifying the extent to which our sufficient condition is satisfied. Lastly,
we show that our method is easily extensible to the setting where we have a non
uniform cost associated with acquiring the missing information.
Knowledge base consisting of triple like (subject entity, predicate relation, object
entity) is a very important database for knowledge management. It is very useful for
humanlike reasoning, query expansion, question answering (Siri) and other related AI
tasks. However, knowledge base often suffers from incompleteness due to a large
volume of increasing knowledge in the real world and a lack of reasoning capability. In
this paper, we propose a Pairwise-interaction Differentiated Embeddings (PIDE) model
to embed entities and relations in the knowledge base to low dimensional vector
representations and then predict the possible truth of additional facts to extend the
knowledge base. In addition, we present a probability-based objective function to
improve the model optimization. Finally, we evaluate the model by considering the
problem of computing how likely the additional triple is true for the task of knowledge
base completion.Experiments on WordNet and Freebase dataset show the excellent
performance of our model and algorithm.
Nowadays, video surveillance systems are taking the first steps toward automation, in
order to ease the burden on human resources as well as to avoid human error. As the
underlying data distribution and the number of concepts change over time, the
conventional learning algorithms fail to provide reliable solutions for this setting.
Herein, we formalize a learning concept suitable for multi-camera video surveillance
and propose a learning methodology adapted to that new paradigm. The proposed
framework resorts to the universal background model to robustly learn individual
object models from small samples and to more effectively detect novel classes. The
individual models are incrementally updated in an ensemble based approach, with older
models being progressively forgotten. The framework is designed to detect and label
new concepts automatically. The system is also designed to exploit active learning
strategies, in order to interact wisely with operator, requesting assistance in the most
ambiguous to classify observations. The experimental results obtained both on real and
synthetic data sets verify the usefulness of the proposed approach.
XL Abstracts of Journal Track Articles
Defining appropriate distance measures among rankings is a classic area of study which
has led to many useful applications. In this paper, we propose a more general
abstraction of preference data, namely directed acyclic graphs (DAGs), and introduce a
measure for comparing DAGs, given that a vertex correspondence between the DAGs
is known. We study the properties of this measure and use it to aggregate and cluster a
set of DAGs. We show that these problems are NP-hard and present efficient methods
to obtain solutions with approximation guarantees. In addition to preference data, these
methods turn out to have other interesting applications, such as the analysis of a
collection of information cascades in a network. We test the methods on synthetic and
real-world datasets, showing that the methods can be used to, e.g., find a set of
influential individuals related to a set of topics in a network or to discover meaningful
and occasionally surprising clustering structure.
Abstracts of Journal Track Articles XLI
When we are investigating an object in a data set, which itself may or may not be an
outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we
identify the novel problem of mining outlying aspects on numeric data. Given a query
object o in a multidimensional numeric data set O, in which subspace is o most
outlying? Technically, we use the rank of the probability density of an object in a
subspace to measure the outlyingness of the object in the subspace. A minimal
subspace where the query object is ranked the best is an outlying aspect. Computing the
outlying aspects of a query object is far from trivial. A naїve method has to calculate
the probability densities of all objects and rank them in every subspace, which is very
XLII Abstracts of Journal Track Articles
Event detection has been one of the most important research topics in social media
analysis. Most of the traditional approaches detect events based on fixed temporal and
spatial resolutions, while in reality events of different scales usually occur simulta-
neously, namely, they span different intervals in time and space. In this paper, we
propose a novel approach towards multiscale event detection using social media data,
which takes into account different temporal and spatial scales of events in the data.
Specifically, we explore the properties of the wavelet transform, which is a
welldeveloped multiscale transform in signal processing, to enable automatic handling
of the interaction between temporal and spatial scales. We then propose a novel
algorithm to compute a data similarity graph at appropriate scales and detect events of
different scales simultaneously by a single graph-based clustering process. Further-
more, we present spatiotemporal statistical analysis of the noisy information present in
the data stream, which allows us to define a novel term-filtering procedure for the
proposed event detection algorithm and helps us study its behavior using simulated
noisy data. Experimental results on both synthetically generated data and real world
data collected from Twitter demonstrate the meaningfulness and effectiveness of the
proposed approach. Our framework further extends to numerous application domains
that involve multiscale and multiresolution data analysis.
Although count data are increasingly ubiquitous, surprisingly little work has employed
probabilistic graphical models for modeling count data. Indeed the univariate case has
been well studied, however, in many situations counts influence each other and should
not be considered independently. Standard graphical models such as multinomial or
Gaussian ones are also often ill-suited, too, since they disregard either the infinite range
over the natural numbers or the potentially asymmetric shape of the distribution of
count variables. Existing classes of Poisson graphical models can only model negative
conditional dependencies or neglect the prediction of counts or do not scale well. To
ease the modeling of multivariate count data, we therefore introduce a novel family of
Poisson graphical models, called Poisson Dependency Networks (PDNs). A PDN
consists of a set of local conditional Poisson distributions, each representing the
probability of a single count variable given the others, that naturally facilities a simple
Gibbs sampling inference. In contrast to existing Poisson graphical models, PDNs are
non-parametric and trained using functional gradient ascent, i.e., boosting. The
particularly simple form of the Poisson distribution allows us to develop the first
multiplicative boosting approach: starting from an initial constant value, alternatively a
log-linear Poisson model, or a Poisson regression tree, a PDN is represented as
products of regression models grown in a stage-wise optimization. We demonstrate on
several real world datasets that PDNs can model positive and negative dependencies
and scale well while often outperforming state-of-the-art, in particular when using
multiplicative updates.
XLIV Abstracts of Journal Track Articles
This paper is about the exploitation of Lipschitz continuity properties for Markov
Decision Processes (MDPs) to safely speed up policy-gradient algorithms.Starting from
assumptions about the Lipschitz continuity of the state-transition model, the reward
function, and the policies considered in the learning process, we show that both the
expected return of a policy and its gradient are Lipschitz continuous w.r.t. policy
parameters.By leveraging such properties, we define policy-parameter updates that
guarantee a performance improvement at each iteration. The proposed methods are
empirically evaluated and compared to other related approaches using different
configurations of three popular control scenarios: the linear quadratic regulator, the
mass-spring-damper system and the ship-steering control.
We present a novel probabilistic clustering model for objects that are represented via
pairwise distances and observed at different time points. The proposed method utilizes
the information given by adjacent time points to find the underlying cluster structure
and obtain a smooth cluster evolution. This approach allows the number of objects and
clusters to differ at every time point, and no identification on the identities of the
objects is needed. Further, the model does not require the number of clusters being
specified in advance – they are instead determined automatically using a Dirichlet
process prior. We validate our model on synthetic data showing that the proposed
method is more accurate than state-of-the-art clustering methods. Finally, we use our
dynamic clustering model to analyze and illustrate the evolution of brain cancer
patients over time.
One of the biggest setbacks in traditional frequent pattern mining is that overwhelm-
ingly many of the discovered patterns are redundant. A prototypical example of such
redundancy is a freerider pattern where the pattern contains a true pattern and some
Abstracts of Journal Track Articles XLV
additional noise events. A technique for filtering freerider patterns that has proved to be
efficient in ranking itemsets is to use a partition model where a pattern is divided into
two subpatterns and the observed support is compared to the expected support under
the assumption that these two subpatterns occur independently. In this paper we
develop a partition model for episodes, patterns discovered from sequential data. An
episode is essentially a set of events, with possible restrictions on the order of events.
Unlike with itemset mining, computing the expected support of an episode requires
surprisingly sophisticated methods. In order to construct the model, we partition the
episode into two subepisodes. We then model how likely the events in each subepisode
occur close to each other. If this probability is high—which is often the case if the
subepisode has a high support—then we can expect that when one event from a
subepisode occurs, then the remaining events occur also close by. This approach
increases the expected support of the episode, and if this increase explains the observed
support, then we can deem the episode uninteresting. We demonstrate in our
experiments that using the partition model can effectively and efficiently reduce the
redundancy in episodes.
Soft-max Boosting
Matthieu Geist
Machine Learning
DOI: 10.1007/s10994-015-5491-2
The standard multi-class classification risk, based on the binary loss, is rarely directly
minimized. This is due to (i) the lack of convexity and (ii) the lack of smoothness (and
even continuity). The classic approach consists in minimizing instead a convex
XLVI Abstracts of Journal Track Articles
Diffusion magnetic resonance imaging data allows reconstructing the neural pathways
of the white matter of the brain as a set of 3D polylines. This kind of data sets provides
a means of study of the anatomical structures within the white matter, in order to detect
neurologic diseases and understand the anatomical connectivity of the brain. To the
best of our knowledge, there is still not an effective or satisfactory method for
automatic processing of these data. Therefore, a manually guided visual exploration of
experts is crucial for the purpose. However, because of the large size of these data sets,
visual exploration and analysis has also become intractable. In order to make use of the
advantages of both manual and automatic analysis, we have developed a new visual
data mining tool for the analysis of human brain anatomical connectivity. With such
tool, humans and automatic algorithms capabilities are integrated in an interactive data
exploration and analysis process. A very important aspect to take into account when
designing this tool, was to provide the user with comfortable interaction. For this
purpose, we tackle the scalability issue in the different stages of the system, including
the automatic algorithm and the visualization and interaction techniques that are used.
Contents – Part I
Research Track
Data Preprocessing
Deep Learning
Research Track
Rich Data
Inferring Unusual Crowd Events from Mobile Phone Call Detail Records . . . 474
Yuxiao Dong, Fabio Pinelli, Yiannis Gkoufas, Zubair Nabi,
Francesco Calabrese, and Nitesh V. Chawla
Fast Inbound Top-K Query for Random Walk with Restart . . . . . . . . . . . . . 608
Chao Zhang, Shan Jiang, Yucheng Chen, Yidan Sun, and Jiawei Han
Industrial Track
Nectar Track
The Evolution of Social Relationships and Strategies Across the Lifespan . . . 245
Yuxiao Dong, Nitesh V. Chawla, Jie Tang, Yang Yang, and Yang Yang
Contents – Part III LVII
Demo Track
Classification, Regression
and Supervised Learning
Data Split Strategies
for Evolving Predictive Models
1 Introduction
A common data mining task is to build a good predictive model which generalizes
well on future unseen data. Based on the annotated data collected so far the goal
for a machine learning practitioner is to search for the best predictive model
(known as supervised learning) and at the same time have a reasonably good
estimate of the performance (or risk) of the model on future unseen data. It is well
known that the performance of the model on the data used to learn the model
(training set) is an overly optimistic estimate of the performance on unseen data.
For this reason it is a common practice to sequester a portion of the data to assess
the model performance and never use it during the actual model building process.
When we are in a data rich situation a conventional textbook prescription (for
example refer to Chapter 7 in [6]) is to split the data into three parts: training
set, validation set, and test set (See Figure 1). The training set is used for
model fitting, that is, estimate the parameters of the model. The validation set
is used for model selection, that is, we use the performance of the model on
the validation set to select among various competing models (e.g. should we
use a linear classifier like logistic regression or a non-linear neural network) or
to choose the hyperparameters of the model (e.g. choosing the regularization
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 3–19, 2015.
DOI: 10.1007/978-3-319-23528-8 1
4 V.C. Raykar and A. Saha
Fig. 1. Data splits for model fitting, selection, and assessment. The training split is used
to estimate the model parameters. The validation split is used to estimate prediction
error for model selection. The test split is used to estimate the performance of the final
chosen model.
parameter for logistic regression or the number of nodes in the hidden layer for
a neural network). The test set is then used for final model assessment, that is,
to estimate the performance of the estimated model.
However in practice searching for the best predictive model is often an itera-
tive and continuous process. A major bottleneck typically encountered in many
learning tasks is to collect the data and annotate them. Due to various con-
straints (either time or financial) very often the best model based on the data
available so far is deployed in practice. At the same time the data collection and
annotation process will continue so that the model can be improved at a later
stage. Once we have reasonably enough data we refit the model to the new data
to make it more accurate and then release this new model. Sometimes after the
model has been deployed in practice we find that the model does not perform
well on a new kind of data which we do not have in our current training set. So
we redirect our efforts into collecting more data on which our model fails. The
main contribution of this paper is to discuss problems encountered and propose
various workflows to manage the allocation of newly acquired data into different
sets in such dynamic model building/updating scenarios.
With the advent of increased computing power it is very easy to come up
with a model that performs best on the validation set by searching over an
extremely large range of diverse models. This procedure can lead to non-trivial
bias (or over-fitting to the validation set) in the estimated model parameters. It
is very likely that we found the best model on the validation set by chance. The
same applies to the testing set. One way to think of this is that every time we
use the test set to estimate the performance the dataset becomes less fresh and
can increase the risk of over-fitting. The proposed data allocation workflows are
designed with a particular emphasis on avoiding this bias.
has been learnt/estimated using the training data T . Let L(y, f(x)) be the loss
function 1 for measuring errors between the target response y and the prediction
from the learnt model f(x). The (conditional) test error, also referred to as
generalization error, is the prediction error over an independent test sample,
that is, ErrT = E(x,y) [L(y, f(x))|T ], where (x, y) are drawn randomly from their
joint distribution. Since the training set T is fixed, and test error refers to the
error obtained with this specific training set. Assessment of this test error is very
important in practice since it gives us a measure of the quality of the ultimately
chosen model (referred to as model assessment) and also guides the choice of
learning method or model (also known as model selection). Typically our model
will also have tuning parameters (for example the regularization parameter in
lasso or the number of trees in random forest) and we write our predictions as
fθ (x). The tuning parameter θ varies the complexity of our model, and we wish
to find the value of θ that minimizes the test error. The training n error is the
average loss over the entire training sample, that is, err = n1 i=1 L(yi , fθ (xi )).
Unfortunately training error is not a good estimate of the test error. A learning
method typically adapts to the training data, and hence the training error will
be an overly optimistic estimate of the test error. Training error consistently
decreases with model complexity, typically dropping to zero if we increase the
model complexity large enough. However, a model with zero training error is
overfit to the training data and will typically generalize poorly.
If we are in a data-rich situation, the best approach to estimate the test error
is to randomly divide the dataset into three parts [2,4,6]: a training split T , a
validation split V, and a test split U. While it is difficult to give a general rule on
the split proportions a typical split suggested in [6] is to use 50% for training, and
25% each for validation and testing (see Figure 1). The training split T is used to
fit the model (i.e. estimate the parameters of the model for a fixed set of tuning
parameters). The validation split V is used to estimate prediction error for model
selection. We use the performance on the validation split to select among various
competing models or to choose the tuning parameters of the model. The test split
U is used to estimate the performance of the final chosen model. Ideally, the test
set should be sequestered and be brought out only at the end of the data analysis.
In this paper we specifically assume that we are in a data-rich situation, that
is we have a reasonably large amount of data. In data poor situations where we
do not have the luxury of reserving a separate test set, it does not seem possible
to estimate conditional error effectively, given only the information in the same
training set. A related quantity sometimes used in data poor situations is the
expected test error Err = E[ErrT ]. While the estimation of the conditional test
error ErrT will be our goal the expected test Err is more amenable to statistical
analysis, and most methods like cross-validation [15] and bootstrap [3] effectively
estimate the expected error [6].
1
Typical loss functions include the 0-1 loss (L(y, f(x)) = I(y = f(x)),where I is
the indicator function) or the log-likelihood loss for classification and the squared
error (L(y, f(x)) = (y − f(x))2 ) or the absolute error (L(y, f(x)) = |y − f(x)|) for
regression problems.
6 V.C. Raykar and A. Saha
hospitals. These scans are then read by expert radiologists who mark the suspi-
cious locations. Ideally we would like our model to handle all possible variations—
different scanners, acquisition protocols, hospitals, and patients. Collecting such
data is a time consuming process. Typically we can have contracts with different
hospitals to collect and process data. Each hospital has a specific kind of scanners,
acquisition protocols, and patient demographics. While most learning methods
assume that data is randomly sampled from the population in reality due to var-
ious constraints data does not arrive in a random fashion. Based on the contracts
at the end of a year we have data from say around five hospitals and the data from
the another hospital may arrive a year later. Based on the data from five hospitals
we can deploy a model and later update the model when we acquire the data from
the other hospital.
These kind of issues also arise in data mining competitions which tradition-
ally operate in a similar setup. Kaggle [1], for example, is a platform for data
prediction competitions that allows organizations to post their data and have
it scrutinized by thousands of data scientists. The training set along with the
labels is released to the public to develop the predictive model. Another set for
which the labels have been withheld is used to track the performance of the com-
petitors on a public leader board. Very often is happens that the competitors
try to overfit the model on this leader board set. For this reason only a part of
this set is used for the leader board and the remaining data is used to decide the
final rankings. An important feature of our proposed workflows is that there is
a movement of data across different sets at regular intervals and this can help
avoid the competitors trying to overfit their models to the leader board.
New New
1 − (β + γ) δ1
Test Test
γ δ2
Validation Validation
β δ3
Training Training
Fig. 2. (a) Parallel dump workflow The new data is split into three parts (according to
the ratio β : γ : 1 − (β + γ)) and directly dumped into the existing training, validation,
and test splits. (b) Serial waterfall workflow A δ3 fraction of the validation set moves
to the training set, a δ2 part of the test set moves to the validation set, and a δ1 fraction
of the new data is allocated to the test set.
learn from our mistakes in the test set. This is especially useful in scenarios
where then the data mining practitioners have to go back to their drawing
boards and re-design their model because it failed on a sequestered test set.
In the next section we describe two workflows: the parallel dump (§ 4.1) and the
serial waterfall (§ 4.2) each of which can address some of these objectives. In
§ 4.3 we describe the proposed hybrid workflow which can balance all the four
objectives described above.
The most obvious method is to split the new data into three parts and directly
dump each part into the existing test, validation, and training splits as shown
in Figure 2(a). One could use a 50% training-25% validation-25% test split as
earlier or any desired ratio β : γ : 1−(β +γ). The main advantage of this method
is that we have immediate access to the new data in the training set and the
model can be improved quickly. Also a sufficient portion of the new data goes
into the validation and the test set immediately. However once the new data is
allocated we end up using the validation and test splits again one more time
(the split is now no longer as fresh as the first time) thus leading to the model
selection bias. In this workflow the splits are static and there is no movement
across the splits. As a result we do not have a chance to learn from mistakes in
the validation and testing set. Generally it may not be to our advantage to let
the test split to keep growing without learning from the errors the model makes
on the test set. So it makes sense to move some part of the data from the test
set to either the training or validation set.
Data Split Strategies for Evolving Predictive Models 9
New
1−α α
New serial New parallel
Test
Validation
Training
Fig. 3. (a) The hybrid workflow There is a movement of data among different sets in
the serial mode and at the same time there is an immediate fresh influx of data from
the parallel mode. (b) The continuum workflow In this setting instead of having 3 sets
(training, testing, unseen) we will in theory have a continuum of sets (with samples
trickling to the lower levels), with the lowest one for training and the top most one to
give the most unbiased estimate of performance.
In this workflow data keeps trickling from one level to the other as illustrated
in Figure 2(b). Once new data arrives, a δ3 fraction of the validation set moves
to the training set, a δ2 part of the test set moves to the validation set, and a
δ1 fraction of the new data is allocated to the test set. The training set always
keeps getting bigger and once a data moves to the training set it stays there
forever. This mode has the following advantages: (1) the test and validation sets
are always kept fresh. This avoids over fitting due to extensive model search
since the validation and test sets are always refreshed. (2) Since part of the data
from the validation and the test set eventually moves to the training we have a
chance to learn from the mistakes. The disadvantage of this serial workflow is
that the new data takes some time to move to the training set, depending on
how often we refresh the existing sets. This restricts us from exploiting the new
data as quickly as possible as it takes some time for the data to trickle to the
training set.
1
k=0.1
k=1.0
0.8
0.6
M/N
0.4
0.2
0 1 2 3 4 5
10 10 10 10 10
N
√
Fig. 4. The ratio M/N = 1/(k N + 1) as a function of N for two different k.
A key idea in the serial mode is to move data from one level to another to avoid
bias due to multiple reuse. We derive a simple rule to decide how much of the
new data has to be moved to the test set to avoid the bias due to reuse of the
test set multiple times. The same can be used to move data from the test set
to the validation set to keep the validation set fresh. This analysis is based on
ideas in [14]. Our goal is not to get a exact expression but to get the nature of
dependence on the set size. We will consider a scenario where we have a test set
of N examples. At each reuse we supplement the test set with M new examples
2
The parameter α decides the proportion of the incoming data that will go into the
train and the test splits. Without any prior knowledge or assumption, the reasonable
value of α is a constant fixed at 0.5. But α can be further made to vary every time
a new batch of data arrives. For example consider a batch scenario where for every
incoming batch of data we can compute the similarity of the current batch with the
previously seen batches (for example using the Kullback-Leilbler (KL) divergence).
The parameter α can then be made to vary based on the estimated KL divergence.
Data Split Strategies for Evolving Predictive Models 11
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
Training | 500 samples Training | 950 samples
−2 Negative Class | 1000 samples −2 Validation | 300 samples −2 Validation | 570 samples
Positive Class | 1000 samples Test | 200 samples Test | 380 samples
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(a) Complete data (b) Start of data collection (c) End of data collection
and remove M of the oldest examples. Specifically we will prove the following
result (see the appendix for the proof):
After each
√ reuse if we supplement the test set (consisting of N samples) with
M > N/(k N + 1) new samples for some small constant k then the bias due to
test set reuse can be safely ignored.
Figure 4 plots the ratio M/N as a function of N for two values of k. We need
to replace a smaller fraction of the data when N is large. Small datasets lead to
a larger bias after each use and hence need to be substantially supplemented.
We first illustrate the advantages and the disadvantages of the three differ-
ent workflows on a two-dimensional binary classification problem shown in
Figure 5(a) with data sampled from a mixture of four Gaussians. The positive
class consists of 1000 examples from two Gaussians centered at [-1,1] and [1,-1].
The negative class consists of 1000 examples from a mixture of two Gaussians
centered at [1,1] and [-1,-1] respectively. We use a multi-layer neural network
with one hidden layer and trained via back propagation. The number of units
in the hidden layer is considered as the tuning parameter and selected using the
validation split. In order to see the effect of having a non-representative set of
data during typical model building scenarios, we consider a scenario where at
the beginning of the model building process we have collected data only from
two Gaussians (positive class centered at [-1,1] and negative class centered at
[-1,-1]) as shown in Figure 5(b). Based on the data collected so far we will use
the validation split to tune the number of hidden units in the neural network,
the training split to train the neural network via back propagation, and the test
12 V.C. Raykar and A. Saha
split to compute the misclassification error of the final trained model. Figure 5(b)
shows the decision boundary obtained for the trained neural network using such
splits at the start of the model building process. At each time step we add 50
new examples into the data pool, allocate the new data according to different
workflows (either the parallel dump, the serial waterfall, or the hybrid work-
flow) and repeat the model building process. The new data does not arrive in a
random fashion. We first sample data from the Gaussian centered at [1,1] and
then the data from the remaining Gaussian centered at [1,-1]. While this may
not be a fully realistic scenario it helps us to illustrate the different workflows.
One way to think of this is to visualize that each Gaussian represents data from
different hospital/scanner and the data collection may not be designed such that
data arrives in a random fashion. Figure 5(c) shows the final decision boundary
obtained when all the data has been acquired. Once we have all the data all
different workflows reach the same performance. Here we are interested in the
model performance and our estimate of it at different stages as new data arrives.
Actual Performance of The Model. Figure 6 plots the misclassification
error at each time point (until all the data has been used) for the parallel dump
(with parameters β = 0.5 and γ = 0.3 ), the serial waterfall (with δ parameters
automatically chosen), and the hybrid workflow (with α = 0.1). The error is
computed on the entire dataset 3 . If we had the entire dataset the final model
should have an error of around 0.15. It can be seen that the parallel dump
workflow exploits the new data quickly and reaches this performance in around
20 time steps. The serial waterfall moves the data to the training set slowly and
achieves the same performance in around 40 time steps. The hybrid workflow
can be considered as a compromise between these two workflows and exploits all
the new data in around 25 time steps. The sharp drops in the curve occur when
the decision boundary changes abruptly because we have now started collecting
data from a new cluster.
Estimate of The Performance of The Model. We want to exploit as much
of the new data as quickly as possible for training our final predictive model.
However at the same time we want to keep the test set fresh in order to get
the most unbiased estimate of the performance of the final model. Note that
Figure 6 plotted the test error assuming that some oracle gave us the actual
data distribution. However in practice we do not have access to this distribution
and should use the existing data collected so far (more precisely the test split)
to also get a reasonably good estimate of the test error. Figure 7(a), (b), and (c)
compares the training error, test error, and the actual error for three different
workflows. The results are averaged over 100 repetitions and the standard devi-
ation is also shown. While the test error for all the three workflows approaches
the true error when all the data has been collected we want to track how the
test error changes during any stage of the model building process. While the
3
We have shown how to select the delta parameter automatically in the serial waterfall
model. The other parameters are more of design choice and have to be chosen based
on the various constraints (time, financial, etc.).
Data Split Strategies for Evolving Predictive Models 13
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 5 10 15 20 25 30
Time step
Fig. 6. Comparison of workflows The error (computed on the entire dataset) at each
time point (until all the data has been used) for the parallel dump (with parameters
β = 0.5 and γ = 0.3 ), the serial waterfall (with δ automatically chosen), and the
hybrid workflow (with α = 0.1).
parallel dump workflow (see Figure 7(a)) gave us a predictive model quickly the
test error is highly optimistic (and close to the training error) and is close to the
true error only at the end when we have access to all the data. The test error for
serial waterfall workflow (see Figure 7(b)) does not track the training error and is
better reflection of the risk of the model than the parallel dump workflow. These
variations in the test error can be explained because the composition of the test
set is continuously changed at each time step and it takes some time for the new
data to finally reach the training set. However the serial waterfall workflow can
be overly pessimistic and the proposed hybrid workflow (see Figure 7(c)) can be
a good compromise between the two—it can give a good model reasonably fast
and at the same time produce a reasonably good estimate of the performance
of the model. By varying the parameter α one can obtain a desired compromise
between the serial and the parallel workflow. Figure 7(d) compares the hybrid
workflow for two different values of the split parameters α = 0.1 and α = 0.5
with α = 0 corresponding to the serial workflow and α = 1 corresponding to the
parallel workflow. The value of parameter α is more of design choice and can be
chosen based on the domain and various constraints.
Misclassification error
0.35 0.35
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Time step Time step
0.35 0.35
0.25 0.25
0.2 0.2
alpha=0.1
0.15 0.15
alpha=0.5
0.1 alpha=1.0 parallel
0.1
0.05
0 5 10 15 20 25 30 0.05
0 5 10 15 20 25 30
Time step Time step
Fig. 7. The estimated training error, test error, and the actual error (oracle) for (a)
the parallel dump, (b) the serial waterfall and (c) the hybrid workflow. (d) The effect
of varying the split parameter α.
of sentences one can construct various features which quantify the dissimilarity
between two sentences. One of the most import set of features are the based
on machine translation (MT) metrics. For example the BLEU score [11] (which
measures n-gram overlap) which is an widely used evaluation metric for MT
systems is an important feature for paraphrase detection. In our system we used
a total of 14 such features and then trained a binary decision tree using the
labeled data to train the classifier.
To collect the labeled data we show a pair of sentences to three in-house
annotators and ask them to label them as either semantically equivalent or not.
The sentences were taken from wikipedia articles corresponding to a specified
topic. Due to the overall project design the labeling proceeded one topic at a
time. We currently have a labeled data of 715 sentence pairs from a total of 6
topics (56, 76, 88, 108, 140, 247 sentence for each of the six topics) of which 112
were annotated as semantically equivalent by a majority of the annotators. We
analyse a situation where the data arrives one topic at a time. The new data
is allocated into train, validation, and test splits according the parallel dump,
serial waterfall, and the hybrid workflow. At each round a binary decision tree
is trained using the train split, the decision tree parameters are chosen using the
validation split, and the model performance is assessed using the test split.
Data Split Strategies for Evolving Predictive Models 15
Fig. 8. Paraphrase detection (see § 7) (a) The misclassification error at each round
(until all data has been used) on the oracle, train and test split for (a) the parallel
dump (with parameters β = 0.6 and γ = 0.2 ), (b) the serial waterfall (with δ1 = 0.5,
δ2 = 0.3, and δ3 = 0.2), and (b) the hybrid workflow (with α = 0.5). The oracle is a
surrogate for the true model performance which is evaluated on 30 % of entire original
data.
Figure 8 shows the misclassification error on the train and the test splits as a
function of the number of rounds, here each round refers to a point in time when
we acquire a new labeled data and data reallocation/movement happens. The
results are averaged over 50 replications, where for each replication the order of
the topics is randomly permuted. We would like to see how close is the model
performance assessed using the test split to the true model performance on the
entire data (which we call the oracle). Since we do not have access to the true
data distribution we sequester 30 % of original data (which includes data from
all topics) and use this as a surrogate for the true performance. The following
observations can be made (see Figure 8): (1) The test error for all the three
workflows approaches the true oracle error at steady state when a large amount
of data has been collected. (2) However in the early stages the performance of
the model on the test split as assessed by the parallel workflow (Figure 8(a))
is relatively optimistic while that of the serial workflow(Figure 8(b)) is highly
pessimistic. (3) The proposed hybrid workflow (see Figure 8(d)) estimates the
test error much closer to the oracle error.
8 Related Work
There is not much related work in this area in the machine learning literature.
Most earlier research has focussed on settings with either unlimited data or finite
fixed data, while this paper proposes data flow strategies for finite but growing
datasets. The bias due to repeated use of the test set has been pointed out in
a few papers in the cross-validation setting [10,13]. However main focus of this
paper is on data rich situations where we estimate the prediction capability of
a classifier on a independent test data set (called the conditional test error ). In
data poor situations techniques like cross-validation [7] and bootstrapping [3]
are widely used as a surrogate to this, but they can only estimate the expected
test error (averaged over multiple training sets).
16 V.C. Raykar and A. Saha
There is a rich literature in the area of learning under concept drift [5] and
dataset shifts [12]. Concept drift primarily refers to an online supervised learn-
ing scenario when the relation between the input data and the target variable
changes over time. Dataset shift is a common problem in predictive modeling that
occurs when the joint distribution of inputs and outputs differs between training
and test stages. Covariate shift, a particular case of dataset shift, occurs when
only the input distribution changes. The other kinds of concept-drifts are prior
probability shift where the distribution over true label y changes, sample selection
bias where the data distributions varies over time because of an unknown sample
rejection bias and source component shift where the datastream can be thought
to be originating from different unknown sources at different time points.
Various strategies have been proposed to correct for these shifts in test dis-
tribution [5,12]. In general the field of concept drift seeks to develop shift-aware
models that can capture these specific types of variations or a combination of
the different modes of variations or do model selection to assess whether dataset
shift is an issue in particular circumstances [9]. In our current work the focus is
to investigate into different kinds of dynamic workflows for assigning data into
train, test and validation splits to reduce the effect of bias in the scenario of a
time-shifting dataset. In our setting the drift arises as a consequence of that data
arriving in an non-iid fashion. Hence this body of work mainly deals with the
source component shift of the datasets where for example in the medical domain
where a classifier is built from the training data obtained from various hospitals,
each hospital may have a different machine with different bias, each producing
different ranges of values for the covariate x and possibly even the true label y.
We are mainly concerned with allocating the new data to the existing splits
and not with modifying any particular learning method to account for the shifts.
One of our motivations was not to make the allocation strategies model or data
distribution specific. We wanted to come up with strategies that can be used
with any model or data distribution. The main contribution of our work is to
propose these strategies and empirically analyze them. These kind of strategies
have not been discussed in the concept drift literature.
9 Conclusions
We analysed three workflows for allocating new data into the existing training,
validation, and test splits. The parallel dump workflow splits the data into three
parts and directly dumps them into the existing splits. While it can exploit the
new data quickly to build a good model the estimate of the model performance
is optimistic especially when the new data does not arrive in a random fashion.
The serial waterfall workflow which trickles the data from one level to another
avoids this problem by keeping the test set fresh and prevents the bias due to
multiple reuse of the test set. However it takes a long time for the new data to
reach the training split. The proposed hybrid workflow which balances both the
workflows seems to be a good compromise—it can give a good model reasonably
fast and at the same time produce a reasonably good estimate of the model
performance.
Data Split Strategies for Evolving Predictive Models 17
where yi is the true response, yij = fj (xi ) is the response predicted by the learnt
model after the test set has been reused j times, and L is the loss function used
to measure the error. If the test set is used multiple times the final model will
overfit to the testing set. In other words the performance of the model on the
test set will be biased and will not reflect the true performance of the model.
Let Biasmax (Ej ) be the maximum possible bias in the estimate of the error
Ej caused due to multiple reuse of the test set. The first time the test set is
used the bias is zero, i.e., Biasmax (E1 ) = 0. Every subsequent use increases the
bias. The worst case scenario (the perhaps the easiest for the developer) is when
the developer directly observes the predictions yij and learns a model on these
predictions to match the desired response yj for all examples in the test set. Since
we have N examples in the test set by reusing the test set N + 1 times one can
actually drive the error EN +1 = 0 since we will have N unknowns to estimate
(y1 , . . . , yN ) and N new tests. Hence Biasmax (EN +1 ) = E1 . We will further
assume that at each test we lose one degree of freedom and hence approximate
j−1
Biasmax (Ej ) = E1 , j = 1, . . . , N. (2)
N
Supplement the Test Set to Avoid the Bias: In order to avoid this bias our
strategy is to supplement the test set with M new examples and move the oldest
M examples to the lower validation set. We do this every time we reuse the test
set. If we keep doing this then in the long run we will have a set of N examples
which has been reused N/M times. Hence if we supplement the test set with M
examples at each reuse the bias in the test set at steady state will be
N −M
Biasmax (E∞ ) = E1 , (3)
NM
where E∞ is the error in this steady state scenario.
18 V.C. Raykar and A. Saha
References
1. Kaggle. www.kaggle.com
2. Chatfield, C.: Model uncertainty, data mining and statistical inference. Journal
of the Royal Statistical Society. Series A (Statistics in Society) 158(3), 419–466
(1995)
3. Efron, B., Tibshirani, R.: An introduction to the bootstrap. Chapman and Hall
(1993)
4. Faraway, J.: Data splitting strategies for reducing the effect of model selection on
inference. Computing Science and Statistics 30, 332–341 (1998)
5. Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on
concept drift adaptation. ACM Computing Surveys 46(4) (2014)
6. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Series in Statistics (2009)
7. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and
model selection. IJCAI 14, 1137–1145 (1995)
8. Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine translation met-
rics for paraphrase identification. In: Proceedings of 2012 Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL
2012), pp. 182–190 (2012)
9. Mohri, M., Rostamizadeh, A.: Stability bounds for non-i.i.d. processes. In:
Advances in Neural Information Processing Systems, vol. 20, pp. 1025–1032 (2008)
10. Ng, A.Y.: Preventing overfitting of crossvalidation data. In: Proceedings of the
14th International Conference on Machine Learning, pp. 245–253 (1997)
11. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th Annual meeting of
the Association for Computational Linguistics (ACL-2102), pp. 311–318 (2002)
Data Split Strategies for Evolving Predictive Models 19
12. Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D. (eds.):
Dataset Shift in Machine Learning. Neural Information Processing series. MIT
Press (2008)
13. Rao, R.B., Fung, G.: On the dangers of cross-validation. An experimental evalua-
tion. In: Proceedings of the SIAM Conference on Data Mining, pp. 588–596 (2008)
14. Samuelson, F.: Supplementing a validation test sample. In: 2009 Joint Statistical
Meetings (2009)
15. Stone, M.: Asymptotics for and against crossvalidation. Biometrika 64, 29–35
(1977)
Discriminative Interpolation for Classification
of Functional Data
1 Introduction
The choice of data representation is foundational to all supervised and unsuper-
vised analysis. The de facto standard in machine learning is to use feature repre-
sentations that treat data as n-dimensional vectors in Euclidean space and then
This work is partially supported by NSF IIS 1065081 and 1263011.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 20–36, 2015.
DOI: 10.1007/978-3-319-23528-8 2
Discriminative Interpolation for Classification of Functional Data 21
5 5
4 4
3 3
2 2
Intensity
Intensity
1 1
0 0
-1 -1
-2 -2
-3 -3
100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000
Time Time
algorithms for real-valued functions over the real line expanded in wavelet bases
with extensions to higher dimensions following in a straightforward manner.
The motivation for a competing push-pull framework is built on several recent
successful efforts in metric learning [23,24,26,29] where Mahalanobis-style met-
rics are learned by incorporating optimization terms that promote purity of local
neighborhoods. The metric is learned using a kNN approach that is locally sen-
sitive to the the k neighbors around each exemplar and optimized such that
the metric moves data with the same class labels closer in proximity (pulling
in good neighbors) and neighbors with differing labels from the exemplar are
moved out of the neighborhood (pushing away bad neighbors). As these are
kNN approaches, they can inherently handle nonlinear classification tasks and
have been proven to work well in many situations [16] due to their ability to con-
textualize learning through the use of neighborhoods. In these previous efforts,
the metric learning framework is well-suited for feature vectors in Rn . In our
current situation of working with functional data, we propose an analogue that
allows the warping of the function data to visually resemble others in the same
class (within a k-neighborhood) and penalize similarity to bad neighbors from
other classes. We call this gerrymandered morphing of functions based on local
neighborhood label characteristics discriminative interpolation. Figure 1 illus-
trates this neighborhood-based, supervised deforming on a three-class problem.
In Fig. 1(a), we see the original training curves from three classes, colored
magenta, green, and blue; each class has been sub-sampled to 30 curves for
display purposes. Notice the high variability among the classes which can lead
to misclassifications. Fig. 1(b) shows the effects of CDI post training. Now the
curves in each class more closely resemble each other, and ultimately, this leads
to better classification of test curves.
Learning and generalization properties have yet to be worked out for the CDI
framework. Here, we make a few qualitative comments to aid better understanding
Discriminative Interpolation for Classification of Functional Data 23
of the formulation. Clearly, we can overfit during the training stage by forcing
the basis representation of all functions belonging to the same class to be iden-
tical. During the testing stage, in this overfitting regime, training is irrelevant
and the method devolves into a nearest neighbor strategy (using function dis-
tances). Likewise, we can underfit during the training stage by forcing the basis
representation of each function to be without error (or residual). During testing,
in this underfitting regime, the basis representation coefficients are likely to be
far (in terms of a suitable distance measure on the coefficients) from members in
each class since no effort was made during training to conform to any class. We
think a happy medium exists where the basis coefficients for each training stage
function strike a reasonable compromise between overfitting and underfitting—
or in other words, try to reconstruct the original function to some extent while
simultaneously attempting to draw closer to nearest neighbors in each class. Sim-
ilarly, during testing, the classifier fits testing stage function coefficients while
attempting to place the function pattern close to nearest neighbors in each class
with the eventual class label assigned to that class with the smallest compro-
mise value. From an overall perspective, CDI marries function reconstruction
with neighborhood gerrymandering (with the latter concept explained above).
To the best of our knowledge, this is the first effort to develop a FDA app-
roach that leverages function properties to achieve kNN-margin-based learning
in a fully multi-class framework. Below, we begin by briefly covering the req-
uisite background on function spaces and wavelets (Section 2). Related works
are detailed in Section 3. This is followed by the derivation of the proposed
CDI framework in Section 4. Section 5 demonstrates extensive experimental
validations on several functional datasets and shows our method to be compet-
itive with other functional and feature-vector based algorithms—in many cases,
demonstrating the highest performance measures to date. The article concludes
with Section 6 where recommendations and future extensions are discussed.
Most FDA techniques are developed under the assumption that the given set of
labeled functional data can be suitably approximated by and represented in an
infinite dimensional Hilbert space H. Ubiquitous examples of H include the space
of square-integrable functions L2 ([a, b] ⊂ R) and square-summable series l2 (Z).
This premise allows us to transition the analysis from the functions themselves
to the coefficients of their basis expansion. Moving to a basis expansion also
allows us to seamlessly handle irregularly sampled functions, missing data, and
interpolate functions (the most important property for our approach). We now
provide a brief exposition of working in the setting of H. The reader is referred
to many suitable functional analysis references [17] for further details.
The infinite dimensional representation of f ∈ H comes from the fact that
there are a countably infinite number of basis vectors required to produce an
exact representation of f , i.e.
24 R. Haber et al.
∞
f (t) = αl φl (t) (1)
l=1
m
d
2
min f (ti ) − αl φl (ti )
{αl }
i=1 l=1
or in matrix form
min f − φα2 , (2)
α
∞
f (t) = αj0 ,k φj0 ,k (t) + βj,k ψj,k (t) (5)
k j≥j0 ,k
where t ∈ R, φ(x) and ψ(x) are the scaling (a.k.a. father) and wavelet (a.k.a.
mother) basis functions respectively, and αj0 ,k and βj,k are scaling and wavelet
basis function coefficients; the j-index represents the current resolution level
and the k-index the integer translation value. (The translation range of k can be
computed from the span of the data and basis function support size at the differ-
ent resolution levels. Also, when there is no need to distinguish between scaling
Discriminative Interpolation for Classification of Functional Data 25
T
or wavelet coefficients, we simply let c = α, β .) The linear combination in
eq. (5) is known as a multiresolution expansion. The key idea behind multires-
olution theory is the existence of a sequence of nested subspaces Vj j ∈ Z such
that
· · · V−2 ⊂ V−1 ⊂ V0 ⊂ V1 ⊂ V2 · · · (6)
and which satisfy the properties Vj = {0} and Vj = L2 (completeness). The
resolution increases as j → ∞ and decreases as j → −∞ (some references show
this order reversed due to the fact they invert the scale [7]). At any particular
level j + 1, we have the following relationship
Vj Wj = Vj+1 (7)
3 Related Work
Several authors have previously realized the importance of taking the functional
aspect of the data into account. These include representing the data in a different
basis such as B-Splines [1], Fourier [6] and wavelets [3,4] or utilizing the contin-
uous aspects [2] and differentiability of the functional data [18]. Abraham et al.
[1] fit the data using B-Splines and then use the K-means algorithm to cluster
the data. Biau et al. [6] reduce the infinite dimension of the space of functions
to the first d dimensional coefficients of a Fourier series expansion of each func-
tion and then they apply a k-nearest neighborhood for classification. Antoniadis
et al. [3] use the wavelet transform to detect clusters in high dimensional func-
tional data. They present two different measures, a similarity and a dissimilarity,
that are then used to apply the k-centroid clustering algorithm. The similarity
measure is based on the distribution of energy across scales generating a number
of features and the dissimilarity measure are based on wavelet-coherence tools.
26 R. Haber et al.
In [4], Berlinet et al. expand the observations on a wavelet basis and then apply
a classification rule on the non-zero coefficients that is not restricted to the
k-nearest neighbors. Alonso et al. [2] introduce a classification technique that
uses a weighted distance that utilizes the first, second and third derivative of the
functional data. López-Pintado and Romo [18] propose two depth-based classifi-
cation techniques that take into account the continuous aspect of the functional
data. Though these previous attempts have demonstrated the utility of basis
expansions and other machine learning techniques on functional data, none of
them formulate a neighborhood, margin-based learning technique as proposed
by our current CDI framework.
In the literature, many attempts have been made to find the best neigh-
borhood and/or define a good metric to get better classification results
[10,19,24,26,29]. Large Margin Nearest Neighbor (LMNN) [8,26] is a locally
adaptive metric classification method that uses margin maximization to esti-
mate a local flexible metric. The main intuition of LMNN is to learn a metric
such that at least k of its closest neighbors are from the same class. It pre-defines
k neighbors and identifies them as target neighbors or impostors—the same class
or different class neighbors respectively. It aims at readjusting the margin around
the data such that the impostors are outside that margin and the k data points
inside the margin are of the same class. Prekopcsk and Lemire [19] classify times
series data by learning a Mahalanobis distance by taking the pseudoinverse of the
covariance matrix, limiting the Mahalanobis matrix to a diagonal matrix or by
applying covariance shrinkage. They claim that these metric learning techniques
are comparable or even better than LMNN and Dynamic Time Warping (DTW)
[5] when one nearest neighbor is used to classify functional data. We show in the
experiments in Section 5 that our CDI method performs better. In [24], Trivedi
et al. introduce a metric learning algorithm by selecting the neighborhood based
on a gerrymandering concept; redefining the margins such that the majority of
the nearest neighbors are of the same class. Unlike many other algorithms, in
[24], the choice of neighbors is a latent variable which is learned at the same time
as it is learning the optimal metric—the metric that gives the best classification
accuracy. These neighborhood-based approaches are pioneering approaches that
inherently incorporate context (via the neighbor relationships) while remaining
competitive with more established techniques like SVMs [25] or deep learning
[13]. Building on their successes, we adapt a similar concept of redefining the class
margins through pushing away impostors and pulling target neighbors closer.
In this section, we introduce the CDI method for classifying functional data.
Before embarking on the detailed development of the formulation, we first pro-
vide an overall sketch.
In the training stage, the principal task of discriminative interpolation is
to learn a basis representation of all training set functions while simultaneously
pulling each function representation toward a set of nearest neighbors in the same
Discriminative Interpolation for Classification of Functional Data 27
class and pushing each function representation away from nearest neighbors in
the other classes. This can be abstractly characterized as
⎡ ⎤
ECDI (ci ) = ⎣Drep (f i , φci ) + λ Dpull (ci , cj ) − μ Dpush (ci , ck )⎦ (8)
i j∈N (i) k∈M(i)
where Drep is the representation error between the actual data (training set
function sample) and its basis representation, Dpull is the distance between the
coefficients of the basis representation in the same class and Dpush the distance
between coefficients in different classes (with the latter two distances often cho-
sen to be the same). The parameters λ and μ weigh the pull and push terms
respectively. N (i) and M(i) are the sets of nearest neighbors in the same class
and different classes respectively.
Upon completion of training, the functions belonging to each class have been
discriminatively interpolated such that they are more similar to their neighbors
in the same class. This contextualized representation is reflected by the coeffi-
cients of the wavelet basis for each of the curves. We now turn our attention to
classifying incoming test curves. Our labeling strategy focuses on selecting the
class that is able to best represent the test curve under the pulling influence of
its nearest neighbors in the same class. In the testing stage, the principal task
of discriminative interpolation is to learn a basis representation for just the test
set function while simultaneously pulling the function representation toward a
set of nearest neighbors in the chosen class. This procedure is repeated for all
classes with class assignment performed by picking that class which has the low-
est compromise between the basis representation and pull distances. This can
be abstractly characterized as
â = arg min min Drep (f , φc) + λ Dpull (c, ck(a) ) (9)
a c
k∈K(a)
where K(a) is the set of nearest neighbors in class a of the incoming test pattern’s
coefficient vector c. Note the absence of the push mechanism during testing.
Further, note that we have to solve an optimization problem during the testing
stage since the basis representation of each function is a priori unknown (for
both training and testing).
where {φd }∞
d=1 form a complete, orthonormal system of H. As mentioned in
Section 2, we approximate the discretized data f i = {f i (tj )}1≤j≤m in a d-
dimensional space. Let ci = [ci1 , ci2 , · · · , cid ]T be the d × 1 vector of coefficients
28 R. Haber et al.
6. ci ← ci − η∇Einterp
i
8. Until convergence
7. End For
*Obtained via cross validation (CV).
Testing
N
λ μ r 2
e−β c −c (16)
i
ci −cj 2
min f i −φci 2 − log e−β + log
c β i j s.t. y =y β i
i=1 i j r s.t. yi =yr
where β is a free parameter deciding the degree of membership. This will allow
curves to have a weighted vote in the CDI of f i (for example).
An update equation for the objective in eq. 16 can be found by taking the
gradient with respect to ci , which yields
⎛
∂E
i
= −2 ⎝φT f i − φT φci − λ Mij (ci − cj ) + λ Mki (ck − ci ) · · ·
∂c j s.t. yi =yj k s.t. yk =yi
⎞
+μ Mir (ci − cr ) − μ Msi (cs − ci )⎠ (17)
r s.t. yi =yr s s.t. ys =yi
30 R. Haber et al.
where
exp −βci − cj 2
Mij = . (18)
exp (−βci − ct 2 )
t, s.t. yi =yt
We have estimated a set of coefficients which in turn give us the best approximation
to the training curves in a discriminative setting. We now turn to the testing stage.
In contrast to the feature vector-based classifiers, this stage is not straightforward.
When a test function appears, we don’t know its wavelet coefficients. In order to
determine the best set of coefficients for each test function, we minimize an objec-
tive function which is very similar to the training stage objective function. To test
membership in each class, we minimize the sum of the wavelet reconstruction error
and a suitable distance between the unknown coefficients and its nearest neigh-
bors in the chosen class. The test function is assigned to the class that yields the
minimum value of the objective function. This overall testing procedure is
formalized as
⎛ ⎞
arg min min E a (c̃) = arg min ⎝min f̃ − φc̃2 + λ Mia c̃ − ci 2 ⎠ (19)
a c̃ a c̃
i s.t. yi =a
where f̃ is the test set function and c̃ is its vector of reconstruction coefficients.
Mia is the nearest neighbor in the set of class a patterns. As before, the mem-
bership can be “integrated out” to get
λ
E a (c̃) = f̃ − φc̃2 − log exp −βc̃ − ci 2 . (20)
β i s.t. y =a
i
This objective function can be minimized using methods similar to those used
during training. The testing stage algorithm comprises the following steps.
1. Solve Γ (a) = minc̃ E a (c̃) for every class a using the objective function gra-
dient
∂E a
= −2φT f̃ + 2φT φc̃ + λ Mia 2c̃ − 2ci (21)
∂c̃ i s.t. y =ai
Discriminative Interpolation for Classification of Functional Data 31
where
exp −βc̃ − ci 2
Mia = j 2
. (22)
j∈N (a) exp {−βc̃ − c }
2. Assign the label ỹ to f̃ by finding the class with the smallest value of Γ (a),
namely arg mina (Γ (a)).
5 Experiments
In this section, we discuss the performance of the CDI algorithm using publicly
available functional datasets, also known as time series datasets from the “UCR
Time Series Data Mining Archive” [15]. The multi-class datasets are divided into
training and testing sets with detailed information such as the number of classes,
number of curves in each of the testing sets and training sets and the length
of the curves shown in Table 1. The datasets that we have chosen to run the
experiments on range from 2 class datasets—the Gun Point dataset, and up to
37 classes—the ADIAC dataset. Learning is also exercised under a considerable
mix of balanced and unbalanced classes, and minimal training versus testing
exemplars, all designed to rigorously validate the generalization capabilities of
our approach.
For comparison against competing techniques, we selected four other leading
methods based on reported results on the selected datasets. Three out of the
four algorithms are classification techniques based on support vector machines
(SVM) with extensions to Dynamic Time Warping (DTW). DTW has shown
to be a very promising similarity measurement for functional data, supporting
warping of functions to determine closeness. Gudmundsson et al. [11] demon-
strate the feasibility of the DTW approach to get a positive semi-definite kernel
for classification with SVM. The approach in Zhang et al. [28] is one of many that
use a vectorized method to classify functional data instead of using functional
properties of the dataset. They develop several kernels for SVM known as elas-
tic kernels—Gaussian elastic metric kernel (GEMK) to be exact and introduce
several extensions to GEMK with different measurements. In [14], another SVM
Table 1. Functional Datasets. Datasets contain a good mix of multiple classes, class
imbalances, and varying number of training versus testing curves.
Dataset λ μ |N | |M|
Synthetic Control 2 0.01 10 5
Gun Point 0.1 0.001 7 2
ADIAC 0.8 0.012 5 5
Swedish Leaf 0.1 0.009 7 2
ECG200 0.9 0.01 3 3
Yoga 0.08 0.002 2 2
Coffee 0.1 0.005 1 1
Olive Oil 0.1 0.005 1 1
classification technique with a DTW kernel is employed but this time a weight is
added to the kernel to provide more flexibility and robustness to the kernel func-
tion. Prekopcsák et al. [19] do not utilize the functional properties of the data.
Instead they learn a Mahalanobis metric followed by standard nearest neighbors.
For brevity, we have assigned the following abbreviations to these techniques:
SD [11], SG [28], SW [14], and MD [19]. In addition to these published works,
we also evaluated a standard kNN approach directly on the wavelet coefficients
obtained from eq. 2, i.e. direct representation of functions in a wavelet basis
without neighborhood gerrymandering. This was done so that we can compre-
hensively evaluate if the neighborhood adaptation aspect of CDI truly impacted
generalization (with this approach abbreviated as kNN).
A k-fold cross-validation is performed on the training datasets to find the
optimal values for each of our free parameters (λ and μ being the most promi-
nent). Since the datasets were first standardized (mean subtraction, followed by
standard deviation normalization), the free parameters λ and μ became more
uniform across all the datasets. λ ranges from (0.05, 2.0) while μ ranges from
(10−3 , 0.01). Table 2 has detailed information on the optimal parameters found
for each of the datasets. In all our experiments, β is set to 1 and Daubechies 4
(DB4) at j0 = 0 was used as the wavelets basis (i.e. only scaling functions used).
We presently do not investigate the effects of β on the classification accuracy
as we perform well with it set at unity. The comprehensive results are given in
Table 3, with the error percentage being calculated per the usual:
# of misclassified curves
Error = 100 . (23)
Total Number of Curves
The experiments show very promising results for the proposed CDI method
in comparison with the other algorithms, with our error rates as good or better in
most datasets. CDI performs best on the ADIAC dataset compared to the other
techniques, with an order of magnitude improvement over the current state-of-
the-art. This is a particularly difficult dataset having 37 classes where the class
sizes are very small, only ranging from 4 to 13 curves. Figure 2(a) illustrates all
original curves from the 37 classes which are very similar to each other. Hav-
ing many classes with only a few training samples in each presents a significant
Discriminative Interpolation for Classification of Functional Data 33
2 2
1.5 1.5
1 1
0.5 0.5
Intensity
Intensity
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Time Time
Fig. 2. ADIAC dataset 37 classes - 300 Curves. The proposed CDI method is an order
of magnitude better than the best reported competitor. Original curves in (a) are
uniquely colored by class. The curves in (b) are more similar to their respective classes
and are smoothed by the neighborhood regularization—unique properties of the CDI.
Table 3. Classification Errors. The proposed CDI method achieves state-of-the-art per-
formance on half of the datasets, and is competitive in almost all others. The ECG200
exception, with kNN outperforming everyone, is discussed in text.
classification challenge and correlates with why the competing techniques have a
high classification error. In Figure 2(b), we show how CDI brings curves within
the same class together making them more “pure” (such as the orange curves)
while also managing to separate classes from each other. Regularized, discrimi-
native learning also has the added benefit of smoothing the functions. The com-
peting SVM-based approaches suffer in accuracy due to the heavily unbalanced
classes. In some datasets (e.g. Swedish Leaf or Yoga) where we are not the leader,
we are competitive with the others. Comparison with the standard kNN resulted
in valuable insights. CDI fared better in 6 out of the 8 datasets, solidifying the
utility of our push-pull neighborhood adaptation (encoded by the learned μ and
λ), clearly showing CDI is going beyond vanilla kNN on the coefficient vectors.
However, it is interesting that in two of the datasets that kNN beat not only
CDI but all other competitors. For example, kNN obtained a perfect score on
34 R. Haber et al.
ECG200. The previously published works never reported a simple kNN score on
this dataset, but as it can occur, a simple method can often beat more advanced
methods on particular datasets. Further investigation into this dataset showed
that the test curves contained variability versus the training, which may have
contributed to the errors. Our error is on par with all other competing methods.
6 Conclusion
References
1. Abraham, C., Cornillon, P.A., Matzner-Lober, E., Molinari, N.: Unsupervised curve
clustering using B-splines. Scandinavian J. of Stat. 30(3), 581–595 (2003)
2. Alonso, A.M., Casado, D., Romo, J.: Supervised classification for functional data:
A weighted distance approach. Computational Statistics & Data Analysis 56(7),
2334–2346 (2012)
3. Antoniadis, A., Brossat, X., Cugliari, J., Poggi, J.M.: Clustering functional data
using wavelets. International Journal of Wavelets, Multiresolution and Information
Processing 11(1), 1350003 (30 pages) (2013)
Discriminative Interpolation for Classification of Functional Data 35
4. Berlinet, A., Biau, G., Rouviere, L.: Functional supervised classification with
wavelets. Annale de l’ISUP 52 (2008)
5. Berndt, D., Clifford, J.: Using dynamic time warping to find patterns in time series.
In: AAAI Workshop on Knowledge Discovery in Databases, pp. 229–248 (1994)
6. Biau, G., Bunea, F., Wegkamp, M.: Functional classification in Hilbert spaces.
IEEE Transactions on Information Theory 51(6), 2163–2172 (2005)
7. Daubechies, I.: Ten Lectures on Wavelets. CBMS-NSF Reg. Conf. Series in Applied
Math., SIAM (1992)
8. Domeniconi, C., Gunopulos, D., Peng, J.: Large margin nearest neighbor classifiers.
IEEE Transactions on Neural Networks 16(4), 899–909 (2005)
9. Garcia, M.L.L., Garcia-Rodenas, R., Gomez, A.G.: K-means algorithms for func-
tional data. Neurocomputing 151(Part 1), 231–245 (2015)
10. Globerson, A., Roweis, S.T.: Metric learning by collapsing classes. In: Advances in
Neural Information Processing Systems, vol. 18, pp. 451–458. MIT Press (2006)
11. Gudmundsson, S., Runarsson, T., Sigurdsson, S.: Support vector machines and
dynamic time warping for time series. In: IEEE International Joint Conference on
Neural Networks (IJCNN), pp. 2772–2776, June 2008
12. Hall, P., Hosseini-Nasab, M.: On properties of functional principal components
analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy) 68(1), 109–126 (2006)
13. Hinton, G., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets.
Neural Computation 18(7), 1527–1554 (2006)
14. Jeong, Y.S., Jayaramam, R.: Support vector-based algorithms with weighted
dynamic time warping kernel function for time series classification. Knowledge-
Based Systems 75, 184–191 (2014)
15. Keogh, E., Xi, X., Wei, L., Ratanamahatana, C.: The UCR time series classi-
fication/clustering homepage (2011). http://www.cs.ucr.edu/eamonn/time series
data/
16. Keysers, D., Deselaers, T., Gollan, C., Ney, H.: Deformation models for image
recognition. IEEE Transactions on PAMI 29(8), 1422–1435 (2007)
17. Kreyszig, E.: Introductory Functional Analysis with Applications. John Wiley and
Sons (1978)
18. López-Pintado, S., Romo, J.: Depth-based classification for functional data.
DIMACS Series in Discrete Mathematics and Theoretical Comp. Sci. 72, 103
(2006)
19. Prekopcsák, Z., Lemire, D.: Time series classification by class-specific Mahalanobis
distance measures. Adv. in Data Analysis and Classification 6(3), 185–200 (2012)
20. Ramsay, J., Silverman, B.: Functional Data Analysis, 2nd edn. Springer (2005)
21. Rossi, F., Delannay, N., Conan-Guez, B., Verleysen, M.: Representation of func-
tional data in neural networks. Neurocomputing 64, 183–210 (2005)
22. Rossi, F., Villa, N.: Support vector machine for functional data classification.
Neurocomputing 69(7–9), 730–742 (2006)
23. Tarlow, D., Swersky, K., Charlin, L., Sutskever, I., Zemel, R.S.: Stochastic
k-neighborhood selection for supervised and unsupervised learning. In: Proc. of
the 30th International Conference on Machine Learning (ICML), vol. 28, no. 3,
pp. 199–207 (2013)
24. Trivedi, S., McAllester, D., Shakhnarovich, G.: Discriminative metric learning by
neighborhood gerrymandering. In: Advances in Neural Information Processing Sys-
tems (NIPS), vol. 27, pp. 3392–3400. Curran Associates, Inc. (2014)
25. Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
36 R. Haber et al.
26. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large mar-
gin nearest neighbor classification. In: Advances in Neural Information Processing
Systems (NIPS), vol. 18, pp. 1473–1480. MIT Press (2006)
27. Yuille, A., Kosowsky, J.: Statistical physics algorithms that converge. Neural
Computation 6(3), 341–356 (1994)
28. Zhang, D., Zuo, W., Zhang, D., Zhang, H.: Time series classification using sup-
port vector machine with Gaussian elastic metric kernel. In: 20th International
Conference on Pattern Recognition (ICPR), pp. 29–32 (2010)
29. Zhu, P., Hu, Q., Zuo, W., Yang, M.: Multi-granularity distance metric learning via
neighborhood granule margin maximization. Inf. Sci. 282, 321–331 (2014)
Fast Label Embeddings
via Randomized Linear Algebra
1 Introduction
Recent years have witnessed the emergence of many multiclass and multilabel
datasets with increasing number of possible labels, such as ImageNet [12] and the
Large Scale Hierarchical Text Classification (LSHTC) datasets [25]. One could
argue that all problems of vision and language in the wild have extremely large
output spaces.
When the number of possible outputs is modest, multiclass and multilabel
problems can be dealt with directly (via a max or softmax layer) or with a
reduction to binary classification. However, when the output space is large, these
strategies are too generic and do not fully exploit some of the common properties
that these problems exhibit. For example, often the alternatives in the output
space have varying degrees of similarity between them so that typical examples
from similar classes tend to be closer1 to each other than from dissimilar classes.
More concretely, classifying an image of a Labrador retriever as a golden retriever
is a more benign mistake than classifying it as a rowboat.
Shouldn’t these problems then be studied as structured prediction problems,
where an algorithm can take advantage of the structure? That would be the case
if for every problem there was an unequivocal structure (e.g. a hierarchy) that
everyone agreed on and that structure was designed with the goal of being ben-
eficial to a classifier. When this is not the case, we can instead let the algorithm
uncover a structure that matches its own capabilities.
1
Or more confusable, by machines and humans alike.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 37–51, 2015.
DOI: 10.1007/978-3-319-23528-8 3
38 P. Mineiro and N. Karampatziakis
In this paper we use label embeddings as the underlying structure that can
help us tackle problems with large output spaces, also known as extreme classifi-
cation problems. Label embeddings can offer improved computational efficiency
because the embedding dimension is much smaller than the dimension of the out-
put space. If designed carefully and applied judiciously, embeddings can also offer
statistical efficiency because the number of parameters can be greatly reduced
without increasing, or even reducing, generalization error.
1.1 Contributions
We motivate a particular label embedding defined by the low-rank approximation
of a particular matrix, based upon a correspondence between label embedding
and the optimal rank-constrained least squares estimator. Assuming realizability
and infinite data, the matrix being decomposed is the expected outer product
of the conditional label probabilities. In particular, this indicates two labels are
similar when their conditional probabilities are linearly dependent across the
dataset. This unifies prior work utilizing the confusion matrix for multiclass [5]
and the empirical label covariance for multilabel [41].
We apply techniques from randomized linear algebra [19] to develop an effi-
cient and scalable algorithm for constructing the embeddings, essentially via a
novel randomized algorithm for rank-constrained squared loss regression. Intu-
itively, this technique implicitly decomposes the prediction matrix of a model
which would be prohibitively expensive to form explicitly. The first step of our
algorithm resembles compressed sensing approaches to extreme classification
that use random matrices [21]. However our subsequent steps tune the embed-
dings to the data at hand, providing the opportunity for empirical superiority.
2 Algorithm Derivation
2.1 Notation
We denote vectors by lowercase letters x, y etc. and matrices by uppercase letters
W , Z etc. The input dimension is denoted by d, the output dimension by c and
the embedding dimension by k. For multiclass problems y is a one hot (row)
vector (i.e. a vertex of the c − 1 unit simplex) while for multilabel problems y is
a binary vector (i.e. a vertex of the unit c-cube). For an m × n matrix X ∈ Rm×n
we use ||X||F for its Frobenius norm, X † for the pseudoinverse, ΠX,L for the
projection onto the left singular subspace of X, and X1:k for the matrix resulting
by taking the first k columns of X. We use X ∗ to denote a matrix obtained by
solving an optimization problem over matrix parameter X. The expectation of
a random variable v is denoted by E[v].
2.2 Background
In this section we offer an informal discussion of randomized algorithms for
approximating the principal components analysis of a data matrix X ∈ Rn×d
Fast Label Embeddings via Randomized Linear Algebra 39
with n examples and d features. For a very thorough and more formal discussion
see [19].
Algorithm 1 shows a recipe for performing randomized PCA. In both theory
and practice, the algorithm is insensitive to the parameters p and q as long as
they are large enough (in our experiments we use p = 20 and q = 1). We start
with a set of k+p random vectors and use them to probe the range of X X. Since
principal eigenvectors can be thought as “frequent directions” [28], the range of
Ψ will tend to be more aligned with the space spanned by the top eigenvectors
of X X. We compute an orthogonal basis for the range of Ψ and repeat the
process q times. This can also be thought as orthogonal (aka subspace) iteration
for finding eigenvectors with the caveat that we early stop (i.e., q is small). Once
we are done and we have a good approximation for the principal subspace of
X X, we optimize fully over that subspace and back out the solution. The last
few steps are cheap because we are only working with a (k + p) × (k + p) matrix
and the largest bottleneck is either the computation of Ψ in a single machine
setting or the orthogonalization step if parallelization is employed. An important
observation we use below is that X or X X need not be available explicitly; to
run the algorithm we only need to be able to compute the result of multiplying
with X X.
where Y ∈ Rn×c and X ∈ Rn×d are the target and design matrices respectively.
This is a special case of a more general problem studied by [14]; specializing
40 P. Mineiro and N. Karampatziakis
their result yields the solution W ∗ = X † (ΠX,L Y )k , where ΠX,L projects onto
the left singular subspace of X, and (·)k denotes optimal Frobenius norm rank-k
approximation, which can be computed2 via SVD. The expression for W ∗ can
be written interms of the SVD ΠX,L Y = U ΣV , which, after simple algebra,
∗ †
yields W = X (Y V1:k ) V1:k . This is equivalent to the following procedure:
1. Y V1:k : Project Y down to k dimensions using the top right singular vectors
of ΠX,L Y .
†
2. X
(Y V1:k ) Least
squares fit the projected labels using X and predict them.
3. X † (Y V1:k ) V1:k : Map predictions to the original output space, using the
transpose of the top right singular vectors of ΠX,L Y .
This motivates the use of the right singular vectors of ΠX,L Y as a label embed-
ding. The ΠX,L Y term can be demystified: it corresponds to the predictions of
the optimal unconstrained model,
ΠX,L Y = XZ ∗ = Ŷ .
def
2.4 Rembrandt
Our proposal is Rembrandt, described in Algorithm 2. In the previous section, we
motivated the use of the top right singular space of ΠX,L Y as a label embedding,
or equivalently, the top principal components of Y ΠX,L Y (leveraging the fact
that the projection is idempotent). Using randomized techniques, we can decom-
pose this matrix without explicitly forming it, because we can compute the product
of ΠX,L Y with another matrix Q via equation 2. Algorithm 2 is a specialization
of randomized PCA to this particular form of the matrix multiplication operator.
Starting from a random label embedding which satisfies the conditions for random-
ized PCA (e.g., a Gaussian random matrix), the algorithm first fits the embedding,
outer products the embedding with the labels, orthogonalizes and repeats for some
number of iterations. Then a final exact eigendecomposition is used to remove the
additional dimensions of the embedding that were added to improve convergence.
Note that the optimization of 2 is over Rd×(k+p) , not Rd×c , although the result is
equivalent; this is the main computational advantage of our technique.
The connection to compressed sensing approaches to extreme classification
is now clear, as the random sensing matrix corresponds to the starting point of
the iterations in Algorithm 2. In other words, compressed sensing corresponds
to Algorithm 2 with q = 0 and p = 0, which results in a whitened random
projection of the labels as the embedding. Additional iterations (q > 0) and
oversampling (p > 0) improve the approximation of the top eigenspace, hence
the potential for improved performance. However when the model is sufficiently
flexible, an embedding matrix which ignores the training data might be superior
to one which overfits the training data.
42 P. Mineiro and N. Karampatziakis
3 Related Work
a similarity matrix which can ease the computational burden of this portion of
their procedure.
Another intriguing use of side information about the classes is to enable zero-
shot learning. To this end, several authors have exploited the textual nature
of classes in image annotation to learn an embedding over the classes which
generalizes to novel classes, e.g., [15] and [39]. Our embedding technique does
not address this problem.
[18] focus nearly exclusively on the statistical benefit of incorporating label
structure by overcoming the space and time complexity of large-scale one-
against-all classification via distributed training and inference. Specifically, they
utilize side information about the classes to regularize a set of one-against-all
classifiers towards each other. This leads to state-of-the-art predictive perfor-
mance, but the resulting model has high space complexity, e.g., terabytes of
parameters for the LSHTC [24] dataset we utilize in section 4.3. This neces-
sitates distributed learning and distributed inference, the latter being a more
serious objection in practice. In contrast, our embedding technique mitigates
space complexity and avoids model parallelism.
Our objective in equation (1) is highly related to that of partial least
squares [16], as Algorithm 2 corresponds to a randomized algorithm for PLS
if the features have been whitened.3 Unsurprisingly, supervised dimensionality
reduction techniques such as PLS can be much better than unsupervised dimen-
sionality reduction techniques such as PCA regression in the discriminative set-
ting if the features vary in ways irrelevant to the classification task [2].
Two other classical procedures for supervised dimensionality reduction are
Fisher Linear Discriminant [37] and Canonical Correlation Analysis [20]. For
multiclass problems these two techniques yield the same result [2,3], although for
multilabel problems they are distinct. Indeed, extension of FLD to the multilabel
case is a relatively recent development [42] whose straightforward implementa-
tion does not appear to be computationally viable for large number of classes.
CCA and PLS are highly related, as CCA maximizes latent correlation and PLS
maximizes latent covariance [2]. Furthermore, CCA produces equivalent results
to PLS if the features are whitened [40]. Therefore, there is no obvious statistical
reason to prefer CCA to our proposal in this context.
Regarding computational considerations, scalable CCA algorithms are avail-
able [30,32], but it remains open how to specialize them to this context to lever-
age the equivalent of equation (2); whereas, if CCA is desired, Algorithm 2 can
be utilized in conjunction with whitening pre-processing.
Text is one the common input domains over which large-scale multiclass and
multilabel problems are defined. There has been substantial recent work on text
embeddings, e.g., word2vec [31], which (empirically) provide analogous statistical
and computational benefits despite being unsupervised. The text embedding
technique of [27] is a particularly interesting comparison because it is a variant
of Hellinger PCA which leverages sequential information. This suggests that
unsupervised dimensionality reduction approaches can work well when additional
3
More precisely, if the feature covariance is a rotation.
44 P. Mineiro and N. Karampatziakis
4 Experiments
Table 1. Data sets used for experimentation and times to compute an embedding.
In table 1 we present some statistics about the datasets we use in this section
as well as times required to compute an embedding for the dataset. Unless other-
wise indicated, all timings presented in the experiments section are for a Matlab
implementation running on a standard desktop, which has dual 3.2Ghz Xeon
E5-1650 CPU and 48Gb of RAM.
Fast Label Embeddings via Randomized Linear Algebra 45
4.1 ALOI
ALOI is a color image collection of one-thousand small objects recorded for sci-
entific purposes [17]. The number of classes in this data set does not qualify as
extreme by current standards, but we begin with it as it will facilitate compar-
ison with techniques which in our other experiments are intractable on a single
machine. For these experiments we will consider test classification accuracy uti-
lizing the same train-test split and features from [8]. Specifically there is a fixed
train-test split of 90:10 for all experiments and the representation is linear in
128 raw pixel values.
Algorithm 2 produces an embedding matrix whose transpose is a squared-
loss optimal decoder. In practice, optimizing the decode matrix for logistic loss
as described in Algorithm 3 gives much better results. This is by far the most
computationally demanding step in this experiment, e.g., it takes 4 seconds to
compute the embedding but 300 seconds to perform the logistic regression. For-
tunately the number of features (i.e., embedding dimensionality) for this logistic
regression is modest so the second order techniques of [1] are applicable (in
particular, their Algorithm 1 with a simple modification to include accelera-
tion [4,33]). We determine the number of fit iterations for the logistic regression
by extracting a hold-out set from the training set and monitoring held-out loss.
We do not use a random feature map, i.e., φ in line 4 of Algorithm 3 is the
identity function.
10 6
singular value
10 4
10 2
10 0
10 0 10 1 10 2 10 3
rank
more tractable, but underperforms Rembrandt. Lomtree has the worst predic-
tion performance but the lowest computational overhead when the number of
classes is large.
At k = 50, there is no difference in quality between using the Rembrandt
(label) embedding and the PCA (feature) embedding. This is not surprising
considering the effective rank of the covariance matrix of ALOI is 70. For small
embedding dimensionalities, however, PCA underperforms Rembrandt as indi-
cated in Figure 1a. For larger numbers of output classes, where the embedding
dimension will be a small fraction of the number of classes by computational
necessity, we anticipate PCA regression will not be competitive.
Note that, in addition to better statistical performance, all of the “embed-
ding + LR” approaches have lower space complexity O(k(c + d)) than direct
logistic regression O(cd). For ALOI the savings are modest (255600 bytes vs.
516000 bytes) because the input dimensionality is only d = 128, but for larger
problems the space savings are necessary for feasible implementation on a single
commodity computer. Inference time on ALOI is identical for embedding and
direct approaches in practice (both achieving ≈ 170k examples/sec).
4.2 ODP
The Open Directory Project [13] is a public human-edited directory of the web
which was processed by [6] into a multiclass data set. For these experiments we
will consider test classification error utilizing the same train-test split, features,
and labels from [8]. Specifically there is a fixed train-test split of 2:1 for all
experiments, the representation of document is a bag of words, and the unique
class assignment for each document is the most specific category associated with
the document.
Fast Label Embeddings via Randomized Linear Algebra 47
The procedures are the same as in the previous experiment, except that we
do not compare to OAA or full logistic regression due to intractability on a single
machine.
The combination of Rembrandt and logistic regression result is, to the best of
our knowledge, the best published result on this dataset. PCA logistic regression
has a performance gap compared to Rembrandt and logistic regression. The poor
performance of PCA logistic regression is doubly unsurprising, both for general
reasons previously discussed, and due to the fact that covariance matrices of text
data typically have a long plateau of weak spectral decay. In other words, for
text problems projection dimensions quickly become nearly equivalent in terms
of input reconstruction error, and common words and word combinations are
not discriminative. In contrast, Rembrandt leverages the spectral properties of
the prediction covariance of equation (3), rather than the spectral properties of
the input features.
Finally, we remark the following: although inference (i.e., finding the maxi-
mum output) is linear in the number of classes, the constant factors are favorable
due to modern vectorized processors, and therefore proceeds at ≈ 1700 exam-
ples/sec for the embedding based approaches.
4.3 LSHTC
The Large Scale Hierarchical Text Classification Challenge (version 4) was a
public competition involving multilabel classification of documents into approxi-
mately 300,000 categories [24]. The training and test files are available from the
Kaggle platform. The features are bag of words representations of each document.
Table 4. Embedding quality for LSHTC. k = 100 for all embedding strategies.
predictor baseline for this task. PLST [41] has performance close to Rembrandt
according to this metric, so the 3.2 average nonzero classes per example is appar-
ently enough for the approximation underlying PLST to be reasonable.
larger output space, larger embedding dimensionality, and the use of random
Fourier features.
5 Discussion
In this paper we identify a correspondence between rank constrained regression
and label embedding, and we exploit that correspondence along with randomized
matrix decomposition techniques to develop a fast label embedding algorithm.
To facilitate analysis and implementation, we focused on linear prediction,
which is equivalent to a simple neural network architecture with a single linear
hidden layer bottleneck. Because linear predictors perform well for text classi-
fication, we obtained excellent experimental results, but more sophistication is
required for tasks where deep architectures are state-of-the-art. Although the
analysis presented herein would not strictly be applicable, it is plausible that
replacing line 5 in Algorithm 2 with an optimization over a deep architecture
could yield good embeddings. This would be computationally beneficial as reduc-
ing the number of outputs (i.e., predicting embeddings rather than labels) would
mitigate space constraints for GPU training.
Our technique leverages the (putative) low-rank structure of the prediction
covariance of equation (3). For some problems a low-rank plus sparse assumption
might be more appropriate. In such cases combining our technique with L1
regularization, e.g., on a classification residual or on separately regularized direct
connections from the original inputs, might yield superior results.
Acknowledgments. We thank John Langford for providing the ALOI and ODP data
sets.
References
1. Agarwal, A., Kakade, S.M., Karampatziakis, N., Song, L., Valiant, G.: Least
squares revisited: Scalable approaches for multi-class prediction. In: Proceedings
of the 31st International Conference on Machine Learning, pp. 541–549 (2014)
2. Barker, M., Rayens, W.: Partial least squares for discrimination. Journal of chemo-
metrics 17(3), 166–173 (2003)
3. Bartlett, M.S.: Further aspects of the theory of multiple regression. In: Mathe-
matical Proceedings of the Cambridge Philosophical Society, vol. 34, pp. 33–40.
Cambridge Univ. Press (1938)
4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
5. Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multi-class
tasks. In: Advances in Neural Information Processing Systems, pp. 163–171 (2010)
6. Bennett, P.N., Nguyen, N.: Refined experts: improving classification in large tax-
onomies. In: Proceedings of the 32nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 11–18. ACM (2009)
7. Breiman, L., Friedman, J.H.: Predicting multivariate responses in multiple linear
regression. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology) 59(1), 3–54 (1997)
50 P. Mineiro and N. Karampatziakis
29. Lokhorst, J.: The lasso and generalised linear models. Tech. rep., University of
Adelaide, Adelaide (1999)
30. Lu, Y., Foster, D.P.: Large scale canonical correlation analysis with iterative least
squares. arXiv preprint arXiv:1407.4508 (2014)
31. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
32. Mineiro, P., Karampatziakis, N.: A randomized algorithm for CCA. arXiv preprint
arXiv:1411.3409 (2014)
33. Nesterov, Y.: A method of solving a convex programming problem with conver-
gence rate O(1/k2 ). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
34. Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-shot learning with
semantic output codes. In: Advances in Neural Information Processing Systems,
pp. 1410–1418 (2009)
35. Prabhu, Y., Varma, M.: Fastxml: a fast, accurate and stable tree-classifier for
extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 263–272. ACM
(2014)
36. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In:
Advances in Neural Information Processing Systems, pp. 1177–1184 (2007)
37. Rao, C.R.: The utilization of multiple measurements in problems of biological
classification. Journal of the Royal Statistical Society. Series B (Methodological)
10(2), 159–203 (1948). http://www.jstor.org/stable/2983775
38. Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Kocev, D., Džeroski, S.: Predicting
gene function using hierarchical multi-label decision tree ensembles. BMC Bioin-
formatics 11, 2 (2010)
39. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through
cross-modal transfer. In: Advances in Neural Information Processing Systems,
pp. 935–943 (2013)
40. Sun, L., Ji, S., Yu, S., Ye, J.: On the equivalence between canonical correlation
analysis and orthonormalized partial least squares. In: IJCAI, vol. 9, pp. 1230–1235
(2009)
41. Tai, F., Lin, H.T.: Multilabel classification with principal label space transforma-
tion. Neural Computation 24(9), 2508–2542 (2012)
42. Wang, H., Ding, C., Huang, H.: Multi-label linear discriminant analysis. In:
Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol.
6316, pp. 126–139. Springer, Heidelberg (2010)
43. Weinberger, K.Q., Chapelle, O.: Large margin taxonomy embedding for doc-
ument categorization. In: Advances in Neural Information Processing Systems,
pp. 1737–1744 (2009)
44. Weston, J., Bengio, S., Usunier, N.: Wsabie: scaling up to large vocabulary image
annotation. In: IJCAI, vol. 11, pp. 2764–2770 (2011)
45. Weston, J., Makadia, A., Yee, H.: Label partitioning for sublinear ranking. In:
Proceedings of the 30th International Conference on Machine Learning (ICML
2013), pp. 181–189 (2013)
Maximum Entropy Linear Manifold for Learning
Discriminative Low-Dimensional Representation
1 Introduction
Correct representation of the data, consistent with the problem and used classi-
fication method, is crucial for the efficiency of the machine learning models. In
practice it is a very hard task to find suitable embedding of many real-life objects
in Rd space used by most of the algorithms. In particular for natural language
processing [12], cheminformatics or even image recognition tasks it is still an
open problem. As a result there is a growing interest in methods of represen-
tation learning [8], suited for finding better embedding of our data, which may
be further used for classification, clustering or other analysis purposes. Recent
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 52–67, 2015.
DOI: 10.1007/978-3-319-23528-8 4
MELM for Learning Discriminative Low-Dimensional Representation 53
years brought many success stories, such as dictionary learning [13] or deep learn-
ing [9]. Many of them look for a sparse [7], highly dimensional embedding which
simplify linear separation at a cost of making visual analysis nontrivial. A dual
approach is to look for low-dimensional linear embedding, which has advantage
of easy visualiation, interpretation and manipulation at a cost of much weaker
(in terms of models complexity) space of transformations.
In this work we focus on the scenario where we are given labeled dataset in
Rd and we are looking for such low-dimensional linear embedding which allows
to easily distinguish each of the classes. In other words we are looking for a
highly discriminative, low-dimensional representation of the given data.
Our basic idea follows from the observa-
tion [15] that the density estimation is credi-
ble only in the low-dimensional spaces. Con-
sequently, we first project the data onto an
arbitrary k-dimensional affine submanifold V
(where k is fixed), and search for the V for
which the estimated densities of the projected
classes are orthogonal to each other, where
the Cauchy-Schwarz Divergence is applied as
a measure of discriminativeness of the projec-
tion, see Fig. 1 for an example of such projec-
tion preserving classes’ separation. The work
presented in this paper is a natural exten-
sion of our earlier results [6], where we con- Fig. 1. Visualizatoin of sonar
sidered the one-dimensional case. However, we dataset using Maximum Entropy
would like to emphasize that the used app- Linear Manifold with k = 2.
roach needed a nontrivial modification. In the
one-dimensional case we could identify sub-
spaces with elements of the unit sphere in a natural way. For higher dimensional
subspaces such an identification is no longer possible.
To the authors best knowledge the presented idea is novel, and has not been
earlier considered as a method of classification and data visualization. As one
of its benefits is the fact that it does not depend on affine rescaling of the
data, which is a rare feature of the common classification tools. What is also
interesting, we show that as its simple limiting one-class case we obtain the
classical PCA projection. Moreover, from the theoretical standpoint the Cauchy-
Schwarz Divergence factor can be decomposed into the fitting term, bounding the
expected balanced misclassification error, and regularizing term, simplifying the
resulting model. We compute its value and derivative so one can use first-order
optimization to find a solution even though the true optimization should be per-
formed on a Steifel manifold. Empirical tests show that such a method not only
in some cases improves the classification score over learning from raw data but,
more importantly, consistently finds highly discriminative representation which
can be easily visualized. In particular, we show that resulting projections’ dis-
criminativeness is much higher than many popular linear methods, even recently
54 W.M. Czarnecki et al.
proposed GEM model [11]. For the sake of completness we also include the full
source code of proposed method in the supplementary material.
2 General Idea
In order to visualize dataset in Rd we need to project it onto Rk for very small
k (typically 2 or 3). One can use either linear transformation or some complex
embedding, however choosing the second option in general leads to hard inter-
pretability of the results. Linear projections have a tempting characteristics of
being both easy to understand (from both theoretical perspective and practical
implications of the obtained results) as well as they are highly robust in further
application of this transformation.
small Dcs
...
high Dcs
Fig. 2. Visualization of the MELM idea. For given dataset X− , X+ we search through
various linear projections V and analyze how divergent are their density estimations
in order to select the most discriminative.
maximize IM(VT X; X, Y)
V∈Rd×k
subject to ϕi (V), i = 1, . . . , m.
There are many transformations which can achieve such results. For example,
the well known Principal Component Analysis defines important information as
data scattering so it looks for V which preserves as much of the X variance as
MELM for Learning Discriminative Low-Dimensional Representation 55
subject to VT V = I,
where Dcs (·, ·) denotes the Cauchy-Schwarz Divergence, the measure of how
divergent are given probability distributions; · denotes some density estimator
which, given samples, returns a probability distribution. The general idea is also
visualized on Fig. 2.
3 Theory
We first discuss the one class case which has mainly introductory character as
it shows the simplified version of our main idea.
Suppose that we have unlabeled data X ⊂ Rd and that we want to reduce
the dimension of the data (for example to visualize it, reduce outliers, etc.) to
k < d. One of the possible approaches is to use information theory and search
for such k-dimensional subspace V ⊂ Rd for which the orthogonal projection of
X onto V preserves as much information about X as possible.
One can clearly choose various measures of information. In our case, due
to computational simplicity, we have decided to use Renyi’s quadratic entropy,
which for the density f on Rk is given by
H2 (f ) = − log f 2 (x)dx.
Rk
is realized for the first k orthonormal vectors given by the PCA and
H2 (VT XN ) = k
2 log(4π) + 1
2 log(det(VT ΣV)).
In other words we search for these V for which the value of det(VT ΣV) is
maximized. Now by Cauchy interlacing theory [2] eigenvalues of VT ΣV (ordered
decreasingly) are bounded above by the eigenvalues of Σ. Consequently, the
maximum is obtained in the case when V denotes the orthonormal eigenvectors
of Σ corresponding to the biggest eigenvalues of Σ. However, this is exactly the
first k elements of the orthonormal base constructed by the PCA. Proof of the
second part is fully analogous.
1
We identify those vectors with a linear space spanned over them.
2
That is for A ⊂ V we put AN = N (mA , covA ) : V → R+ .
MELM for Learning Discriminative Low-Dimensional Representation 57
Let us now proceed to the binary labeled data. Recall that Dcs can be equiva-
lently expressed in terms of Renyi’s quadratic entropy (H2 ) and Renyi’s quadratic
cross entropy (H× 2 ):
Dcs (V) = log VT X+ 2 + log VT X− 2 − 2 log VT X+ VT X−
Observation 1. Assume that the density estimator · does not change under
the affine change of the coordinate system3 . One can show, by an easy modifica-
tion of the theorem by Czarnecki and Tabor [6, Theorem4.1], that the maximum
of Dcs (·) is independent of the affine change of data. Namely, for an arbitrary
affine invertible map M we have:
Regularizing term has a slightly different meaning than in most of the machine
learning models. Here it controls number of disjoint regions which appear after
performing density based classification in the projected space. For one dimen-
sional case it is a number of thresholds in the multithreshold linear classifier,
for k = 2 it is the number of disjoint curves defining decision boundary, and so
on. Renyi’s quadratic entropy is minimized when each class is as condensed as
possible (as we show in Theorem 1), intuitively resulting in a small number of
disjoint regions.
It is worth noting that, despite similarities, it is not the common classification
objective which can be written as an additive loss function and a regularization
term
N
L(V) = (VT xi , yi , xi ) + Ω(V),
i=1
as the error depends on the relations between each pair of points instead of
each point independently. One can easily prove that there are no , Ω for which
3
This happens in particular for the kernel density estimation we apply in the paper.
58 W.M. Czarnecki et al.
Dcs (v) = L(V; , Ω). Such choice of the objective function might lead to the lack
of connections with optimization of any reasonable accuracy related metric, as
those are based on the point-wise loss functions. However it appears that Dcs
bounds the expected balanced accuracy4 similarly to how hinge loss bounds 0/1
loss. This can be formalized in the following way.
Theorem 2. Negative log-likelihood of balanced misclassification in k-
dimensional linear projection of any non-separable densities f± onto V is
bounded by half of the Renyi’s quadratic cross entropy of these projections.
Proof. Likelihood of balanced misclassification over a k-dimensional hypercube
after projection through V equals
min{(VT f+ )(x), (VT f− )(x)}dx.
[0,1]k
As a result we might expect that maximizing of the Dcs leads to the selection
of the projection which on one hand maximizes the balanced accuracy over the
training set (minimizes empirical error) and on the other fights with overfitting
by minimizing the number of disjoint classification regions (minimizes model
complexity).
where G(V) = VT V denotes the grassmanian. We search for such V for which
the Cauchy-Schwarz Divergence is maximal. Recall that the scalar product in
the space of matrices is given by V1 , V2 = tr(V1T V2 ).
4 1 TP TN
Accuracy without class priors BAC(TP, FP, TN, FN) = 2 TP+FN
+ TN+FP
.
MELM for Learning Discriminative Low-Dimensional Representation 59
There are basically two possible approaches one can apply: either search for
the solution in the set of orthonormal V which generate V, or allow all V with a
penalty function. The first method is possible5 , but does not allow use of most
of the existing numerical libraries as the space we work in is highly nonlinear.
This is the reason why we use the second approach which we describe below.
Since, as we have observed in the previous section, the result does not depend
on the affine transformation of data, we can restrict to the analogous formula
for the sets
VT X+ and VT X− ,
where V consists of linearly independent vectors. Consequently, we need to com-
pute the gradient of the function
where we consider the space consisting only of linearly independent vectors. Since
as the base of the space V we can always take orthonormal vectors, the maximum
is realized for orthonormal sequence, and therefore we can add a penalty term for
being non-orthonormal sequence, which helps avoiding numerical instabilities:
Dcs (V) − VT V − I 2
,
Besides MELM(·) value we need the formula for its gradient ∇MELM(·). For
the second term we obviously have
∇ VT V − I 2
= 4VVT V − 4V.
We consider the first term. Let us first provide the formula for the computa-
tion of the product of kernel density estimations of two sets.
Assume that we are given set A ⊂ V (in our case A will be the projection of
X± onto V ), where V is k-dimensional. Then the formula for the kernel density
estimation with Gaussian kernel, is given by [15]:
1
A = N (a, ΣA ),
|A|
a∈A
where ΣA = (hγA ) covA and (for γ being a scaling hyperparameter [6]) hγA =
2
4 1/(k+4)
γ( k+2 ) |A|−1/(k+4) .
5
And has advantage of having smaller number of parameters.
60 W.M. Czarnecki et al.
Now we need the formula for AB, which is calculated [6] with the use of
N (a, ΣA )N (b, ΣB ) = N (a − b, ΣA + ΣB )(0).
Then we get
1
AB = N (w, ΣA + ΣB )(0)
|A||B|
w∈A−B
1
2
= exp(− 12 w ΣAB ),
(2π)k/2 det1/2 (ΣAB )|A||B| w∈A−B
= γ 2 ( k+2
4 2/(k+4)
) (|A|−2/(k+4) covA + |B|−2/(k+4) covB ).
Observe that ΣAB (V) and SAB (V) are square symmetric matrices which repre-
sent the properties of the projection of the data onto the space spanned over V.
We put
1
φAB (V) = ,
(2π)k/2 det1/2 (ΣAB (V))|A||B|
thus
∇φAB (V) = −φAB (V) · ΣAB · V · SAB (V).
Consequently to compute the final formula, we need the gradient of the function
V → det(ΣAB (V)), which as one can easily verify, is given by the formula
To present the final form for the gradient of Dcs (V) we need the gradient of
the cross information potential
ip×
AB (V) = φAB (V)
w
ψAB (V),
w∈A−B
∇ip×
AB (V) = φAB (V)
w
∇ψAB (V) + w
ψAB (V) · ∇φAB (V).
w∈A−B w∈A−B
MELM for Learning Discriminative Low-Dimensional Representation 61
Since
Dcs (V) = log(ip× × ×
X+ X+ (V)) + log(ipX− X− (V)) − 2 log(ipX+ X− (V)),
we finally get
∇Dcs (V) = ip× 1
∇ip×
X+ X+ (V) + ip×
1
∇ip×
X− X− (V)
X+ X+ (V) X − X−
(V)
− 2 ip× 1
∇ip×
X+ X− (V).
X+ X− (V)
Given
MELM(V) = Dcs (V) − VT V − I 2
,
∇MELM(V) = ∇Dcs (V) − (4VVT V − 4V),
one can run any first-order optimization method to find vectors V spanning k-
dimensional subspace V representing low-dimensional, discriminative manifold
of the input space.
As one can notice from the above equations, the computational complexity of
both function evaluation and its gradient are quadratic in terms of training set
size. For big datasets this can a serious bottleneck. One of the possible solutions
is to use approximation schemes for the computation of the Cauchy-Schwarz
diviergence, which are known to significantly reduce the computational time
without sacrificing the accuracy [10]. One other option is to use analogous of
stochastic
√ gradient descent where we define function value on a random sample
of O( N ) points (resampled in each iteration) from each class, leading to linear
complexity and given that training set is big enough, one can get theoretical
guarantees on the quality of approximation [15]. Finally, it is possible to first
build a Gaussian Mixture Model (GMM) of each class distribution [17] and per-
form optimization on such density estimator. Computational complexity would
be reduced to constant time per-iteration (due to fixing number of components
during clustering) trading speed for accuracy.
5 Experiments
We use ten binary classification datasets from UCI repository [1] and libSVM
repository [3], which are briefly summarized in Table 1. These are moderate size
problems.
Code was written in Python with the use of scikit-learn, numpy and scipy.
Besides MELM we use 8 other linear dimensionality reduction techniques,
namely: Principal Component Analysis (PCA), class PCA (cPCA6 ), two ellip-
soid PCA (2ePCA7 ), per class PCA (pPCA8 ), Independent Component Anal-
ysis (ICA), Factor Analysis (FA), Nonnegative Matrix Factorization (NMF9 ),
6
cPCA uses sum of each classes covariances, weighted by classes sizes, instead of
whole data covariance.
7
2ePCA is cPCA without weights, so it is a balanced counterpart.
8
pPCA uses as Vi the first principal component of ith class.
9
In order to use NMF we first transform dataset so it does not contain negative values.
62 W.M. Czarnecki et al.
As one can see on Fig. 4 for 8 out of 10 considered datasets one can expect to
find the maximum (with 5% error) after just 16 random starts. Obviously this
cannot be used as a general heuristics as it is heavily dependent on the dataset
size, dimensionality as well as its discriminativness. However, this experiment
10
forked at http://gist.github.com/lejlot/3ab46c7a249d4f375536
11
http://github.com/gmum/melm
MELM for Learning Discriminative Low-Dimensional Representation 63
Fig. 3. Histograms of Dcs values obtained for each dataset during 500 random starts
using L-BFGS.
shows that for moderate size problems (hundreds to thousands samples with
dozens of dimensions) MELM can be relatively easily optimized even though it
is a rather complex function with possibly many local maxima.
It is worth noting that truly complex optimization problem is only given by
ionosphere dataset. One can refer to Table 1 to see that this is a very specific
problem where positive class is located in a very low-dimensional linear manifold
(approximately 7 dimensional) while the negative class is scattered over nearly
4 times more dimensions.
We check how well MELM behaves when used in a classification pipeline.
There are two main reasons for such approach, first if the discriminative manifold
is low-dimensional, searching for it may boost the classification accuracy. Second,
even if it decreases classification score as compared to non-linear methods applied
directly in the input space, the resulting model will be much simpler and more
robust. For comparison consider training a RBF SVM in R60 using 1000 data
64 W.M. Czarnecki et al.
points. It is a common situation when SVM selects large part of the dataset as
the support vectors [16], [18], meaning that the classification of the new point
requires roughly 500 · 60 = 30000 operations. In the same time if we first embed
space in a plane and fit RBF SVM there we will build a model with much less
support vectors (as the 2D decision boundary generally is not as complex as
60-dimensional one), lets say 100 and consequently we will need 60 · 2 + 2 · 100 =
120 + 200 = 320 operations, two orders of magnitude faster. Whole pipeline is
composed of:
1. Splitting dataset into training X− , X+ and testing X̂− , X̂+ .
2. Finding plane embeding matrix V ∈ Rd×2 using tested method.
3. Training a classifier cl on VT X− , VT X+ .
4. Testing cl on VT X̂− , VT X̂+ .
Table 2 summarizes BAC scores obtained by each method on each of the
considered datasets in 5-fold cross validation. For the classifier module we used
SVM RBF, KNN and KDE-based density classification. Each of them was fitted
using internal cross-validation to find the best parameters. GEM and MELM γ
hyperperameters were also fitted. Reported results come from the best classifier.
significantly smaller degree than using PCA, cPCA, 2ePCA, pPCA, ICA, NMF,
FA or GEM. It is also worth noting that Factor Analysis, as the only method
which does not require orthogonality of resulting projection vectors did a really
bad job while working with fourclass data even though these samples are just
two-dimensional.
As stated before, MELM is best suited for low-dimensional embedings and
one of its main applications is supervised data visualization. However in gen-
eral one can be interested in higher dimensional subspaces. During preliminary
studies we tested model behavor up to k = 5 and results were similar to the one
reported in this paper (when compared to the same methods, with analogous
k). It is worth noting that methods like PCA also use a density estimator - one
big Gaussian fitted through maximum likelihood estimation. Consequently even
though from theoretical point of view MELM should not be used for k > 5 (due
to the curse of dimensionality [15], it works fine as long as one uses good density
estimator (such as a well fitted GMM [17]).
6 Conclusions
In this paper we construct Maximum Entropy Linear Manifold (MELM), a
method of learning discriminative low-dimensional representation which can
be used for both classification purposes as well as a visualization preserving
classes separation. Proposed model has important theoretical properties includ-
ing affine transformations invariance, connections with PCA as well as bounding
the expected balanced misclassification error. During evaluation we show that
for moderate size problems MELM can be efficiently optimized using simple
first-order optimization techniques. Obtained results confirm that such an app-
roach leads to highly discriminative transformation, better than obtained by 8
compared solutions.
Acknowledgments. The work has been partially financed by National Science Centre
Poland grant no. 2014/13/B/ST6/01792.
MELM for Learning Discriminative Low-Dimensional Representation 67
References
1. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.
ics.uci.edu/ml
2. Bhatia, R.: Matrix analysis, vol. 169. Springer Science & Business Media (1997)
3. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans-
actions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)
4. Cover, T.M., Thomas, J.A.: Elements of information theory, 2nd edn. Willey-
Interscience, NJ (2006)
5. Czarnecki, W.M.: On the consistency of multithreshold entropy linear classifier.
Schedae Informaticae (2015)
6. Czarnecki, W.M., Tabor, J.: Multithreshold entropy linear classifier: Theory and
applications. Expert Systems with Applications (2015)
7. Geng, Q., Wright, J.: On the local correctness of 1-minimization for dictionary
learning. In: 2014 IEEE International Symposium on Information Theory (ISIT),
pp. 3180–3184. IEEE (2014)
8. Goodfellow, I.J., et al.: Challenges in representation learning: a report on three
machine learning contests. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.)
ICONIP 2013, Part III. LNCS, vol. 8228, pp. 117–124. Springer, Heidelberg (2013)
9. Hinton, G., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets.
Neural Computation 18(7), 1527–1554 (2006)
10. Jozefowicz, R., Czarnecki, W.M.: Fast optimization of multithreshold entropy lin-
ear classifier (2015). arXiv preprint arXiv:1504.04739
11. Karampatziakis, N., Mineiro, P.: Discriminative features via generalized eigenvec-
tors. In: Proceedings of the 31st International Conference on Machine Learning
(ICML 2014), pp. 494–502 (2014)
12. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In:
Advances in Neural Information Processing Systems (NIPS 2014), pp. 2177–2185
(2014)
13. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse
coding. In: Proceedings of the 26th Annual International Conference on Machine
Learning, pp. 689–696. ACM (2009)
14. Principe, J.C., Xu, D., Fisher, J.: Information theoretic learning. Unsupervised
Adaptive Filtering 1, 265–319 (2000)
15. Silverman, B.W.: Density estimation for statistics and data analysis, vol. 26. CRC
Press (1986)
16. Suykens, J.A., Van Gestel, T., De Brabanter, J., De Moor, B., Vandewalle, J.:
Least squares support vector machines, vol. 4. World Scientific (2002)
17. Tabor, J., Spurek, P.: Cross-entropy clustering. Pattern Recognition 47(9),
3046–3059 (2014)
18. Wang, L.: Support Vector Machines: theory and applications, vol. 177. Springer
Science & Business Media (2005)
Novel Decompositions of Proper Scoring Rules
for Classification: Score Adjustment
as Precursor to Calibration
1 Introduction
Classifier evaluation is crucial for building better classifiers. Selecting the best
from a pool of models requires evaluation of models on either hold-out data or
through cross-validation with respect to some evaluation measure. An obvious
choice is the same evaluation measure which is later going to be relevant in the
model deployment context.
However, there are situations where the deployment measure is not neces-
sarily the best choice, as in model construction by optimisation. Optimisation
searches through the model space to find ways to improve an existing model
according to some evaluation measure. If this evaluation measure is simply the
error rate, then the model fitness space becomes discrete in the sense that there
are improvements only if some previously wrongly classified instance crosses the
decision boundary. In this case, surrogate losses such as quadratic loss, hinge
loss or log-loss enable SVMs, logistic regression or boosting to converge towards
better models.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 68–85, 2015.
DOI: 10.1007/978-3-319-23528-8 5
Novel Decompositions of Proper Scoring Rules for Classification 69
1
Note that the commonly used bias-variance decompositions apply to the loss of a
learning algorithm, whereas we are studying the loss of a particular model.
2
Our terminology here relates to epistemic and aleatoric uncertainty [10].
3
In some literature a classifier has been called calibrated when it is actually only
adjusted, a confusion that we hope to help remove by giving a name for the latter.
70 M. Kull and P. Flach
Suppose now that the true class y is being sampled from a distribution q over
classes (i.e. q is a probability vector). We denote by s(p, q) the expected score
with rule φ on probability vector p with respect to the class label drawn according
to q:
k
s(p, q) := EY ∼q φ(p, Y ) = φ(p, ej )qj ,
j=1
In the particular case where q is equal to the true class label y, divergence is
equal to the proper scoring rule itself, i.e. d(p, y) = φ(p, y). In the following we
refer to proper scoring rules as d(p, y) because this makes the decompositions
more intuitive.
Proper scoring rules define the loss of a class probability estimator on a single
instance. In practice, we are interested in the performance of the model on test
data. Once the test data are fixed and known, the proper scoring rules provide
the performance measure as the average of instance-wise losses across the test
data. We refer to this as empirical loss. If the test data are drawn randomly from
a (potentially infinite) labelled instance space, then the performance measure can
be defined as the expected loss on a randomly drawn labelled instance. We refer
to this as expected loss.
Empirical loss can be thought of as a special case of expected loss with uni-
form distribution over the test instances and zero probability elsewhere. Indeed,
suppose that the generative model is uniformly randomly picking and outputting
72 M. Kull and P. Flach
one of the test instances. The empirical loss on the (original) test data and the
expected loss with this generative model are then equal. Therefore, all decom-
positions that we derive for the expected loss naturally apply to the empirical
loss as well, assuming that test data represent the whole population.
Next we introduce our notation in terms of random variables. Let X be a
random variable (a vector) representing the attributes of a randomly picked
instance, and Y = (Y1 , . . . , Yk ) be a random vector specifying the class of that
instance, where Yj = 1 if X is of class j, and Yj = 0 otherwise, for j = 1, 2, . . . , k.
Let now f be a fixed scoring classifier (or class probability estimator), then we
denote by S = (S1 , S2 , . . . , Sk ) = f (X) the score vector output by the classifier
on instance X. Note that S is now a random vector, as it depends on the random
variable X. The expected loss of S with respect to Y under the proper scoring
rule d is E[d(S, Y )].
Example 1. Consider a binary (k = 2) classification test set of 8 instances with
2 features, as shown in column X (i) of Table 1. Suppose the instances with
indices 1,2,3,5,6 are positives (class 1) and the rest are negatives (class 2). This
(i)
information is represented in column Y1 , where 1 means ‘class 1’ and 0 means
‘not class 1’.
Suppose we have two models predicting both 0.9 as the probability of class
1 for the first 4 instances, but differ in probability estimates for the remain-
ing 4 instances with 0.3 predicted by the first and 0.4 by the second model.
(i)
This information is represented in the columns labelled S1 for both models.
Table 1. Example dataset with 2 classes, with information shown for class 1 only.
The score for class 1 is S1 = 0.3X1 by Model 1 and S1 = 0.25X1 + 0.15 by Model 2,
whereas the optimal model is Q1 = 0.5X2 (or any other model which outputs 1 for first
two instances and 0.5 for the rest). Columns A+,1 , A∗,1 and C1 represent additively
adjusted, multiplicatively adjusted, and calibrated scores, respectively. The average of
each column is presented (mean), as well as log-loss (LL) and Brier score (BS) with
respect to the true labels (Y1 = 1 stands for class 1).
The second model is better according to both log-loss (0.684 < 0.717) and Brier
score (0.47 < 0.5). These can equivalently be considered either as empirical losses
(as they are averages over 8 instances) or as expected losses (if the generative
model picks one of the 8 instances uniformly randomly). The meaning of the
remaining columns in Table 1 will become clear in the following sections.
For our running example the epistemic log-loss ELLL for the two models is 0.198
and 0.164 (not shown in Table 1) and the (model-independent) irreducible log-
loss is ILLL = 0.520, which (as expected) sum up to the total expected log-loss of
0.717 and 0.684, respectively (with the rounding effect in the last digit of 0.717).
For Brier score the decomposition for the two models is 0.5 = 0.125 + 0.375 and
0.47 = 0.095 + 0.375, respectively.
– Calibration Loss CL = E[d(S, C)] is the loss due to the difference between
the model output score S and the proportion of positives among instances
with the same output (calibrated score).
– Refinement Loss RL = E[d(C, Y )] is the loss due to the presence of instances
from multiple classes among the instances with the same estimate S.
For our running example the calibration loss for Brier score CLBS for the two
models is 0.062 and 0.033 (not shown in Table 1) and the refinement loss is for
both equal to RLBS = 0.438, which sum up to the total expected Brier scores
of 0.5 and 0.47, respectively (with the rounding effect in the last digit, we omit
this comment in the following cases). For log-loss the decomposition for the two
models is 0.717 = 0.090 + 0.628 and 0.684 = 0.056 + 0.628, respectively.
In practice, calibration has proved to be an efficient way of decreasing proper
scoring rule loss [2]. Calibrating a model means learning a calibration mapping
from the model output scores to the respective calibrated probability scores.
Calibration is simple to perform if the model has only a few possible output
5
Actually, in [4] the calibration-refinement decomposition is stated as E[s(S, Y )] =
E[d(S, C)] + E[e(C)] but this can easily be shown to be equivalent to our statement.
Novel Decompositions of Proper Scoring Rules for Classification 75
scores, each covered by many training examples. Then the empirical class dis-
tribution among training instances with the same output scores can be used as
calibrated score vector. However, in general, there might be a single or even no
training instances with the same score vector as the model outputs on a test
instance. Then the calibration procedure needs to make additional assumptions
(inductive bias) about the shape of the calibration map, such as monotonicity
and smoothness.
Regardless of the method, calibration is almost never perfect. Even if per-
fectly calibrated on the training data, the model can suffer some calibration
loss on test data. In the next section we propose an adjustment procedure as a
precursor of calibration. Adjustment does not make any additional assumptions
and is guaranteed to decrease loss if the test class distribution is known exactly.
Ideal scores cannot be obtained in practice, and calibrated scores are hard to
obtain, requiring extra assumptions about the shape of the calibration map.
Here we propose two procedures which take as input the estimated scores and
output adjusted scores such that the mean matches with the given target class
distribution. As opposed to calibration, no labels are required for learning how to
adjust, only the scores and target class distribution are needed. We prove that
additive adjustment is guaranteed to decrease Brier score, and multiplicative
adjustment is guaranteed to decrease log-loss. In both cases we can decompose
the expected loss in a novel way.
4.1 Adjustment
Suppose we are given the class distribution of the test data, represented as a
vector π of length k, with non-negative entries and adding up to 1. It turns out
that if the average of the model output scores on the test data does not match
with the given distribution then for both log-loss and Brier score it is possible
to adjust the scores with guaranteed reduction of loss. First we define what we
mean by adjusted scores.
If the scores are not adjusted, then they can be adjusted using one of the
two following procedures.
Additive (score) adjustment is a procedure applying the following func-
tion α+ :
α+ (s) = (s1 + b1 , . . . , sk + bk ) ∀s ∈ Rk ,
76 M. Kull and P. Flach
where wj are suitably chosen non-negative weights such that α∗ (S) is adjusted
to π. It is not obvious that such weights exist because of the required renormal-
isation, but the following theorem gives this guarantee.
Proof. All the proofs are in the Appendix and the extended proofs are available
at http://www.cs.bris.ac.uk/∼flach/Kull Flach ECMLPKDD2015 Supplementary.pdf .
For our running example in Table 1 the additively adjusted and multiplica-
(i) (i)
tively adjusted scores for class 1 are shown in columns A+,1 and A∗,1 , respec-
tively. The shift b for additive adjustment was (+0.025, −0.025) for Model 1 and
(−0.025, +0.025) for Model 2. The weights w for multiplicative adjustment were
(1.18, 1) for Model 1 and (1, 1.16) for Model 2. For example, for Model 1 the
scores (0.9, 0.1) (of first four instances) become (1.062, 0.1) after weighting and
(0.914, 0.086) after renormalising (dividing by 1.062 + 0.1 = 1.162). The average
score for class 1 becomes 0.625 for both additive and multiplicative adjustment
and both models, confirming the correctness of these procedures.
the loss. The guarantee is due to the following novel loss-specific decompositions
(and non-negativity of divergence):
where A+ = α+ (S) and A∗ = α∗ (S) are obtained from the scores S using addi-
tive and multiplicative adjustment, respectively. Note that the additive adjust-
ment procedure can produce values out of the range [0, 1] but Brier score is
defined for these as well. The decompositions follow from Theorem 4 in Section 5,
which provides a unified decomposition:
under an extra assumption which links the adjustment method and the loss
measure. Due to this unification we propose the following terminology for the
losses:
– Adjustment Loss AL = E[d(S, A)] is the loss due to the difference between
the mean model output E[S] and the overall class distribution π := E[Y ].
This loss is zero if the scores are adjusted.
– Post-adjustment Loss P L = E[d(A, Y )] is the loss after adjusting the
model output with the method corresponding to the loss measure.
For our running example the adjustment log-loss ALLL for the two models is
0.0021 and 0.0019 (not shown in Table 1) and the respective post-adjustment
losses P LLL are 0.7154 and 0.6822, which sum up to the total expected log-loss
of 0.7175 and 0.6841, respectively. For Brier score the decomposition for the two
models is 0.5 = 0.00125 + 0.49875 and 0.47 = 0.00125 + 0.46875, respectively.
In practice, the class distribution is usually not given, and has to be estimated
from training data. Therefore, if the difference between the average output scores
and class distribution is small (i.e. adjustment loss is small), then the benefit of
adjustment might be subsumed by class distribution estimation errors. Experi-
ments about this remain as future work.
So far we have given three different two-term decompositions of expected
loss: epistemic loss plus irreducible loss, calibration loss plus refinement loss,
and adjustment loss plus post-adjustment loss. In the following section we show
that these can all be obtained from a single four-term decomposition, and provide
more terminology and intuition.
This theorem proves the decompositions of Section 3 but adds two more:
– Grouping Loss GL = E[d(C, Q)] is the loss due to many instances being
grouped under the same estimate S while having different true posterior
probabilities Q.
adjusted to the class distribution E[Y ] and the following quantity is a constant
(not a random variable), depending on i, j only:
Note that coherent adjustment might not exist for all proper scoring rules:
then the decompositions involving A do not work, falling back to Theorem 2.
Theorem 4 proves the decompositions in Section 4 and also provides the following
decompositions:
Table 3. The decomposed losses (left) and their values for model 1 of the running
example using log-loss (middle) and Brier score (right).
S A C Q Y LL S A∗ C Q Y BS S A+ C Q Y
S 0 AL CL EL L S 0 0.002 0.090 0.198 0.717 S 0 0.001 0.062 0.125 0.5
A 0 PCL PEL PL A∗ 0 0.088 0.196 0.715 A+ 0 0.061 0.124 0.499
C 0 GL RL C 0 0.108 0.628 C 0 0.062 0.438
Q 0 IL Q 0 0.520 Q 0 0.375
Y 0 Y 0 Y 0
Table 3 provides numerical values for all 10 losses of Table 2 for Model 1 in our
running example data (Table 1). The 4-term decomposition proves that the num-
bers right above the main diagonal (AL, PCL, GL, IL) add up to the total loss
at the top right corner (L). All other decompositions can be checked numerically
from the table (taking into account the accumulating rounding errors).
7 Related Work
Proper scoring rules have a long history of research, with Brier score introduced
in 1950 in the context of weather forecasting [3], and the general presentation of
proper scoring rules soon after, see e.g. [11]. The decomposition of Brier score
into calibration and refinement loss (which were back then called reliability and
resolution) was introduced by Murphy [8] and was generalised for proper scoring
rules by DeGroot and Fienberg [5]. The decompositions with three terms were
introduced by Murphy [9] with uncertainty, reliability and resolution (Murphy
reused the same name for a different quantity), later generalised to all proper
scoring rules as well [4]. In our notation these can be stated as E[d(S, Y )] =
REL + U N C − RES = E[d(S, C)] + E[d(π, Y )] − E[d(π, C)]. This can easily be
proved by taking into account that the last term can be viewed as calibration
loss for constant estimator π but segmented in the same way as S.
In machine learning proper scoring rules are often treated as surrogate loss
functions, which are used instead of the 0-1 loss to facilitate optimisation [1]. An
82 M. Kull and P. Flach
important question in practice is which proper scoring rule to use. One possible
viewpoint is to assume a particular distribution over anticipated deployment
contexts and derive the expected loss from that assumption. Hernández-Orallo
et al. have shown that the Brier score can be derived from a particular additive
cost model [6].
8 Conclusions
This paper proposes novel decompositions of proper scoring rules. All presented
decompositions are sums of expected divergences between original scores S,
adjusted scores A, calibrated scores C, true posterior probabilities Q and true
labels Y . Each such divergence stands for one part of the total expected loss.
Calibration and refinement loss are known losses of this form, the paper pro-
poses names for the other 7 losses and provides underlying intuition. In par-
ticular, we have introduced adjustment loss, which arises from the difference
between mean estimated scores and true class distribution. While it is a part
of calibration loss, it is easier to eliminate or decrease than calibration loss.
We have proposed first algorithms for additive and multiplicative adjustment,
which we prove to be coherent with (decomposing) Brier score and log-loss,
respectively. More algorithm development is needed for multiplicative adjust-
ment, as the current algorithm can sometimes fail to converge. An open question
is whether there are other, potentially better coherent adjustment procedures for
these losses. We hope that the proposed decompositions provide deeper insight
into the causes behind losses and facilitate development of better classification
methods, as knowledge about calibration loss has already delivered several cali-
bration methods, see e.g. [2].
References
1. Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, Classification, and Risk
Bounds. Journal of the American Statistical Association 101(473), 138–156 (2006)
2. Bella, A., Ferri, C., Hernández-Orallo, J., Ramı́rez-Quintana, M.J.: On the effect
of calibration in classifier combination. Applied Intelligence 38(4), 566–585 (2012)
3. Brier, G.W.: Verification of forecasts expressed in terms of probability. Monthly
weather review 78(1), 1–3 (1950)
4. Bröcker, J.: Reliability, sufficiency, and the decomposition of proper scores. Quar-
terly Journal of the Royal Meteorological Society 135(643), 1512–1519 (2009)
5. De Groot, M.H., Fienberg, S.E.: The Comparison and Evaluation of Forecasters.
Journal of the Royal Statistical Society. Series D (The Statistician) 32(1/2), 12–22
(1983)
Novel Decompositions of Proper Scoring Rules for Classification 83
6. Hernández-Orallo, J., Flach, P., Ferri, C.: A unified view of performance met-
rics: translating threshold choice into expected classification loss. The Journal of
Machine Learning Research 13(1), 2813–2869 (2012)
7. Kull, M., Flach, P.A.: Reliability maps: a tool to enhance probability estimates
and improve classification accuracy. In: Calders, T., Esposito, F., Hüllermeier, E.,
Meo, R. (eds.) ECML PKDD 2014, Part II. LNCS, vol. 8725, pp. 18–33. Springer,
Heidelberg (2014)
8. Murphy, A.H.: Scalar and vector partitions of the probability score: Part I. Two-
state situation. Journal of Applied Meteorology 11(2), 273–282 (1972)
9. Murphy, A.H.: A new vector partition of the probability score. Journal of Applied
Meteorology 12(4), 595–600 (1973)
10. Senge, R., Bösner, S., Dembczynski, K., Haasenritter, J., Hirsch, O., Donner-
Banzhoff, N., Hüllermeier, E.: Reliable classification: Learning classifiers that dis-
tinguish aleatoric and epistemic uncertainty. Information Sciences 255, 16–29
(2014)
11. Winkler, R.L.: Scoring Rules and the Evaluation of Probability Assessors. Journal
of the American Statistical Association 64(327), 1073–1078 (1969)
Here we prove the theorems presented in the paper, extended proofs are available
at http://www.cs.bris.ac.uk/∼flach/Kull Flach ECMLPKDD2015 Supplementary.pdf .
Proof of Theorem 1: If there are any zeros in the vector π, then we can set
the respective positions in the weight vector also to zero and solve the problem
with the remaining classes. Therefore, from now on we assume that all entries
in π are positive.
Let W denote the set of all non-negative (weight) vectors of length k
with at least one knon-zero component. We introduce functions ti : W → R with
ti (w)=E[wi Si / j=1 wj Sj ]. Then we need to find w∗ such that ti (w∗ ) = πi for
i = 1, . . . , k. For this we prove the existence of increasingly better functions
h0 , h1 , . . . , hk−1 : W → W such that for m = 0, . . . , k − 1 the function hm sat-
isfies ti (hm (w)) = πi for i = 1, . . . , m for any w. Then w∗ = hk−1 (w) is the
desired solution, where w ∈ W is any weight vector, such as the vector of all
ones. Indeed, it satisfies ti (w∗ ) = πi for i = 1, . . . , k − 1 and hence for i = k.
We choose h0 to be the identity function and prove the existence of other
functions hm by induction. Let hm for m < k − 1 be such that for any w the
vector hm (w) does not differ from w in positions m+1, . . . , k and ti (hm (w)) = πi
for i = 1, . . . , m. For a fixed w it is now sufficient to prove the existence of w
such that it does not differ from w in positions m + 2, . . . , k and ti (w ) = πi for
i = 1, . . . , m + 1. We search for such w among the vectors hm (w[m + 1 : x]) with
x ∈ [0, ∞) where w[m + 1 : x] denotes the vector w with the element at position
m + 1 changed into x. The chosen form of w guarantees that it does not differ
from w in positions m + 2, . . . , k and ti (w ) = πi for i = 1, . . . , m. It only remains
to choose x such that tm+1 (w ) = πm+1 . For this we note that for x = 0 we have
tm+1 (hm (w[m + 1 : 0])) = 0 because the weight at position m + 1 is zero. In the
84 M. Kull and P. Flach
m
limit process x → ∞ we have tm+1 (hm (w[m + 1 : x])) → 1 − i=1 πi because the
weight x at position m + 1 will dominate over weights at m + 2, . . . , k, whereas
the weights at 1, . . . , m
ensure that ti (hm (w[m + 1 : x])) = πi for i = 1, . . . , m.
m
Since 0 < πm+1 < 1 − i=1 πi then according to the intermediate value theorem
there exists x such that tm+1 (hm (w[m + 1 : x])) = πm+1 . By this we have proved
the existence of a suitable function hm+1 , proving the step of induction, which
concludes the proof.
k
k
k
k
(Am − δmi )2 − (Am − δmj )2 − (Sm − δmi )2 + (Sm − δmj )2 = constij ,
m=1 m=1 m=1 m=1
(Ai − 1)2 − A2i − (Si − 1)2 + Si2 , for additive adjustment this equals to the con-
stant −2bi due to Ai = Si + bi . A similar argument holds for m = j and as a
result we have proved that the requirement (1) holds and additive adjustment
is coherent with Brier score.
Proof of Theorem 4: If none of V1 , V2 , V3 is A, then the result follows from
Theorem 2. If V1 = A, then the result follows from Theorem 2 with f N EW = α◦f
because then S N EW = A, C N EW = C, QN EW = Q. It remains to prove the
result for the case where V1 = S and V2 = A. Denote βj = φ(A, e1 ) − φ(A, ej ) −
φ(S, e1 ) + φ(S, ej ) for j = 1, . . . , k, then βj are all constants. Now it is enough
to prove that the following quantity is zero:
k
k
=E φ(S, ej ) − φ(A, ej ) V3,j −Aj =E βj + φ(S, e1 ) − φ(A, e1 ) V3,j −Aj
j=1 j=1
k
k
k
= βj E[V3,j ] − E[Aj ] + E φ(S, e1 ) − φ(A, e1 ) V3,j − Aj .
j=1 j=1 j=1
1 Introduction
Most commonly Bayesian network classifiers (BNCs) are implemented on nowa-
days desktop computers, where double-precision floating-point numbers are used
for parameter representation and arithmetic operations. In these BNCs, inference
and classification is typically performed using the same precision for parameters
and operations, and the executed computations are considered as exact. However,
there is a need for BNCs working with limited computational resources. Such
resource-constrained BNCs are important in domains such as ambient comput-
ing, on-satellite computations1 or acoustic environment classification in hearing
F. Pernkopf—This work was supported by the Austrian Science Fund (FWF) under
the project number P25244-N15.
1
Computational capabilities on satellites are still severely limited due to power con-
straints and restricted availability of hardware satisfying the demanding require-
ments with respect to radiation tolerance.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 86–101, 2015.
DOI: 10.1007/978-3-319-23528-8 6
Parameter Learning of Bayesian Network Classifiers 87
aids, machine learning for prosthetic control, e.g. a brain implant to control
hand movements, amongst others. In all these applications, a trade-off between
accuracy and required computational resources is essential.
In this paper, we investigate BNCs with limited computational demands by
considering BNCs with reduced-precision parameters, i.e. fixed-point parameters
with limited precision.2 Using reduced-precision parameters is advantageous in
many ways, e.g. power consumption compared to full-precision implementations
can be reduced [20] and reduced-precision parameters enable one to implement
many BNCs in parallel on field programmable gate arrays (FPGAs), i.e. the cir-
cuit area requirements on the FPGA correlate with the parameter precision [9].
Our investigations are similar to those performed in digital signal-processing,
where reduced-precision implementations for digital signal processors are of great
importance [10]. Note that there is also increased interest in implementing other
machine learning models, e.g. neural networks, using reduced-precision parame-
ters/computations to achieve faster training and to facilitate the implementation
of larger models [2,18].
We are especially interested in learning the reduced-precision parameters using
as little computational resources as possible. To decide on how to perform this
learning, several questions should be answered. Should reduced-precision parame-
ters be learned in a pre-computation step in which we can exploit the full computa-
tional power of nowadays computers? Or is it necessary to learn/adopt parameters
using reduced-precision arithmetic only? The answers to these questions depend
on the application of interest and identify several learning scenarios that are sum-
marized in Figure 1. In the following, we discuss these scenarios briefly:
2
We are interested in fixed-point arithmetic and not in floating-point arithmetic,
because typically the implementation of fixed-point processing units requires less
resources than the implementation of floating-point processing units.
88 S. Tschiatschek and F. Pernkopf
Training
reduced-
full-precision
precision
potentially
classical scenario,
relevant for
full-precision e.g. training and
big-data
Testing
testing on PCs
applications
satellite-based system for remote sensing that tunes its parameter according
to changing atmospheric conditions.
2 Related Work
For undirected graphical models, approximate inference and learning using inte-
ger parameters has been proposed [16]. While undirected graphical models are
more amenable to integer approximations mainly due to the absence of sum-to-
one constraints, there are domains where probability distributions represented
by directed graphical models are desirable, e.g. in expert systems in the medical
domain.
Parameter Learning of Bayesian Network Classifiers 89
L
νj|h
i
PB (C = c, X = x) = i
θj|h , (2)
i=0 j∈val(Xi ) h∈val(Pa(Xi ))
i
where νj|h = 1([c,x](Xi )=j and [c,x](Pa(Xi ))=h) .3 We typically represent the BN
i i
parameters in the logarithmic domain, i.e. wj|h = log θj|h , wi = (wj|h
i
|j ∈
val(Xi ), h ∈ val(Pa(Xi ))), and w = (w0 , . . . , wL ). In general, we will interpret
i
w as a vector, whose elements are addressed as wj|h . We define a vector-valued
i
function φ(c, x) of the same length as w, collecting νj|h , analog to the entries
i
wj|h in w. In that way, we can express the logarithm of (2) as
of fractional bits bf . The addition of two fixed-point numbers can be easily and
accurately performed, while the multiplication of two fixed-point numbers often
leads to overflows and requires truncation to achieve results in the same format.
PB (c(n) , x(n) )
dB (c(n) , x(n) ) = , (6)
maxc=c(n) PB (c, x(n) )
and where the hinge loss function is denoted as min(γ, dB (c(n) , x(n) )). The
parameter γ > 1 controls the margin. In this way, the margin measures the
ratio of the likelihood of the nth sample belonging to the correct class c(n)
to the likelihood of belonging to the most likely competing class. The nth
sample is correctly classified iff dB (c(n) , x(n) ) > 1 and vice versa.
where η is the learning rate, ∇w (f )(a) denotes the gradient of f with respect
to w at a, and Π[w] denotes the 2 -norm projection of the parameter vector
w onto the set of normalized parameter vectors. Note that the gradient has a
simple form: it consists only of zeros and ones, where the ones are indicators of
active entries in the CPTs of sample (c, x). Furthermore, assuming normalized
parameters at time-step t, the direction of the gradient is always such that the
parameters wML,t+1 are super-normalized. Consequently, after (exact) projec-
tion the parameters satisfy the sum-to-one constraints.
We continue by analyzing the effect of using reduced-precision arithmetic on
the online learning algorithm. Therefore, we performed the following experiment:
Assume that the projection can only be approximately performed. We simulate
the approximate projection by performing an exact projection and subsequently
adding quantization noise (this is similar to reduced-precision analysis in signal
processing [10]). We sample the noise from a Gaussian distribution with zero
mean and with variance σ 2 = q 2 /12, where q = 2−bf . For the satimage dataset
from the UCI repository [1] we construct BNCs with TAN structure. As initial
parameters we use rounded ML parameters computed from one tenth of the
training data. Then, we present the classifier further samples in an online manner
and update the parameters
√ according to (9). During learning, we set the learning
rate η to η = η0 / 1 + t, where η0 is some constant (η0 is tuned by hand such that
the test set performance is maximized). The resulting classification performance
is shown in Figures 2a and 2b for the exact and the approximate projection,
respectively. One can observe, that the algorithm does not properly learn using
the approximate projection. Thus, it seems crucial to perform the projections
rather accurately. To circumvent the need for accurate projections, we propose
a method that avoids computing a projection at all in the following.
Consider again the offline parameter learning case. ML parameters can be
computed in closed-form by computing relative frequencies, i.e.
i
mij|h
θj|h = , (10)
mih
where
N
mij|h = φ(c(n) , x(n) )ij|h , and mih = mij|h . (11)
n=1 j
This can be easily extended to online learning. Assume that the counts mi,t j|h at
t t
time t are given and that a sample (c , x ) is presented to the learning algorithm.
Then, the counts are updated according to
mi,t+1
j|h = mi,t t t i
j|h + φ(c , x )j|h . (12)
94 S. Tschiatschek and F. Pernkopf
95 95
90 90
classification rate
classification rate
85 85
80 80
95
90
classification rate
85
80
75 rounded ML (train)
rounded ML (test)
70 proposed (train)
proposed (test)
65
0 5000 10000 15000 20000
# samples seen
Fig. 2. Classification performance of BNCs with TAN structure for satimage data in an
online learning scenario; (a) Online ML parameter learning with exact projection after
each parameter update, (b) online ML parameter learning with approximate projection
after each parameter update (see text for details), (c) proposed algorithm for online
ML parameter learning.
i,t
Exploiting these counts, the logarithm of the ML parameters θj|h at time t can
be computed as
i,t
mi,t
j|h
wj|h = log , (13)
mi,t
h
log2 (i/j)
L(i, j) = · q, (14)
q R
where [·]R denotes rounding to the closest the integer, q is the quantization
interval of the desired fixed-point representation, log2 (·) denotes the base-2 log-
arithm, and where i and j are in the range 0, . . . , M − 1. Given sample (ct , xt ),
the counts mi,t+1
j|h and mi,t+1
h are computed according to Algorithm 1 from the
counts mi,t i,t
j|h and mh . To guarantee that the counts stay in range, the algorithm
identifies counters that reach their maximum value, and halfs these counters as
well as all other counters corresponding to the same CPTs. This division by 2
can be implemented as a bitwise shift operation.
mi,t+1
j|h ← mi,tj|h + φ(c , x )j|h
t t i
∀i, j, h update counts
for i, j, h do
if mi,t+1
j|h = M then maximum value of counter reached?
mi,t+1
j|h ← mi,t+1
j|h /2 ∀j half counters of considered CPT (round down)
end if
end for
return mi,t+1
j|h
div ← 0
while mi,th ≥ M do ensure that index into lookup table is in range
mi,t
h ← m i,t
h /2 half and round down
div ← div + 1
end while
wj|h
i,t
← L(mi,t
j,h , mh )
i,t
∀j get log-probability from lookup table
while div > 0 and ∀j : wj|h
i,t
> (−2bi + 2bf ) + 1 do revise index correction
wj|h
i,t
← wj|h
i,t
−1 ∀j
div ← div − 1
end while
return wj|h
i,t
N
N
PB (c(n) , x(n) )
B (n) (n)
log P (c ,x ) +λ log min γ, . (15)
n=1 n=1
maxc=c(n) PB (c, x(n) )
ML MM
In this way, generative properties, e.g. the ability to marginalize over missing
features, are combined with good discriminative performance. This variant of
the MM objective can be easily written in the form
N
wMM = arg max φ(c(n) , x(n) )T w+ (16)
w
n=1
N
λ min γ, min (φ(c(n) , x(n) ) − φ(c, x(n) ))T w
c=c(n)
n=1
and, for simplicity, we will refer to this modified objective as the MM objective.
Note that there are implicit sum-to-one constraints in problem (16), i.e. any
i
feasible solution w must satisfy j exp(wj|h ) = 1 for all i, j, h. In the online
learning case, given sample (c, x), the parameters wMM,t+1 at time t + 1 are
computed from the parameters wMM,t at time t as
wMM,t+1 = Π wMM,t + ηφ(c, x) + ηλg(c, x) , (17)
where
⎧
⎨0 min (φ(c, x) − φ(c , x))T w ≥ γ,
c =c
g(c, x) =
⎩φ(c, x) − φ(c , x)
o.w., c = arg minc (φ(c, x) − φ(c , x))T w
(18)
5 Experiments
5.1 Datasets
In our experiments, we considered the following datasets.
1. UCI data [1]. This is in fact a large collection of datasets, with small
to medium number of samples. Features are discretized as needed using the
algorithm proposed in [3]. If not stated otherwise, in case of the datasets chess,
letter, mofn-3-7-10, segment, shuttle-small, waveform-21, abalone, adult, car,
mushroom, nursery, and spambase, a test set was used to estimate the accuracy
of the classifiers. For all other datasets, classification accuracy was estimated
by 5-fold cross-validation. Information on the number of samples, classes and
features for each dataset can be found in [1].
2. USPS data [6]. This data set contains 11000 handwritten digit images
from zip codes of mail envelopes. The data set is split into 8000 images for
training and 3000 for testing. Each digit is represented as a 16 × 16 greyscale
image. These greyscale values are discriminatively quantized [3] and each pixel
is considered as feature.
3. MNIST Data [8]. This dataset contains 70000 samples of handwritten
digits. In the standard setting, 60000 samples are used for training and 10000
for testing. The digits represented by grey-level images were down-sampled by a
factor of two resulting in a resolution of 16 × 16 pixels, i.e. 196 features.
5.2 Results
We performed experiments using M = 1024, i.e. we used counters with 10 bits
(bi + bf = 10). The splitting of the available bits into integer bits and frac-
tional bits was set using 10-fold cross-validation. Experimental results for BNCs
98 S. Tschiatschek and F. Pernkopf
mulation
mi,t+1
j|h ← mi,tj|h + φ(c , x )j|h
t t i
∀i, j, h update counts (likelihood term)
for i, j, h do ensure that parameters stay in range
if mi,t+1
j|h = M then
mi,t+1
j|h ← mi,t+1
j|h /2 ∀j
end if
end for
c ←
strongest competitor of class c for features x
if (φ(ct , x(n) ) − φ(c , x(n) ))T w < γ then
mi,t+1
j|h ← mi,tj|h ∀i, j, h
for k = 1, . . . , λ do Add-up gradient in λ steps
mi,t+1
j|h ← mi,t+1 j|h + φ(ct , xt )ij|h ∀i, j, h update counts (margin term)
mi,t+1
j|h ← mi,t+1
j|h − φ(c , xt )ij|h ∀i, j, h update counts (margin term)
for i, j, h do ensure that parameters stay in range
if mi,t+1
j|h = 0 then
mi,t+1
j|h ← mi,t+1
j|h +1 ∀j
end if
if mi,t+1
j|h = M then
mi,t+1
j|h ← mi,t+1
j|h /2 ∀j
end if
end for
end for
end if
return mi,t+1
j|h
with NB and TAN structures are shown in Table 1 for the datasets described
above. All samples from the training set were presented to the proposed algo-
rithm twenty times in random order. The absolute reduction in classification
rate (CR) compared to the exact CR, i.e. using BNCs with the optimal double-
precision parameters, for the considered datasets is, with few exceptions, rela-
tively small. Thus the proposed reduced-precision computation scheme seems to
be sufficiently accurate to yield good classification performance while employ-
ing only range-limited counters and a lookup table of size M × M . Clearly, the
performance of the proposed method can be improved by using larger and more
accurate lookup tables and counters with larger bit-width.
For discriminative parameter learning, we set the hyper-parameters λ ∈
{0, 1, 2, 4, 8, 16} and γ ∈ {0.25, 0.5, 1, 2, 4, 8.} using 10-fold cross-validation. For
this setup, we observed the classification performance summarized in Table 1.
While the results are not as good as those of the exact MM solution, in terms of the
absolute reduction in CR, we can clearly observe an improvement in classification
performance using the proposed MM parameter learning method over the proposed
ML parameter learning method for many datasets. The performance of BNCs using
Parameter Learning of Bayesian Network Classifiers 99
ML – CR [%] MM – CR [%]
Dataset Structure exact prop. abs. exact prop. abs.
USPS NB 86.89 86.34 0.55 93.91 93.17 0.74
TAN 91.39 90.05 1.34 93.01 93.50 −0.49
MNIST NB 82.88 80.61 2.26 93.11 93.00 0.11
TAN 90.49 87.92 2.57 93.49 93.83 −0.34
australian NB 85.92 85.48 0.44 87.24 85.63 1.61
TAN 81.97 84.46 −2.49 84.76 83.58 1.18
breast NB 97.63 97.48 0.15 97.04 97.63 −0.59
TAN 95.85 96.15 −0.30 96.00 94.52 1.48
chess NB 87.45 86.20 1.25 97.68 94.32 3.36
TAN 92.19 92.13 0.06 97.99 96.27 1.73
cleve NB 82.87 83.55 −0.68 82.53 80.84 1.69
TAN 79.09 80.47 −1.37 80.79 75.69 5.10
corral NB 89.16 89.22 −0.07 93.36 93.36 0.00
TAN 97.53 94.96 2.57 100.00 99.20 0.80
crx NB 86.84 86.22 0.62 86.06 86.68 −0.62
TAN 83.73 84.04 −0.31 84.20 83.58 0.62
diabetes NB 73.96 72.65 1.31 74.87 75.01 −0.14
TAN 73.83 73.44 0.39 74.35 71.73 2.62
flare NB 77.16 75.81 1.34 83.11 83.97 −0.86
TAN 83.59 79.46 4.13 83.30 83.20 0.10
german NB 74.50 72.90 1.60 75.30 73.80 1.50
TAN 72.60 71.80 0.80 72.60 72.10 0.50
glass NB 71.16 71.66 −0.50 70.61 71.08 −0.47
TAN 71.11 69.58 1.53 72.61 69.55 3.05
heart NB 81.85 82.96 −1.11 83.33 84.44 −1.11
TAN 81.48 81.11 0.37 81.48 81.48 0.00
hepatitis NB 89.83 89.83 0.00 92.33 88.67 3.67
TAN 84.83 87.33 −2.50 86.17 88.58 −2.42
letter NB 74.95 74.41 0.54 85.79 81.50 4.30
TAN 86.26 85.93 0.33 88.57 88.43 0.14
lymphography NB 84.23 85.71 −1.48 82.80 87.31 −4.51
TAN 82.20 82.86 −0.66 76.92 80.66 −3.74
nursery NB 89.97 89.63 0.35 93.03 93.05 −0.02
TAN 92.87 92.87 0.00 98.68 98.12 0.56
satimage NB 81.56 82.02 −0.45 88.41 86.96 1.45
TAN 85.85 86.40 −0.55 86.98 87.44 −0.47
segment NB 92.68 91.90 0.78 95.37 93.85 1.52
TAN 94.85 94.89 −0.04 95.76 95.63 0.13
shuttle NB 99.66 99.10 0.56 99.95 99.86 0.09
TAN 99.88 99.71 0.17 99.93 99.87 0.06
soybean-large NB 93.35 92.80 0.56 91.50 92.05 −0.55
TAN 91.14 89.12 2.02 91.87 92.61 −0.74
spambase NB 90.03 89.88 0.15 94.08 93.19 0.89
TAN 92.97 92.79 0.17 94.03 93.73 0.31
vehicle NB 61.57 61.93 −0.36 67.95 69.16 −1.21
TAN 71.09 68.91 2.18 69.88 69.16 0.72
vote NB 90.16 90.63 −0.47 94.61 94.61 0.00
TAN 94.61 94.60 0.01 95.31 94.60 0.71
waveform-21 NB 81.14 81.18 −0.04 85.14 84.16 0.98
TAN 82.52 82.20 0.32 83.48 83.94 −0.46
100 S. Tschiatschek and F. Pernkopf
6 Discussions
We proposed online algorithms for learning BNCs with reduced-precision fixed-
point parameters using reduced-precision computations only. This facilitates the
utilization of BNCs in computationally constrained platforms, e.g. embedded-
and ambient-systems, as well as power-aware systems. The algorithms differ
from naive implementations of conventional algorithms by avoiding error-prone
parameter projections commonly used in gradient ascent/descent algorithms. In
experiments, we demonstrated that our algorithms yield parameters that achieve
classification performances close to that of optimal double-precision parameters
for many of the investigated datasets.
Our algorithms have similarities with a very simple method for learn-
ing discriminative parameters of BNCs known as discriminative frequency
estimates [19]. According to this method, parameters are estimated using a
perceptron-like algorithm, where parameters are updated by the prediction loss,
i.e. the difference of the class posterior of the correct class (which is assumed to
be 1 for the data in the training set) and the class posterior according to the
model using the current parameters.
References
1. Bache, K., Lichman, M.: UCI Machine Learning Repository (2013). http://archive.
ics.uci.edu/ml
2. Courbariaux, M., Bengio, Y., David, J.: Low precision arithmetic for deep learning.
CoRR abs/1412.7024 (2014). http://arxiv.org/abs/1412.7024
3. Fayyad, U.M., Irani, K.B.: Multi-Interval discretization of continuous-valued
attributes for classification learning. In: International Conference on Artificial Intel-
ligence (IJCAI), pp. 1022–1029 (2003)
4. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian Network Classifiers. Machine
Learning 29, 131–163 (1997)
5. Guo, Y., Wilkinson, D., Schuurmans, D.: Maximum Margin Bayesian Networks.
In: Uncertainty in Artificial Intelligence (UAI), pp. 233–242 (2005)
6. Hull, J.J.: A Database for Handwritten Text Recognition Research. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence (TPAMI) 16(5), 550–554
(1994)
7. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. MIT Press (2009)
8. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based Learning Applied
to Document Recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Parameter Learning of Bayesian Network Classifiers 101
9. Lee, D.U., Gaffar, A.A., Cheung, R.C.C., Mencer, O., Luk, W., Constan-
tinides, G.A.: Accuracy-Guaranteed Bit-Width Optimization. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems 25(10), 1990–2000
(2006)
10. Oppenheim, A.V., Schafer, R.W., Buck, J.R.: Discrete-time Signal Processing, 2nd
edn., Prentice-Hall Inc. (1999)
11. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann Publishers Inc. (1988)
12. Peharz, R., Tschiatschek, S., Pernkopf, F.: The most generative maximum margin
bayesian networks. In: International Conference on Machine Learning (ICML), vol.
28, pp. 235–243 (2013)
13. Pernkopf, F., Peharz, R., Tschiatschek, S.: Introduction to Probabilistic Graphical
Models, vol. 1, chap. 18, pp. 989–1064. Elsevier (2014)
14. Pernkopf, F., Wohlmayr, M., Mücke, M.: Maximum margin structure learning of
bayesian network classifiers. In: International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 2076–2079 (2011)
15. Pernkopf, F., Wohlmayr, M., Tschiatschek, S.: Maximum Margin Bayesian Net-
work Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI) 34(3), 521–531 (2012)
16. Piatkowski, N., Sangkyun, L., Morik, K.: The integer approximation of undirected
graphical models. In: International Conference on Pattern Recognition Applica-
tions and Methods (ICPRAM) (2014)
17. Roos, T., Wettig, H., Grünwald, P., Myllymäki, P., Tirri, H.: On Discriminative
Bayesian Network Classifiers and Logistic Regression. Journal of Machine Learning
Research 59(3), 267–296 (2005)
18. Soudry, D., Hubara, I., Meir, R.: Expectation backpropagation: parameter-free
training of multilayer neural networks with continuous or discrete weights. In:
Advances in Neural Information Processing Systems, pp. 963–971 (2014)
19. Su, J., Zhang, H., Ling, C.X., Matwin, S.: Discriminative parameter learning for
bayesian networks. In: International Conference on Machine Learning (ICML), pp.
1016–1023. ACM (2008)
20. Tong, J.Y.F., Nagle, D., Rutenbar, R.A.: Reducing Power by Optimizing the Nec-
essary Precision/Range of Floating-point Arithmetic. IEEE Transactions on Very
Large Scale Integration Systems 8(3), 273–285 (2000)
21. Tschiatschek, S., Cancino Chacón, C.E., Pernkopf, F.: Bounds for bayesian net-
work classifiers with reduced precision parameters. In: International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 3357–3361 (2013)
22. Tschiatschek, S., Paul, K., Pernkopf, F.: Integer bayesian network classifiers. In:
Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014, Part
III. LNCS, vol. 8726, pp. 209–224. Springer, Heidelberg (2014)
23. Tschiatschek, S., Pernkopf, F.: On Reduced Precision Bayesian Network Classifiers.
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (to be
published)
24. Tschiatschek, S., Reinprecht, P., Mücke, M., Pernkopf, F.: Bayesian network classi-
fiers with reduced precision parameters. In: Flach, P.A., De Bie, T., Cristianini, N.
(eds.) ECML PKDD 2012, Part I. LNCS, vol. 7523, pp. 74–89. Springer, Heidelberg
(2012)
Predicting Unseen Labels Using Label
Hierarchies in Large-Scale Multi-label Learning
1 Introduction
Multi-label classification is an area of machine learning which aims to learn a
function that maps instances to a label space. In contrast to multiclass classifi-
cation, each instance is assumed to be associated with more than one label. One
of the goals in multi-label classification is to model the underlying structure of
the label space because in many such problems, the occurrences of labels are not
independent of each other.
Recent developments in multi-label classification can be roughly divided into
two bodies of research. One is to build a classifier in favor of statistical dependen-
cies between labels, and the other is devoted to making use of prior information
over the label space. In the former area, many attempts have been made to exploit
label patterns [6, 9,24]. As the number of possible configurations of labels grows
exponentially with respect to the number of labels, it is required for multi-label
classifiers to handle many labels efficiently [4] or to reduce the dimensionality of
An erratum to this chapter is available at DOI: 10.1007/978-3-319-23528-8 46
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 102–118, 2016.
DOI: 10.1007/978-3-319-23528-8 7
Predicting Unseen Labels Using Label Hierarchies 103
a label space by exploiting properties of label structures such as sparsity [17] and
co-occurrence patterns [7]. Label space dimensionality reduction (LSDR) meth-
ods allow to make use of latent information on a label space as well as to reduce
computational cost. Another way of exploiting information on a label space is to
use its underlying structures as a prior. Many methods have been developed to
use hierarchical output structures in machine learning [27]. In particular, several
researchers have looked into utilizing the hierarchical structure of the label space
for improved predictions in multi-label classification [26,30,32].
Although extensive research has been devoted to techniques for utilizing
implicitly or explicitly given label structures, there remain the scalability issues
of previous approaches in terms of both the number of labels and documents
in large feature spaces. Consider a very large collection of scientific documents
covering a wide range of research interests. In an emerging research area, it can
be expected that the number of publications per year grows rapidly. Moreover,
new topics will emerge, so that the set of indexing terms, which has initially
been provided by domain experts or authors to describe publications with few
words for potential readers, will grow as well.
Interestingly, similar problems have been faced recently in a different domain,
namely representation learning [2]. In language modeling, for instance, a word
is traditionally represented by a K-dimensional vector where K is the number
of unique words, typically hundreds of thousands or several millions. Clearly, it
is desirable to reduce this dimensionality to much smaller values d K. This
can, e.g., be achieved with a simple log-linear model [21], which can efficiently
compute a so-called word embedding, i.e., a lower-dimensional vector represen-
tations for words. Another example for representation learning is a technique
for learning a joint embedding space of instances and labels [31]. This approach
maximizes the similarity between vector representations of instances and rele-
vant labels while projecting them into the same space.
Inspired by the log-linear model and the joint space embedding, we address
large-scale multi-label classification problems, in which both hierarchical label
structures are given a priori as well as label patterns occur in the training data.
The mapping functions in the joint space embedding method can be used to rank
labels for a given instance, so that relevant labels are placed at the top of the
ranking. In other words, the quality of such a ranking depends on the mapping
functions. As mentioned, two types of information on label spaces are expected to
help us to train better joint embedding spaces, so that the performance on unseen
data can be improved. We focus on exploiting such information so as to learn a
mapping function projecting labels into the joint space. The vector representa-
tions of labels by using this function will be referred to as label embeddings. While
label embeddings are usually initialized randomly, it will be beneficial to learn the
joint space embedding method taking label hierarchies into consideration when
label structures are known. To this end, we adopt the above-mentioned log-linear
model which has been successfully used to learn word embeddings.
Learning word embeddings relies fundamentally on the use of the context
information, that is, a fixed number of words surrounding that word in a sentence
104 J. Nam et al.
2 Multi-label Classification
In multi-label classification, assuming that we are given a set of training examples
D = {(xn , Yn )}N n=1 , our goal is to learn a classification function f : x → Y which
maps an instance x to a set of relevant labels Y ⊆ {1, 2, · · · , L}. All other labels
Y = {1, 2, · · · , L}\Y are called irrelevant. Often, it is sufficient, or even required,
to obtain a list of labels ordered according to some relevance scoring functions.
In hierarchical multi-label classification (HMLC) labels are explicitly orga-
nized in a tree usually denoting a is-a or composed-of relation. Several
approaches to HMLC have been proposed which replicate this structure with
a hierarchy of classifiers which predict the paths to the correct labels [5, 30,32].
Although there is evidence that exploiting the hierarchical structure in this way
has advantages over the flat approach [3, 5, 30], some authors unexpectedly found
that ignoring the hierarchical structure gives better results. For example, in [32]
it is claimed that if a strong flat classification algorithm is used the lead vanishes.
Similarly, in [30] it was found that learning a single decision tree which predicts
probability distributions at the leaves outperforms a hierarchy of decision trees.
One of the reasons may be that hierarchical relations in the output space are
often not in accordance with the input space, as claimed by [15] and [32]. Our
proposed approach aims at overcoming this problem as it learns an embedding
space where similarities in the input, output and label hierarchies are jointly
respected.
3 Model Description
3.1 Joint Space Embeddings
Weston et al. [31] proposed an efficient online method to learn ranking functions
in a joint space of instances and labels, namely Wsabie. Under the assumption
that instances which have similar representation in a feature space tend to be
associated with similar label sets, we find joint spaces of both instances and labels
where the relevant labels for an instance can be separated from the irrelevant
ones with high probability.
Formally, consider an instance x of dimension D and a set of labels Y associ-
ated with x. Let φ(x) = Wx denote a linear function which projects the original
Predicting Unseen Labels Using Label Hierarchies 105
with the pairwise hinge loss function (xn , yi , yj )= ma − uTi φ(xn ) + uTj φ(xn ) +
where ri (·) denotes the rank of label i for a given instance xn , h(·) is a function
that maps this rank to a real number (to be introduced shortly in more detail),
Y n is the complement of Yn , [x]+ is defined as x if x > 0 and 0 otherwise,
ΘF = {W, U} are model parameters, and ma is a real-valued parameter, namely
the margin. The relevance scores s(x) = [s1 (x) , s2 (x) , · · · , sL (x)] of labels for a
given instance x can be computed as si (x) = uTi φ(x) ∈ R. Then, the rank of label
i with respect
to an instance x can be determined based on the relevance scores
ri (x) = j∈Y ,j=i [ma − si (x) + sj (x)]+ . It is prohibitively expensive to compute
such rankings exactly when L is large.
We
use instead its approximation to update
parameters ΘF given by ri (x) ≈ L−|Y| Ti where · denotes the floor function and
Ti is the number of trials to sample an index j yielding incorrect ranking against
label i such that ma − si (x) + sj (x) > 0 during stochastic parameter update
steps. Having an approximate rank ri (x), we can obtain a weighted ranking func-
ri (x) 1
tion h(ri (x)) = k=1 k , which is shown to be an effective way of optimizing
precision at the top of rankings.
yi
ancestors
ys
ys
yk
yi
Doc yn yi yk descendants
yi
yn
where mb is the margin, ZA = |Yn ||SA (i)|, and p(ys |yi , xn ) denotes the probability
of predicting an ancestor label s of a label i given i and an instance xn for which
i is relevant. More specifically, the probability p(ys |yi , xn ) can be defined as
exp(uTs ûi )
(n)
p(ys |yi , xn ) = (n)
, (3)
v∈L exp(uTv ûi )
(n)
where ûi = 12 (ui + φ(xn )) is the averaged-representation of a label i and the
n-th instance in a joint space. Intuitively, this regularizer forces labels, which
share the same parent label, to have similar vector representations as much as
possible while keeping them separated from each other. Moreover, an instance
x has the potential to make good predictions on some labels even though they
do not appear in the training set only if their descendants are associated with
training instances.
Adding Ω to Eq. 1 results in the objective function of Wsabie H
N
L (ΘH ; D) = h(ri (xn )) (xn , yi , yj ) + λΩ(ΘH ) (4)
n=1 i∈Yn j∈Ȳn
Due to the high computational cost for computing gradients for the softmax
function in Eq. 3, we use hierarchical softmax [22,23] which reduces the gradient
computing cost from O(L) to O(log L). Similar to [21], in order to make use of
the hierarchical softmax, a binary tree is constructed by Huffman coding, which
yields binary codes with variable length to each label according to |SD (·)|. Note
that by definition of the Huffman coding all L labels correspond to leaf nodes
in a binary tree, called the Huffman tree. Instead of computing L outputs, the
hierarchical softmax computes a probability of log L binary decisions over a
path from the root node of the tree to the leaf node corresponding to a target
label, say, yj in Eq. 3.
More specifically, let C(y) be a codeword of a label y by the Huffman coding,
where each bit can be either 0 or 1, and I(C(y)) be the number of bits in the
codeword for that label. Cl (y) is the l-th bit in y’s codeword. Unlike for softmax,
for computing the hierarchical softmax we use the output label representations
U as vector representations for inner nodes in the Huffman tree. The hierarchical
softmax is then given by
I(C(yj ))
T
p(yj |yi ) = σ(Cl (yj ) = 0 u n(l,yj ) ui ) (5)
l=1
108 J. Nam et al.
where σ(·) is the logistic function, · is 1 if its argument is true and −1 otherwise,
and u n(l,yj ) is a vector representation for the l-th node in the path from the
root node to the node corresponding to the label yj in the Huffman tree. While
L inner products are required to compute the normalization term in Eq. 3, the
hierarchical softmax needs I(C(·)) computations. Hence, the hierarchical softmax
allows substantial improvements in computing gradients if E [I(C(·))] L.
4 Experimental Setup
Datasets. We benchmark our proposed method on two textual corpora con-
sisting of a large number of documents and with label hierarchies provided.
The RCV1-v2 dataset [19] is a collection of newswire articles. There are 103
labels and they are organized in a tree. Each label belongs to one of four major
categories. The original train/test split in the RCV1-v2 dataset consists of 23,149
training documents and 781,265 test documents. In our experiments, we switched
the training and the test data, and selected the top 20,000 words according to
the document frequency. We chose randomly 10,000 training documents as the
validation set.
The second corpus is the OHSUMED dataset [16] consisting of 348,565 sci-
entific articles from MEDLINE. Each article has multiple index terms known
as Medical Subject Headings (MeSH). In this dataset, the training set contains
articles from year 1987 while articles from 1988 to 1991 belong to the test set.
We map all MeSH terms in the OHSUMED dataset to 2015 MeSH vocabulary1
in which 27,483 MeSH terms are organized in a DAG hierarchy. Originally, the
OHSUMED collection consists of 54,710 training documents and 293,856 test
documents. Having removed all MeSH terms that do not appear in the 2015
MeSH vocabulary, we excluded all documents that have no label from the cor-
pus. To represent documents in a vector space, we selected unigram words that
1
http://www.nlm.nih.gov/pubs/techbull/so14/so14 2015 mesh avail.html
Predicting Unseen Labels Using Label Hierarchies 109
occur more than 5 times in the training set. These pre-processing steps left us
with 36,883 train documents and 196,486 test documents. Then, 10% of the
training documents were randomly set aside for the validation set. Finally, for
both datasets, we applied log tf-idf term-weighting and then normalized docu-
ment vectors to unit length.
RCV1-v2 OHSUMED
BR PLST CPLST CLRsvm Wsabie WsabieH BR PLST Wsabie WsabieH
AvgP 94.20 92.75 92.76 94.76 94.34 94.39 45.00 26.50 45.72 45.76
RL 0.46 0.78 0.76 0.40 0.44 0.44 4.48 15.06 4.09 3.72
false negatives for label l, respectively. Throughout the paper, we present the
evaluation scores of these measures multiplied by 100.
denotes a base learning rate which decrease by a factor of 0.99 per epoch. We
implemented our proposed methods using a lock-free parallel gradient update
scheme [25], namely Hogwild!, in a shared memory system since the number of
parameters involved during updates is sparse even though the whole parameter
space is large. For BR and CLRsvm , LIBLINEAR[12] was used as a base learner
and the regularization parameter C = 10−2 , 100 , 102 , 104 , 106 was chosen by
validation sets.
5 Experimental Results
5.1 Learning All Labels Together
Table 2 compares our proposed algorithm, Wsabie H , with the baselines on the
benchmark datasets in terms of two ranking measures. It can be seen that
CLRsvm outperforms the others including Wsabie H on the RCV1-v2 dataset,
but the performance gap across all algorithms in our experiments is not large.
Even BR ignoring label relationship works competitively on this dataset. Also,
no difference between Wsabie and Wsabie H was observed. This is attributed to
characteristics of the RCV1-v2 dataset that if a label corresponding to one of
the leaf nodes in the label hierarchy is associated with an instance, (almost) all
nodes in a path from the root node to that node are also present, so that the
hierarchical information is implicitly present in the training data.
Let us now turn to the experimental results on the OHSUMED dataset which
are shown on the right-hand side of Table 2. Since the dataset consists of many
labels, as an LSDR approach, we include PLST only in this experiment because
112 J. Nam et al.
RCV1-v2 OHSUMED
AvgP RL MiF MaF AvgP RL MiF MaF
Wsabie 2.31 62.29 0.00 0.00 0.01 56.37 0.00 0.00
WsabieH 9.47 30.39 0.50 1.64 0.06 39.91 0.00 0.00
Over the last few years, there has been an increasing interest in zero-shot learn-
ing, which aims to learn a function that maps instances to classes or labels that
have not been seen during training. Visual attributes of an image [18] or textual
description of labels [13, 28] may serve as additional information for zero-shot
learning algorithms. In contrast, in this work, we focus on how to exploit label
hierarchies and co-occurrence patterns of labels to make predictions on such
unseen labels. The reason is that in many cases it is difficult to get additional
information for some specific labels from external sources. In particular, while
using a semantic space of labels’ textual description is a promising way to learn
vector representations of labels, sometimes it is not straightforward to find suit-
able mappings of specialized labels.
Table 3 shows the results of Wsabie against Wsabie H on the modified datasets
which do not contain any known label in the test set (cf. Sec. 4). As can be seen,
Wsabie H clearly outperforms Wsabie on both datasets across all measures except
for MiF and MaF on the OHSUMED dataset. Note that the key difference between
Wsabie H and Wsabie is the use of hierarchical structures over labels during the
training phase. Since the labels in the test set do not appear during training, Wsa-
bie can basically only make random predictions for the unknown labels. Hence,
the comparison shows that taking only the hierarchical relations into account
already enables a considerable improvement over the baseline. Unfortunately, the
effect is not substantial enough in order to be reflected w.r.t. MiF and MaF on
OHSUMED. Note, however, that a relevant, completely unknown label must be
ranked approximately as one of the top 4 labels out of 17,913 in order to count for
bipartition measures in this particular setting.
Predicting Unseen Labels Using Label Hierarchies 113
N
(1 − α) α
log p (yj |yi ) + log p (yk |yi ) (6)
n=1
ZA ZN
i∈Yn j∈SA (i) i∈Yn k∈Yn
k=i
Fig. 2. Visualization of learned label embeddings by the log-linear model (Eq 6). (left)
using only label co-occurrence patterns α = 1 (middle) using a hierarchy as well as
co-occurrences α = 0.5 (right) using only a hierarchy α = 0.
the hidden activations to the output layer. Here, U and U correspond to vector
representations for input labels and output labels, respectively. Like Eq. 3, we use
hierarchical softmax instead of Eq. 7 to speed up pre-training label embeddings.
the middle in Fig. 2), we performed analogical reasoning on both the represen-
tations trained with the hierarchy and ones without the hierarchy, specifically,
regarding Therapy-Disorders/Diseases relationships (Table 4). As expected, it
seems like the label representations trained with the hierarchy are clearly advan-
tageous to the ones trained without the hierarchy on analogical reasoning. To
be more specific, consider the first example, where we want to know what kinds
of therapies are effective on “Respiration Disorders” as the relationship between
“Diet Therapy” and “Cardiovascular Diseases”. When we perform such analog-
ical reasoning using learned embeddings with the hierarchy, the most probable
answers to this analogy question are therapies that can be used to treat “Respi-
ration Disorders” including nutritional therapies. Unlike the learned embeddings
with the hierarchy, the label embeddings without the hierarchy perform poorly.
In the bottom-right of Table 4, “Phobic Disorders” can be considered as a type
of anxiety disorders that occur commonly together with “Post-traumatic Stress
Disorders (PTSD)” rather than a treatment of it.
6.2 Results
The results on the modified zero-shot learning datasets in Table 5 show that
we can obtain substantial improvements by the pretrained label embeddings.
Please note that the scores obtained by using random label embeddings on the
left in Table 5 are the same as those of Wsabie and Wsabie H in Table 3. In
this experiment, we used very small base learning rates (i.e., η0 = 10−4 chosen
by validation) for updating label embeddings in Eq. 4 after being initialized
116 J. Nam et al.
Table 6. Evaluation on the full test data of the OHSUMED dataset. Numbers in
parentheses are standard deviation over 5 runs. Subscript P denotes the use of pre-
trained label embeddings.
by the pretrained ones. This means that our proposed method is trained in
a way that maps a document into the some point of label embeddings while
the label embeddings hardly change. In fact, the pretrained label embeddings
have interesting properties shown in Section 6.1, so that Wsabie starts learning
at good initial parameter spaces. Interestingly, it was observed that some of
the unseen labels are placed at the top of rankings for test instances, so that
relatively higher scores of bipartition measures are obtained even for Wsabie. We
also performed an experiment on the full OHSUMED dataset. The experimental
results are given in Table 6. Wsabie HP combining pretrained label embeddings
with hierarchical label structures is able to further improve, outperforming both
extensions by its own across all measures.
7 Conclusions
We have presented a method that learns a joint space of instances and labels taking
hierarchical structures of labels into account. This method is able to learn repre-
sentations of labels, which are not presented during the training phase, by leverag-
ing label hierarchies. We have also proposed a way of pretraining label embeddings
from huge amounts of label patterns and hierarchical structures of labels.
We demonstrated the joint space learning method on two multi-label text cor-
pora that have different types of label hierarchies. The empirical results showed
that our approach can be used to place relevant unseen labels on the top of the
ranked list of labels. In addition to the quantitative evaluation, we also analyzed
label representations qualitatively via a 2D-visualization of label representa-
tions. This analysis showed that using hierarchical structures of labels allows us
to assess vector representations of labels by analogical reasoning. Further studies
should be carried out to make use of such regularities in label embeddings at
testing time.
Predicting Unseen Labels Using Label Hierarchies 117
References
1. Balikas, G., Partalas, I., Ngomo, A.N., Krithara, A., Paliouras, G.: Results of the
BioASQ track of the question answering lab at CLEF 2014. In: Working Notes for
CLEF 2014 Conference, pp. 1181–1193 (2014)
2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new
perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence
35(8), 1798–1828 (2013)
3. Bi, W., Kwok, J.T.: Multilabel classification on tree- and DAG-structured hierar-
chies. In: Proceedings of the 28th International Conference on Machine Learning,
pp. 17–24 (2011)
4. Bi, W., Kwok, J.T.: Efficient multi-label classification with many labels. In: Proc.
of the 30th International Conference on Machine Learning, pp. 405–413 (2013)
5. Cesa-Bianchi, N., Gentile, C., Zaniboni, L.: Incremental algorithms for hierarchical
classification. Journal of Machine Learning Research 7, 31–54 (2006)
6. Chekina, L., Gutfreund, D., Kontorovich, A., Rokach, L., Shapira, B.: Exploiting
label dependencies for improved sample complexity. Machine Learning 91(1), 1–42
(2013)
7. Chen, Y.N., Lin, H.T.: Feature-aware label space dimension reduction for multi-
label classification. In: Advances in Neural Information Processing Systems,
pp. 1529–1537 (2012)
8. Crammer, K., Singer, Y.: A family of additive online algorithms for category rank-
ing. The Journal of Machine Learning Research 3, 1025–1058 (2003)
9. Dembczyński, K., Waegeman, W., Cheng, W., Hüllermeier, E.: On label depen-
dence and loss minimization in multi-label classification. Machine Learning
88(1–2), 5–45 (2012)
10. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learn-
ing and stochastic optimization. The Journal of Machine Learning Research 12,
2121–2159 (2011)
11. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. Advances
in Neural Information Processing Systems 14, 681–687 (2001)
12. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A
library for large linear classification. The Journal of Machine Learning Research 9,
1871–1874 (2008)
13. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov,
T.: Devise: A deep visual-semantic embedding model. Advances in Neural Infor-
mation Processing Systems 26, 2121–2129 (2013)
14. Fürnkranz, J., Hüllermeier, E., Loza Mencı́a, E., Brinker, K.: Multilabel classifica-
tion via calibrated label ranking. Machine Learning 73(2), 133–153 (2008)
15. Fürnkranz, J., Sima, J.F.: On exploiting hierarchical label structure with pairwise
classifiers. SIGKDD Explorations 12(2), 21–25 (2010)
16. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: Ohsumed: an interactive retrieval
evaluation and new large test collection for research. In: Proceedings of the 17th
Annual International ACM SIGIR Conference, pp. 192–201 (1994)
17. Hsu, D., Kakade, S., Langford, J., Zhang, T.: Multi-label prediction via compressed
sensing. In: Advances in Neural Information Processing Systems 22, vol. 22, pp.
772–780 (2009)
18. Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-
shot visual object categorization. IEEE Transactions on Pattern Analysis and
Machine Intelligence 36(3), 453–465 (2014)
118 J. Nam et al.
19. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for
text categorization research. The Journal of Machine Learning Research 5, 361–397
(2004)
20. Loza Mencı́a, E., Fürnkranz, J.: Pairwise learning of multilabel classifications with
perceptrons. In: Proceedings of the International Joint Conference on Neural Net-
works, pp. 2899–2906 (2008)
21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. Advances in Neural
Information Processing Systems 26, 3111–3119 (2013)
22. Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model.
Advances in Neural Information Processing Systems 22, 1081–1088 (2009)
23. Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model.
In: Proceedings of the 10th International Workshop on Artificial Intelligence and
Statistics, pp. 246–252 (2005)
24. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
classification. Machine Learning 85(3), 333–359 (2011)
25. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to paralleliz-
ing stochastic gradient descent. In: Advances in Neural Information Processing
Systems, pp. 693–701 (2011)
26. Rousu, J., Saunders, C., Szedmák, S., Shawe-Taylor, J.: Kernel-based learning of
hierarchical multilabel classification models. Journal of Machine Learning Research
7, 1601–1626 (2006)
27. Silla Jr, C.N., Freitas, A.A.: A survey of hierarchical classification across different
application domains. Data Mining and Knowledge Discovery 22(1–2), 31–72 (2011)
28. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through
cross-modal transfer. In: Advances in Neural Information Processing Systems,
pp. 935–943 (2013)
29. Tai, F., Lin, H.T.: Multilabel classification with principal label space transforma-
tion. Neural Computation 24(9), 2508–2542 (2012)
30. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for
hierarchical multi-label classification. Machine Learning 73(2), 185–214 (2008)
31. Weston, J., Bengio, S., Usunier, N.: Wsabie: scaling up to large vocabulary image
annotation. In: Proceedings of the 22nd International Joint Conference on Artificial
Intelligence, pp. 2764–2770 (2011)
32. Zimek, A., Buchwald, F., Frank, E., Kramer, S.: A study of hierarchical and flat
classification of proteins. IEEE/ACM Transactions on Computational Biology and
Bioinformatics 7, 563–571 (2010)
Regression with Linear Factored Functions
1 Introduction
This paper introduces a novel regression-algorithm, which performs competitive
to Gaussian processes, but yields linear factored functions (LFF). These have
outstanding properties like analytical point-wise products and marginalization.
Regression is a well known problem, which can be solved by many non-linear
architectures like kernel methods (Shawe-Taylor and Cristianini 2004) or neural
networks (Haykin 1998). While these perform well, the estimated functions often
suffer a curse of dimensionality in later applications. For example, computing an
integral over a neural network or kernel function requires to sample the entire
input space. Applications like belief propagation (Pearl 1988) and reinforcement
learning (Kaelbling et al. 1996), on the other hand, face large input spaces and
require therefore efficient computations. We propose LFF for this purpose and
showcase its properties in comparison to kernel functions.
regression, Gaussian processes (GP, see Bishop 2006; Rasmussen and Williams
2006), on the other hand, requires as many SV as training samples. Sparse ver-
sions of GP aim thus for a small subset of SV. Some select this set based on
constraints similar to SVM (Tipping 2001; Vapnik 1995), while others try to
conserve the spanned linear function space (sparse GP, Csató and Opper 2002;
Rasmussen and Williams 2006). There exist also attempts to construct new SV by
averaging similar training samples (e.g. Wang et al. 2012).
Well chosen SV for regression are usually not sparsely concentrated on a deci-
sion boundary as they are for SVM. In fact, many practical applications report
that they are distributed uniformly in the input space (e.g. in Böhmer et al.
2013). Regression tasks restricted to a small region of the input space may tol-
erate this, but some applications require predictions everywhere. For example,
the value function in reinforcement learning must be generalized to each state.
The number of SV required to represent this function equally well in each state
grows exponentially in the number of input-space dimensions, leading to Bell-
man’s famous curse of dimensionality (Bellman 1957).
Kernel methods derive their effectiveness from linear optimization in a non-
linear Hilbert space of functions. Kernel-functions parameterized by SV are the
non-linear basis functions in this space. Due to the functional form of the kernel,
this can be a very ineffective way to select basis functions. Even in relatively
small input spaces, it often takes hundreds or thousands SV to approximate a
function sufficiently. To alleviate the problem, one can construct complex kernels
out of simple prototypes (see a recent review in Gönen and Alpaydın 2011).
Diverging from all above arguments, this article proposes a more radical app-
roach: to construct the non-linear basis functions directly during training, with-
out the detour over kernel functions and support vectors. This poses two main
challenges: to select a suitable functions space and to regularize the optimization
properly. The former is critical, as a small set of basis functions must be able to
approximate any target function, but shouldalso be easy to compute in practice.
k
We propose factored functions ψi = k ψi ∈ F as basis functions for
regression, and call the linear combination of m of those bases a linear fac-
tored function f ∈ F m (LFF, Section 3). For example, generalized linear mod-
els (Nelder and Wedderburn 1972) and multivariate adaptive regression splines
(MARS, Friedman 1991) are both LFF. Computation remains feasible by using
hinge functions ψik (xk ) = max(0, xk − c) and restricting the scope of each fac-
tored function ψi . In contrast, we assume the general case without restrictions
to functions or scope.
Due to their structure, LFF can solve certain integrals analytically and allow
very efficient computation of point-wise products and marginalization. We show
that our LFF are universal function approximators and derive an appropriate
regularization term. This regularization promotes smoothness, but also retains
a high degree of variability in densely sampled regions by linking smoothness to
Regression with Linear Factored Functions 121
2 Regression
Let {xt ∈ X }nt=1 be a set of n input samples, i.i.d. drawn from an input
set X ⊂ IRd . Each so called “training sample” is labeled with a real number
{yt ∈ IR}nt=1 . Regression aims to find a function f : X → IR, that predicts the
labels to all (previously unseen) test samples as well as possible. Labels may be
afflicted by noise and f must thus approximate the mean label of each sample,
i.e., the function μ : X → IR. It is important to notice that conceptually the noise
is introduced by two (non observable) sources: noisy labels yt and noisy samples
xt . The latter will play an important role for regularization. We define the con-
ditional distribution χ of observable samples x ∈ X given the non-observable
“true” samples z ∈ X , which are drawn by a distribution ξ. In the limit of
infinite samples, the least squares cost-function C[f |χ, μ] can thus be written as
1 2 2
n
lim inf f (xt ) − yt = inf ξ(dz) χ(dx|z) f (x) − μ(z) . (1)
n→∞ f n f
t=1
The cost function C can never be computed exactly, but approximated using the
training samples1 and assumptions about the unknown noise distribution χ.
2
arbitrary functions . Although each factored function ψi is very restricted, a
linear combination of them can be very powerful:
Corollary 1. Let Xk be a bounded continuous set and φkj the j’th Fourier base
over Xk . In the limit of mk → ∞, ∀k ∈ {1, . . . , d}, holds F ∞ = L2 (X , ϑ).
Strictly this holds in the limit of infinitely many basis functions ψi , but we will
show empirically that there exist close approximations with a small number m
of factored functions. One can make similar statements for other bases {φkj }∞ j=1 .
For example, for Gaussian kernels one can show that the space F ∞ is in the
limit equivalent to the corresponding reproducing kernel Hilbert space H.
LFF offer some structural advantages over other universal function approxi-
mation classes like neural networks or reproducing kernel Hilbert spaces. Firstly,
the inner product of two LFF in L2 (X , ϑ) can be computed as products of
one-dimensional integrals. For some bases3 , these integrals can be calculated
analytically without any sampling. This could in principle break the curse of
dimensionality for algorithms that have to approximate these inner products
numerically. For example, input variables can be marginalized (integrated) out
analytically (Equation 9 on Page 130). Secondly, the point-wise product of two
LFF is a LFF as well4 (Equation 10 on Page 131). See Appendix A for details.
These properties are very useful, for example in belief propagation (Pearl 1988)
and factored reinforcement learning (Böhmer and Obermayer 2013).
3.2 Constraints
LFF have some degrees of freedom that can impede optimization. For example,
the norm of ψi ∈ F does not influence function f ∈ F m , as the corresponding
linear coefficients ai can be scaled accordingly. We can therefore introduce the
constraints ψi ϑ = 1, ∀i, without restriction to the function class. The factor-
ization of inner products (see AppendixA on Page 130) allows us furthermore
to rewrite the constraints as ψi ϑ = k ψik ϑk = 1. This holds as long as
the product is one, which exposes another unnecessary degree of freedom. To
2
Examples are Fourier bases, Gaussian kernels or hinge-functions as in MARS.
3
E.g. Fourier bases for continuous, and Kronecker-delta bases for discrete variables.
4
One can use the trigonometric product-to-sum identities for Fourier bases or the Kro-
necker delta for discrete bases to construct LFF from a point-wise product without
mk d
changing the underlying basis {{φki }i=1 }k=1 .
Regression with Linear Factored Functions 123
finally make the solution unique (up to permutation), we define the constraints
as ψik ϑk = 1, ∀k, ∀i. Minimizing some C[f ] w.r.t. f ∈ F m is thus equivalent to
inf C[f ] s.t. ψik ϑk = 1 , ∀k ∈ {1, . . . , d} , ∀i ∈ {1, . . . , m} . (3)
f ∈F m
The cost function C[f |χ, μ] of Equation 1 with the constraints in Equation 3
is equivalent to ordinary least squares (OLS) w.r.t. linear parameters a ∈ IRm .
However, the optimization problem is not convex w.r.t. the parameter space
{Bk ∈ IRmk ×m }dk=1 , due to the nonlinearity of products.
Instead of tackling the global optimization problem induced by Equation 3,
we propose a greedy approximation algorithm. Here we optimize at iteration ı̂
one linear basis function ψı̂ =: g =: k g k ∈ F, with g k (xk ) =: bk φk (xk ), at a
time, to fit the residual μ − f between the true mean label function μ ∈ L2 (X , ϑ)
and the current regression estimate f ∈ F ı̂−1 , based on all ı̂ − 1 previously
ı̂−1
constructed factored basis functions {ψi }i=1 :
inf C[f + g|χ, μ] s.t. g k ϑk = 1 , ∀k ∈ {1, . . . , d} . (4)
g∈F
3.3 Regularization
Regression with any powerful function class requires regularization to avoid over-
fitting. Examples are weight decay for neural networks (Haykin 1998) or param-
eterized priors for Gaussian processes. It is, however, not immediately obvious
how to regularize the parameters of a LFF and we will derive a regularization
term from a Taylor approximation of the cost function in Equation 1.
We aim to enforce smooth functions, espe-
cially in those regions our knowledge is limited
due to a lack of training samples. This uncertainty
can be expressed as the Radon-Nikodym deriva-
tive 5 ϑξ : X → [0, ∞) of our factored measure ϑ
(see Appendix A) w.r.t. the sampling distribution
ξ. Figure 1 demonstrates at the example of a uni-
form distribution ϑ how ϑξ reflects our empirical
knowledge of the input space X .
We use this uncertainty to modulate the sam-
ple noise distribution χ in Equation 1. This means
that frequently sampled regions of X shall yield
low, while scarcely sampled regions shall yield Fig. 1. We interpret the
Radon-Nikodym derivative
high variance. Formally, we assume χ(dx|z) to dϑ
as uncertainty measure
be a Gaussian probability measure over X with dξ
for our knowledge of X . Reg-
mean z and a covariance matrix Σ ∈ IRd×d, scaled ularization enforces smooth-
by the local uncertainty in z (modeled as ϑξ (z)): ness in uncertain regions.
5
Technically we have to assume that ϑ is absolutely continuous in respect to ξ. For
“well-behaving” distributions ϑ, like the uniform or Gaussian distributions we discuss
in Appendix A, this is equivalent to the assumption that in the limit of infinite
samples, each sample z ∈ X will eventually be drawn by ξ.
124 W. Böhmer and K. Obermayer
3.4 Optimization
Another advantage of cost function C[g] ˜ is that one can optimize one factor
k d
function g of g(x) = g (x1 )·. . .·g (xd ) ∈ F at a time, instead of time consuming
1
Proposition 3. If all but one factor function g k are considered constant, Equa-
tion 6 has an analytical solution. If {φkj }m 2
j=1 is a Fourier base, σk > 0 and ϑ
k
ξ,
then the solution is also unique.
Proof: see Appendix C on Page 133.
One can give similar guarantees for other bases, e.g. Gaussian kernels. Note
that Proposition 3 does not state that the optimization problem has a unique
solution in F. Formal convergence statements are not trivial and empirically the
parameters of g do not converge, but evolve around orbits of equal cost instead.
However, since the optimization of any g k cannot increase the cost, any sequence
of improvements will converge to (and stay in) a local minimum. This implies a
nested optimization approach, that is formulated in Algorithm 1 on Page 133:
– An inner loop that optimizes one factored basis function g(x) = g 1 (x1 ) · . . . ·
g d (xd ) by selecting an input dimension k in each iteration and solve Equation
6 for the corresponding g k . A detailed derivation of the optimization steps of
6
Non-diagonal covariance matrices Σ can be cast in this framework by projecting the
input samples into the eigenspace of Σ (thus diagonalizing the input) and use the
corresponding eigenvalues λk instead of the regularization parameters σk2 ’s.
7
Each regularization term is measured w.r.t. the factored distribution ϑ. We also
tested the algorithm without consideration of “uncertainty” ϑξ , i.e., by measuring
each term w.r.t. ξ. As a result, regions outside the hypercube containing the training
set were no longer regularized and predicted arbitrary (often extreme) values.
Regression with Linear Factored Functions 125
the inner loop is given in Appendix B on Page 131. The choice of k influences
the solution in a non-trivial way and further research is needed to build up
a rationale for any meaningful decision. For the purpose of this paper, we
assume k to be chosen randomly by permuting the order of updates.
The computational complexity of the inner loop is O(m2k n+d2 mk m). Memory
complexity is O(d mk m), or O(d mk n) with the optional cache speedup of
Algorithm 1. The loop is repeated for random k until the cost-improvements
of all dimensions k fall below some small .
– After convergence of the inner loop in (outer) iteration ı̂, the new basis
function is ψı̂ := g. As the basis has changed, the linear parameters a ∈ IRı̂
have to be readjusted by solving the ordinary least squares problem
a = (ΨΨ )−1 Ψy , with Ψit := ψi (xt ) , ∀i ∈ {1, . . . , ı̂} , ∀t ∈ {1, . . . , n} .
We propose to stop the approximation when the newly found basis function
ı̂−1
ψı̂ is no longer linearly independent of the current basis {ψi }i=1 . This can
for example be tested by comparing the determinant det( n ΨΨ ) < ε, for
1
4 Empirical Evaluation
In this section we will evaluate the novel LFF regression Algorithm 1, printed in
detail on Page 133. We will analyze its properties on low dimensional toy-data,
and compare its performance with sparse and traditional Gaussian processes
(GP, see Bishop 2006; Rasmussen and Williams 2006).
4.1 Demonstration
To showcase the novel Algorithm 1, we tested it on an artificial two-dimensional
regression toy-data set. The n = 1000 training samples were drawn from a noisy
spiral and labeled with a sinus. The variance of the Gaussian sample-noise grew
with the spiral as well:
t
cos 6 nt π t2 t
xt = 6 n t + N 0, 4n 2 I , y t = sin 4 n π , ∀t ∈ {1, . . . , n} . (7)
sin 6 n π
126 W. Böhmer and K. Obermayer
Fig. 2. Two LFF functions learned from the same 1000 training samples (white circles).
The color inside a circle represents the training label. Outside the circles, the color
represents the prediction of the LFF function. The differences between both functions
are rooted in the randomized order in which the factor functions g k are updated.
However, the similarity of the sampled region indicates that poor initial choices can be
compensated by subsequently constructed basis functions.
Figure 2 shows one training set plotted over two learned8 functions f ∈ F m with
m = 21 and m = 24 factored basis functions, respectively. Regularization con-
stants were in both cases σk2 = 0.0005, ∀k. The differences between the functions
stem from the randomized order in which the factor functions g k are updated.
Note that the sampled regions have similar predictions. Regions with strong
differences, for example the upper right corner, are never seen during training.
In all our experiments, Algorithm 1 always converged. Runtime was mainly
influenced by the input dimensionality (O(d2 )), the number of training sam-
ples (O(n)) and the eventual number of basis functions (O(m)). The latter was
strongly correlated with approximation quality, i.e., bad approximations con-
verged fast. Cross-validation was therefore able to find good parameters effi-
ciently and the resulting LFF were always very similar near the training data.
4.2 Evaluation
8
Here (and in the rest of the paper), each variable was encoded with 50 Fourier cosine
bases. We tested other sizes as well. Few cosine bases result effectively in a low-pass
filtered function, whereas every experiment with more than 20 or 30 bases behaved
very similar. We tested up to mk = 1000 bases and did not experience over-fitting.
9
https://archive.ics.uci.edu/ml/index.html
Regression with Linear Factored Functions 127
Fig. 3. Mean and standard deviation within a 10-fold cross-validation of a) the toy
data set with additional independent noise input dimensions and b) all tested UCI
benchmark data sets. The stars mark significantly different distribution of RMSE over
all folds in both a paired-sample t-test and a Wilcoxon signed rank test. Significance
levels are: one star p < 0.05, two stars p < 0.005.
Table 1. 10-fold cross-validation RMSE for benchmark data sets with d dimensions
and n samples, resulting in m basis functions. The cross-validation took h hours.
5 Discussion
10
RMSE are not a common performance metric for GP, which represent a distribution
of solutions. However, RMSE reflect the objective of regression and are well suited
to compare our algorithm with the mean of a GP.
Regression with Linear Factored Functions 129
Acknowledgments. The authors thank Yun Shen and the anonymous reviewers for
their helpful comments. This work was funded by the German science foundation
(DFG) within SPP 1527 autonomous learning.
Let Xk denote the subset of IR associated with the k’th variable of input space
X ⊂ IRd , such that X := X1 × . . . × Xd . To avoid the curse of dimension-
ality in this space,
d one canintegrate w.r.t. a factored probability measure ϑ,
i.e. ϑ(dx) = k=1 ϑk (dxk ), ϑk (dxk ) = 1, ∀k. For example, ϑk could be uni-
form or Gaussian distributions over Xk and the resulting ϑ would be a uniform
or Gaussian distribution over the input space X .
A function g : X → IR is called a factored function if it can be written
as a product of one-dimensional factor functions g k : Xk → IR, i.e. g(x) =
d k
k=1 g (xk ). We only consider factored functions g that are twice integrable
w.r.t. measure ϑ, i.e. g ∈ L2 (X , ϑ). Note that not all functions f ∈ L2 (X , ϑ)
are factored, though. Due to Fubini’s theorem the d-dimensional inner product
between two factored functions g, g ∈ L2 (X , ϑ) can be written as the product
of d one-dimensional inner products:
d
d
g, g ϑ = ϑ(dx) g(x) g (x) = ϑk (dxk ) g k (dxk ) g k (dxk ) = g k , g k ϑk .
k=1 k=1
This trick can be used to solve the integrals at the heart of many least-squares
algorithms. Our aim is to learn factored basis functions ψi . To this end, let
{φkj : Xk → IR}m j=1 be a well-chosen
k 11
(i.e. universal) basis on Xk , with the
space of linear combinations denoted by Lkφ := {b φk |b ∈ IRmk }. One can thus
approximate factor functions of ψi in Lkφ , i.e., as linear functions
mk
ψik (xk ) := k k
Bji φj (xk ) ∈ Lkφ , Bk ∈ IRmk ×m . (8)
j=1
Let F be the space of all factored basis functions ψi defined by the factor func-
tions ψik above, and F m be the space of all linear combinations of those m
factored basis functions (Equation 2).
Marginalization of LFF can be performed analytically with Fourier bases φkj
and uniform distribution ϑ (many other bases can be analytically solved as well):
m
ml
d m
d
l l
φlj , 1 ϑl ψik l
ψik (.9)
Fourier
ϑ (dxl ) f (x) = ai Bji = ai B1i
j=1 k=l k=l
i=1 i=1
mean of φlj new ai
11
Examples for continuous variables Xk are Fourier cosine bases φkj (xk ) ∼ cos (j −
1) π xk , and Gaussian bases φkj (xk ) = exp 2σ1 2 (xk − skj )2 . Discrete variables may
be represented with Kronecker-delta bases φkj (xk = i) = δij .
Regression with Linear Factored Functions 131
Using the trigonometric
product-to-sum identity cos(x) · cos(y) = 12 cos(x − y) +
cos(x + y) , one can also compute the point-wise product between two LFF f
and f¯ with cosine-Fourier base (solutions to other Fourier bases are less elegant):
Fourier
m m̄ k
d 2m l−1 mk
= ai āj 1 k
Bqi k
B̄(l−q)j + 1 k
Bqi k
B̄(q−l)j φkl (xk ) , (10)
2 2
q=1 q=l+1
i,j=1 k=1 l=1
new ãt
k
where t := (i − 1) m̄ + j, and Bji := 0, ∀j > mk , for both f and f¯. Note that
this increases the number of basis functions m̃ = mm̄, and the number of bases
m̃k = 2mk for each respective input dimension. The latter can be counteracted
k
by low-pass filtering, i.e., by setting B̃ji := 0, ∀j > mk .
the covariance matrices analytically before the main algorithm starts, e.g. Fourier
k k
cosine bases have Cij = δij and Ċij = (i − 1)2 π 2 δij .
The approximated cost function in Equation 6 is
d
˜
C[g] = g2ξ −2 g, μ−f ξ +μ−f 2ξ + σk2 ∇k g2ϑ +2 ∇k g, ∇k f ϑ +∇k f 2ϑ .
k=1
The non-zero gradients of all inner products of this equation w.r.t. parameter
vector bk ∈ IRmk are
∂
∂bk
g, g ξ = 2 φk · g l , g l · φk ξ bk ,
l=k l=k
∂ k l
∂bk
g, μ − f ξ = φ · g , μ − f ξ ,
l=k
1
l l s s
∂
∂bk
∂
∇l g, ∇l g ϑ = ∂bk ∇l g , ∇l g ϑl g , g ϑs = 2 δkl Ċk bk ,
⎧ s s s
s=l
⎪ k k
⎨ Ċ B a · B C b , if k = l
Rlk := ∂b∂ k ∇l g, ∇l f ϑ = s=k s s s .
⎪
⎩ Ck Bk a · Bl Ċl bl · B C b , if k = l
s=k=l
k
Setting this to zero yields the unconstrained solution guc ,
regularized covariance matrix C̄k
−1 d
bkuc = φk · gl , g l · φk ξ + σk2 Ċk φk · g l , μ−f ξ − Rlk σl2 . (11)
l=k l=k l=k l=1
132 W. Böhmer and K. Obermayer
!
However, these parameters do not satisfy to the constraint g k ϑk = 1, and
have to be normalized:
bkuc bkuc
bk := k k
= . (12)
guc ϑ bk k k
uc C buc
The inner loop finishes when for all k the improvement12 from g k to g k drops
˜ ] < . Using g l = g l , ∀l = k,
˜ − C[g
below some very small threshold , i.e. C[g]
one can calculate the left hand side:
˜ ] = g2 − g 2 − 2 g − g , μ − f ξ
˜ − C[g
C[g] ξ ξ
d
+ σl ∇l g2ϑ − ∇l g 2ϑ −2 ∇l g − ∇l g , ∇l f
2
ϑ (13)
l=1
bl Ċl bl bl Ċl bl (bk −bk ) Rlk
d
= 2 g − g , μ − f ξ +b
k k k
C̄ b − bk C̄k bk − 2(bk − bk ) Rlk σl2 .
l=1
1 0 due to (eq.5)
≈ ξ(dz) g(z) f (z) ∫ ξ(dx|z) +g(z) ∫ χ(dx|z) (x − z) ∇f (z)
+ ∫ χ(dx|z) (x − z) ∇g(z) f (z) + ∇g(z) ∫ χ(dx|z) (x − z)(x − z) ∇f (z)
0 due to (eq.5) ϑ (z)·Σ due to (eq.5)
ξ
d
= g, f ξ + σk2 ∇k g, ∇k f ϑ .
k=1
Using this twice and the zero mean assumption (Eq. 5), we can derive:
inf C[f + g|χ, μ] ≡ inf ξ(dz) χ(dx|z) g 2 (x) − 2 g(x) μ(z) − f (x)
g∈F g∈F
= inf g, g ξχ + 2g, f ξχ − 2 ξ(dz) μ(z) χ(dx|z) g(x)
g∈F
d
≈ inf g, g ξ − 2g, μ − f ξ + σk2 ∇k g, ∇k g ϑ + 2∇k g, ∇k f ϑ
g∈F k=1
d
2
≡ inf g − (μ − f ) ξ + σk2 ∇k g + ∇k f 2
ϑ = ˜ .
C[g]
g∈F k=1
12
Anything simpler does not converge, as the parameter vectors often evolve along
chaotic orbits in IRmk .
Regression with Linear Factored Functions 133
References
Bellman, R.E.: Dynamic programming. Princeton University Press (1957)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag New York
Inc, Secaucus (2006). ISBN 0387310738
Böhmer, W., Obermayer, K.: Towards structural generalization: Factored approx-
imate planning. ICRA Workshop on Autonomous Learning (2013). http://
autonomous-learning.org/wp-content/uploads/13-ALW/paper 1.pdf
Böhmer, W., Grünewälder, S., Nickisch, H., Obermayer, K.: Generating feature spaces
for linear algorithms with regularized sparse kernel slow feature analysis. Machine
Learning 89(1–2), 67–86 (2012)
Böhmer, W., Grünewälder, S., Shen, Y., Musial, M., Obermayer, K.: Construction
of approximation spaces for reinforcement learning. Journal of Machine Learning
Research 14, 2067–2118 (2013)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin clas-
sifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning
Theory, pp. 144–152 (1992)
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences
by data mining from physicochemical properties. Decision Support Systems 47(4),
547–553 (2009)
Csató, L., Opper, M.: Sparse on-line Gaussian processes. Neural Computation 14(3),
641–668 (2002)
Friedman, J.H.: Multivariate adaptive regression splines. The Annals of Statistics
19(1), 1–67 (1991)
Gerritsma, J., Onnink, R., Versluis, A.: Geometry, resistance and stability of the delft
systematic yacht hull series. Int. Shipbuilding Progress 28, 276–297 (1981)
Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. Journal of Machine
Learning Research 12, 2211–2268 (2011). ISSN 1532–4435
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall
(1998). ISBN 978-0132733502
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal
of Artificial Intelligence Research 4, 237–285 (1996)
Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. Journal of the Royal
Statistical Society, Series A, General 135, 370–384 (1972)
Pearl, J.: Probabilistic reasoning in intelligent systems. Morgan Kaufmann (1988)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT
Press (2006)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press (2004)
Tipping, M.E.: Sparse bayesian learning and the relevance vector machine. Journal of
Machine Learning Research 1, 211–244 (2001). ISSN 1532–4435
Tüfekci, P.: Prediction of full load electrical power output of a base load operated
combined cycle power plant using machine learning methods. International Journal
of Electrical Power & Energy Systems 60, 126–140 (2014)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer (1995)
Wang, Z., Crammer, K., Vucetic, S.: Breaking the curse of kernelization: Budgeted
stochastic gradient descent for large-scale svm training. Journal of Machine Learning
Research 13(1), 3103–3131 (2012). ISSN 1532–4435
Yeh, I.-C.: Modeling of strength of high performance concrete using artificial neural
networks. Cement and Concrete Research 28(12), 1797–1808 (1998)
Ridge Regression, Hubness,
and Zero-Shot Learning
1 Introduction
1.1 Background
In recent years, zero-shot learning (ZSL) [10,14,15,22] has been an active
research topic in machine learning, computer vision, and natural language
processing. Many practical applications can be formulated as a ZSL task:
drug discovery [15], bilingual lexicon extraction [7,8,20], and image labeling
[2,11,21,22,25], to name a few. Cross-lingual information retrieval [28] can also
be viewed as a ZSL task.
ZSL can be regarded as a type of (multi-class) classification problem, in the
sense that the classifier is given a set of known example-class label pairs (train-
ing set), with the goal to predict the unknown labels of new examples (test set).
However, ZSL differs from the standard classification in that the labels for the
test examples are not present in the training set. In standard settings, the classi-
fier chooses, for each test example, a label among those observed in the training
set, but this is not the case in ZSL. Moreover, the number of class labels can be
huge in ZSL; indeed, in bilingual lexicon extraction, labels correspond to possible
translation words, which can range over entire vocabulary of the target language.
Obviously, such a task would be intractable without further assumptions.
Labels are thus assumed to be embedded in a metric space (label space), and their
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 135–151, 2015.
DOI: 10.1007/978-3-319-23528-8 9
136 Y. Shigeto et al.
distance (or similarity) can be measured in this space1 . Such a label space can be
built with the help of background knowledge or external resources; in image label-
ing tasks, for example, labels correspond to annotation keywords, which can be
readily represented as vectors in a Euclidean space, either by using corpus statis-
tics in a standard way, or by using the more recent techniques for learning word
representations, such as the continuous bag-of-words or skip-gram models [19].
After a label space is established, one natural approach would be to use a
regression technique on the training set to obtain a mapping function from the
example space to the label space. This function could then be used for mapping
unlabeled examples into the label space, where nearest neighbor search is carried
out to find the label closest to the mapped example. Finally, this label would be
output as the prediction for the example.
To find the mapping function, some researchers use the standard linear ridge
regression [7,8,20,22], whereas others use neural networks [11,21,25].
In the machine learning community, meanwhile, the hubness phenomenon
[23] is attracting attention as a new type of the “curse of dimensionality.” This
phenomenon is concerned with nearest neighbor methods in high-dimensional
space, and states that a small number of objects in the dataset, or hubs, may
occur as the nearest neighbor of many objects. The emergence of these hubs
will diminish the utility of nearest neighbor search, because the list of nearest
neighbors often contain the same hub objects regardless of the query object for
which the list is computed.
In this paper, we show the interaction between the regression step in ZSL and
the subsequent nearest neighbor step has a non-negligible effect on the prediction
accuracy.
In ZSL, examples and labels are represented as vectors in high-dimensional
space, of which the dimensionality is typically a few hundred. As demonstrated
by Dinu and Baroni [8] (see also Sect. 6), when ZSL is formulated as a prob-
lem of ridge regression from examples to labels, “hub” labels emerge, which are
simultaneously the nearest neighbors of many mapped examples. This has the con-
sequence of incurring bias in the prediction, as these labels are output as the pre-
dicted labels for these examples. The presence of hubs are not necessarily disad-
vantageous in standard classification settings; there may be “good” hubs as well
as “bad” hubs [23]. However, in typical ZSL tasks in which the label set is fine-
grained and huge, hubs are nearly always harmful to the prediction accuracy.
Therefore, the objective of this study is to investigate ways to suppress hubs,
and to improve the ZSL accuracy. Our contributions are as follows.
1. We analyze the mechanism behind the emergence of hubs in ZSL, both with
ridge regression and ordinary least squares. It is established that hubness
occurs in ZSL not only because of high-dimensional space, but also because
1
Throughout the paper, we assume both the example and label spaces are Euclidean.
Ridge Regression, Hubness, and Zero-Shot Learning 137
ridge regression has conventionally been used in ZSL in a way that promotes
hubness. To be precise, the distributions of the mapped examples and the
labels are different such that hubs are likely to emerge.
2. Drawing on the above analysis, we propose using ridge regression to map
labels into the space of examples. This approach is contrary to that followed
in existing work on ZSL, in which examples are mapped into label space.
Our proposal is therefore to reverse the mapping direction.
As shown in Sect. 6, our proposed approach outperformed the existing app-
roach in an empirical evaluation using both synthetic and real data.
3. In terms of contributions to the research on hubness, this paper is the first
to provide in-depth analysis of the situation in which the query and data
follow different distributions, and to show that the variance of data matters
to hubness. In particular, in Sect. 3, we provide a proposition in which the
degree of bias present in the data, which causes hub formation, is expressed
as a function of the data variance. In Sect 4, this proposition serves as the
main tool for analyzing hubness in ZSL.
Let x be a point sampled from a distribution X with zero mean. Then, the expected
difference Δ between the squared distances from y1 and y2 to x, i.e.,
Δ = EX x − y2 2 − EX x − y1 2 (1)
is given by √
Δ= 2γd1/2 s2 . (2)
Ridge Regression, Hubness, and Zero-Shot Learning 139
EX [x−y2 2 ] EX [x−y1 2 ]
Δ = EX x2 + y2 2 − EX [x2 ] + y1 2 = y2 2 − y1 2 = γσ. (3)
Δ(γ, d, s1 ) Δ(γ, d, s2 )
< ,
EX Y1 [x − y ]
2 EX Y2 [x − y2 ]
where we wrote Δ explicitly as a function of γ, d, and s.
140 Y. Shigeto et al.
1. We first show in Sect. 4.1 that, with ridge regression (and ordinary least
squares as well), mapped observation data tend to lie closer to the origin
than the target responses do. Because the existing work formulates ZSL as
a regression problem that projects source objects into the target space, this
means that the norm of the projected source objects tends to be smaller
than that of target objects.
2. By combining the above result with the discussion of Sect. 3, we then argue
that placing source objects closer to the origin is not ideal from the perspec-
tive of reducing hubness. On the contrary, placing target objects closer to the
origin, as attained with the proposed approach, is more desirable (Sect. 4.2).
3. In Sect. 4.3, we present a simple additional argument against placing source
objects closer to the origin; if the data is unimodal, such a configuration
increases the possibility of another target object falling closer to the source
object. This argument diverges from the discussion on hubness, but again
justifies the proposed approach.
−1 σ2
AT AAT + λI A2 = ≤ 1.
σ2 +λ
Substituting this inequality in (6) establishes the proposition.
Recall that if the data is centered, the matrix 2-norm can be interpreted as
an indicator of the variance of data along its principal axis. Proposition 2 thus
indicates that the variance along the principal axis of the mapped observations
MA tends to be smaller than that of responses B.
Furthermore, this tendency even persists in the ordinary least squares with
no penalty term (i.e., λ = 0), since MA2 ≤ B2 still holds in this case; note
−1
that AT AAT A is an orthogonal projection and its 2-norm is 1, but the
inequality in (6) holds regardless. This tendency therefore cannot be completely
eliminated by simply decreasing the ridge parameter λ towards zero.
In existing work on ZSL, A represents the (training) source objects X =
[x1 · · · xn ] ∈ Rc×n , to be mapped into the space of target objects (by projection
matrix M); and B is the matrix of labels for the training objects, i.e., B =
Y = [y1 · · · yn ] ∈ Rd×n . Although Proposition 2 is thus only concerned with the
training set, it suggests that the source objects at the time of testing, which are
not in X, are also likely to be mapped closer to the origin of the target space
than many of the target objects in Y.
We learned in Sect. 4.1 that ridge regression (and ordinary least squares) shrink
the mapped observation data towards the origin of the space, relative to the
response. Thus, in existing work on ZSL in which source objects X are projected
to the space of target objects Y , the norm of the mapped source objects is likely
to be smaller than that of the target objects.
The proposed approach, which was described in the beginning of Sect. 4,
follows the opposite direction: target objects Y are projected to the space of
source objects X. Thus, in this case, the norm of the mapped target objects is
expected to be smaller than that of the source objects.
The question now is which of these configurations is preferable for the sub-
sequent nearest neighbor step, and we provide an answer under the following
assumptions: (i) The source space and the target space are of equal dimensions;
(ii) the source and target objects are isotropically normally distributed and inde-
pendent; and (iii) the projected data is also isotropically normally distributed,
except that the variance has shrunk.
Let D1 = N (0, s21 I) and D2 = N (0, s22 I) be two multivariate normal distri-
butions, with s21 < s22 . We compare two configurations of source object x and
target objects y: (a) the one in which x ∼ D1 and y ∼ D2 , and (b) the one in
which x ∼ D2 and y ∼ D1 on the other hand; here, the primes in (b) were
added to distinguish variables in two configurations.
142 Y. Shigeto et al.
y y x
x x, x y
y
s2 s2 s2
s1 s1 s3 s1
Configuration (a): (x, y) (x, y) and (x , y ) Configuration (b): (x , y )
Fig. 1. Schematic illustration for Sect. 4.2 in two-dimensional space. The left and the
right panels depict configurations (a) and (b), respectively, with the center panel show-
ing both configuration (a) and the scaled version of configuration (b) in the same space.
A circle represents a distribution, with its radius indicating the standard deviation. The
radius of the circles for y (on the left panel) and x (right panel) is s1 , whereas that of
the circles for x (left panel) and y (right panel) is s2 , with s1 < s2 . Circles x and y
are the scaled versions of x and y such that the standard deviation (radius) of x is
equal to x, which makes the standard deviation of y equal to s3 = s21 /s2 .
Thus, x follows N (0, s21 I), and y follows N (0, (s41 /s22 )I). Since both x in config-
uration (a) and x above follow the same distribution, it now becomes possible
to compare the properties of y and y in light of the discussion at the end of
Sect. 3: In order to reduce hubness, the distribution with a smaller variance is
preferred to the one with a larger variance, for a fixed distribution of source x
(or equivalently, x ).
Ridge Regression, Hubness, and Zero-Shot Learning 143
It follows that y is preferable to y, because the former has a smaller variance.
As mentioned above, the nearest neighbor relation between the scaled variables,
y against x (or equivalently x), is identical to y against x in configuration (b).
Therefore, we conclude that configuration (b) is preferable to configuration (a),
in the sense that the former is more likely to suppress hubs.
Finally, recall that the preferred configuration (b) models the situation of
our proposed approach, which is to map target objects in the space of source
objects.
y x2
x1
5 Related Work
The first use of ridge regression in ZSL can be found in the work of Palatucci
et al. [22]. Ridge regression has since been one of the standard approaches to
ZSL, especially for natural language processing tasks: phrase generation [7] and
bilingual lexicon extraction [7,8,20]. More recently, neural networks have been
used for learning non-linear mapping [11,25]. All of the regression-based methods
listed above, including those based on neural networks, map source objects into
the target space.
ZSL can also be formulated as a problem of canonical correlation analysis
(CCA). Hardoon et. al. [12] used CCA and kernelized CCA for image labeling.
Lazaridou et. al. [16] compared ridge regression, CCA, singular value decompo-
sition, and neural networks in image labeling. In our experiments (Sect. 6), we
use CCA as one of the baseline methods for comparison.
Dinu and Baroni [8] reported the hubness phenomenon in ZSL. They pro-
posed two reweighting techniques to reduce hubness in ZSL, which are applicable
Ridge Regression, Hubness, and Zero-Shot Learning 145
6 Experiments
We evaluated the proposed approach with both synthetic and real datasets. In
particular, it was applied to two real ZSL tasks: bilingual lexicon extraction and
image labeling.
The main objective of the following experiments is to verify whether our
proposed approach is capable of suppressing hub formation and outperforming
the existing approach, as claimed in Sect. 4.
where is the total number of test objects in Y , Nk (i) is the number of times the
ith target object is in the top-k closest target objects of the source objects. A
large Nk skewness value indicates the existence of target objects that frequently
appear in the k-nearest neighbor lists of source objects; i.e., the emergence of
hubs.
Image Labeling. The second real task is image labeling, i.e., the task of finding
a suitable word label for a given image. Thus, source objects X are the images
and target objects Y are the word labels.
We used the Animal with Attributes (AwA) dataset8 , which consists of 30,475
images of 50 animal classes. For image representation, we used the DeCAF fea-
tures [9], which are the 4096-dimensional vectors constructed with convolutional
neural networks (CNNs). DeCAF is also available from the AwA website. To save
computational cost, we used random projection to reduce the dimensionality of
DeCAF features to 500.
As with the bilingual lexicon extraction experiment, label features (word
representations) were constructed with word2vec, but this time they were trained
4
We also conducted experiments with English as X and other languages as Y . The
results are not presented here due to lack of space, but the same trend was observed.
5
https://sites.google.com/site/rmyeid/projects/polyglot
6
https://code.google.com/p/word2vec/
7
http://hlt.sztaki.hu/resources/dict/bylangpair/wiktionary 2013july/
8
http://attributes.kyb.tuebingen.mpg.de/
148 Y. Shigeto et al.
Table 1. Experimental results: MAP is the mean average precision. Acck is the accu-
racy of the k-nearest neighbor list. Nk is the skewness of the Nk distribution. A high Nk
skewness indicates the emergence of hubs (smaller is better). The bold figure indicates
the best performer in each evaluation criteria.
method MAP Acc1 Acc10 N1 N10
RidgeX→Y 21.5 13.8 36.3 24.19 12.75
RidgeX→Y + NICDM 58.2 47.6 78.4 13.71 7.94
RidgeY→X (proposed) 91.7 87.6 98.3 0.46 1.18
CCA 78.9 71.6 91.7 12.0 7.56
CCA + NICDM 87.6 82.3 96.5 0.96 2.58
method cs de fr ru ja hi
RidgeX→Y 1.7 1.0 0.7 0.5 0.9 5.3
RidgeX→Y + NICDM 11.3 7.1 5.9 3.8 10.2 21.4
RidgeY→X (proposed) 40.8 30.3 46.5 31.1 42.0 40.6
CCA 24.0 18.1 33.7 21.2 27.3 11.8
CCA + NICDM 30.1 23.4 39.7 26.7 35.3 19.3
cs de fr ru ja hi
method Acc1 Acc10 Acc1 Acc10 Acc1 Acc10 Acc1 Acc10 Acc1 Acc10 Acc1 Acc10
RidgeX→Y 0.7 2.8 0.4 1.6 0.3 1.2 0.2 0.8 0.2 1.3 2.9 8.2
RidgeX→Y + NICDM 7.2 17.9 4.3 11.4 3.5 9.8 2.1 6.3 6.1 16.8 14.4 32.6
RidgeY→X (proposed) 31.5 54.5 21.6 43.0 36.6 58.6 21.9 43.6 31.9 56.3 31.1 55.4
CCA 17.9 32.7 12.9 25.2 27.0 41.7 15.2 28.8 20.2 37.3 7.4 18.9
CCA + NICDM 21.9 42.3 16.1 33.9 31.1 50.1 18.7 37.0 25.9 48.8 12.4 30.7
cs de fr ru ja hi
method N1 N10 N1 N10 N1 N10 N1 N10 N1 N10 N1 N10
RidgeX→Y 50.29 23.84 43.00 24.37 67.79 35.83 95.05 35.36 62.12 22.78 23.75 10.84
RidgeX→Y + NICDM 41.56 20.38 39.32 20.82 57.18 25.97 89.08 30.70 57.57 21.62 20.33 9.21
RidgeY→X (proposed) 11.91 10.74 12.49 11.94 2.56 2.77 4.28 4.18 5.15 6.76 10.45 6.14
CCA 28.00 18.67 36.66 18.98 30.18 15.95 51.92 21.60 37.73 18.27 22.31 8.95
CCA + NICDM 25.00 17.13 32.94 17.65 25.20 14.65 42.61 20.72 34.66 13.16 22.00 8.46
on the English version of Wikipedia (as of March 4, 2015) to cover all AwA labels.
Except for the corpus, we used the same word2vec parameters as with bilingual
lexicon extraction.
Ridge Regression, Hubness, and Zero-Shot Learning 149
Table 1 shows the experimental results. The trends are fairly clear: The proposed
approach, RidgeY→X , outperformed other methods in both MAP and Acck , over
all tasks. RidgeX→Y and CCA combined with NICDM performed better than
those with Euclidean distances, although they still lagged behind the proposed
method RidgeY→X even with NICDM.
The Nk skewness achieved by RidgeY→X was lower (i.e., better) than that of
compared methods, meaning that it effectively suppressed the emergence of hub
labels. In contrast, RidgeX→Y produced a high skewness which was in line with
its poor prediction accuracy. These results support the expectation we expressed
in the discussion in Sect. 4.
The results presented in the tables show that the degree of hubness (Nk )
for all tested methods inversely correlates with the correctness of the output
rankings, which strongly suggests that hubness is one major factor affecting the
prediction accuracy.
For the AwA image dataset, Akata et. al. [2, the fourth row (CNN) and
second column (ϕw ) of Table 2] reported a 39.7% Acc1 score, using image
representations trained with CNNs, and 100-dimensional word representations
trained with word2vec. For comparison, our proposed approach, RidgeY→X , was
evaluated in a similar setting: We used the DeCAF features (which were also
trained with CNNs) without random projection as the image representation,
and 100-dimensional word2vec word vectors. In this setup, RidgeY→X achieved
a 40.0% Acc1 score. Although the experimental setups are not exactly identical
and thus the results are not directly comparable, this suggests that even linear
ridge regression can potentially perform as well as more recent methods, such as
Akata et al.’s, simply by exchanging the observation and response variables.
7 Conclusion
This paper has presented our formulation of ZSL as a regression problem of find-
ing a mapping from the target space to the source space, which opposes the way
in which regression has been applied to ZSL to date. Assuming a simple model in
which data follows a multivariate normal distribution, we provided an explana-
tion as to why the proposed direction is preferable, in terms of the emergence of
hubs in the subsequent nearest neighbor search step. The experimental results
showed that the proposed approach outperforms the existing regression-based
and CCA-based approaches to ZSL.
Future research topics include: (i) extending the analysis of Sect. 4 to cover
multi-modal data distributions, or other similarity/distance measures such as
150 Y. Shigeto et al.
References
1. Ács, J., Pajkossy, K., Kornai, A.: Building basic vocabulary across 40 languages.
In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora,
pp. 52–58 (2013)
2. Akata, Z., Lee, H., Schiele, B.: Zero-shot learning with structured embeddings
(2014). arXiv preprint arXiv:1409.8403v1
3. Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: Distributed word representations for
multilingual NLP. In: CoNLL 2013, pp. 183–192 (2013)
4. Bakir, G., Hofmann, T., Schölkopf, B., Smola, A.J., Taskar, B., Vishwanathan,
S.V.N. (eds.): Predicting Structured Data. MIT press (2007)
5. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: appli-
cations to image and text data. In: KDD 2001, pp. 245–250 (2001)
6. Dasgupta, S.: Experiments with random projection. In: UAI 2000, pp. 143–151
(2000)
7. Dinu, G., Baroni, M.: How to make words with vectors: phrase generation in dis-
tributional semantics. In: ACL 2014, pp. 624–633 (2014)
8. Dinu, G., Baroni, M.: Improving zero-shot learning by mitigating the hubness
problem. In: Workshop at ICLR 2015 (2015)
9. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E.,
Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual
recognition (2013). arXiv preprint arXiv:1310.1531
10. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their
attributes. In: CVPR 2009, pp. 1778–1785 (2009)
11. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ronzato, M., Mikolov,
T.: Devise: a deep visual-semantic embedding model. In: NIPS 2013, pp. 2121–2129
(2013)
12. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: An
overview with application to learning methods. Neural Computation 16, 2639–2664
(2004)
13. Jegou, H., Harzallah, H., Schmid, C.: A contextual dissimilarity measure for accu-
rate and efficient image search. In: CVPR 2007, pp. 1–8 (2007)
14. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object
classes by between-class attribute transfer. In: CVPR 2009. pp. 951–958 (2009)
15. Larochelle, H., Erhan, D., Bengio, Y.: Zero-data learning of new tasks. In: AAAI
2008, pp. 646–651 (2008)
16. Lazaridou, A., Bruni, E., Baroni, M.: Is this a wampimuk? Cross-modal map-
ping between distributional semantics and the visual world. In: ACL 2014,
pp. 1403–1414 (2014)
17. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press (2008)
Ridge Regression, Hubness, and Zero-Shot Learning 151
18. Mika, S., Schölkopf, B., Smola, A., Müller, K.R., Scholz, M., Rätsch, G.: Kernel
PCA and de-noising in feature space. In: NIPS 1998, pp. 536–542 (1998)
19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. In: Workshop at ICLR 2013 (2013)
20. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for
machine translation (2013). arXiv preprint arXiv:1309.4168
21. Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado,
G.S., Dean, J.: Zero-shot learning by convex combination of semantic embeddings.
In: ICLR 2014 (2014)
22. Palatucci, M., Pomerleau, D., Hinton, G., Mitchell, T.M.: Zero-shot learning with
semantic output codes. In: NIPS 2009, pp. 1410–1418 (2009)
23. Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: Popular nearest
neighbors in high-dimensional data. Journal of Machine Learning Research 11,
2487–2531 (2010)
24. Schnitzer, D., Flexer, A., Schedl, M., Widmer, G.: Local and global scaling reduce
hubs in space. Journal of Machine Learning Research 13, 2871–2902 (2012)
25. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.Y.: Zero-shot learning through
cross-modal transfer. In: NIPS 2013, pp. 935–943 (2013)
26. Suzuki, I., Hara, K., Shimbo, M., Saerens, M., Fukumizu, K.: Centering similarity
measures to reduce hubs. In: EMNLP 2013, pp. 613–623 (2013)
27. Tomašev, N., Rupnik, J., Mladenić, D.: The role of hubs in cross-lingual supervised
document retrieval. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.)
PAKDD 2013, Part II. LNCS, vol. 7819, pp. 185–196. Springer, Heidelberg (2013)
28. Vinokourov, A., Shawe-Taylor, J., Cristianini, N.: Inferring a semantic representa-
tion of text via cross-language correlation analysis. In: NIPS 2002, pp. 1473–1480
(2002)
29. Weston, J., Chapelle, O., Vapnik, V., Elisseeff, A., Schölkopf, B.: Kernel depen-
dency estimation. In: NIPS 2002, pp. 873–880 (2002)
Solving Prediction Games with Parallel Batch
Gradient Descent
1 Introduction
In many security-related applications, the assumption that training data and
data at application time are identically distributed is routinely violated. For
instance, new malware is designed to bypass detection methods which their
designers believe virus and malware scanners to employ, and email spamming
tools allow their users to develop templates of randomized messages that pro-
duce a low spam score with current filters. In these examples, the party that
creates the predictive model and the data-generating party factor the possible
actions of their opponent into their decisions. This interaction can be modeled
as a prediction game in which one player controls the predictive model whereas
another exercises some control over the process of data generation.
Robust learning methods have been derived under the zero-sum assumption
that the loss of one player is the gain of the other, for several types of adversar-
ial feature transformations [7,8,11–13,17,23,24]. Settings in which both players
have individual cost functions—a fraudster’s profit is not the negative of an email
service provider’s goal of achieving a high spam recognition rate at close-to-zero
false positives—cannot adequately be modeled as zero-sum games.
When the learner has to act first and model parameters are disclosed to the
data generator, this non-zero-sum interaction can be modeled as a Stackelberg
M. Großhans—This work was done while the author was at the University of Pots-
dam, Germany.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 152–167, 2015.
DOI: 10.1007/978-3-319-23528-8 10
Solving Prediction Games with Parallel BGD 153
generation model. We derive a method for finding the unique equilibrium point
in a way that can be parallelized in Section 4. Section 5 presents empirical result
and Section 6 concludes.
The theoretical costs of both players depend on the unknown test distribu-
tion; we will therefore resort to regularized, empirical costs based on the training
sample. The empirical costs incurred by the predictive model fw and transfor-
mation gA are
n
1
θ̂f (w, A) = f (wT gA (x), yi ) + ρf Ωf (w)
n i=1
n
1
θ̂g (w, A) = g (wT gA (x), yi ) + ρg Ωg (A) .
n i=1
which is not strongly convex in A for every data matrix X. Hence, this regular-
izer can have multiple optima, which should be avoided. Therefore, we use the
Frobenius norm of A as regularizer for the data generator; we use standard 2
regularization for the learner:
2
Ωf (w) = w , (3)
1 2 1 2
Ωg (A) = AF = a . (4)
2 2
In this section, we analyze the prediction game between learner and data gen-
erator that we have introduced in the previous section. We will derive conditions
under which equilibrium points exist, and conditions under which an equilibrium
point is unique.
Known results on the existence of equilibrium points for prediction games [3]
do not apply to the data transformation model derived in Section 2: Equation 4
regularizes A because regularization of ||X − gA (X)|| would not be convex in A,
and therefore Assumption 3 of [3] is not met.
156 M. Großhans and T. Scheffer
We will now study under which conditions the prediction game between learner
and data generator with the data transformation introduced above has at least
one equilibrium point. We start by formulating conditions on action spaces and
loss functions in the following assumption.
Assumption 1. The players’ action sets W and A and loss functions f and
g satisfy the following properties.
Theorem 1. Under Assumption 1 the game has at least one equilibrium point.
Proof. By Assumption 1 the loss functions f (zi , yi ) and g (zi , yi ) are continuous
and convex in zi for any yi ∈ Y. Note that zi = wT xi + wT Axi is linear in w ∈
2
Rm and linear in A ∈ Rm for any (xi , yi ) ∈ X × Y. Hence, for both ν ∈ {f, g},
n
the sum of loss terms i=1 ν (zi , yi ) is jointly continuous in (w, A) ∈ Rm×(m+1)
and convex in both w ∈ Rm and A ∈ Rm×m . Both regularizers Ωf and Ωg are
jointly continuous in (w, A) ∈ Rm×(m+1) . Additionally Ωf is strictly convex in
w ∈ Rm and Ωg is strictly convex in A ∈ Rm×m .
Hence, both empirical cost functions θ̂f and θ̂g are jointly continuous in
(w, A) ∈ Rm×(m+1) . Additionally θ̂f is strictly convex in w ∈ Rm and θ̂g is
strictly convex in A ∈ Rm×m . Therefore by Theorem 4.3. in [2]—together with
the fact that both action spaces are non-empty, compact and convex—at least
one equilibrium point exists.
is positive semi-definite.
Proof. Note that we can rewrite this matrix as a product of three matrices:
A T
X AT wT ⊗ Im
w ⊗ Im
(λ − a)w = Vx (7)
(λ − b)V = wx . T
(8)
T T
holds for every eigenvector wT , v1T , . . . , vm with corresponding eigen-
value λ.
Firstly, assume that w = 0 holds. Due to the definition of an eigenvector, the
matrix V = 0 is non-zero. By Equation 8, the corresponding eigenvalue would
be λ = b, and hence the corresponding eigenvalue would be positive.
Now assume that w = 0 holds. Then, by using Equation 9 it turns out that
(λ − a)(λ − b) = xT x. Solving this Equation for λ results in the following two
solutions:
a+b (a − b)2
λ1,2 = ± + xT x.
2 4
Therefore, matrix M2 is positive semi-definite if and only if
a+b (a − b)2
≥ + xT x (10)
2 4
holds, which is equivalent to the inequality
(a + b)2 (a − b)2
≥ + xT x.
4 4
Hence, the smallest eigenvalue is non-negative if and only if a · b > xT x which
completes the proof.
We can now formulate Assumptions under which a unique equilibrium point
exists.
Assumption 2. For a given data matrix X ∈ Rm×n and labels y ∈ Y n , the
players’ action sets W and A, loss functions f and g , and regularization param-
eters ρf , ρg satisfy the following properties.
1. the second derivatives of the loss functions are equal for all y ∈ Y and z ∈ R
f (z, y) = g (z, y).
2a. The regularization parameters satisfy
ρf ρg > sup x̄T
(w,A,X,y) x̄(w,A,X,y) ,
(w,A)∈W×A
df = w̄ − w,
dg = Ā − A.
Inexact line search tries increasingly large values of the step size t and perform an
update by adding tdf to the learner’s prediction model w and by adding tdg to
the data generator’s transformation matrix A. This procedure converges to the
unique Nash equilibrium—von Heusinger and Kanzow discuss its convergence
properties [16].
θ̂([w1 , A1 ] , [w2 , A2 ])
= θ̂f (w1 , A1 ) − θ̂f (w2 , A1 ) + θ̂g (w1 , A1 ) − θ̂g (w1 , A2 ) . (13)
Solving Prediction Games with Parallel BGD 161
This function quantifies the cost savings that the learner could achieve by
unilaterally changing the model from w1 to w2 plus the cost savings that the
data generator could achieve by unilaterally changing the transformation from
A1 to A2 . Nikaido-Isoda function θ̂ is concave in (w2 , A2 ) because θ̂f and θ̂g
are convex, and the cost functions for (w2 , A2 ) enter the function as negatives.
For convex-concave Nikaido-Isoda functions, parameters [w∗ , A∗ ] are an equi-
librium point if and only if the Nikaido-Isoda function has a saddle point at
([w∗ , A∗ ], [w∗ , A∗ ]) [9]. The intuition behind this result is the following. An equi-
librium point (w∗ , A∗ ) satisfies Equations 5 and 6 by definition. By Equation 13,
θ̂([w∗ , A∗ ], [w∗ , A∗ ]) = 0. Equations 5 and 6 imply that θ̂([w, A], [w∗ , A∗ ]) is
positive and θ̂([w∗ , A∗ ], [w, A]) is negative for [w, A] = [w∗ , A∗ ]. When θ̂ is
convex in [w1 , A1 ] and concave in [w2 , A2 ], this means that (w∗ , A∗ ) is an equi-
librium point if and only if θ̂ has a saddle point at position [w∗ , A∗ ], [w∗ , A∗ ].
Saddle points of convex-concave functions can be located with the Arrow-
Hurwicz-Uzawa method [1]. We implement the method as an iterative procedure
with a constant stepsize t [20]. In each iteration j, the method computes the
gradient of θ̂ with respect to w1 , w2 , A1 and A2 , and performs a descent by
updating previous estimates:
(j+1) (j) (j) (j)
(w1 , A1 ) = (w1 , A1 ) − t∇(w1 ,A1 ) θ̂([w1 , A1 ] , [w2 , A2 ] )
(j+1) (j) (j) (j)
(w2 , A2 ) = (w2 , A2 ) + t∇(w2 ,A2 ) θ̂([w1 , A1 ] , [w2 , A2 ] ).
The final estimator of the equilibrium point after T iterations is the average
T
of all iterates: (ŵ∗ , Â∗ ) = (j)
j=1 (w1 , A1 ) . For any convex-concave θ̂, this
method converges towards a saddle-point.
Both the inexact line search method sketched in Section 4.1 and the Arrow-
Hurwicz-Uzawa method derived in Section 4.2 can be implemented in a batch-
parallel manner. To this end, the data is randomly partitioned into k batches
(Xi , yi ), where i = 1, . . . , k. In practice, rather than splitting the data into k dis-
joint partitions, it is advisable to split the data into a larger number of portions
but have some overlap between the portions. In the map step, k parallel nodes
perform gradient descent on their respective batch of training examples; in the
final reduce step, the k parameter vectors are averaged [26]. The execution time
of averaging k parameter vectors wi ∈ Rm is vanishingly small in comparison to
the execution time of the parallel gradient descent.
When w1 , . . . , wk are equilibrium points of the games given k by the respective
partitions of the sample, then the averaged vector w = k1 j=1 wj still cannot
be guaranteed to be an equilibrium point of the game given by the entire sample.
In fact, in the experimental study we will find example cases where this is not
the case.
162 M. Großhans and T. Scheffer
Relative Error Rate (w.r.t. LR)
Error Rate over Future Months Nikaido-Isoda during Optimization Nikaido-Isoda Function
Log-Value of Nikaido-Isoda
log-Value of Nikaido-Isoda
LR 10000 4
NashGlobal 2
1.1 1000
NashParam 0
BestResp 100 -2
ILS ILS-Ave.
1 -4
10 AHU ILS-Sgl.
-6
1 -8
0.9 -10
0.1
Mar Jun Sep Dec 1 2 4 8 16 1 2 4 8 16
Month of Reception Execution Time (in sec) Number of Workers
Fig. 1. Relative error (with respect to logistic regression, LR) of classification models
evaluated into future (left). Value of Nikaido-Isoda function over time for three differ-
ent optimization algorithms (center) and parallelized models (right). Error bars show
standard errors.
5 Experimental Results
The goal of this section is to explore the robustness and scalability of sequential
and parallelized methods that locate equilibrium points. We use a data set of
290,262 emails collected by an email service provider [3]. Each instance contains
the term frequency of 226,342 terms. We compute a PCA of the emails and
use the first 250 principal components as feature representation for most exper-
iments. The data set is sorted chronologically. Emails that have been received
over the final 12 months of the data collection period are held out for evaluation.
Emails received in the month before that are used for tuning of the regularization
parameters. Training emails are drawn from the remaining set of older emails.
This section compares the convergence rates of the inexact line search (ILS )
and Arrow-Hurwicz-Uzawa (AHU ) approaches to finding equilibrium points,
discussed in Sections 4.1 and 4.2, respectively.
In each repetition of the experiment, we sample 10,000 instances from the
training portion of the data. Here, we use the 500 first principal components
as feature representation. In each iteration of the optimization procedures, we
measure he Nikaido-Isoda function of the current pair of parameters and the
best possible reactions to these parameters—this function reaches zero at an
equilibrium point. Figure 1 (center) shows that the ILS procedure converges
very quickly. By contrast, AHU requires several orders of magnitude more time
before the Nikaido-Isoda function drops noticably (not visible in the diagram);
we have not been able to observe convergence. Increasing the regularization
parameters by a factor of 100—which should make the optimization criterion
more convex—did not change these findings. We therefore excluded AHU from
further investigation.
8 Nodes with 1/8-th of Data 8 Nodes with 1/8-th of Data 8 Nodes with 1/8-th of Data
LR-8-Avg. ILS-8-Avg. 1.01 LR-8-Avg.
1 LR-1-Sgl. 1 ILS-1-Sgl. ILS-8-Avg.
LR-8-Sgl. ILS-8-Sgl.
Accuracy
Accuracy
Accuracy
0.95 0.95 1
0.9 0.9
0.99
Mar Jun Sep Dec Mar Jun Sep Dec Mar Jun Sep Dec
Month of Reception Month of Reception Month of Reception
Fig. 2. Accuracy of logistic regression (left) and equilibrium points (center). Error
rate of the aggregated equilibrium point relative to the error rate of aggregated logistic
regression models (right). Each of 8 nodes processes 18 -th of the data. Error bars show
standard errors.
8 Nodes with 1/4-th of Data 8 Nodes with 1/4-th of Data 8 Nodes with 1/4-th of Data
LR-8-Avg. ILS-8-Avg. 1.01 LR-8-Avg.
1 LR-1-Sgl. 1 ILS-1-Sgl. ILS-8-Avg.
LR-8-Sgl. ILS-8-Sgl.
Accuracy
Accuracy
Accuracy
0.95 0.95 1
0.9 0.9
0.99
Mar Jun Sep Dec Mar Jun Sep Dec Mar Jun Sep Dec
Month of Reception Month of Reception Month of Reception
Fig. 3. Accuracy of logistic regression (left) and equilibrium points (center). Error
rate of the aggregated equilibrium point relative to the error rate of aggregated logistic
regression models (right). Each of 8 nodes processes 14 -th of the data. Error bars show
standard errors.
8 Nodes with Sq.Root of 1/8-th of Data 8 Nodes with Sq.Root of 1/8-th of Data 8 Nodes with Sq.Root of 1/8-th of Data
LR-8-Avg. ILS-8-Avg. 1.01 LR-8-Avg.
1 LR-1-Sgl. 1 ILS-1-Sgl. ILS-8-Avg.
LR-8-Sgl. ILS-8-Sgl.
Accuracy
Accuracy
Accuracy
0.95 0.95 1
0.9 0.9
0.99
Mar Jun Sep Dec Mar Jun Sep Dec Mar Jun Sep Dec
Month of Reception Month of Reception Month of Reception
Fig. 4. Accuracy of logistic regression (left) and equilibrium points (center). Error
rate of the aggregated equilibrium point relative to the error rate of aggregated logistic
regression models (right). Each of 8 nodes processes √18 -th of the data. Error bars show
standard errors.
earlier results on parallel stochastic gradient descent [18,26]. The same is true
for the equilibrium models found by ILS : Figures 2 (center, 18 of the data per
node), 3 (center, 18 of the data per node), and 4 (center, √18 of the data per
node) show that the averaged models ILS-8-Avg outperform the parallel models
ILS-8-Sgl. The sequentially trained model ILS-1-Sgl outperforms the averaged
models.
Solving Prediction Games with Parallel BGD 165
Figures 2 (right, 14 of the data), and 3 (right, 18 of the data), and 4 (right, √18
of the data) show the error rate of ILS-8-Avg relative to the error rate of LR-8-
Avg, in analogy to Figure 1 (left). While in Figure Figure 1 (left) the equilibrium
points have outperformed LR, the averaged model ILS-8-Avg tends to have a
similar error rate as LR-8-Avg. The averaged equilibrium parameters—while still
outperforming the equilibrium parameters trained on parallel batches—are no
longer more accurate than the averaged logistic regression models.
We investigate further why this is the case. Figure 1 (right) shows the value
of the Nikaido-Isoda function at the end of the batch optimization process for a
single model trained on k1 of the data (ILS-Sgl ), and the corresponding Nikaido-
Isoda function value for the average of k models trained on k1 of the data each
(ILS-Avg). Surprisingly, the averaged parameter vectors have a higher function
value which means that they are further away from being equilibrium points
than the individual models.
We can conclude that for this application (a) equilibrium points tend to be
more accurate than standard logistic regression models; (b) averaging parameter
vectors that have been trained on different batches of the data always leads to
more accurate models; but (c) averaging equilibrium points tends to lead to
model parameters that are no longer equilibrium points, and are therefore not
generally more accurate than standard logistic regression models.
6 Conclusion
it turns out that aggregates of equilibrium points are not equilibrium points
themselves; the Nikaido-Isoda function increases its value during the aggrega-
tion step. Therefore, aggregated logistic regression models are about as accurate
as aggregated equilibrium points. From a practical point of view, this implies
that searching for equilibrium points is advisable for adversarial applications as
long as training data, and not computation time, is the limiting factor. As the
sample size increases, the computation time needed to locate equilibrium points
on a single node becomes the limiting factor. For intermediate sample sizes, it
may still be possible (and advisable) to train a model on a single node using iid
learning. For even larger sample sizes, this becomes impossible. At this point,
aggregated batch-parallel gradient descent outperforms sequential optimization
using a subset of the data. At this point, however, aggregated equilibrium points
offer no advantage over aggregated models trained under the iid assumption.
Acknowledgment. This work was supported by the German Science Foundation DFG
under grant SCHE540/12-2.
References
1. Arrow, K., Hurwicz, L., Uzawa, H.: Studies in Linear and Non-Linear Program-
ming. Stanford University Press, Stanford (1958)
2. Basar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory. Academic Press,
London/New York (1995)
3. Brückner, M., Kanzow, C., Scheffer, T.: Static prediction games for adversarial
learning problems. Journal of Machine Learning Research 12, 2617–2654 (2012)
4. Brückner, M., Scheffer, T.: Stackelberg games for adversarial learning problems.
In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (2011)
5. Bu, Y., Howe, B., Balazinska, M., Ernst, M.: Haloop: efficient iterative data pro-
cessing on large clusters. In: Proceedings of the VLDB Endowment, vol. 3 (2010)
6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters.
In: Proceedings of the 6th Symposium on Operating System Design and Imple-
mentation (2004)
7. Dekel, O., Shamir, O.: Learning to classify with missing and corrupted features.
In: Proceedings of the International Conference on Machine Learning. ACM Press
(2008)
8. Dekel, O., Shamir, O., Xiao, L.: Learning to classify with missing and corrupted
features. Machine Learning 81(2), 149–178 (2010)
9. Flam, S., Ruszczynski, A.: Computing normalized equilibria in convex-concave
games. Working Papers Working Papers 2006:9, Lund University, Department of
Economics (2006)
10. Geiger, C., Kanzow, C.: Theorie und Numerik restringierter Optimierungsauf-
gaben. Springer, Heidelberg (2002)
11. Ghaoui, L.E., Lanckriet, G.R.G., Natsoulis, G.: Robust classification with interval
data. Tech. Rep. UCB/CSD-03-1279, EECS Department, University of California,
Berkeley (2003)
12. Globerson, A., Roweis, S.T.: Nightmare at test time: robust learning by feature
deletion. In: Proceedings of the International Conference on Machine Learning.
ACM Press (2006)
Solving Prediction Games with Parallel BGD 167
13. Globerson, A., Teo, C.H., Smola, A.J., Roweis, S.T.: An adversarial view of
covariate shift and a minimax approach. In: Dataset Shift in Machine Learning,
pp. 179–198. MIT Press (2009)
14. Großhans, M., Sawade, C., Brückner, M., Scheffer, T.: Bayesian games for adversar-
ial regression problems. In: Proceedings of the International Conference on Machine
Learning (2013)
15. Hardt, M., Megiddo, N., Papadimitriou, C., Wooters, M.: Strategic classification.
Unpublished manuscript
16. von Heusinger, A., Kanzow, C.: Relaxation methods for generalized nash equi-
librium problems with inexact line search. Journal of Optimization Theory and
Applications 143(1), 159–183 (2009)
17. Lanckriet, G.R.G., Ghaoui, L.E., Bhattacharyya, C., Jordan, M.I.: A robust mini-
max approach to classification. Journal of Machine Learning Research 3, 555–582
(2002)
18. Mann, G., McDonald, R., Mohri, M., Silberman, N., Walker, D.: Efficient large-
scale distributed training of conditional maximum entropy models. In: Advances
in Neural Information Processing, vol. 22 (2009)
19. Nedic, A., Bertsekas, D., Borkar, V.: Distributed asynchronous incremental sub-
gradient methods. Studies in Computational Mathematics 8, 381–407 (2001)
20. Nedic, A., Ozdaglar, A.: Subgradient methods for saddle-point problems. Journal
of Optimization Theory and Applications 142(1), 205–228 (2009)
21. Nikaido, H., Isoda, K.: Note on noncooperative convex games. Pacific Journal of
Mathematics 5, 807–815 (1955)
22. Rosen, J.B.: Existence and uniqueness of equilibrium points for concave n-person
games. Econometrica 33(3), 520–534 (1965)
23. Teo, C.H., Globerson, A., Roweis, S.T., Smola, A.J.: Convex learning with invari-
ances. In: Advances in Neural Information Processing Systems. MIT Press (2007)
24. Torkamani, M., Lowd, D.: Convex adversarial collective classification. In:
Proceedings of the International Conference on Machine Learning (2013)
25. Weimer, M., Condie, T., Ramakrishnan, R.: Machine learning in scalops, a
higher order cloud computing language. In: NIPS 2011 Workshop on Parallel and
Large-scale Machine Learning (BigLearn) (2011)
26. Zinkevich, M., Weimer, M., Smola, A., Li, L.: Parallelized stochastic gradient
descent. In: Advances in Neural Information Processing, vol. 23 (2010)
Structured Regularizer for Neural Higher-Order
Sequence Models
1 Introduction
Overfitting is a common and challenging problem in machine learning. It occurs
when a learning algorithm overspecializes to training samples, i.e. irrelevant or
noisy information for prediction is learned or even memorized. Consequently,
the learning algorithm does not generalize to unseen data samples. This results
in large test error, while obtaining small training error. A common assumption
is that complex models are prone to overfitting, while simple models have lim-
ited predictive expressiveness. Therefore a trade-off between model complexity
and predictive expressiveness needs to be found. Usually, a penalty term for
This work was supported by the Austrian Science Fund (FWF) under the project
number P25244-N15. Furthermore, we acknowledge NVIDIA for providing GPU
computing resources.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 168–183, 2015.
DOI: 10.1007/978-3-319-23528-8 11
Structured Regularizer for Sequence Models 169
the model complexity is added to the training objective. This penalty term is
called regularization. Many regularization techniques have been proposed, e.g. in
parameterized model priors on individual weights or priors on groups of weights
like the l1 -norm and l2 -norm are commonly used.
Recently, dropout [12] and dropconnect [33] have been proposed as regulariza-
tion techniques for neural networks. During dropout training, input and hidden
units are randomly canceled. The cancellation of input units can be interpreted
as a special form of input noising and, therefore, as a special type of data aug-
mentation [18,32]. During dropconnect training, the connections between the
neurons are dropped [33]. Dropout and dropconnect can be interpreted as mix-
tures of neural networks with different structures. In this sense, dropout and
dropconnect have been interpreted as ensemble learning techniques. In ensemble
learning, many different classifiers are trained independently to make the same
predictions, i.e. ensembles of different base classifiers. For testing, the predic-
tions of the different classifiers are combined. In the dropout and dropconnect
approaches, the mixture of models is trained and utilized for testing. Recently,
pseudo-ensemble learning [1] has been suggested as a generalization of dropout
and dropconnect. In pseudo-ensemble learning, a mixture of child models gener-
ated by perturbing a parent model is considered.
We propose a generalization of pseudo-ensemble learning. We introduce a
mixture of models which share parts, e.g. neural sub-networks, called shared-
ensemble learning. The difference is that in shared-ensemble learning, there is no
parent model from which we generate child models. The models in the shared-
ensemble can be different, but share parts. This is in contrast to traditional
ensembles which typically do no share parts. Based on shared-ensembles, we
derive a new regularizer as lower bound of the conditional likelihood of the
mixture of models. Furthermore, this regularizer can be expressed explicitly as
regularization term in the training objective. In this paper, we apply shared-
ensemble learning to linear-chain conditional random fields (LC-CRFs) [13] in
a sequence labeling task, derive a structured regularizer and demonstrate its
advantage in experiments. LC-CRFs are established models for sequence label-
ing [7,35], i.e. we assign some given input sequence x, e.g. a time series, to an
output label sequence y.
First-order LC-CRFs typically consist of transition factors, modeling the
relationship between two consecutive output labels, and local factors, mod-
eling the relationship between input observations (usually a sliding window
over input frames) and one output label. But CRFs are not limited to
such types of factors: Higher-order LC-CRFs (HO-LC-CRFs) allow for arbi-
trary input-independent (such factors depend on the output labels only) [35]
and input-dependent (such factors depend on both the input and output vari-
ables) higher-order factors [16,23]. That means both types of factors can include
more than two output labels. Clearly, the Markov order of the largest factor (on
the output side) dictates the order of LC-CRFs.
It is common practice to represent the higher-order factors by linear func-
tions which can reduce the model’s expressiveness. To overcome this limitation,
170 M. Ratajczak et al.
2 Related Work
Dropout applied to the input has been formalized for some linear and log-linear
models [18,32]. Assuming a distribution of the dropout noise, an analytical form
of the dropout technique has been presented. The training objective has been
formulated as the expectation of the loss function under this distribution. Fur-
thermore, this objective has been reformulated as the original objective and an
additive explicit regularization.
As mentioned before, HO-LC-CRFs have been applied to sequence label-
ing in tagging tasks, in handwriting recognition [23,35] and large-scale machine
translation [16]. In these works, higher-order factors have not been modeled
by neural networks which is the gap we fill. However, first-order factors have
Structured Regularizer for Sequence Models 171
1
T N
pCRF (y|x) = Φt (yt−n+1:t ; x), (1)
Z(x) t=1 n=1
f2 f2
yt−1 yt yt+1
xt−1 xt xt+1
(a) LC-CRF
f3
··· ···
f2 f2
yt−1 yt yt+1
··· ···
f 3-3
(b) HO-LC-CRF
Fig. 1. Factor graph of LC-CRFs using (a) input-dependent uni-gram features f 1-1
and bi-gram transition features f 2 (typical) and (b) additionally 3-gram features f 3 as
well as input-dependent features f 2-2 and f 3-3 .
J
F(w) = log p(y(j) |x(j) ), (4)
j=1
Fig. 2. Three realizations of word-final /t/ in spontaneous Dutch. Left panel: Realization
of /rt/ in gestudeerd ’studied’. Middle panel: Realization of in leeftijd mag ‘age
is allowed’. Right panel: Realization of in want volgens ‘because according’ [28].
The weights are shared over time wkt,n = wkn as we use time-homogeneous fea-
tures. To perform parameter learning using gradient ascent all marginal posteri-
ors of the form p(yt−n+1:t |t, x(j) ) are required. These marginals can be efficiently
computed using the forward-backward algorithm [23,35]. The algorithm can be
easily extended to CRFs of order (N − 1) > 2. However, for simplicity and as
we are targeting GPU platforms, we choose another approach. As we describe
in more detail in Section 3.2, we compute the conditional log-likelihood by com-
puting just the forward recursion. Then we utilize back-propagation [27] as com-
mon in typical neural networks to compute their gradients. The conditional
likelihood, the forward recursion and the corresponding gradients are computed
using Theano [2], a mathematical expression compiler for GPUs and automatic
differentiation toolbox.
The main ingredient needed for applying the back-propagation algorithm is the
forward recursion and the computation of the normalization constant. For a
given input-output sequence pair (x, y), the forward recursion is given in terms
of quantities αt (yt−1:t ) that are updated according to
αt (yt−1:t ) = Φt (yt ; x)Φt (yt−1:t ; x) Φt (yt−2:t ; x)αt−1 (yt−2:t−1 ). (5)
yt−2
The most probable label sequence can be found by the Viterbi algorithm gener-
alized to HO-LC-CRFs: The summation in the forward recursion is exchanged
by the maximum operation, i.e. quantities α̂t (yt−1:t ) are computed as
α̂t (yt−1:t ) =Φt (yt ; x)Φt (yt−1:t ; x) max Φt (yt−2:t ; x)α̂t−1 (yt−2:t−1 ). (8)
yt−2
At the end of the recursion, we identify the most probable state at the last
position and apply back-tracking. For details and for time complexities we refer
to [23,35].
4 Structured Regularizer
As mentioned in the introduction, NHO-LC-CRFs have high predictive expres-
siveness but are prone to overfitting. To fully exploit the potential of these mod-
els, special regularization techniques must be applied. Therefore, on top of the
NHO-LC-CRF, we add a new structured regularizer. In the following, we derive
this regularizer in a quite general form based on additive mixture of experts [3].
Our derivation is based on a single training sample (x, y), the generalization to
multiple samples is straightforward.
A mixture of models, i.e. additive mixture of experts, is defined as
log p(y|x) = log p(y, M |x), (9)
M ∈M
where the first term in the logarithm is the conditional joint probability of y
and the combination model MS1 ,...,SK and the sum is over the conditional joint
probabilities of y and the sub-models in MS = {MS1 , . . . , MSK }. By the chain-
rule p(y, M |x) = p(y|x, M ) p(M |x) and Jensen’s inequality, we obtain a lower
bound to the log-likelihood as
log p(y|x, M ) p(M |x) ≥ p(M |x) log p(y|x, M ), (11)
M ∈M M ∈M
176 M. Ratajczak et al.
where M ∈M p(M |x) = 1. By lower-bounding the log-likelihood we reformu-
lated the additive mixture of experts as a product of experts, i.e. summation in
p(M |x) = p(M ), i.e. the
log-space. By assuming that the model prior satisfies
prior is independent of the sample x, we obtain M ∈M p(M ) log p(y|x, M ).
To this end, we can rewrite our lower bound as
p(MS1 ,...,SK ) log p(y|x, MS1 ,...,SK ) + p(MS ) log p(y|x, MS ). (12)
MS ∈MS
where (x(j) , y(j) ) is the j th training sample. The regularizers Rn (y|x) for the
corresponding label sub-sequences are defined as
log Rn (y|x) = log pMLP (yt−n+1:t |t, m, x), (14)
t=n:T
This means the sub-models are MLP networks sharing the MLP features f m−n
with the NHO-LC-CRF, the combination model. The trade-off parameter λ bal-
ances the importance of the sequence model against the other sub-models.
For testing we drop the regularizer and find the most probable sequence by
utilizing the Viterbi algorithm for NHO-LC-CRFs as describled in Section 3.2.
5 Experiments
We evaluated the performance of the proposed models on the TIMIT phoneme
classification task. We compared isolated phone classification (without label
context information) with MLP networks to phone labeling with neural HO-
LC-CRFs. This comparison substantiates the effectiveness of joint-training of
neural higher-order factors. Furthermore, we show performance improvements
using our introduced structured regularizer during joint-training.
Structured Regularizer for Sequence Models 177
Table 2. Linear higher-order CRFs. All results with m = 1 and n = 1 already include
input-independent 2-gram factors.
m=n 1 +2 +3
dev 25.8 20.4 20.7
test 25.9 20.5 21.6
Table 3. Neural higher-order LC-CRF without and with our structured regularizer.
The order m = n of the higher order factors is examined. Furthermore, the effectiveness
of the new structured regularizer is demonstrated for factors up to order m = n = 3.
All results with m = 1 and n = 1 already include input-independent 2-gram factors.
In all experiments, the number of hidden units is H = 150 and one hidden layer.
λ and a margin to the baseline results without the regularizer, i.e. λ = 1.0,
which we indicated by a dotted line in Figure 3. By this additional tuning, we
improved the result further and achieved the best overall performance of 16.7%
with factors up to order m = n = 3 and a trade-off parameter of 0.1.
Fig. 3. Performance using the structured regularizer for various trade-off parameters
λ. The number of hidden units H ∈ {100, 150, 200} in the neural higher-order fac-
tors varies in the different plots. Baseline results without the regularizer λ = 1.0 are
indicated by a dotted line.
180 M. Ratajczak et al.
Fig. 4. Test performance for (left) various training set sample sizes with and with-
out our structured regularizer and (right) its zoomed presentation to illustrate the
effectiveness for small training set sample sizes.
Table 4. Summary of labeling results. Results marked by († ) are from [31], by (†† )
are from [29], by (††† ) are from [8], by (†††† ) are from [24], and by (*) are from [4].
Performance measure: Phone error rate (PER) in percent.
with 50, 100 and 200 hidden units as well as one and three input segments. We
achieved the best result with 100 hidden units and one segment as input. Fur-
thermore, hierarchical large-margin GMMs achieve a performance of 16.7% and
outperform most other referenced methods but exploit human-knowledge and
committee techniques. However, our best model, the HO-LC-CRF augmented
by m-n-gram MLP factors using our new structured regularizer achieves a per-
formance of 16.7% without human-knowledge and is outperforming most of the
state-of-the-art methods.
6 Conclusion
We considered NHO-LC-CRFs for sequence labeling. While these models have
high predictive expressiveness, they are prone to overfitting due to their high
model complexity. To avoid overfitting, we applied a novel structured regular-
izer derived from the proposed shared-ensemble framework. We show that this
regularizer can be derived as lower bound from a mixture of models sharing
parts of each other, e.g. neural sub-networks. We demonstrated the effectiveness
of this structured regularizer in phoneme classification experiments. Further-
more, we experimentally confirmed the importance of non-linear representations
in form of neural higher-order factors in LC-CRFs in contrast to linear ones.
In TIMIT phoneme classification, we achieved state-of-the-art phoneme error
rate of 16.7% using the NHO-LC-CRFs equipped with our proposed structured
regularizer. Future work includes testing of different types of neural networks.
References
1. Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In:
Advances in Neural Information Processing Systems, pp. 3365–3373 (2014)
2. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,
Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expres-
sion compiler. In: Proceedings of the Python for Scientific Computing Conference
(SciPy) (2010). Oral Presentation
3. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
4. Chang, H.A., Glass, J.R.: Hierarchical large-margin Gaussian mixture models
for phonetic classification. In: Workshop on Automatic Speech Recognition and
Understanding (ASRU), pp. 272–277 (2007)
5. Do, T.M.T., Artières, T.: Neural conditional random fields. In: International
Conference on Artificial Intelligence and Statistics (AISTATS), pp. 177–184 (2010)
6. Gens, R., Domingos, P.: Learning the structure of sum-product networks. In: Inter-
national Conference on Machine Learning (ICML), vol. 28, pp. 873–880 (2013)
7. Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random
fields for phone classification. In: Interspeech, pp. 1117–1120 (2005)
8. Halberstadt, A.K., Glass, J.R.: Heterogeneous acoustic measurements for phonetic
classification. In: EUROSPEECH, pp. 401–404 (1997)
9. He, X., Zemel, R.S., Carreira-Perpiñán, M.A.: Multiscale conditional random fields
for image labeling. In: Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 695–703 (2004)
182 M. Ratajczak et al.
10. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for
acoustic modeling in speech recognition. Signal Processing Magazine, pp. 82–97
(2012)
11. Hinton, G.E.: Products of experts. In: International Conference on Artificial Neural
Networks (ICANN), pp. 1–6 (1999)
12. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Improving neural networks by preventing co-adaptation of feature detectors. CoRR
abs/1207.0580 (2012)
13. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic
models for segmenting and labeling sequence data. In: International Conference on
Machine Learning (ICML), pp. 282–289 (2001)
14. Lafferty, J., Zhu, X., Liu, Y.: Kernel conditional random fields: representation and
clique selection. In: International Conference on Machine Learning (ICML), p. 64
(2004)
15. Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann
machines. In: International Conference on Machine Learning (ICML), pp. 536–543
(2008)
16. Lavergne, T., Allauzen, A., Crego, J.M., Yvon, F.: From n-gram-based to crf-based
translation models. In: Workshop on Statistical Machine Translation, pp. 542–553
(2011)
17. Li, Y., Tarlow, D., Zemel, R.S.: Exploring compositional high order pattern poten-
tials for structured output learning. In: Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 49–56 (2013)
18. Maaten, L., Chen, M., Tyree, S., Weinberger, K.Q. In: International Conference
on Machine Learning (ICML), pp. 410–418 (2013)
19. van der Maaten, L., Welling, M., Saul, L.K.: Hidden-unit conditional random fields.
In: International Conference on Artificial Intelligence and Statistics (AISTATS),
pp. 479–488 (2011)
20. Peng, J., Bo, L., Xu, J.: Conditional neural fields. In: Neural Information Process-
ing Systems (NIPS), pp. 1419–1427 (2009)
21. Poon, H., Domingos, P.: Sum-product networks: a new deep architecture. In: Uncer-
tainty in Artificial Intelligence (UAI), pp. 337–346 (2011)
22. Prabhavalkar, R., Fosler-Lussier, E.: Backpropagation training for multilayer con-
ditional random field based phone recognition. In: International Conference on
Acoustics Speech and Signal Processing (ICASSP), pp. 5534–5537 (2010)
23. Qian, X., Jiang, X., Zhang, Q., Huang, X., Wu, L.: Sparse higher order condi-
tional random fields for improved sequence labeling. In: International Conference
on Machine Learning (ICML), pp. 849–856 (2009)
24. Ratajczak, M., Tschiatschek, S., Pernkopf, F.: Sum-product networks for struc-
tured prediction: context-specific deep conditional random fields. In: Interna-
tional Conference on Machine Learning (ICML): Workshop on Learning Tractable
Probabilistic Models Workshop (2014)
25. Ratajczak, M., Tschiatschek, S., Pernkopf, F.: Neural higher-order factors in con-
ditional random fields for phoneme classification. In: Interspeech (2015)
26. Roth, S., Black, M.J.: Fields of experts: a framework for learning image priors. In:
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 860–867
(2005)
27. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Parallel distributed processing:
Explorations in the microstructure of cognition, vol. 1. chap. Learning Internal
Representations by Error Propagation, pp. 318–362 (1986)
Structured Regularizer for Sequence Models 183
1 Introduction
Supervised machine learning is typically concerned with learning a model using
training data and applying this model to new test data. An implicit assumption
made for successfully deploying a model is that both training and test data
follow the same distribution. However, the distribution of the attributes can
change, especially when the training data is gathered in one context, but the
model is deployed in a different context (e.g., the training data is collected in
one country but the predictions are required for another country). The presence
of such dataset shifts can harm the performance of a learned model. Different
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 184–199, 2015.
DOI: 10.1007/978-3-319-23528-8 12
Versatile Decision Trees for Learning Over Multiple Contexts 185
kinds of dataset shift have been investigated in the literature [10]. In this work we
focus on shifts in continuous attributes caused by hidden transformations from
context to another. For instance, a diagnostic test may have different resolutions
when produced by different laboratories, or the average temperature may change
from city to city. In such cases, the distribution of one or more of the covariates
in X changes. This problem is referred as a covariate observation shift [7].
We address this problem in two steps. In the first step, we build Decision Trees
(DTs) using percentiles for each attribute to deal with covariate observation
shifts. In this proposal, if a certain percentage of training data reaches a child
node after applying a decision test, the decision thresholds in deployment are
redefined in order to preserve the same percentage (60%) of deployment instances
in that node. In the original learned DT, the learned threshold in a decision node
corresponds to the 60th percentile of the training data. The updated threshold
in deployment will be the 60th percentile of the deployment data.
The percentile approach assumes that the shift is caused by a monotonic
function preserving the ordering of attribute values but ignoring the scale. For
some shifts it may be more appropriate to assume a transformation from one
linear scale to another. We therefore develop a more general and versatile DT
that can choose between different strategies (percentiles, linear shifts or no shift)
to update the DT thresholds for each deployment context, according to the shifts
observed in the data.
The rest of the paper is organised as follows. Section 2 presents the dataset
shift problem and the existing approaches addressing it. In Section 3 we introduce
the use of percentiles and the versatile model proposed in our work. Section 4
presents the experiments performed to evaluate the versatile model on both
synthetic and non-synthetic dataset shifts, and Section 5 concludes the paper.
2 Dataset Shift
We start by making a distinction between the training and deployment contexts.
In the training context, a set of labelled instances is available for learning a
model. The deployment context is where the learned model is actually used
for predictions. These contexts are often different in some non-trivial way. For
instance, a model may be built using training data collected in a certain period of
time and in a particular country, and deployed to data in a future time and/or in
a different country. A model built in a training context may fail in a deployment
context due to different reasons: in the current paper we focus on performance
degradation caused by dataset shifts across contexts.
A simple solution to deal with shifts would be to train a new classifier for
each new deployment context. However, if there are not enough available labelled
instances in the new context, training a new model specific for that context
is then unfeasible as the model would not sufficiently generalise. Alternative
solutions have to be applied to reuse or adapt existing models, which will depend
on the kind of shift observed.
186 R. Al-Otaibi et al.
Shifts can occur in the input attributes, in the output or both. Dataset shift
happens when training and deployment joint distributions are different [10], i.e.:
A shift can occur in the output, i.e., Ptr (Y ) = Pdep (Y ), while the condi-
tional probability distribution remains the same Ptr (X|Y ) = Pdep (X|Y ). This
is referred in the literature as the prior probability shift and can be addressed
in different ways (e.g., [5]).
In our work we are mainly concerned with situations where the marginal dis-
tribution of a covariate changes across contexts, i.e.: Ptr (X) = Pdep (X). Given
a change in the marginal distribution, we can further define two different kinds
of shifts depending on whether the conditional distribution of the target also
changes between training and deployment. In the first case, the marginal distri-
bution of X changes, while the conditional probability of the target Y given X
remains the same:
Ptr (X) = Pdep (X)
(2)
Ptr (Y |X) = Pdep (Y |X)
For instance, the smoking habits of a population may change over time due to
public initiatives but the probability of lung cancer given smoking is expected
to remain the same [12]. In the same problem, a labelled training set may be
collected biased to patients with bad smoking habits. Again, the marginal dis-
tribution in deployment may be different from training while the conditional
probability is the same. The above shift is referred in the literature by different
terms such as simple covariate shift [12] or sample selection bias [15]. A com-
mon solution to deal with simple covariate shifts is to modify the training data
distribution by considering the deployment data distribution. A new model can
then be learned using the shift-corrected training data distribution. This strat-
egy is adopted by different authors using importance sampling which corrects
the training distribution using instance weights proportional to Pdep (X)/Ptr (X).
Examples of such solutions include Integrated Optimisation Problem IOP [3],
Kernel Mean Matching [6], Importance Weighted Cross Validation IWCV [14]
and others.
In this paper we focus on the second kind of shift in which both the marginal
and the conditional distributions can change across contexts, i.e.:
This is a more difficult situation that can be hard to solve and requires additional
assumptions. A suitable assumption in many situations is that there is a hidden
transformation of the covariates Φ(X) for which the conditional probability is
unchanged across contexts, i.e.:
In this work, we propose different strategies to build Decision Trees (DTs) in the
presence of covariate observation shifts. We make two main contributions. First,
we propose a novel approach to build DTs based on percentiles (see Sections 3.1
188 R. Al-Otaibi et al.
Fig. 1. Two types of models; on the left is the model using a fixed threshold while on
the right is the model using percentiles. For each deployment context, the decision tree
is deployed in such a way that the deployment instances are split to the leaves in the
same percentile amounts of 63% and 37%.
and 3.2). The basic idea is to learn a conventional DT and then to replace the
internal decision thresholds by percentiles, which can deal with monotonic shifts.
Secondly, we propose a more general Versatile Model (VM) that deploys differ-
ent strategies (including the percentiles) to update the DT thresholds for each
deployment context, according to the shifts observed in the data (see Section 3.3).
The shifts are identified by applying a non-parametric statistical test.
We consider an example using the diabetes dataset from the UCI repository,
which has 8 input attributes and 768 instances. Suppose we train a decision
stump and the discriminative attribute is the Plasma glucose concentration
attribute, which is a numerical attribute. Suppose the decision threshold is 127.5,
meaning that any patient with plasma concentration above 127.5 will be clas-
sified as diabetic, otherwise classified as non-diabetic as seen in Figure 1 (left).
If there is no shift in the attribute from training to deployment, the decision
node can be directly applied, i.e., the threshold 127.5 is maintained to split data
in deployment. However, if the attribute is shifted in deployment, the original
threshold may not work well.
In the current work, we propose to adopt the percentiles1 of continuous
attributes to update the decision thresholds for each deployment context. Back
to the example, instead of interpreting the data split in an absolute sense, we
will interpret it in terms of ranks: 37% of the training examples with the highest
values of Plasma reach the right child, while 63% of the training examples with
the lowest values of Plasma in turn reach the left child (see the right side of
Figure 1). We can say that the data was split at the 63th percentile in training.
Given a batch set of instances in deployment to classify, the DT can apply the
same split rule: the 37% of the examples in deployment with the highest val-
ues of Plasma are associated to the right child, while 63% of the examples with
the lowest values of Plasma in deployment are associated to the left child. The
decision threshold in deployment is updated in such a way that the percentage
1
Percentile is the value below which a given percentage of observations in a group is
observed.
Versatile Decision Trees for Learning Over Multiple Contexts 189
Let Rtr (thtr ) = 100 ∗ nlef t /n be the percentage of training instances in the
left node, where n is the total number of training instances. Then, thtr is the
percentile associated to Rtr (thtr ) for the attribute X. In the above example:
Rth (127.5) = 63% and thtr is the 63th percentile of Plasma in the training data.
Then the threshold adopted in deployment is defined as:
−1
thdep = Rdep (Rtr (thtr )) (6)
In the above equation, the threshold thdep is the attribute value in deployment
that, once adopted to split the deployment data, maintains the percentage of
instances in each child node: Rdep (thdep ) = Rtr (thtr ).
Fig. 2. Example of DT with percentiles when a shift is identified in the class distri-
bution. Part (a) illustrates the percentiles of each leaf for the training context, with
prior probability equal to 0.5. Part (b) illustrates the correction of the percentiles for
a new deployment context where the prior probability is 0.6. The correction of per-
formed using the ratios of 0.6/0.5 and 0.4/0.5 respectively for the positive and negative
instances (left side - (b)). Corrected number of instances expected at each leaf resulted
in new estimated percentiles (right side - (b))
.
190 R. Al-Otaibi et al.
The percentile rule can be adapted to additionally deal with shifts in the class
distribution across contexts. Figure 2 illustrates a situation where the prior prob-
ability of the positive class was 0.5 in training and then shifted to 0.6 in deploy-
ment. In Figure 2(a) we observe a certain number of positives and negatives
internally in each child node, which is used to derive the percentiles. If a shift is
expected in the target, the percentage of instances expected in deployment for
each child node may change as well. For instance, a higher percentage of instances
may be observed in the right node in deployment because the probability of pos-
itives has increased and the proportion of positives related to negatives in this
node is high. In our work, we estimate the percentage of instances in each child
node according to the class ratios between training and deployment.
cl cl
Let Ptr and Pdep be the probability of class cl , respectively in training and
cl
deployment. Pdep can be estimated using available labelled data in deployment
or just provided as input. There is a prior shift related to this class label when
cl cl
Ptr = Pdep . For each instance belonging to cl observed in training we expect
cl cl
to observed Pdep /Ptr instances of cl in deployment. The number of instances
associated to the left child node in deployment is then estimated by the following
equation:
c Pdep cl
l
n̂lef t = nlef t cl (7)
Ptr
cl ∈L
In this section, we propose a versatile decision tree model that employs differ-
ent strategies to choose the decision threshold according to the shift observed in
deployment. Algorithm 1 presents the proposed versatile model, which receives as
input the original threshold applied on an attribute, the training and the deploy-
ment data of that attribute and returns the appropriate threshold to adopt in
deployment. This versatile model (VM) can be described in three steps:
the change in mean and standard deviation of the attribute in training and
deployment (see Algorithm 2). We then apply the KS test again to compare
the distribution of the transformed attribute in deployment and the attribute
in the training data. If no shift is observed now, we assume that the linear
transformation applied was adequate. The versatile DT is then deployed with
a threshold thdep = α · thtr + β.
3. Finally, if the second test indicates that there is still a shift in the attribute
(i.e., the shift was not corrected using the linear transformation), then the
percentiles are deployed, assuming a non-linear monotonic shift. In this case
−1
the adopted threshold is: thdep = Rdep (Rtr (thtr )).
4 Experimental Results
The VM combines 3 strategies for defining the DT thresholds in deployment:
original thresholds, linear transformations, and monotonic transformations using
percentiles. In the experiments, each strategy was individually compared to the
VM, respectively named as Original Model (OM), (α, β) and Perc. Additionally,
(α, β) and the Percentile methods were combined with the KS test, referred in
the experiments as KS+(α, β) and KS+Perc, respectively. In the former, linear
transformation is applied to all shifted attributes, whereas, in the latter, Per-
centiles are utilised. In both approaches, the original model was applied if there
is no shift detected by the KS test.
The first set of experiment applies synthetic shifts to UCI datasets to analyse
the performance of the shift detection approach adopted by the VM. We inject
two types of shifts into the deployment data to test the VM: a non-linear mono-
tonic transformation and linear shifts with different degrees (see Sections 4.1
Versatile Decision Trees for Learning Over Multiple Contexts 193
Table 1. Values used in the experiments for ϕ and γ in order to generate the synthetic
linear shifts.
ϕ γ Effect
0 0 unshifted data (original)
0 1 μdep shifted to right
0 -1 μdep shifted to left
1 0 stretch data
-1 0 compress data
1 1 μdep shifted to right and stretch the data
1 -1 μdep shifted to left and stretch the data
-1 1 μdep shifted to right and compress the data
-1 -1 μdep shifted to left and compress the data
and 4.2). In Section 4.3 we report on experiments with actual context changes
occurring in real-world datasets.
μdep = α · μtr + β
σdep = α · σtr
α = 2ϕ
β = (1 − 2ϕ ) · μtr + γ · σtr
Unshifted Data. Unsurprisingly, the original model is the best when there is no
shift from training to deployment, but the CD diagram demonstrates that the
Versatile Model is not significantly worse. Percentiles don’t work well in this
case, confirming the need for a multi-strategy approach.
Non-Linear Shift. Here the Versatile Model outperforms all other methods in
terms of the average ranks, but not significantly. Notice that, while the original
model performs worst, there are 2 datasets where the original model performs
best: in these datasets many attribute values are in the range [−1, 1] where the
cubic transformation has less effect.
Mixture Shift. The aim of this experiment was to test how well the Versatile
Model deals with a mixture of different shifts: one-third of the attributes was
shifted linearly, one-third non-linearly, and one-third remained unchanged. The
results demonstrate that the Versatile Model derives a clear advantage from the
ability of being able to distinguish between these different kinds of shift and
adapt its strategy.
Fig. 3. Critical Difference diagrams using pairwise comparisons for those experiments
where the Friedman test gives significance at 0.05.
Heart. Our next benchmark is the heart disease dataset. We split it into two
subsets according to gender: male and female. In this dataset there are 5 con-
tinuous attributes, 3 of them are indicated as shifted between gender according
to KS test, which are age, heart rate and serum cholesterol. Table 4 shows the
performance of versatile method against the original model, IOP and KMM. In
both contexts the VM has the best accuracy among all three methods including
the original model.
Bike Sharing. This dataset [8] contains the hourly and daily count of rental bikes
between years 2011 and 2012 in addition to weather information. It contains 4
continuous attributes: actual and apparent temperature in Celsius, humidity and
wind speed. The classification task is whether there is a demand in this period of
time or not. In order to evaluate the shift effects, we split the dataset as proposed
in [1] to obtain the 4 seasons datasets. According to KS, all these 4 attributes are
detected as shifted except in 3 cases. First, between Summer-Spring, wind speed
is not shifted. Second, in both Summer-Autumn and Autumn-Winter, humidity
is not shifted. The performance of Versatile Model and others are shown in
Table 5. Again we note the solid performance of the Versatile Model.
196 R. Al-Otaibi et al.
Table 2. Cross-validated classification accuracy for both unshifted, linear shift, non-
linear shift and mixture shift. The numbers between brackets are ranks. VM is the
Versatile Model, OM is the original model, α, β corresponds to a linear shift, Perc
corresponds to a percentile shift, and KS+. . . indicates that the Kolmogorov-Smirnov
test is used for detecting the shift.
AutoMPG. AutoMPG dataset [8] concerns the consumption in miles per gal-
lon of vehicle from 3 different regions: USA, Europe and Japan. It contains
4 numerical attributes: displacement, horsepower, weight and acceleration. All
these input attributes have been detected as shifted between regions using KS
test. This dataset has been binarised according to the mean value of the target.
Versatile Decision Trees for Learning Over Multiple Contexts 197
Table 3. Classification accuracy for Diabetes dataset. Symbols denote ethnic groups as
follows: African-American (AA), Asian (A), Caucasian (C), Hispanic (H). X-Y denotes
trained on X, deployed on Y.
A-AA A-C A-H AA-A AA-C AA-H C-A C-AA C-H H-A H-AA H-C
# shifted 6 5 4 6 5 4 5 5 4 4 4 4
VM 0.569 0.529 0.576 0.653 0.530 0.590 0.645 0.546 0.588 0.624 0.565 0.564
OM 0.574 0.538 0.554 0.642 0.526 0.587 0.641 0.566 0.595 0.642 0.562 0.563
IOP 0.526 0.499 0.547 0.500 0.494 0.463 0.520 0.488 0.469 0.519 0.509 0.452
KMM 0.467 0.499 0.419 0.352 0.530 0.474 0.647 0.557 0.580 0.400 0.442 0.507
Table 4. Classification accuracy for Heart dataset, with contexts by gender (F: Female,
M: Male).
M-F F-M
# shifted 3 3
VM 0.735 0.568
OM 0.712 0.557
IOP 0.703 0.500
KMM 0.724 0.540
Table 5. Classification accuracy for Bike Sharing dataset, with contexts by season
(Sp: Spring, S: Summer, A: Autumn, W: Winter).
Sp-S Sp-A Sp-W S-Sp S-A S-W A-Sp A-S A-W W-Sp W-S W-A
# shifted 3 4 4 3 3 4 4 3 3 4 4 3
VM 0.641 0.558 0.601 0.519 0.579 0.601 0.602 0.543 0.556 0.646 0.565 0.526
OM 0.538 0.468 0.544 0.607 0.547 0.612 0.574 0.521 0.528 0.718 0.657 0.558
IOP 0.489 0.468 0.533 0.635 0.510 0.657 0.585 0.534 0.522 0.658 0.630 0.510
KMM 0.559 0.468 0.522 0.635 0.521 0.651 0.585 0.521 0.589 0.690 0.521 0.521
Table 6. Classification accuracy for AutoMPG dataset, with contexts by origin (U:
USA, E: Europe, J:Japan).
U-E U-J E-U E-J J-U J-E
# shifted 4 4 4 3 4 3
VM 0.676 0.759 0.873 0.772 0.780 0.647
OM 0.544 0.607 0.670 0.746 0.747 0.691
IOP 0.558 0.493 0.400 0.417 0.600 0.441
KMM 0.558 0.582 0.600 0.582 0.400 0.485
We split the dataset as proposed in [1] to obtain the 3 regions datasets. The
performance of Versatile Model and others are shown in Table 6. The VM out-
performs all three methods and has only one loss against the original model.
Finally, we report the result of a Friedman test and post-hoc analysis on all
non-synthetic shifts. Figure 4 demonstrates that the Versatile Model outperforms
all others, significantly so except for the original model.
198 R. Al-Otaibi et al.
Fig. 4. Critical Difference diagram using pairwise comparisons for non synthetic shift.
Average ranks as follows: VM=1.671, OM=2.140, KMM=2.875 and IOP= 3.312. The
Friedman test gives significance at 0.05.
5 Conclusion
References
1. Ahmed, C.F., Lachiche, N., Charnay, C., Braud, A.: Reframing continuous input
attributes. In: 2014 IEEE 26th International Conference on Tools with Artificial
Intelligence (ICTAI), pp. 31–38, November 2014
2. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., Garcı́a, S.: Keel data-mining
software tool: Data set repository, integration of algorithms and experimental
analysis framework. Multiple-Valued Logic and Soft Computing 17(2–3), 255–287
(2011)
Versatile Decision Trees for Learning Over Multiple Contexts 199
3. Bickel, S., Brückner, M., Scheffer, T.: Discriminative learning under covariate shift.
Journal of Machine Learning Research 10, 2137–2155 (2009)
4. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal
of Machine Learning Research 7, 1–30 (2006)
5. Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seven-
teenth International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)
6. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.:
Covariate shift by kernel mean matching. In: Quińonero-Candela, J., Masashi
Sugiyama, A.S., Lawrence, N.D. (eds.) Dataset Shift in Machine Learning,
pp. 131–160. MIT Press (2009)
7. Kull, M., Flach, P.: Patterns of dataset shift. In: First International Workshop on
Learning over Multiple Contexts (LMCE) at ECML-PKDD 2014, Nancy, France,
September 2014
8. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
9. Moreno-Torres, J.G., Llorı́, X., Goldberg, D.E., Bhargava, R.: Repairing fractures
between data using genetic programming-based feature extraction: A case study in
cancer diagnosis. Inf. Sci. 222, 805–823 (2013)
10. Moreno-Torres, J.G., Raeder, T., Alaiz-Rodrı́guez, R., Chawla, N.V., Herrera, F.:
A unifying view on dataset shift in classification. Pattern Recognition 45(1), 521–530
(2012)
11. Moreno-Torres, J.G., Raeder, T., Aláiz-Rodrı́guez, R., Chawla, N.V., Herrera, F.:
Tackling dataset shift in classification: Benchmarks and methods. http://sci2s.ugr.
es/dataset-shift (accessed: March 30, 2015)
12. Storkey, A.J.: When training and test sets are different: characterising learning
transfer. In: Quińonero-Candela, J., Masashi Sugiyama, A.S., Lawrence, N.D. (eds.)
Dataset Shift in Machine Learning, chap. 1, pp. 3–28. MIT Press (2009)
13. Strack, B., DeShazo, J.P., Gennings, C., Olmo, J.L., Ventura, S., Cios, K.J.,
Clore, J.N.: Impact of hba1c measurement on hospital readmission rates: analysis
of 70,000 clinical database patient records. BioMed Research International 2014,
781670 (2014). http://europepmc.org/articles/PMC3996476
14. Sugiyama, M., Krauledat, M., Müller, K.R.: Covariate shift adaptation by importance
weighted cross validation. Journal of Machine Learning Research 8, 985–1005 (2007)
15. Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In:
International Conference on Machine Learning ICML 2004, pp. 903–910 (2004)
When is Undersampling Effective in Unbalanced
Classification Tasks?
1 Introduction
In several binary classification problems, the two classes are not equally repre-
sented in the dataset. For example, in fraud detection, fraudulent transactions
are normally outnumbered by genuine ones [5]. When one class is underrep-
resented in a dataset, the data is said to be unbalanced. In such problems,
typically, the minority class is the class of interest. Having few instances of one
class means that the learning algorithm is often unable to generalize the behav-
ior of the minority class well, hence the algorithm performs poorly in terms of
predictive accuracy [14].
When the data is unbalanced, standard machine learning algorithms that
maximise overall accuracy tend to classify all observations as majority class
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 200–215, 2015.
DOI: 10.1007/978-3-319-23528-8 13
When is Undersampling Effective in Unbalanced Classification Tasks? 201
instances. This translates into poor accuracy on the minority class (low recall),
which is typically the class of interest. Degradation of classification performance
is not only related to a small number of examples in the minority class in com-
parison to the number of examples in the majority classes (expressed by the
class imbalance ratio), but also to the minority class decomposition into small
sub-parts [19] (also known in the literature as small disjuncts [15]) and to the
overlap between the two classes [16] [3] [11] [10]. In these studies it emerges that
performance degradation is strongly caused by the presence of both unbalanced
class distributions and a high degree of class overlap. Additionally, in unbalanced
classification tasks, the performance of a classifier is also affected by the presence
of noisy examples [20] [2].
One possible way to deal with this issue is to adjust the algorithms them-
selves [14] [23] [7]. Here we will consider instead a data-level strategy known as
undersampling [13]. Undersampling consists in down-sizing the majority class by
removing observations at random until the dataset is balanced. In an unbalanced
problem, it is often realistic to assume that many observations of the majority
class are redundant and that by removing some of them at random the data
distribution will not change significantly. However the risk of removing relevant
observations from the dataset is still present, since the removal is performed
in an unsupervised manner. In practice, sampling methods are often used to
balance datasets with skewed class distributions because several classifiers have
empirically shown better performance when trained on balanced dataset [22] [9].
However, these studies do not imply that classifiers cannot learn from unbal-
anced datasets. For instance, other studies have also shown that some classifiers
do not improve their performances when the training dataset is balanced using
sampling techniques [4] [14]. As a result, the only way to know if sampling
helps the learning process is to run some simulations. Despite the popularity of
undersampling, we have to remark that there is not yet a theoretical framework
explaining how it can affect the accuracy of the learning process.
In this paper we aim to analyse the role of the two side-effects of undersam-
pling on the final accuracy. The first side-effect is that, by removing majority
class instances, we perturb the a priori probability of the training set and we
induce a warping in the posterior distribution [8,18]. The second is that the
number of samples available for training is reduced with an evident consequence
in terms of accuracy of the resulting classifier. We study the interaction between
these two effects of undersampling and we analyse their impact on the final rank-
ing of posterior probabilities. In particular we show under which conditions an
under sampling strategy is recommended and expected to be effective in terms
of final classification accuracy.
Fig. 1. Undersampling: remove majority class observations until we have the same
number of instances in the two classes.
label negative (resp. positive) to denote the label 0 (resp. 1). Suppose that the
training set (X , Y) of size N is unbalanced (i.e. the number N + of positive cases
is small compared to the number N − of negative ones) and that rebalancing is
performed by undersampling. Let (X, Y ) ⊂ (X , Y) be the balanced sample of
(X , Y) which contains a subset of the negatives in (X , Y).
Let us introduce a random binary selection variable s associated to each
sample in (X , Y), which takes the value 1 if the point is in (X, Y ) and 0 otherwise.
We now derive how the posterior probability of a model learned on a balanced
subset relates to the one learned on the original unbalanced dataset, on the basis
of the reasoning presented in [17]. Let us assume that the selection variable s is
independent of the input x given the class y (class-dependent selection):
where p(s = 1|y, x) is the probability that a point (x, y) is included in the bal-
anced training sample. Note that the undersampling mechanism has no impact
on the class-conditional distribution but that it perturbs the prior probability
(i.e. p(y|s = 1) = p(y)).
Let the sign + denote y = 1 and − denote y = 0, e.g. p(+, x) = p(y = 1, x)
and p(−, x) = p(y = 0, x). From Bayes’ rule we can write:
p(s = 1|+, x)p(+|x)
p(+|x, s = 1) = (2)
p(s = 1|+, x)p(+|x) + p(s = 1|−, x)p(−|x)
Using condition (1) in (2) we obtain:
p(s = 1|+)p(+|x)
p(+|x, s = 1) = (3)
p(s = 1|+)p(+|x) + p(s = 1|−)p(−|x)
Since undersampling corresponds to set
we obtain
p(+)
≤ p(s = 1|−) < 1 (5)
p(−)
When is Undersampling Effective in Unbalanced Classification Tasks? 203
p(+)
Note that if we set p(s = 1|−) = p(−) , we obtain a balanced dataset where
the number of positive and negative instances is the same. At the same time, if
we set p(s = 1|−) = 1, no negative instances are removed and no undersampling
takes place. Using (4), we can rewrite (3) as
p(+|x) p
ps = p(+|x, s = 1) = = (6)
p(+|x) + p(s = 1|−)p(−|x) p + β(1 − p)
where β = p(s = 1|−) is the probability of selecting a negative instance with
undersampling, p = p(+|x) is the true posterior probability of class + in the
original dataset, and ps = p(+|x, s = 1) is the true posterior probability of class
+ after sampling. Equation (6) quantifies the amount of warping of the posterior
probability due to undersampling. From it, we can derive p as a function of ps :
βps
p= (7)
βps − ps + 1
The relation between p and ps (parametric in β) is illustrated in Figure 2.
The top curve of Figure 2 refers to the complete balancing which corresponds to
p(+) N+ N+
β = p(−) ≈N − , assuming that N − provides an accurate estimation of the ratio
sampling maps two close and low values of p into two values ps with a larger
distance. The opposite occurs for high values of p. In Section 3 we will show how
this has an impact on the ranking returned by estimations of p and ps .
204 A. Dal Pozzolo et al.
The previous section discussed the first consequence of under sampling, i.e. the
transformation of the original conditional distribution p into a warped condi-
tional distribution ps according to equation (6). The second consequence of
undersampling is the reduction of the training set size which inevitably leads
to an increase of the variance of the classifier. This section discusses how these
two effects interact and their impact on the final accuracy of the classifier, by
focusing in particular on the accuracy of the ranking of the minority class (typ-
ically the class of interest).
Undersampling transforms the original classification task (i.e. estimating the
conditional distribution p) into a new classification task (i.e. estimating the con-
ditional distribution ps ). In what follows we aim to assess whether and when
under sampling has a beneficial effect by changing the target of the estimation
problem.
Let us denote by p̂ (resp. p̂s ) the estimation of the conditional probability p
(resp. ps ). Assume we have two distinct test points having probabilities p1 < p2
where Δp = p2 −p1 with Δp > 0. A correct classification aiming to rank the most
probable positive samples should rank p2 before p1 , since the second test sample
has an higher probability of belonging to the positive class. Unfortunately the
values p1 and p2 are not known and the ranking should rely on the estimated
values p̂1 and p̂2 . For the sake of simplicity we will assume here that the estimator
of the conditional probability has the same bias and variance in the two test
points. This implies p̂1 = p1 + 1 and p̂2 = p2 + 2 , where 1 and 2 are two
realizations of the random variable ε ∼ N (b, ν) where b and ν are the bias and
When is Undersampling Effective in Unbalanced Classification Tasks? 205
the variance of the estimator of p. Note that the estimation errors 1 and 2 may
induce a wrong ranking if p̂1 > p̂2 .
What happens if instead of estimating p we decide to estimate ps , as in
undersampling? Note that because of the monotone transformation (6), p1 <
p2 ⇒ ps,1 < ps,2 . Is the ranking based on the estimations of ps,1 and ps,2 more
accurate than the one based on the estimations of p1 and p2 ?
In order to answer this question let us suppose that also the estimator of
ps is biased but that its variance is larger given the smaller number of samples.
Then p̂s,1 = ps,1 + η1 and p̂s,2 = ps,2 + η2 , where η ∼ N (bs , νs ), νs > ν and
Δps = ps,2 − ps,1 .
Let us now compute the derivative of ps w.r.t. p. From (6) we have:
dps β
= (8)
dp (p + β(1 − p))2
dps
corresponding to a concave function. Let λ be the value of p for which dp = 1:
√
β−β
λ=
1−β
It follows that
dps 1
β≤ ≤ (9)
dp β
and
dps 1
1< < , when 0 < p < λ
dp β
while
dps
β< <1 when λ < p < 1.
dp
In particular for p = 0 we have dps = β1 dp while for p = 1 it holds dps = βdp.
Let us now suppose that the quantity Δp is small enough to have an accurate
approximation Δp dps
Δp ≈ dp . We can define the probability of obtaining a wrong
s
and
Δp
P (η1 − η2 > Δps ) = 1 − Φ √ s (11)
2νs
We can now say that a classifier learned after undersampling has better
ranking w.r.t. a classifier learned with unbalanced distribution when
(a) (b)
dps dps
Fig. 4. Left: dp
as a function of p. Right: dp
as a function of β
1.2
1.0
Posterior probability
0.8
0.6
0.4
0.2
0.0
2 1 0 1 2
x
s.eps dp νs
(a) Class conditional distributions (thin (b) dp
(solid lines), ν
(dotted
lines) and the posterior distribution of the lines).
minority class (thicker line).
Fig. 5. Non separable case. On the right we plot both terms of inequality 14 (solid:
left-hand, dotted: right-hand term) for β = 0.1 and β = 0.4
4 Experimental Validation
In this section we assess the validity of the condition (14) by performing a number
of tests on synthetic and real datasets.
1.2
1.0
Posterior probability
0.8
0.6
0.4
0.2
0.0
2 1 0 1 2
x
s.eps dp νs
(a) Class conditional distributions (thin (b) dp
(solid lines), ν
(dotted
lines) and the posterior distribution of the lines).
minority class (thicker line).
Fig. 6. Separable case. On the right we plot both terms of inequality 14 (solid: left-
hand, dotted: right-hand term) for β = 0.1 and β = 0.4
distribution. Figures 7(a) and Figure 9(a) show the distributions of the testing
sets for the two tasks.
In order to compute the variance of p̂ and p̂s in each test point, we generate
1000 times a training set (N = 1000) and we estimate the conditional probability
on the basis of sample mean and covariance.
νs
In Figure 7(b) (first task) we plot ν (dotted line) and three percentiles
(0.25, 0.5, 0.75) of dp
dp
s
vs. the rate of undersampling β. It appears that for at
dps
least 75% of the testing points, the term dp is higher than ννs . In Figure 8(a)
νs
the points surrounded with a triangle are those one for which dp dp >
s
ν hold
when β = 0.053 (balanced dataset). For such samples we expect that ranking
returned by undersampling (i.e. based on p̂s ) is better than the one based on the
original data (i.e. based on p̂). The plot shows that undersampling is beneficial
in the region where the majority class is situated, which is also the area where
we expect to have low values of p. Figure 8(b) shows also that this region moves
towards the minority class when we do undersampling with β = 0.323 (90%
negatives, 10% positives after undersampling).
In order to measure the quality of the rankings based on p̂s and p̂ we compute
the Kendall rank correlation of the two estimates with p, which is the true pos-
terior probability of the testing set that defines the correct ordering. In Table 1
we show the ranking correlations of p̂s (and p̂) with p for the samples where
the condition (14) (first five rows) holds and where it does not (last five rows).
The results indicate that points for which condition (14) is satisfied have indeed
better ranking with p̂s than p̂.
We repeated the experiments for the second task having a larger proportion
of positives (25%) (dataset 2 in Figure 9(a)). From the Figure 9(b), plotting dpdp
s
When is Undersampling Effective in Unbalanced Classification Tasks? 209
νs dps
(a) Synthetic dataset 1 (b) ν
and dp
for different
β
Fig. 7. Left: distribution of the testing set where the positive samples account νs for 5%
of the total. Right: plot of dpdp
s
percentiles (25 th
, 50 th
and 75th
) and of ν
(black
dashed).
Fig. 8. Regions where undersampling should work. Triangles indicate the testing sam-
ples where the condition (14) holds for the dataset in Figure 7.
νs
and as a function of β, it appears that only the first two percentiles are over
νs ν
ν . This means that less points of the testing set satisfy the condition (14).
This is confirmed from the results in Table 2 where it appears that the benefit
due to undersampling is less significant than for the first task.
210 A. Dal Pozzolo et al.
νs dps
(a) Synthetic dataset 2 (b) ν
and dp
for different β
Fig. 9. Left: distribution of the testing set where the positive samples account νsfor 25%
of the total. Right: plot of dpdp
s
percentiles (25 th
, 50 th
and 75th
) and of ν
(black
dashed).
In this section we assess the validity of the condition (14) on a number of real
unbalanced binary classification tasks obtained by transforming some datasets
from the UCI repository [1] (Table 3)1 .
1
Transformed datasets are available at http://www.ulb.ac.be/di/map/adalpozz/
imbalanced-datasets.zip
When is Undersampling Effective in Unbalanced Classification Tasks? 211
Datasets N N+ N− N + /N
ecoli 336 35 301 0.10
glass 214 17 197 0.08
letter-a 20000 789 19211 0.04
letter-vowel 20000 3878 16122 0.19
ism 11180 260 10920 0.02
letter 20000 789 19211 0.04
oil 937 41 896 0.04
page 5473 560 4913 0.10
pendigits 10992 1142 9850 0.10
PhosS 11411 613 10798 0.05
satimage 6430 625 5805 0.10
segment 2310 330 1980 0.14
boundary 3505 123 3382 0.04
estate 5322 636 4686 0.12
cam 18916 942 17974 0.05
compustat 13657 520 13137 0.04
covtype 38500 2747 35753 0.07
Figure 10 reports the difference between Kendall rank correlation of p̂s and
p̂, averaged over different levels of undersampling (proportions of majority vs.
minority: 90/10, 80/20, 60/40, 50/50). Higher difference means that p̂s returns
212 A. Dal Pozzolo et al.
Fig. 10. Difference between the Kendall rank correlation of p̂s and p̂ with p, namely Ks
and K, for points having the conditions (14) satisfied and not. Ks and K are calculated
as the mean of the correlations over all βs.
νs
Also the estimation of ν is an hard statistical problem, as known in the
statistical literature on ratio estimation [12].
When is Undersampling Effective in Unbalanced Classification Tasks? 213
Fig. 11. Ratio between the number of sample satisfying condition (14) and all the
instances available in each dataset averaged over all the βs.
5 Conclusion
Undersampling has become the de facto strategy to deal with skewed distribu-
tions, but, though easy to be justified, it conceals two major effects: i) it increases
the variance of the classifier and ii) it produces warped posterior probabilities.
The first effect is typically addressed by the use of averaging strategies (e.g.
UnderBagging [21]) to reduce the variability while the second requires the cali-
bration of the probability to the new priors of the testing set [18]. Despite the
popularity of undersampling for unbalanced classification tasks, it is not clear
how these two effects interact and when undersampling leads to better accuracy
in the classification task.
In this paper, we aimed to analyse the interaction between undersampling
and the ranking error of the posterior probability. We derive the condition (14)
under which undersampling can improve the ranking and we show that when it is
satisfied, the posterior probability obtained after sampling returns a more accu-
rate ordering of testing instances. To validate our claim we used first synthetic
and then real datasets, and in both cases we registered a better ranking with
undersampling when condition (14) was met. It is important to remark how this
condition shows that the beneficial impact of undersampling is strongly depen-
dent on the nature of the classification task (degree of unbalancedness and non
separability), on the variance of the classifier and as a consequence is extremely
dependent on the specific test point. We think that this result sheds light on the
reason why several discordant results have been obtained in the literature about
the effectiveness of undersampling in unbalanced tasks.
However, the practical use of this condition is not straightforward since it
requires the knowledge of the posteriori probability and of the ratio of variances
before and after undersampling. It follows that this result should be used mainly
as a warning against a naive use of undersampling in unbalanced tasks and
should suggest instead the adoption of specific adaptive selection techniques (e.g.
racing [6]) to perform a case-by-case use (and calibration) of undersampling.
214 A. Dal Pozzolo et al.
References
1. Newman, D.J., Asuncion, A.: UCI machine learning repository (2007)
2. Anyfantis, D., Karagiannopoulos, M., Kotsiantis, S., Pintelas, P.: Robustness of
learning techniques in handling class noise in imbalanced datasets. In: Artifi-
cial intelligence and innovations 2007: From Theory to Applications, pp. 21–28.
Springer (2007)
3. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: Balancing Strategies and Class
Overlapping. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.)
IDA 2005. LNCS, vol. 3646, pp. 24–35. Springer, Heidelberg (2005)
4. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of sev-
eral methods for balancing machine learning training data. ACM SIGKDD Explo-
rations Newsletter 6(1), 20–29 (2004)
5. Dal Pozzolo, A., Caelen, O., Borgne, Y.-A.L., Waterschoot, S., Bontempi, G.:
Learned lessons in credit card fraud detection from a practitioner perspective.
Expert Systems with Applications 41(10), 4915–4928 (2014)
6. Dal Pozzolo, A., Caelen, O., Waterschoot, S., Bontempi, G.: Racing for Unbal-
anced Methods Selection. In: Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M.,
Weise, T., Li, B., Yao, X. (eds.) IDEAL 2013. LNCS, vol. 8206, pp. 24–31. Springer,
Heidelberg (2013)
7. Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In:
Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 155–164. ACM (1999)
8. Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Con-
ference on Artificial Intelligence, Citeseer, vol. 17, pp. 973–978 (2001)
9. Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning
from imbalanced data sets. Computational Intelligence 20(1), 18–36 (2004)
10. Garcı́a, V., Mollineda, R.A., Sánchez, J.S.: On the k-nn performance in a chal-
lenging scenario of imbalance and overlapping. Pattern Analysis and Applications
11(3–4), 269–280 (2008)
11. Garcı́a, V., Sánchez, J., Mollineda, R.A.: An Empirical Study of the Behavior of
Classifiers on Imbalanced and Overlapped Data Sets. In: Rueda, L., Mery, D.,
Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 397–406. Springer, Heidelberg
(2007)
12. Hartley, H.O., Ross, A.: Unbiased ratio estimators (1954)
13. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on
Knowledge and Data Engineering 21(9), 1263–1284 (2009)
14. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study.
Intelligent Data Analysis 6(5), 429–449 (2002)
15. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD
Explorations Newsletter 6(1), 40–49 (2004)
16. Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class Imbalances versus Class
Overlapping: An Analysis of a Learning System Behavior. In: Monroy, R.,
Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI),
vol. 2972, pp. 312–321. Springer, Heidelberg (2004)
When is Undersampling Effective in Unbalanced Classification Tasks? 215
17. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset
shift in machine learning. The MIT Press (2009)
18. Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to
new a priori probabilities: a simple procedure. Neural Computation 14(1), 21–41
(2002)
19. Stefanowski, J.: Overlapping, rare examples and class decomposition in learning
classifiers from imbalanced data. In: Emerging Paradigms in Machine Learning,
pp. 277–306. Springer (2013)
20. Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy
data. Data & Knowledge Engineering 68(12), 1513–1542 (2009)
21. Wang, S., Tang, K., Yao, X.: Diversity exploration and negative correlation learning
on imbalanced data sets. In: International Joint Conference on Neural Networks,
IJCNN 2009, pp. 3259–3266. IEEE (2009)
22. Weiss, G.M.: Foster Provost. The effect of class distribution on classifier learning:
an empirical study. Rutgers Univ (2001)
23. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate
example weighting. In: Data Mining, ICDM, pp. 435–442. IEEE (2003)
Clustering and Unsupervised Learning
A Kernel-Learning Approach to Semi-supervised
Clustering with Relative Distance Comparisons
1 Introduction
Clustering is the task of partitioning a set of data items into groups, or clusters.
However, the desired grouping of the data may not be sufficiently expressed
by the features that are used to describe the data items. For instance, when
clustering images it may be necessary to make use of semantic information about
the image contents in addition to some standard image features. Semi-supervised
clustering is a principled framework for combining such external information with
features. This information is usually given as labels about the pair-wise distances
between a few data items. Such labels may be provided by the data analyst, and
reflect properties of the data that are hard to express as an easily computable
function over the data features.
There are two commonly used ways to formalize such side information. The
first are must-link (ML) and cannot-link (CL) constraints. An ML (CL) con-
straint between data items i and j suggests that the two items are similar (dis-
similar), and should thus be assigned to the same cluster (different clusters).
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 219–234, 2015.
DOI: 10.1007/978-3-319-23528-8 14
220 E. Amid et al.
The second way to express pair-wise similarities are relative distance compar-
isons. These are statements that specify how the distances between some data
items relate to each other. The most common relative distance comparison task
asks the data analyst to specify which of the items i and j is closer to a third
item k. Note that unlike the ML/CL constraints, the relative comparisons do
not as such say anything about the clustering structure.
Given a number of similarity constraints, an efficient technique to implement
semi-supervised clustering is metric learning. The objective of metric learning
is to find a new distance function between the data items that takes both the
supplied features as well as the additional distance constraints into account.
Metric learning can be based on either ML/CL constraints or relative distance
comparisons. Both approaches have been studied extensively in literature, and
a lot is known about the problem.
The method we discuss in this paper is a combination of metric-learning and
relative distance comparisons. We deviate from existing literature by eliciting
every constraint with the question
“Which one of the items i, j, and k is the least similar to the other two?”
The labeler should thus select one of the items as an outlier. Notably, we also
allow the labeler to leave the answer as unspecified. The main practical novelty
of this approach is in the capability to gain information also from comparisons
where the labeler has not been able to give a concrete solution. Some sets of three
items can be all very similar (or dissimilar) to each other, so that picking one
item as an obvious outlier is difficult. In those cases that the labeler gives a
“don’t know” answer, it is beneficial to use this answer in the metric-learning
process as it provides a valuable cue, namely, that the three displayed data items
are roughly equidistant.
We cast the metric-learning problem as a kernel-learning problem. The
learned kernel can be used to easily compute distances between data items,
even between data items that did not participate in the metric-learning training
phase, and only their feature vectors are available. The use of relative com-
parisons, instead of hard ML/CL constraints, leads to learning a more accurate
metric that captures relations between data items at different scales. The learned
metric can be used for multi-level clustering, as well as other data-analysis tasks.
On the technical side, we start with an initial kernel K0 , computed using only
the feature vectors of the data items. We then formulate the kernel-learning task
as an optimization problem: the goal is to find the kernel matrix K that is the
closest to K0 and satisfies the constraints induced by the relative-comparison
labellings. To solve this optimization task we use known efficient techniques,
which we adapt for the case of relative comparisons.
More concretely, we make the following contributions:
2 Related Work
The idea of semi-supervised clustering was initially introduced by Wagstaff and
Cardie [2], and since then a large number of different problem variants and
methods have been proposed, the first ones being COP-Kmeans [3] and CCL [4].
Some of the later methods handle the constraints in a probabilistic framework.
For instance, the ML and CL constraints can be imposed in the form of a Markov
random field prior over the data items [5–7]. Alternatively, Lu [8] generalizes the
standard Gaussian process to include the preferences imposed by the ML and CL
constraints. Recently, Pei et al. [9] propose a discriminative clustering model that
uses relative comparisons and, like our method, can also make use of unspecified
comparisons.
The semi-supervising clustering setting has also been studied in the con-
text of spectral clustering, and many spectral clustering algorithms have been
extended to incorporate pairwise constraints [10,11]. More generally, these meth-
ods employ techniques for semi-supervised graph partitioning and kernel k-means
algorithms [12]. For instance, Kulis et al. [13] present a unified framework for
semi-supervised vector and graph clustering using spectral clustering and kernel
learning.
As stated in the Introduction, our work is based on metric learning. Most of
the metric-learning literature, starting by the work of Xing et al. [14], aims at
finding a Mahalanobis matrix subject to either ML/CL or relative distance con-
straints. Xing et al. [14] use ML/CL constraints, while Schultz and Joachims [15]
present a similar approach to handle relative comparisons. Metric learning often
requires solving a semidefinite optimization problem. This becomes easier if Breg-
man divergences, in particular the log det divergence, is used to formulate the
optimization problem. Such an approach was first used for metric learning by
Davis et al. [16] with ML/CL constraints, and subsequently by Liu et al. [17]
likewise with ML/CL, as well as by Liu et al. [18] with relative comparisons.
Our algorithm also uses the log det divergence, and we extend the technique of
Davis et al. [16] to handle relative comparisons.
The metric-learning approaches can also be more directly combined with a
clustering algorithm. The MPCK-Means algorithm by Bilenko et al. [19] is one of
the first to combine metric learning with semi-supervised clustering and ML/CL
constraints. Xiang et al. [20] use metric learning, as well, to implement ML/CL
constraints in a clustering and classification framework, while Kumar et al. [21]
follow a similar approach using relative comparisons. Recently, Anand et al. [1]
222 E. Amid et al.
As mentioned above, the first step of our approach is forming the initial kernel
K0 using the feature vectors X . We do this using a standard Gaussian kernel.
Details are provided in Section 4.2.
Next we show how the constraints implied by the distance comparison
sets Cneq and Ceq extend to a kernel space, obtained by a mapping Φ : D → Rm .
As usual, we assume that an inner product Φ(i) Φ(j) between items i and j in D
can be expressed by a symmetric kernel matrix K, that is, Kij = Φ(i) Φ(j).
Moreover, we assume that the kernel K (and the mapping Φ) is connected to
the unknown distance function δ via the equation
In other words, we explicitly assume that the distance function δ is in fact the
Euclidean distance in some unknown vector space. This is equivalent to assume
that the evaluators base their distance-comparison decisions on some implicit
features, even if they might not be able to quantify these explicitly.
224 E. Amid et al.
Next, we discuss the constraint inequalities (Equations (1), (2), and (3))
in the kernel space. Let ei denote the vector of all zeros with the value 1 at
position i. Equation (4) above can be expressed in matrix form as follows:
where tr(A) denotes the trace of the matrix A and we use the fact that K = K .
Using the previous equation we can write Equation (1) as
γ tr K(ei − ej )(ei − ej ) − tr K(ei − ek )(ei − ek ) ≤ 0
tr Kγ(ei − ej )(ei − ej ) − K(ei − ek )(ei − ek ) ≤ 0
tr K(γ(ei − ej )(ei − ej ) − (ei − ek )(ei − ek ) ) ≤ 0
tr K C(i←j|k) ≤ 0,
The log det divergence has many interesting properties, which make it ideal for
kernel learning. As a general result of Bregman divergences, log det divergence
is convex with respect to the first argument. Moreover, it can be evaluated using
the eigenvalues and eigenvectors of the matrices K and K0 . This property can
be used to extend log det divergence to handle rank-deficient matrices [23], and
we will make use of this in our algorithm described in Section 4.
we seek for a kernel matrix K by projecting the initial kernel matrix K0 onto the
convex set obtained from the intersection of the set of constraints. The optimiza-
tion Problem (9) can be solved using the method of Bregman projections [23–25].
The idea is to consider one unsatisfied constraint at a time and project the matrix
so that the constraint gets satisfied. Note that the projections are not orthogonal
and thus, a previously satisfied constraint might become unsatisfied. However,
as stated before, the objective function in Problem (9) is convex and the method
is guaranteed to converge to the global minimum if all the constraints are met
infinitely often (randomly or following a more structured procedure).
Let us consider the update rule for an unsatisfied constraint from Cneq . The
procedure for dealing with constraints from Ceq is similar. We first consider the
case of full-rank symmetric positive semidefinite matrices. Let Kt be the value
of the kernel matrix at step t. For an unsatisfied inequality constraint C, the
optimization problem becomes1
Following standard derivations for computing gradient updates for Bregman pro-
jection [25], the solution of Equation (11) can be written as
Kt+1 = (K−1
t + αC)
−1
. (12)
tr((K−1
t + αC)
−1
C) = 0. (13)
Equation (13) does not have a closed form solution for α, in general. However, we
exploit the fact that both types of our constraints, the matrix C has rank 2, i.e.,
rank(C) = 2. Let Kt = G G and W = G C G and therefore rank(W) = 2,
with eigenvalues η2 ≤ 0 ≤ η1 and |η2 | ≤ |η1 |. Solving Equation (13) for α gives
η1 η2
+ = 0, (14)
1 + αη1 1 + αη2
and
1 η1 + η2
α∗ = − ≥ 0. (15)
2 η1 η2
Substituting Equation (15) into (12), gives the following update equation for the
kernel matrix
Kt+1 = (K−1 ∗
t + α C)
−1
. (16)
1
We skip the subscript for notational simplicity.
Semi-supervised Clustering with Relative Distance Comparisons 227
Kt+1 = Kt − Kt α∗ U (I + α∗ V Kt U)−1 V Kt
(17)
= Kt − Kt
in which, Kt is the correction term on the current kernel matrix Kt . Calcu-
lation of the update rule (17) is simpler since it only involves inverse of a 2 × 2
matrix, rather than the n × n matrix in (16).
For a rank-deficient kernel matrix K0 with rank(K0 ) = r, we employ the
results of Kulis et al. [23], which state that for any column-orthogonal matrix Q
with range(K0 ) ⊆ range(Q) (e.g., obtained by singular value decomposition of
K0 ), we first apply the transformation
M → M̂ = Q M Q,
on all the matrices, and after finding the kernel matrix K̂ satisfying all the
transformed constraints, we can obtain the final kernel matrix using the inverse
transformation
K = Q K̂ Q .
Note that since log det preserves the matrix rank, the mapping is one-to-one
and invertible.
As the final remark, the kernel matrix learned by minimizing the log det
divergence subject to the set of constraints Cneq ∪ Ceq can be also extended to
handle out of sample data points, i.e., data points that were not present when
learning the kernel matrix. The inner product between a pair of out of sample
data points x, y ∈ Rd in the transformed kernel space can written as
k(x, y) = k0 (x, y) + kx (K†0 (K − K0 ) K†0 ) ky (18)
where, k0 (x, y) and the vectors kx = [k0 (x, x1 ), . . . , k0 (x, xn )] and ky =
[k0 (y, x1 ), . . . , k0 (y, xn )] are formed using the initial kernel function.
2
where, σij = σi σj . This process ensures a large bandwidth for sparse regions and
a small bandwidth for dense regions.
Clustering Method. After obtaining the kernel matrix K satisfying the set of
all relative and undetermined constraints, we can obtain the final clustering of
the points by applying any standard kernelized clustering method. In this paper,
we consider the kernel k-means because of its simplicity and good performance.
Generalization of the method to other clustering techniques such as kernel mean-
shift is straightforward.
5 Experimental Results
In this section, we evaluate the performance of the proposed kernel-learning
method, SKLR. As the under-the-hood clustering method required by SKLR, we
use the standard kernel k-means with Gaussian kernel and without any super-
vision (Equation (19) and = 100). We compare SKLR to three different semi-
supervised metric-learning algorithms, namely, ITML [16], SKkm [1] (a variant
Semi-supervised Clustering with Relative Distance Comparisons 229
of SKMS with kernel k-means in the final stage), and LSML [18]. We select the
SKkm variant as Anand et al. [1] have shown that SKkm tends to produce more
accurate results than other semi-supervised clustering methods. Two of the base-
lines, ITML and SKkm, are based on pairwise ML/CL constraints, while LSML
uses relative comparisons. For ITML and LSML we apply k-means on the trans-
formed feature vectors to find the final clustering, while for SKkm and SKLR we
apply kernel k-means on the transformed kernel matrices.
To assess the quality of the resulting clusterings, we use the Adjusted Rand
(AR) index [26]. Each experiment is repeated 20 times and the average over all
executions is reported. For the parameter γ required by SKLR we use γ = 2. Our
implementation of SKLR is in MATLAB and the code is publicly available.2 For
the other three methods we use publicly available implementations.345
Finally, we note that in this paper we do not report running-time results,
but all tested methods have comparable running times. In particular, the com-
putational overhead of our method can be limited by leveraging the fact that
the algorithm has to perform rank-2 matrix updates.
5.1 Datasets
We conduct the experiments on three different real-world datasets.
Vehicle:6 The dataset contains 846 instances from 4 different classes and is
available on the LIBSVM repository.
MIT Scene:7 The dataset contains 2688 outdoor images, each sized 256 ×
256, from 8 different categories: 4 natural and 4 man-made. We use the GIST
descriptors [27] as the feature vectors.
USPS Digits:8 The dataset contains 16 × 16 grayscale images of handwritten
digits. It contains 1100 instances from each class. The columns of each images
are concatenated to form a 256 dimensional feature vector.
The 2-class partitionings of our datasets required for the binary experiment
are defined as follows: For the vehicle dataset, we consider class 4 as one group
and the rest of the classes as the second group (an arbitrary choice). For the
MIT Scene dataset, we perform a partitioning of the data into natural vs.
man-made scenes. Finally, for the USPS Digits, we divide the data instances
into even vs. odd digits.
To generate the pairwise constraints for each dataset, we vary the number of
labeled instances from each class (from 5 to 19 with step-size of 2) and form all
possible ML constraints. We then consider the same number of CL constraints.
Note that for the binary case, we only have two classes for each dataset. To
compare with the methods that use relative comparisons, we consider an equal
number of relative comparisons and generate them by sampling two random
points from the same class and one point (outlier) from one of the other classes.
Note that for the relative comparisons, there is no need to restrict the points to
the labeled samples, as the comparisons are made in a relative manner.
Finally, in these experiments, we consider a subsample of both MIT Scene
and USPS Digits datasets by randomly selecting 100 data points from each
class, yielding 800 and 1000 data points, respectively.
The results for the binary and multi-class experiments are shown in Fig-
ures 1(a) and 1(b), respectively. We see that all methods perform equally with
no constraints. As constraints or relative comparisons are introduced the accu-
racy of all methods improves very rapidly. The only surprising behavior is the
one of ITML in the multi-class setting, whose accuracy drops as the number of
constraints increases. From the figures we see that SKLR outperforms all com-
peting methods by a large margin, for all three datasets and in both settings.
As discussed earlier, one of the main advantages of kernel learning with relative
comparisons is the feasibility of multi-resolution clustering using a single kernel
matrix. To validate this claim, we repeat the binary and multi-class experiments
described above. However, this time, we mix the binary and multi-class con-
straints and use the same set of constraints in both experimental conditions. We
evaluate the results by performing binary and multi-class clustering, as before.
Figures 1(c) and 1(d) illustrate the performance of different algorithms using
the mixed set of constraints. Again, SKLR produces more accurate clusterings,
especially in the multi-class setting. In fact, two of the methods, SKkm and ITML,
perform worse than the kernel k-means baseline in the multi-class setting. On
the other hand all methods outperform the baseline in the binary setting. The
reason is that most of the constraints in the multi-class setting are also relevant
to the binary setting, but not the other way around.
Figure 2 shows a visualization of the USPS Digits dataset using the SNE
method [28] in the original space, and the spaces induced by SKkm and SKLR. We
see that SKLR provides an excellent separation of the clusters that correspond to
even/odd digits as well as the sub-clusters that correspond to individual digits.
Semi-supervised Clustering with Relative Distance Comparisons 231
Vehicle - Binary Vehicle - Multiclass Vehicle - Binary (Mixed) Vehicle - Multiclass (Mixed)
0.7 0.8 1 0.7
AR
AR
AR
AR
0.3 0.4 0.4
0.3
0.2 0.3
0.2
0.2
0.1 0.2
0 0.1
0 SKLR LSML SKkm ITML Kkm 0.1
-0.1 0 -0.2 0
0 500 1000 1500 0 500 1000 1500 0 500 1000 1500 0 500 1000 1500
Number of constraints Number of constraints Number of constraints Number of constraints
MIT Scene - Binary MIT Scene - Multiclass MIT Scene - Binary (Mixed) MIT Scene - Multiclass (Mixed)
1 1 1 1
0.9
0.9
0.8 0.8 0.8
0.8
0.7
0.6 0.6
0.7 0.6
AR
AR
AR
AR
0.6 0.5
0.4 0.4
0.4
0.5
0.2 0.2 0.3
0.4
0.2
SKLR LSML SKkm ITML Kkm
0 0.3 0 0.1
0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000
Number of constraints Number of constraints Number of constraints Number of constraints
USPS Digits - Binary USPS Digits - Multiclass USPS Digits - Binary (Mixed) USPS Digits - Multiclass (Mixed)
1 1 1 1
0.9 0.9
0.8 0.8 0.8
0.8
0.7
0.6 0.7 0.6
0.6
AR
AR
AR
AR
0.6
0.5
0.4 0.5 0.4
0.4
0.4 0.3
0.2 0.2
0.3 0.2
SKLR LSML SKkm ITML Kkm
0 0.2 0 0.1
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Number of constraints Number of constraints Number of constraints Number of constraints
Fig. 1. Clustering accuracy measured with Adjusted Rand index (AR). Rows corre-
spond to different datasets: (1) Vehicle; (2) MIT Scene; (3) USPS Digits. Columns
correspond to different experimental settings: (a) binary with separate constraints;
(b) multi-class with separate constraints; (c) binary with mixed constraints; (d) multi-
class with mixed constraints.
USPS Digits - Original Space USPS Digits - SKkm USPS Digits - Transformed Space
Fig. 2. Visualization of the USPS Digits using SNE: (a) original space; (b) space
obtained by SKkm; (c) space obtained by our method, SKLR.
MIT Scene - Binary MIT Scene - Multiclass MIT Scene - Binary (Mixed) MIT Scene - Multiclass (Mixed)
0.9 0.75 0.9 0.6
0.8 0.7 0.8
0.7 0.65 0.7 0.5
AR
AR
AR
USPS Digits - Binary USPS Digits - Multiclass USPS Digits - Binary (Mixed) USPS Digits - Multiclass (Mixed)
0.9 0.9 0.9 0.6
0.8 0.8
0.8 0.5
0.7 0.7
0.7
0.6 0.6 0.4
0.5 0.6 0.5
AR
AR
AR
AR
0.3
0.4 0.5 0.4
0.3 0.3 0.2
0.4
0.2 0.2
0.3 0.1
0.1 SKLR LSML SKkm ITML Kkm
0.1
0 0.2 0 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Number of constraints Number of constraints Number of constraints Number of constraints
of relative comparisons (360, 720, and 900 relative comparisons for Vehicle,
MIT Scene, and USPS Digits, respectively) and then we add some addi-
tional equality constraints (up to 200). The equality constraints are generated
by randomly selecting three data points, all from the same class, or each from
a different class. The results are shown in Figure 4. As can be seen, considering
the equality constraint also improves the performance, especially on the MIT
Scene and USPS Digits datasets. Note that none of the other methods can
handle these type of constraints.
Semi-supervised Clustering with Relative Distance Comparisons 233
0.86
0.42
0.96
0.84
0.4 0.82
0.94
0.8
0.38
AR
AR
AR
0.78 0.92
0.36
0.76
0.9
0.34 0.74
0.72
0.88
0.32
0.7
6 Conclusion
We have devised a semi-supervised kernel-learning algorithm that can incorpo-
rate various types of relative distance constraints, and used the resulting kernels
for clustering. Our experiments show that our method outperforms by a large
margin other competing methods, which either use ML/CL constraints or use
relative constraints but different metric-learning approaches. Our method is com-
patible with existing kernel-learning techniques [1] in the sense that if ML and
CL constraints are available, they can be used together with relative compar-
isons. We have also proposed to interpret an “unsolved” distance comparison so
that the interpoint distances are roughly equal. Our experiments suggest that
incorporating such equality constraints to the kernel learning task can be advan-
tageous, especially in settings where it is costly to collect constraints.
For future work we would like to extend our method to incorporate more
robust clustering methods such as spectral clustering and mean-shift. Addition-
ally, the soft formulation of the relative constraints for handling possibly incon-
sistent constraints is straightforward, however, we leave it for future study.
References
1. Anand, S., Mittal, S., Tuzel, O., Meer, P.: Semi-supervised kernel mean shift clus-
tering. PAMI 36, 1201–1215 (2014)
2. Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: ICML
(2000)
3. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering
with background knowledge. In: ICML (2001)
4. Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-
level constraints. In: ICML (2002)
5. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised
clustering. In: KDD (2004)
6. Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise con-
strained clustering. In: SDM (2004)
7. Lu, Z., Leen, T.K.: Semi-supervised learning with penalized probabilistic cluster-
ing. NIPS (2005)
8. Lu, Z.: Semi-supervised clustering with pairwise constraints: A discriminative
approach. In: AISTATS (2007)
234 E. Amid et al.
9. Pei, Y., Fern, X.Z., Rosales, R., Tjahja, T.V.: Discriminative clustering with rela-
tive constraints. arXiv:1501.00037 (2014)
10. Lu, Z., Ip, H.H.S.: Constrained Spectral Clustering via Exhaustive and Efficient
Constraint Propagation. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV
2010, Part VI. LNCS, vol. 6316, pp. 1–14. Springer, Heidelberg (2010)
11. Lu, Z., Carreira-Perpiñán, M.: Constrained spectral clustering through affinity
propagation. In: CVPR (2008)
12. Dhillon, I.S., Guan, Y., Kulis, B.: A unified view of kernel k-means, spectral clus-
tering and graph cuts. Technical Report TR-04-25, University of Texas (2005)
13. Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-supervised graph clustering: a
kernel approach. Machine Learning 74, 1–22 (2009)
14. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.J.: Distance metric learning with
application to clustering with side-information. In: NIPS (2002)
15. Schultz, M., Joachims, T.: Learning a distance metric from relative comparisons.
In: NIPS (2003)
16. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: ICML (2007)
17. Liu, W., Ma, S., Tao, D., Liu, J., Liu, P.: Semi-supervised sparse metric learning
using alternating linearization optimization. In: KDD (2010)
18. Liu, E.Y., Guo, Z., Zhang, X., Jojic, V., Wang, W.: Metric learning from relative
comparisons by minimizing squared residual. In: ICDM (2012)
19. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning
in semi-supervised clustering. In: ICML (2004)
20. Xiang, S., Nie, F., Zhang, C.: Learning a mahalanobis distance metric for data
clustering and classification. Pattern Recognition 41, 3600–3612 (2008)
21. Kumar, N., Kummamuru, K.: Semisupervised clustering with metric learning using
relative comparisons. TKDE 20, 496–503 (2008)
22. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space anal-
ysis. PAMI 24, 603–619 (2002)
23. Kulis, B., Sustik, M.A., Dhillon, I.S.: Low-rank kernel learning with Bregman
matrix divergences. JMLR 10, 341–376 (2009)
24. Bregman, L.: The relaxation method of finding the common point of convex sets
and its application to the solution of problems in convex programming. USSR
Computational Mathematics and Mathematical Physics 7, 200–217 (1967)
25. Tsuda, K., Rätsch, G., Warmuth, M.: Matrix exponentiated gradient updates for
on-line learning and Bregman projection. JMLR 6, 995–1018 (2005)
26. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218
(1985)
27. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation
of the spatial envelope. IJCV 42, 145–175 (2001)
28. Hinton, G., Roweis, S.: Stochastic neighbor embedding. In: NIPS (2003)
Bayesian Active Clustering with Pairwise
Constraints
1 Introduction
Constraint-based clustering aims to improve clustering using user-provided pair-
wise constraints regarding similarities between pairs of instances. In particular,
a must-link constraint states that a pair of instances belong to the same cluster,
and a cannot-link constraint implies that two instances are in different clusters.
Existing work has shown that such constraints can be effective at improving
clustering in many cases [2, 4, 8, 16, 19, 20,22, 24,28]. However, most prior work
focus on “passive” learning from constraints, i.e., instance pairs are randomly
selected to be labeled by a user. Constraints acquired in this random manner
may be redundant and lead to the waste of labeling effort, which is typically
limited in real applications. Moreover, when the constraints are not properly
selected, they may even be harmful to the clustering performance as has been
revealed by Davidson et al. [7]. In this paper, we study the important problem
of actively selecting effective pairwise constraints for clustering.
Existing work on active learning of pairwise constraints for clustering has
mostly focused on neighbourhood-based methods [3, 12,14,17,25]. Such meth-
ods maintain a neighbourhood structure of the data based on the existing con-
straints, which represents a partial clustering solution, and they query pairwise
constraints to expand such neighborhoods. Other methods that do not rely on
An erratum to this chapter is available at DOI: 10.1007/978-3-319-23528-8 44
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 235–250, 2015.
DOI: 10.1007/978-3-319-23528-8 15
236 Y. Pei et al.
such structure consider various criteria for measuring the utility of instance pairs.
For example, Xu et al. [26] propose to select constraints by examining the spec-
tral eigenvectors of the similarity matrix, and identify data points that are at
or close to cluster boundaries. Vu et al. [21] introduce a method that chooses
instance pairs involving points on the sparse regions of a k-nearest neighbours
graph. As mentioned by Xiong et al. [25], many existing methods often select
a batch of pairwise constraints before performing clustering, and they are not
designed for iteratively improving clustering by querying new pairs.
In this work, we study Bayesian active clustering with pairwise constraints
in an iterative fashion. In particular, we introduce a Bayesian clustering model
to find the clustering posterior given a set of pairwise constraints. At every
iteration, our task is: a) to select the most informative pair toward improving
current clustering, and b) to update the clustering posterior after the query is
answered by an oracle/a user. Our goal is to achieve the best possible clustering
performance with minimum number of queries.
In our Bayesian clustering model, we use a discriminative logistic model to
capture the conditional probability of the cluster assignments given the instances.
The likelihood of observed pairwise constraints is computed by marginalizing
over all possible cluster assignments using message passing. We adopt a special
data-dependent prior that encourages large cluster separations. At every iter-
ation, the clustering posterior is represented by a set of samples (“particles”).
After obtaining a new constraint, the posterior is effectively updated with a
sequential Markov Chain Monte Carlo (MCMC) method (“particle filter”).
We present two information-theoretic criteria for selecting instance pairs to
query at each iteration: a) Uncertain, which chooses the most uncertain pair
based on current posterior, and b) Info, which selects the pair that maximizes
the information gain regarding current clustering. With the clustering posterior
maintained at every iteration, both objectives can be efficiently calculated.
We evaluate our method on benchmark datasets, and the results demonstrate
that our Bayesian clustering model is very effective at learning from a small
number of pairwise constraints, and our active clustering model outperforms
state-of-the-art active clustering methods.
2 Problem Statement
The goal of clustering is to find the underlying cluster structure in a dataset
X = [x1 , · · · , xN ] with xi ∈ Rd where d is the feature dimension. The unknown
cluster label vector Y = [y1 , · · · , yN ], with yi ∈ {1, · · · , K} being the cluster
label for xi , denotes the ideal clustering of the dataset, where K is the number of
clusters. In the studied active clustering, we could acquire some weak supervision,
i.e., pairwise constraints, by requesting an oracle to specify whether two instances
(xa , xb ) ∈ X × X belong to the same cluster. We represent the response of the
oracle as a pair label za,b ∈ {+1, −1}, with za,b = +1 representing that instance
xa and xb are in the same cluster (a must-link constraint), and za,b = −1 meaning
that they are in different clusters (a cannot-link constraint). We assume the cost
Bayesian Active Clustering with Pairwise Constraints 237
is uniform for different queries, and the goal of active clustering is to achieve the
best possible clustering with the least number of queries.
In this work, we consider sequential active clustering. In each iteration, we
select one instance pair to query the oracle. After getting the answer of the query,
we update the clustering model to integrate the supervision. With the updated
model, we then choose the best possible pair for the next query. So the task of
active clustering is an iterative process of posing queries and incorporating new
information to clustering.
An active clustering model generally has two key components: the clustering
component and the pair selection component. In every iteration, the task of the
clustering component is to identify the cluster structure of the data given the
existing constraints. The task of the pair selection component is to score each
candidate pair and choose the most informative pair to improve the clustering.
where the prior parameter θ = [λ, τ ]. The first term is the weighted Frobenius
norm of W . This term corresponds to the Gaussian prior with zero mean and
238 Y. Pei et al.
diagonal covariance matrix with λ as the diagonal elements, and it controls the
model complexity. The second term is the average negative entropy of the cluster
assignment variable Y . We use this term to encourage large separations among
clusters, as similarly utilized by [11] for semi-supervised classification problems.
The constant term normalizes the probability. Although it is unknown, inference
can be carried out by sampling from the unnormalized distribution (e.g., using
slice sampling [18]). We will discuss more details in Sec. 3.3.
With our model assumption, the conditional probability P (Z|Y ) is fully fac-
torized based on the pairwise constraints. For a single pair (xa , xb ), we define
the probability of za,b given cluster labels ya and yb as
if ya = yb
P (za,b = +1|ya , yb ) = ,
1 − if ya = yb (4)
P (za,b = −1|ya , yb ) = 1 − P (za,b = +1|ya , yb ) ,
where is a small number to accommodate the (possible) labeling error. In the
case where no labeling error exists, allows for “soft constraints”, meaning that
the model can make small errors on some pair labels and achieve large cluster
separations.
where α(Z) denotes the set of indices for all instances involved in Z.
The marginalization can be solved by performing sum-product message pass-
ing [15] on a factor graph defined by all the constraints. Specifically, the set of all
instances indexed by α(Z) defines the nodes of the graph, and P (Yα(Z) |W, Xα(Z) )
defines the node potentials. Each queried pair (xa , xb ) creates an edge, and the
edge potential is defined by P (za,b |ya , yb ). In this work, we require that the
graph formed by the constraints does not contain cycles, and message passing is
performed on a tree (or a forest, which is a collection of trees). Since inference
on trees are exact, the marginalization is computed exactly. Moreover, due to
the simple form of the edge potential (which is a simple modification to the
identity matrix as can be seen from (4)), the message passing can be performed
very efficiently. In fact, each message propagation only requires O(K) complexity
instead of O(K 2 ) as in the general case. Overall the message passing only takes
O(K|Z|), even faster than calculating the node potentials P (Yα(Z) |W, Xα(Z) ),
which takes O(dK|Z|).
to select a pair (xta , xtb ) from a pool of unlabeled pairs U t , and acquire the label
t
za,b from the oracle. We let U 1 ⊆ X × X be the initial pool of unlabeled pairs.
t−1
Then U t = U t−1 \(xt−1 t
a , xb ) for 1 ≤ t ≤ T . Below we use Zt = [za,b , · · · , za,b ]
1
Selection Criteria. We use two entropy-based criteria to select the best pair
at each iteration. The first criterion, which we call Uncertain, is to select the
pair whose label is the most uncertain. That is, at the t iteration, we choose
the pair (xta , xtb ) that has the largest marginal entropy of za,b
t
(over the posterior
distribution of W ):
Similar objective has been considered in prior work on distance metric learning
[27] or document clustering [14], where the authors propose different approaches
to compute/approximate the entropy objective.
The second criterion is a greedy objective adopted from active learning for
classification [6,10,13], which we call Info. The idea is to select the query (xta , xtb )
that maximizes the marginal information gain about the model W :
P (za,b ∪ Zt |Ŵ , X)
P (za,b |Zt , Ŵ , X) = , (10)
P (Zt |Ŵ , X)
where calculating both the numerator and the denominator are the same infer-
ence problem as (5) and can be solved similarly using message passing. In fact,
message propagations for the two calculations are shared except for that a new
edge regarding za,b is introduced to the graph for P (za,b ∪ Zt |Ŵ , X). So we can
calculate the two values by performing message passing algorithm only once on
the graph of P (za,b ∪Zt |Ŵ , X), and record P (Zt |Ŵ , X) in the intermediate step.
By definition, the conditional entropy is
H(za,b |W, Zt , X, θ) = P (Ŵ |Zt , X, θ)H(za,b |Zt , Ŵ , X)dŴ , (11)
where H(za,b |Ŵ , Zt , X) is also easy to compute once we know P (za,b |Zt , Ŵ , X),
which has been done in (10).
Now the only obstacle in calculating the two entropies is to take the expecta-
tions over the posterior distribution P (W |Zt , X, θ) in (9) and (11). Here we
use sampling to approximate such expectations. We first sample W ’s from
P (W |Zt , X, θ) and then approximate the expectations with the sample means.
Directly sampling from the posterior at every iteration is doable but very ineffi-
cient. Below we describe a sequential MCMC sampling method (“particle filter”)
that effectively updates the samples of the posterior.
from P (W |Zt−1 , X, θ), and then performs just a few MCMC steps with these
particles to prevent degeneration [9].
Here we maintain S particles in each iteration. We denote Wst , 1 ≤ s ≤ S,
as the s-th particle in the t-th iteration. For initialization, we sample particles
{W10 , · · · , WS0 } from the prior distribution P (W |X, θ) defined in (3) using slice
sampling [18] 2 , an MCMC method that can uniformly draw samples from an
unnormalized density function. Since slice sampling does not require the target
distribution to be normalized, the unknown constant in the prior (3) can be
neglected here.
t
At iteration t, 1 ≤ t ≤ T , after a new pair label za,b is observed, we per-
form the following two steps to update the particles and get samples from
P (W |Zt , X, θ).
(1) Resample. The first step is to resample from the particles {W1t−1 , · · · , WSt−1 }
obtained from the previous iteration for P (W |Zt−1 , X, θ). We observe that
t
P (W |Zt , X, θ) = P (W |za,b , Zt−1 , X, θ)
t
∝ P (za,b |Zt−1 , W, X)P (W |Zt−1 , X, θ) .
So each particle Wst−1 is weighted by P (za,b
t
|Zt−1 , Wst−1 , X), which can be cal-
culated the same as (10).
(2) Move. In the second step, we start with each resampled particles, and per-
form several slice sampling steps for the posterior
P (W |Zt , X, θ) ∝ P (Zt |W, X)P (W |X, θ) . (12)
Again P (Zt |W, X) is calculated by message passing as (5), and the unknown
normalizing constant in P (W |X, θ) can be ignored, since slice sampling does not
require the normalization constant.
The resample-move method avoids degeneration in the sequence of slice
sampling steps. After these two steps, we have updated the particles for
P (W |Zt , X, θ). Such particles are used to approximate the selection objectives
as described in Sec. 3.2, allowing us to select the next informative pair to query.
Note that the distribution P (W |Zt , X, θ) is invariant to label switching, that
is, permuting column vectors of W = [W·,1 , · · · , W·,K ] will not change the prob-
ability P (W |Zt , X, θ). This is because we can not provide any prior of W with
label order, nor does the obtained constraints provide any information about the
label order. One concern is whether the label switching problem would reduce
sampling efficiency and affect the pair selection, since P (W |Zt , X, θ) has mul-
tiple modes corresponding to different label permutations. Actually it does not
cause an issue to the approximation of integrations in (9) and (11), since the
term P (za,b |Zt , W, X, θ) is also invariant to label permutations. However, the
label switching problem does cause difficulty in getting the Bayesian prediction
of clusters labels from distribution P (Y |Zt , X, θ), so we will employ the MAP
solution Wmap and predict cluster labels with P (Y |Zt , Wmap , X, θ). We describe
this in the following section.
2
Here we use the implementation slicesample provided in the MATLAB toolbox.
242 Y. Pei et al.
where pi = [pi1 , · · · , piK ] with pik = P (yi = k|W, xi ), qi = [qi1 , · · · , qiK ] with
qik = P (yi = k|Z, W, xi ), and 1k is a K dimensional vector that contains 1 on
the k-th dimension and 0 elsewhere. Here α(Z) again indexes all the instances
involved in the constraints.
With the Wmap solution to (13), we then find the MAP solution of the cluster
labels Y from P (Y |Z, Wmap , X). This is done in two cases. For the instances that
are not involved in the constraints, the MAP of Y is simply the most possible
assignment of P (Y |Wmap , X). For the instances involved in the constraints, we
need to find
max P (Yα(Z) |Z, Wmap , Xα(Z) ) ∝ P (Z|Yα(Z) )P (Yα(Z) |Wmap , Xα(Z) ) .
Yα(Z)
4 Experiments
In this section, we empirically examine the effectiveness of the proposed method.
In particular, we aim to answer the following questions:
– Is the proposed Bayesian clustering model effective at finding good clustering
solutions with a small number of pairwise constraints?
– Is the proposed active clustering method more effective than state-of-the-art
active clustering approaches?
Bayesian Active Clustering with Pairwise Constraints 243
Fertility Parkinsons
0.9 0.8
0.85
0.75
0.8
F−Measure
F−Measure
0.7
0.75
0.7 0.65
0.65
0.6
0.6
0.55
0 20 40 60 0 20 40 60
#Query #Query
Crabs Sonar
1 0.7
0.9
0.65
F−Measure
0.8
F−Measure
0.7 0.6
0.6
0.55
0.5
0.4 0.5
0 20 40 60 0 20 40 60 80 100
#Query #Query
Balance Transfusion
0.9 0.8
0.8 0.75
F−Measure
F−Measure
0.7 0.7
0.6 0.65
0.5 0.6
0.4 0.55
0 20 40 60 80 100 0 20 40 60 80 100
#Query #Query
Letters−IJ Digits−389
0.9 1
0.8 0.9
F−Measure
F−Measure
0.8
0.7
0.7
0.6
0.6
0.5
0 20 40 60 80 100 0 20 40 60 80 100
#Query #Query
BC+tree ITML+tree ITML MPCKmeans+tree MPCKmeans
allows for “soft constraints”. For the parameter λ, which controls the covariance
of the Gaussian prior, we experimented with λ ∈ {1, 10, 100} and found that
λ = 10 is uniformly good with all datasets, which we fix as the default value. For
each dataset, we maintain S = 2dK samples of the posterior at every iteration.
Fertility Parkinsons
0.9 0.8
0.85
0.75
0.8
F−Measure
F−Measure
0.7
0.75
0.7 0.65
0.65
0.6
0.6
0.55
0 20 40 60 0 20 40 60
#Query #Query
Crabs Sonar
1 0.7
0.9
0.65
F−Measure
0.8
F−Measure
0.7 0.6
0.6
0.55
0.5
0.4 0.5
0 20 40 60 0 20 40 60 80 100
#Query #Query
Balance Transfusion
0.9 0.8
0.8 0.75
F−Measure
F−Measure
0.7 0.7
0.6 0.65
0.5 0.6
0.4 0.55
0 20 40 60 80 100 0 20 40 60 80 100
#Query #Query
Letters−IJ Digits−389
0.9 1
0.8 0.9
F−Measure
F−Measure
0.8
0.7
0.7
0.6
0.6
0.5
0 20 40 60 80 100 0 20 40 60 80 100
#Query #Query
BC+tree ITML+tree ITML MPCKmeans+tree MPCKmeans
– Info+BC: The proposed active clustering model with the Info criterion (7).
– Uncertain+BC: The proposed active clustering model with the Uncertain
criterion (6).
– NPU+ITML: The NPU active selection strategy combined with ITML.
– NPU+MPCKmeans: The NPU method with MPCKmeans.
– MinMax+ITML: The MinMax active learning method combined with ITML.
– MinMax+MPCKmeans: The MinMax approach combined with MPCK-
means.
Figure 2 reports the performance of all active clustering methods with
increasing number of queries. We see that both Info+BC and Uncertain+BC
improve the clustering very quickly as more constraints are obtained, and they
outperform all baselines on most datasets. Moreover, Info+BC seems to be more
effective than Uncertain+BC in most cases. We hypothesize this is because Info
reduces the uncertainty of the model, which might be more appropriate for
improving the MAP solution of clustering than decreasing the maximum uncer-
tainty of the pair labels as Uncertain does.
To avoid crowding Fig. 2, we did not present the passive learning results of
our method BC+tree as a baseline in the same figure. The comparison between
active learning and passive learning for our method can be done by comparing
Uncertain+BC and Info+BC in Fig. 2 with BC+tree in Fig. 1. We see that both
our active learning approaches produce better performance than passive learning
on most datasets, demonstrating the effectiveness of our pair selection strategies.
We also notice that the performance of NPU or MinMax highly depends on
the clustering method in use. With different clustering methods, their behav-
iors are very different. In practice, it can be difficult to decide which clustering
algorithm should be used in combination with the active selection strategies to
ensure good clustering performance. In contrast, our method unifies the cluster-
ing and active pair selection model, and the constraints are selected to explicitly
reduce the clustering uncertainty and improve the clustering performance.
Query Iteration
Dataset
10 20 30 40 50 60
Fertility 0.4/0.0 0.6/0.1 0.9/0.1 2.7/1.9 4.2/14.3 10.8/32.0
Parkinsons 0.1/0.0 0.0/0.0 0.5/0.0 0.8/0.3 0.9/0.6 1.7/1.7
Crabs 0.6/0.0 0.2/0.0 0.0/0.0 0.1/0.3 0.2/0.6 0.4/1.5
Sonar 0.7/0.0 0.2/0.0 0.4/0.1 0.5/0.2 0.5/0.2 0.6/0.2
Balance 0.0/0.0 0.3/0.0 1.7/0.0 2.6/0.0 3.3/0.1 2.9/0.0
Transfusion 0.3/0.0 1.3/0.0 2.4/0.0 2.3/0.0 4.6/0.0 4.9/0.1
Letters-IJ 0.0/0.0 0.2/0.0 0.3/0.0 0.2/0.0 0.5/0.0 0.7/0.0
Digits-389 0.0/0.0 0.0/0.0 0.1/0.0 0.1/0.0 0.0/0.0 0.3/0.0
In addition, during our experiments, we found that for both criteria the
difference between the maximum objective value and objective of the finally
selected pair is often negligible. So in the case where some high-ranking pairs
are dropped due to the acyclic graph structure restriction, the selected pair is
still very informative. Overall, this enforcement does not present any significant
negative impact on the final clustering results. It is interesting to note that, the
results in Sec. 4.2 suggest that such graph structure restriction can in some cases
improve the clustering performance.
5 Related Work
Prior work on active clustering for pairwise constraints has mostly focused on the
neighbourhood-based method, where a neighbourhood skeleton is constructed to
partially represent the underlying clusters, and constraints are queried to expand
such neighbourhoods. Basu et al. [3] first proposed a two-phase method, Explore
and Consolidate. The Explore phase incrementally builds K disjoint neighbor-
hoods by querying instance pairwise relations, and the Consolidate phase iter-
atively queries random points outside the neighborhoods against the existing
neighborhoods, until a must-link constraint is found. Mallapragada et al. [17]
proposed an improved version, which modifies the Consolidate stage to query
the most uncertain points using an MinMax objective. As mentioned by Xiong
et al. [25], these methods often select a batch of constraints before perform-
ing clustering, and they are not designed for iteratively improving clustering by
querying new constraints, as considered in this work.
Wang and Davidson [23], Huang et al. [14] and Xiong et. al. [25] studied
active clustering in an iterative manner. Wang and Davidson introduced an
active spectral clustering method that iteratively select the pair that maximized
the expected error reduction of current model. This method is however restricted
to the two-cluster problems. Huang et al. proposed an active document clustering
method that iteratively finds probabilistic clustering solution using a language
model and they selected the most uncertain pair to query. But this method is
Bayesian Active Clustering with Pairwise Constraints 249
limited to the task of document clustering. Xiong et. al. considered a similar
iterative framework to Huang et al., and they queried the most uncertain data
point against existing neighbourhoods, as apposed to the most uncertain pair
in [14]. Xiong et al. only provide a query selection strategy and require a clus-
tering method to learn from the constraints. In contrast, our method is a unified
clustering and active pair selection model.
Finally, there are other methods that use various criteria to select pairwise
constraints. Xu et al. [26] proposed to select constraints by examining the spec-
tral eigenvectors of the similarity matrix in the two-cluster scenario. Vu et al.
[21] proposed to select constraints involving points on the sparse regions of a
k-nearest neighbours graph. The work [1, 12] used ensemble approaches to select
constraints. The scenarios considered in these methods are less similar to what
has been studied in this paper.
6 Conclusion
In this work, we studied the problem of active clustering, where the goal is to
iteratively improve clustering by querying informative pairwise constraints. We
introduced a Bayesian clustering method that adopted a logistic clustering model
and a data-dependent prior which controls model complexity and encourages
large separations among clusters. Instead of directly computing the posterior of
the clustering model at every iteration, our approach maintains a set of sam-
ples from the posterior. We presented a sequential MCMC method to efficiently
update the posterior samples after obtaining a new pairwise constraint. We intro-
duced two information-theoretic criteria to select the most informative pairs to
query at every iteration. Experimental results demonstrated the effectiveness of
the proposed Bayesian active clustering method over existing approaches.
References
1. Al-Razgan, M., Domeniconi, C.: Clustering ensembles with active constraints. In:
Okun, O., Valentini, G. (eds.) Applications of Supervised and Unsupervised Ensem-
ble Methods. SCI, vol. 245, pp. 175–189. Springer, Heidelberg (2009)
2. Baghshah, M.S., Shouraki, S.B.: Semi-supervised metric learning using pairwise
constraints. In: IJCAI, pp. 1217–1222 (2009)
3. Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise con-
strained clustering. In: SDM, pp. 333–344 (2004)
4. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised
clustering. In: KDD, pp. 59–68 (2004)
5. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning
in semi-supervised clustering. In: ICML, pp. 81–88 (2004)
250 Y. Pei et al.
6. Dasgupta, S.: Analysis of a greedy active learning strategy. In: NIPS, pp. 337–344
(2005)
7. Davidson, I., Wagstaff, K.L., Basu, S.: Measuring constraint-set utility for parti-
tional clustering algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.)
PKDD 2006. LNCS (LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006)
8. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: ICML, pp. 209–216 (2007)
9. Gilks, W.R., Berzuini, C.: Following a moving target - monte carlo inference
for dynamic bayesian models. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 63(1), 127–146 (2001)
10. Golovin, D., Krause, A., Ray, D.: Near-optimal bayesian active learning with noisy
observations. In: NIPS, pp. 766–774 (2010)
11. Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In:
NIPS, pp. 281–296 (2005)
12. Greene, D., Cunningham, P.: Constraint selection by committee: an ensemble app-
roach to identifying informative constraints for semi-supervised clustering. In: Kok,
J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron,
A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 140–151. Springer, Heidelberg
(2007)
13. Houlsby, N., Huszár, F., Ghahramani, Z., Lengyel, M.: Bayesian active learning for
classification and preference learning. CoRR abs/1112.5745 (2011)
14. Huang, R., Lam, W.: Semi-supervised document clustering via active learning with
pairwise constraints. In: ICDM, pp. 517–522 (2007)
15. Kschischang, F.R., Frey, B.J., Loeliger, H.A.: Factor graphs and the sum-product
algorithm. IEEE Trans. Inform. Theory 47(2), 498–519 (2001)
16. Lu, Z., Leen, T.K.: Semi-supervised clustering with pairwise constraints: a discrim-
inative approach. In: AISTATS, pp. 299–306 (2007)
17. Mallapragada, P.K., Jin, R., Jain, A.K.: Active query selection for semi-supervised
clustering. In: ICPR, pp. 1–4 (2008)
18. Neal, R.M.: Slice sampling. Annals of statistics 31(3), 705–741 (2003)
19. Nelson, B., Cohen, I.: Revisiting probabilistic models for clustering with pairwise
constraints. In: ICML, pp. 673–680 (2007)
20. Shental, N., Bar-hillel, A., Hertz, T., Weinshall, D.: Computing gaussian mixture
models with em using equivalence constraints. In: NIPS, pp. 465–472 (2003)
21. Vu, V.V., Labroche, N., Bouchon-Meunier, B.: An efficient active constraint selec-
tion algorithm for clustering. In: ICPR, pp. 2969–2972 (2010)
22. Wagstaff, K.L., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering
with background knowledge. In: ICML, pp. 577–584 (2001)
23. Wang, X., Davidson, I.: Active spectral clustering. In: ICDM, pp. 561–568 (2010)
24. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.J.: Distance metric learning with
application to clustering with side-information. In: NIPS, pp. 505–512 (2002)
25. Xiong, S., Azimi, J., Fern, X.Z.: Active learning of constraints for semi-supervised
clustering. IEEE Trans. Knowl. Data Eng. 26(1), 43–54 (2014)
26. Xu, Q., desJardins, M., Wagstaff, K.L.: Active constrained clustering by examining
spectral eigenvectors. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005.
LNCS (LNAI), vol. 3735, pp. 294–307. Springer, Heidelberg (2005)
27. Yang, L., Jin, R., Sukthankar, R.: Bayesian active distance metric learning. In:
UAI, pp. 442–449 (2007)
28. Yu, S.X., Shi, J.: Segmentation given partial grouping constraints. IEEE Trans.
Pattern Anal. Mach. Intell. 26(2), 173–183 (2004)
ConDist: A Context-Driven Categorical
Distance Measure
1 Introduction
Distance calculation between objects is a key requirement for many data mining
tasks like clustering, classification or outlier detection [13]. Objects are described
by a set of attributes. For continuous attributes, the distance calculation is
well understood and mostly the Minkowski distance is used [2]. For categori-
cal attributes, defining meaningful distance measures is more challenging since
the values within such attributes have no inherent order [4]. The absence of
additional domain knowledge further complicates this task.
However, several methods exist to address this issue. Some are based on sim-
ple approaches like checking for equality and inequality of categorical values, or
create a new binary attribute for each categorical value [2]. An obvious draw-
back of these two approaches is that they cannot reflect the degree of similarity
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 251–266, 2015.
DOI: 10.1007/978-3-319-23528-8 16
252 M. Ring et al.
Table 1. Example data set which describes nine people with three categorical
attributes. The attributes height and weight have natural orders. Whereas the attribute
haircolor has no natural order. Height and weight are correlated to each other while
the attribute haircolor is uncorrelated to the other two attributes. ConDist uses such
correlations between attributes to extract relevant information for distance calculation.
Fig. 1. This figure shows the conditional probability distributions (CPDs) of context
attributes weight and haircolor, given the different values of the target attribute height
based on Table 1. W stands for weight, C for haircolor and H for height. ConDist uses
the differences between CPDs of context attributes to calculate the distance of target
attribute values. Thus, weight can be used to calculate a meaningful distance between
the values of height, while haircolor will yield the same distance for all three target
attribute values.
2 Related Work
ConDist compares objects based on the distances between each of the attribute
values associated with the objects it compares (see Equation (1)). Each of these
attributes is weighted differently by an individual weighting factor wX . This
section explains why these weights wX are necessary and how they are calculated.
The weight wX is especially necessary for data sets in which some attributes
depend on each other, while others do not: refer back to the example in Table 1.
ConDist: A Context-Driven Categorical Distance Measure 257
The attribute distance measures dX (Section 3.2) and the weighting scheme wX
(Section 3.3) use the notion of correlation on categorical distance measures as
well as a correlation related impact factor. Both are defined here.
and describes a correlation measure which is normalized to the interval [0, 1].
The quality of possible conclusions from the given attribute Y to the target
attribute X can differ from the quality of conclusions from given attribute X
to target attribute Y . This aspect is considered in the asymmetric correlation
function cor(X|Y ) and allows us to always extract the maximum amount of
useful information for each target attribute X.
2
impactX (Y ) = cor(X|Y ) 1 − 0.5 · cor(X|Y ) . (8)
4 Experiments
This section presents an experimental evaluation of ConDist in the context of
classification and clustering. We compared ConDist with DILCA [6], JiaChe-
ung [7], Quang [8] and several distance measures presented in [4], namely Eskin,
Gambaryan, Occurrence Frequency (OF) and Overlap. For DILCA, we used the
non-parametric approach DILCARR as described in [6] and for JiaCheung we
set the threshold parameter β to 0.1 as recommended by the authors [7].
260 M. Ring et al.
Table 2. Characteristics of the data sets. The column Correlation contains the average
correlation between each pair of attributes in the data set, calculated by the function
cor(X|Y ), see Equation (6). The value ranges from 0 if no correlation exists to 1 if all
attributes are perfectly correlated. The data sets are separated in three subsets from
highly correlated to uncorrelated based on their average correlation.
Clustering. The hierarchical WARD algorithm [14] is used to evaluate the per-
formance of ConDist in the context of clustering. ConDist and its competitors
are used to calculate the initial distance matrix as input for WARD. For sim-
plification, the clustering process is terminated when the number of clusters is
equal to the number of classes in the data sets. Performance is measured by Nor-
malized Mutual Information (NMI) [12] which ranges from 0 for poor clustering
to 1 for perfect clustering with respect to the predefined classes.
Data Sets. For the evaluation of ConDist the multivariate categorical data sets
for classification from the UCI machine learning repository [10] are chosen. We
exclude data sets with less than 25 objects (e.g., Balloons) or mainly binary
ConDist: A Context-Driven Categorical Distance Measure 261
Threshold θ
Data Set 0.00 0.01 0.02 0.03 0.05 0.10 0.20 0.50 1.00
Soybean Large 91.74 91.74 91.79 91.80 91.82 89.75 89.36 89.63 91.30
Lymphography 83.36 83.36 83.30 83.24 83.01 81.99 82.01 81.24 81.26
Hayes-Roth 68.11 68.36 68.51 68.60 69.21 64.47 64.47 64.47 64.47
TicTacToe 99.99 99.99 99.99 99.98 94.74 94.74 94.74 94.74 94.74
Balance-Scale 77.35 78.66 78.66 78.66 78.66 78.66 78.66 78.66 78.66
Car 88.98 90.56 90.56 90.56 90.56 90.56 90.56 90.56 90.56
Average 84.92 85.45 85.47 85.47 84.67 83.36 83.30 83.22 83.50
g n
st un rya ap
i A
Ch
e ba erl
g
nD kin
LC am an
Data Set Co DI Es Jia G OF Ov Qu
Teaching Assistant. E. 49.85 50.68 48.79 49.54 49.44 39.16 45.84 44.48
Soybean Large 91.79 91.48 89.83 89.45 87.18 89.61 91.30 92.01
Breast Cancer W. 96.13 95.55 95.67 95.08 92.84 72.47 95.25 96.28
Dermatology 96.76 97.97 94.91 97.39 91.69 61.12 95.90 96.64
Lymphography 83.30 82.09 79.17 83.95 80.72 72.77 81.26 81.53
Breast Cancer 73.85 72.94 73.18 74.30 74.55 68.32 74.06 70.45
Audiology Standard 66.44 64.80 63.24 60.95 66.16 51.87 61.27 55.56
Hayes-Roth 68.50 67.59 46.71 68.27 60.84 58.71 61.74 71.19
Post-Operative Patient 69.62 68.22 68.36 67.28 69.69 69.44 68.59 68.69
TicTacToe 99.99 90.65 94.74 99.93 98.25 76.80 94.74 99.65
Car 90.56 90.25 90.03 90.01 90.25 87.83 90.56 88.25
Nursey 94.94 92.61 93.29 93.32 93.24 94.65 94.94 94.72
Monks 94.50 90.76 87.29 87.34 86.61 98.67 94.50 96.66
Balance-Scale 78.66 78.43 78.66 78.65 77.13 78.54 78.66 77.51
Average 82.49 81.00 78.85 81.10 79.90 72.85 80.62 80.97
5 Discussion
5.1 Experiment 1 – Context Attribute Selection
Table 3 shows that many useful context attributes are discarded if threshold θ is
too high. This is especially the case for weakly correlated data sets, e.g. Hayes-
Roth and TicTacToe. For Hayes-Roth, the decrease of classification accuracy is
observed for θ > 0.05, and for TicTacToe the decrease of classification accuracy
is already observed for θ > 0.02. In contrast to this, if the threshold θ is too
low, independent context attributes are added which may contribute noise to
the distance calculation. This is especially the case for uncorrelated data sets,
264 M. Ring et al.
g n
st un rya ap
Di CA Ch
e ba erl
g
n L kin am an
Data Set Co DI Es Jia G OF Ov Qu
Teaching Assistant Eva. .078 .085 .085 .085 .085 .060 .044 .042
Soybean Large .803 .785 .758 .735 .772 .805 .793 .778
Breast Cancer Winconsin .785 .557 .749 .656 .601 .217 .621 .798
Mushroom Extended .597 .597 .317 .223 .597 .597 .597 .245
Mushroom .594 .594 .312 .594 .594 .312 .594 .241
Dermatology .855 .946 .832 .879 .863 .292 .847 .859
Lymphography .209 .303 .165 .207 .163 .243 .226 .320
Soybean Small .687 .690 .687 .701 .692 .690 .689 .692
Breast Cancer .063 .068 .031 .074 .001 .002 .100 .001
Audiology-Standard .661 .612 .623 .679 .620 .439 .568 .582
Hayes-Roth .017 .027 .004 .012 .007 .166 .006 .029
Post-Operative Patient .043 .017 .018 .025 .017 .032 .019 .033
TicTacToe .087 .003 .003 .082 .085 .001 .033 .039
Monks .001 .000 .000 .000 .000 .081 .001 .003
Balance-Scale .083 .036 .064 .067 .064 .064 .083 .036
Car .062 .036 .150 .150 .150 .062 .062 .036
Nursey .048 .006 .037 .037 .037 .098 .048 .006
Average .334 .315 .284 .306 .315 .245 .314 .279
For highly correlated data sets, distance measures using context attributes out-
perform other distance measures. However, for those data sets no best distance
measures can be identified among the context based distance measures.
For uncorrelated data sets, previous context-based distance measures
(DILCA, Quang and JiaCheung) achieved inferior results in comparison to
ConDist and non-context based distance measures. This is because, e.g., DILCA
ConDist: A Context-Driven Categorical Distance Measure 265
and Quang use only context attributes for the distance calculation which results
in random distances if all context attributes are uncorrelated.
In contrast, ConDist achieved acceptable results because not only correlated
context attributes, but also the target attributes are considered. This effect is
also illustrated by the comparison between ConDist and Overlap. ConDist is
equal to Overlap if no correlated context attributes can be identified, see uncor-
related data sets (Monks, Balance-Scale, Nursey and Car ) in Table 4. However,
for weakly- and highly-correlated data sets, ConDist’s consideration of context
attributes turns into an advantage, leading to better results than Overlap. The
improvement of ConDist can be statistically confirmed by the Wilcoxon Signed-
Ranks Test (see Table 5).
6 Summary
Categorical distance calculation is a key requirement for many data mining tasks.
In this paper, we proposed ConDist, an unsupervised categorical distance mea-
sure based on the correlation between the target attribute and context attributes.
With this approach, we aim to compensate for the lack of inherent orders within
categorical attributes by extracting statistical relationships from the data set.
Our experiments show that ConDist is a generally usable categorical distance
measure. In the case of correlated data sets, ConDist is comparable to existing
context based categorical distance measures and superior to non-context based
categorical distance measures. In the case of weakly and uncorrelated data sets,
ConDist is comparable to non-context based categorical distance measures and
superior to context based categorical distance measures. The overall improve-
ment of ConDist can be statistically confirmed in the context of classification. In
the context of clustering, this improvement could not be statistically confirmed.
266 M. Ring et al.
References
1. Ahmad, A., Dey, L.: A method to compute distance between two categorical val-
ues of same attribute in unsupervised learning for categorical data set. Pattern
Recognition Letters 28(1), 110–118 (2007)
2. Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures
for categorical data. In: Proc. of IJCNN, pp. 1907–1914. IEEE (2014)
3. Au, W.H., Chan, K.C., Wong, A.K., Wang, Y.: Attribute clustering for grouping,
selection, and classification of gene expression data. IEEE/ACM Transactions on
Computational Biology and Bioinformatics (TCBB) 2(2), 83–101 (2005)
4. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data:
A comparative evaluation. In: Proc. SIAM Int. Conference on Data Mining,
pp. 243–254 (2008)
5. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The
Journal of Machine Learning Research 7, 1–30 (2006)
6. Ienco, D., Pensa, R.G., Meo, R.: Context-Based Distance Learning for Categorical
Data Clustering. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.)
IDA 2009. LNCS, vol. 5772, pp. 83–94. Springer, Heidelberg (2009)
7. Jia, H., Cheung, Y.M.: A new distance metric for unsupervised learning of cate-
gorical data. In: Proc. of IJCNN, pp. 1893–1899. IEEE (2014)
8. Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data.
Pattern Recognition Letters 26(16), 2549–2557 (2005)
9. Lehmann, E., Romano, J.: Testing Statistical Hypotheses, Springer Texts in Statis-
tics. Springer (2005)
10. Lichman, M.: Uci machine learning repository (2013). http://archive.ics.uci.edu/ml
11. Schmidberger, G., Frank, E.: Unsupervised Discretization Using Tree-Based Density
Estimation. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.)
PKDD 2005. LNCS (LNAI), vol. 3721, pp. 240–251. Springer, Heidelberg (2005)
12. Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining
multiple partitions. The Journal of Machine Learning Research 3 (2003)
13. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to data mining. Pearson Addison
Wesley Boston (2006)
14. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. Journal of
the American statistical association 58(301), 236–244 (1963)
15. Yu, L., Liu, H.: Feature selection for high-dimensional data: A fast correlation-based
filter solution. ICML 3, 856–863 (2003)
Discovering Opinion Spammer Groups
by Network Footprints
1 Introduction
Online reviews of products and services are an increasingly important source of
information for consumers. They are valuable since, unlike advertisements, they
reflect the testimonials of other, “real” consumers. While many positive reviews
can increase the revenue of a business, negative reviews can cause substantial
loss. As a result of such financial incentives, opinion spam has become a critical
issue [17], where fraudulent reviewers fabricate spam reviews to unjustly promote
or demote (e.g., under competition) certain products and businesses.
Opinion spam is surprisingly prevalent; one-third of consumer reviews on the
Internet1 , and more than 20% of reviews on Yelp2 are estimated to be fake.
Despite being widespread, opinion spam remains a mostly open and challenging
problem for at least two main reasons; (1) humans are incapable of distinguishing
fake reviews based on text [25], which renders manual labeling extremely difficult
1
http://www.nytimes.com/2012/08/26/business/book-reviewers-for-hire-meet-a-
demand-for-online-raves.html
2
http://www.businessinsider.com/20-percent-of-yelp-reviews-fake-2013-9
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 267–282, 2015.
DOI: 10.1007/978-3-319-23528-8 17
268 J. Ye and L. Akoglu
and hence supervised methods inapplicable, and (2) fraudulent reviewers are
often professionals, paid by businesses to write detailed and genuine-looking
reviews.
Since the seminal work by Jindal and Liu [17], opinion spam has been the
focus of research for the last 7-8 years (Section 5). Most existing work aim
to detect individual spam reviews [12,17,20,21,25,26] or spammers [1,11,13,
18,22,26]. However, fraud/spam is often a collective act, where the involved
individuals cooperate in groups to execute spam campaigns. This way, they can
increase total impact (i.e., dominate the sentiments towards target products via
flooding deceptive opinions), split total effort, and camouflage (i.e., hide their
suspicious behaviors by balancing workload so that no single individual stands
out). Surprisingly, however, only a few efforts aim to detect group-level opinion
spam [23,27,28]. Moreover, most existing work employ supervised techniques
[12,17,20,25] and/or utilize side information, such as behavioral [17,22,23,27,28]
or linguistic [12,25] clues of spammers. The former is inadmissible, due to the
difficulty in obtaining ground truth labels. The latter, on the other hand, is not
adversarially robust; the spammers can fine-tune their language (e.g., usage of
superlatives, self-references, etc.) and behavior (e.g., login times, IPs, etc.) to
mimic genuine users as closely as possible and evade detection.
In this work, we propose a new unsupervised and scalable approach for detect-
ing opinion spammer groups solely based on their network footprints. At its
heart, our method consists of two key components:
In a nutshell, while the former observation implies (i) local diversity of node
importances in the neighborhood of a node, the latter implies (ii) distributional
similarity between node importances at local (i.e., neighborhood) and global
level (i.e., at large). Importance of nodes in a network can be captured by their
centrality. There exist a long list of work on developing various measures for
node centralities [10]. In this work, we consider two different measures; degree
and Pagerank centrality, which respectively use local and global information to
quantify node importances.3
In the following, we describe how we utilize each of the above insights to
measure the network footprints of spammer groups in review networks. More
precisely, a review network is a bipartite graph G that consists of n reviewer
nodes connected to m product nodes through review relations.
3
We compute these measures based on the reviewer–product bipartite review network.
270 J. Ye and L. Akoglu
f (Hdeg (i))2 + f (Hpr (i))2 + f (KLdeg (i))2 + f (KLpr (i))2
N F S(i) = 1 − (1)
4
4
We use Laplace smoothing for empty buckets.
272 J. Ye and L. Akoglu
Algorithm 1. ComputeNFS
1 Input: Reviewer–Product graph G = (V, E), degree threshold η
2 Output: Network Footprint Score (NFS) of products with degree ≥ η
3 Compute centrality c of each reviewer in G, c = {degree, Pagerank}
4 Create a list of buckets k = {0, 1, . . .}
5 foreach reviewer j in G, 1 ≤ j ≤ n do //Compute global histogram Q
6 if c =degree then
(i)
7 place j to bucketk with a · bk−1 ≤ cj < a · bk , (a = 3, b = 3)
(i)
8 else place j to bucketk with a · bk−1 ≥ cj > a · bk , (a = .3, b = .3)
9 forall the non-empty buckets k = {0, 1, . . .} do
10 qk = |bucketk |/n
11 foreach product i with deg(i) ≥ η do
12 Create a list of buckets k = {0, 1, . . .}
13 foreach neighbor (reviewer) j of product i do
14 if c =degree then
(i)
15 place j to bucketk with a · bk−1 ≤ cj < a · bk , (a = 3, b = 3)
(i)
16 else place j to bucketk with a · bk−1 ≥ cj > a · bk , (a = .3, b = .3)
17 forall the non-empty buckets k = {0, 1, . . .} do //local histogram P (i)
(i)
18 pk = |bucketk |/deg(i)
(i)
19 Compute entropy Hc (i) based on pk ’s
(i)
20 K = number of uncommon buckets where qk = 0 and pk = 0
21 forall the buckets k where qk = 0 do //local smoothed histogram P (i)
(i) (i)
22 if pk = 0 then pk = 1/(deg(i) + K ) //Laplace smoothing
(i) (i)
23 else pk = (pk · deg(i))/(deg(i) + K ) //Re-normalize
(i)
24 Compute divergence KLc (P (i) Q) based on pk ’s and qk ’s
25 Compute NFS of i by Equ. (1) based on Hc (i) and KLc (P (i) Q)
5
Rather than an arbitrary top k, we adopt the mixture-modeling approach in [14],
where NFS values are modeled as drawn from two distributions; Gaussian for normal
values and exponential for outliers. Model parameters and assignment of points to
distributions are learned by the EM algorithm. Top products are then the ones whose
NFS values belong to the exponential distribution.
274 J. Ye and L. Akoglu
Algorithm 2. GroupStrainer
1 Input: p × u adjacency matrix A of 2-hop network of top products with highest
NFS values, similarity lower bound sL (default 0.5)
2 Output: User groups U out and their hierarchical structure
3 Create a list of similarity thresholds S = {0.95, 0.90, . . . , sL }
4 Set k1 = u, U 1 := {1, 2, . . . , u} → {1, 2, . . . , u} // init reviewer groups
5 for T = 1 to |S| do
6 Estimate LSH parameters r and b for threshold S(T )
7 //Step 1. Generate signatures
T
8 Init signature matrix M [i][j] ∈ Rrb×k
9 if T = 1 then // use Jaccard similarity, generate min-hash signatures
10 for i = 1 to rb do
11 πi ← generate random permutation (1 . . . p)
12 for j = 1 to kT do M [i][j] ← minv∈Nj πi (v)
13 else // use cosine similarity, generate random-projection signatures
14 for i = 1 to rb do
15 ri ← pick a random hyperplane ∈ Rp×1
16 for j = 1 to kT do M [i][j] ← sign(rsum(A(:, U T =j)/|U T =j|) · ri )
17 //Step 2. Generate hash tables
18 for h = 1 to b do
19 for j = 1 to kT do hash(M [(h − 1) · r + 1:h · r][j])
20 //Step 3. Merge clusters from hash tables
21 Build candidate groups g’s: union of clusters that hash to at least one same
bucket in all hash tables, i.e. {ci , cj } ∈ g if hashh (ci ) = hashh (cj ) for ∃h
22 foreach candidate group g do
23 foreach ci , cj ∈ g, i = j do
24 if sim(vi , vj ) ≥ S(T ) then //merge
25 g = g\{ci , cj } ∪ {ci ∪ cj }
26 kT = kT − 1, U T (ci ) = U T (cj ) = min(U T (ci ), U T (cj ))
27 kT +1 = kT , U T +1 = U T
28 return U out = U |S|+1 , evolving groups {U 1 , . . . , U |S| } to build nested hierarchy
different similarity functions. In this paper, we use two: min-hashing for Jaccard
similarity and random-projection for cosine similarity.
The details of our GroupStrainer is given in Algorithm 2, which consists of
three main steps: (1) generate LSH signatures (Lines 7-16), (2) generate hash
tables (17-19), and (3) merge clusters using hash buckets (20-24).
In the first iteration of step (1), i.e. T = 1, the clusters consist of individual,
binary columns of A, represented by vj ’s, 1 ≤ j ≤ u. As the similarity measure,
we use Jaccard similarity which is high for those columns (i.e., reviewers) with
many exclusive common neighbors (i.e., products). Min-hashing is designed to
capture the Jaccard similarity between binary vectors; that is, it can be shown
that the probability that the min-hash values (signatures) of two binary vectors
agree is equal to their Jaccard similarity (Lines 9-12). For T > 1, the clusters
consist of multiple columns. This time, we represent each cluster j by a length-p
real-valued vector vj in which the entries denote the density of each row, i.e.,
Discovering Opinion Spammer Groups by Network Footprints 275
4 Evaluation
We first evaluate the performance of our method on synthetic datasets, as com-
pared to several existing methods. Our method consists of two steps: NFS com-
putation and GroupStrainer. The former tries to capture the targeted products,
and the latter focuses on detecting spammer groups through their collusion on
the target products. As such, we design separate data generators for each step
to best simulate these scenarios. In addition, we apply our method on two real-
world review datasets and show through detailed case analyses that it detects
many suspicious user groups. A summary of the datasets is given in Table 1.
276 J. Ye and L. Akoglu
Dataset SCI = 1.0 SCI = 0.9 SCI = 0.8 SCI = 0.7 SCI = 0.6 SCI = 0.5 SCI = 0.4
= 0 1.000/0.95 1.000/0.70 1.000/0.65 1.000/0.65 0.997/0.65 1.000/0.60 0.948/0.55
= 0.2 0.994/0.65 0.997/0.55 1.000/0.60 0.995/0.60 0.998/0.55 0.990/0.60 0.980/0.50
= 0.4 0.993/0.50 1.000/0.55 0.993/0.55 0.998/0.55 0.994/0.55 0.988/0.55 0.980/0.50
= 0.6 0.989/0.55 0.998/0.50 0.991/0.50 0.991/0.55 0.996/0.50 0.995/0.50 0.984/0.45
= 0.8 0.984/0.50 0.987/0.50 0.989/0.50 0.993/0.50 0.977/0.45 0.991/0.50 0.976/0.45
for CatchSync. In contrast, our approach outperforms others and achieves near-
perfect accuracy on all settings (ranking the spammers on top). These results
demonstrate the effectiveness and robustness of NFS.
iTunes Amazon
ID #P #U t, Dup Developer #P #U t, Dup Category, Author
1 5 31 s, c 51/154 all same 10 20 c, c 90/138 Books, all same
2 8 38 c, s 29/202 2 same 4 12 s, c 32/47 Books, all same
3 4 61 s, c 34/144 all inaccessible 7 9 c, c 44/60 Books, all same
4 4 17 c, s 0/68 1 inaccessible 7 19 s, c 5/88 Books, all same
5 5 102 c, s 8/326 different 23 42 c, c 2/468 Music, all same
6 6 50 s, c 4/173 all same 8 17 s, c 9/73 Books, 4/8 same
7 2 56 c, c 12/112 different 6 18 s, c 4/94 Movies&TV, all same
8 4 42 c, c 8/112 2 same
9 6 67 s, c 0/137 all same
Fig. 2 illustrates the scatter plot of Hdeg vs. KLdeg for products in iTunes,
where outliers with large NFS values are circled (Hpr vs. KLpr looks similar).
The adjacency matrix A of the 2-hop induced subnetwork on the outlier products
is shown in Fig. 3 (top). While the groups are not directly evident here, the
GroupStrainer output clearly reveals various colluding user groups as shown in
Fig. 3 (bottom). Statistics and properties of the groups are listed in Table 4.
We note that
group#1 is the same
31 users spamming
5 products of the
same developer with
all-5 reviews, as
was previously found
in [1]. Our method
finds other suspi-
cious user groups;
e.g., group#2 con-
sists of 8 products,
each receiving all
their reviews on the Fig. 2. Degree entropy Hdeg vs. KL-divergence KLdeg of prod-
same day from a ucts (dots) in iTunes. Top-left: distribution of NFS scores;
subset of 38 users. circled points: outlier products by NFS.
Interestingly, while
the time-stamp is the same (concentrated), the ratings are a bit diversified (not
all 5s but 3&4s as well)—potentially for camouflage. This behavior is observed
among other groups; while some groups concentrate on both time and ratings
(c,c), e.g., all 5 in one day, most groups aim to diversify in one of the aspect.
We also note the duplicate reviews across reviewers in the same group.
Similar results are obtained on Amazon, as shown in Fig. 4. We also provide
the summary statistics and characteristics of the 7 colluding groups in Table 4.
280 J. Ye and L. Akoglu
Group: (1) (2) (3) (4) (5) (6) (7) (8) (9)
Fig. 3. (top) 2-hop induced network of top-ranking products by NFS in iTunes, (bot-
tom) output of GroupStrainer with 9 discovered colluding user groups.
We find that the majority of the targeted products in our groups belong to the
Books category. This is not a freak occurrence, as the media has reported that
the authors of books get involved in opinion spam to gain popularity (see URL
in footnote1 on pg. 1). For example, group#1 consists of 20 users spamming
10 books. 19/20 users write their reviews on the exact same day. 15/20 has
duplicate reviews across products, in total 90/138 reviews have at least one
copy. Our dataset listed the same author for 9/10 books. Manual inspection of
the authors revealed that the 10th book also belonged to the same author but
was mis-indexed by Amazon.
5 Related Work
Opinion spam is one of the new forms of Web-based spam, emerging as a result
of review-based sites (TripAdvisor, Yelp, etc.) gaining popularity. We organize
related work into two: (i) spotting individual spam reviewers or fake reviews,
and (ii) detecting group spammers.
separate views. Relational models have also been explored [21], which correlate
reviews written by the same users and from the same IPs. Similarly, behav-
ioral methods have been developed to identify individual spam reviewers [13,22].
Other approaches use association based rule mining of rating patterns [18], or
temporal analysis [11]. There also exist network-based methods that spot both
suspicious reviewers and reviews [1,26]. Those infer a suspiciousness score for
each node/edge in the user-product or user-review-product network.
Detecting Group Spam. There exist only a few efforts that focus on group-level
opinion spam detection [23,27,28]. This is counter-intuitive, since spam/fraud
is often an organized act in which the involved individuals cooperate to reduce
effort and response time, increase total impact, and camouflage so that no sin-
gle individual stands out. All related work in this category define group-level
or pairwise spam indicators to score reviewer groups by suspiciousness. Their
indicators are several behavioral and linguistic features. In contrast, our work
solely uses network footprints without relying on side information (e.g., language,
time-stamps, etc.) that can be fine-tuned by spammers.
6 Conclusion
We proposed an unsupervised and scalable approach for spotting spammer
groups in online review sites solely based on their network footprints. Our
method consists of two main components: (1) NFS; a new measure that quan-
tifies the statistical distortions of well-studied properties in the review network,
and (2) GroupStrainer; a hierarchical clustering method that chips off colluding
groups from a subnetwork induced on target products with high NFS values.
We validated the effectiveness of our method on both synthetic and real-world
datasets, where we detected various groups of users with suspicious colluding
behavior.
Acknowledgments. The authors thank the anonymous reviewers for their useful
comments. This material is based upon work supported by the ARO Young Investigator
Program under Contract No. W911NF-14-1-0029, NSF CAREER 1452425, IIS 1408287
and IIP1069147, a Facebook Faculty Gift, an R&D grant from Northrop Grumman
Aerospace Systems, and Stony Brook University Office of Vice President for Research.
Any conclusions expressed in this material are of the authors’ and do not necessarily
reflect the views, either expressed or implied, of the funding parties.
References
1. Akoglu, L., Chandy, R., Faloutsos, C.: Opinion fraud detection in online reviews
by network effects. In: ICWSM (2013)
2. Akoglu, L., Faloutsos, C.: RTG: a recursive realistic graph generator using ran-
dom typing. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.)
ECML PKDD 2009, Part I. LNCS, vol. 5781, pp. 13–28. Springer, Heidelberg
(2009)
282 J. Ye and L. Akoglu
3. Akoglu, L., McGlohon, M., Faloutsos, C.: Oddball: spotting anomalies in weighted
graphs. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) AKDD 2010, Part
II. LNCS, vol. 6119, pp. 410–421. Springer, Heidelberg (2010)
4. Barabási, A.-L., Albert, R., Jeong, H.: Scale-free characteristics of random net-
works: the topology of the world-wide web. Physica A: Statistical Mechanics and
its Applications 281(1–4), 69–77 (2000)
5. Benczr, A.A., Csalogny, K., Sarls, T., Uher, M.: Spamrank - fully automatic link
spam detection. In: AIRWeb, pp. 25–38 (2005)
6. Broder, A.: Graph structure in the web. Computer Networks 33(1–6), 309–320
(2000)
7. Chung, F.R.K., Lu, L.: The average distance in a random graph with given
expected degrees. Internet Mathematics 1(1), 91–113 (2003)
8. Dalvi, N., Domingos, P., Mausam, Sanghai, S., Verma, D.: Adversarial classifica-
tion. In: KDD, pp. 99–108 (2004)
9. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the inter-
net topology. In: SIGCOMM, pp. 251–262 (1999)
10. Faust, K.: Centrality in affiliation networks. Social Networks 19(2), 157–191 (1997)
11. Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: Exploiting
burstiness in reviews for review spammer detection. In: ICWSM (2013)
12. Feng, S., Banerjee, R., Choi, Y.: Syntactic stylometry for deception detection. In:
ACL (2012)
13. Feng, S., Xing, L., Gogar, A., Choi, Y.: Distributional footprints of deceptive prod-
uct reviews. In: ICWSM (2012)
14. Gao, J., Tan, P.-N.: Converting output scores from outlier detection algorithms to
probability estimates. In: ICDM (2006)
15. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
ing. In: VLDB, pp. 518–529 (1999)
16. Jiang, M., Cui, P., Beutel, A., Faloutsos, C., Yang, S.: Catchsync: catching syn-
chronized behavior in large directed graphs. In: KDD, pp. 941–950 (2014)
17. Jindal, N., Liu, B.: Opinion spam and analysis. In: WSDM, pp. 219–230 (2008)
18. Jindal, N., Liu, B., Lim, E.-P.: Finding unusual review patterns using unexpected
rules. In: CIKM, pp. 1549–1552. ACM (2010)
19. Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert
space. Contemporary Mathematics, vol. 26, pp. 189–206 (1984)
20. Li, F., Huang, M., Yang, Y., Zhu, X.: Learning to identify review spam. In: IJCAI
(2011)
21. Li, H., Chen, Z., Liu, B., Wei, X., Shao, J.: Spotting fake reviews via collective
positive-unlabeled learning. In: ICDM (2014)
22. Mukherjee, A., Kumar, A., Liu, B., Wang, J., Hsu, M., Castellanos, M., Ghosh,
R.: Spotting opinion spammers using behavioral footprints. In: KDD (2013)
23. Mukherjee, A., Liu, B., Glance, N.S.: Spotting fake reviewer groups in consumer
reviews. In: WWW (2012)
24. Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemporary
Physics 46(5), 323–351 (2005)
25. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by
any stretch of the imagination. In: ACL, pp. 309–319 (2011)
26. Wang, G., Xie, S., Liu, B., Yu, P.S.: Review graph based online store review spam-
mer detection. In: ICDM, pp. 1242–1247 (2011)
27. Xu, C., Zhang, J.: Combating product review spam campaigns via multiple het-
erogeneous pairwise features. In: SDM. SIAM (2015)
28. Xu, C., Zhang, J., Chang, K., Long, C.: Uncovering collusive spammers in Chinese
review websites. In: CIKM, pp. 979–988. ACM (2013)
Gamma Process Poisson Factorization for Joint
Modeling of Network and Documents
1 Introduction
Social networks and other relational datasets often involve a large number of
nodes N with sparse connections between them. If the relationship is symmetric,
it can be represented compactly using a binary symmetric adjacency matrix
B ∈ {0, 1}N ×N , where bij = bji = 1 if and only if nodes i and j are linked.
Often, the nodes in such datasets are also associated with “side information,”
such as documents read or written, movies rated, or messages sent by these nodes.
Integer-valued side information are commonly observed and can be naturally
represented by a count matrix Y ∈ ZD×V , where Z = {0, 1, . . .}. For example, B
may represent a coauthor network and Y may correspond to a document-by-word
count matrix representing the documents written by all these authors. In another
example, B may represent a user-by-user social network and Y may represent a
user-by-item rating matrix that adds nuance and support to the network data.
Incorporating such side information can result in better community identification
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 283–299, 2015.
DOI: 10.1007/978-3-319-23528-8 18
284 A. Acharya et al.
The negative binomial (NB) distribution m ∼ NB(r, p), with probability mass
function (PMF) Pr(M = m) = Γm!Γ (m+r) m r
(r) p (1 − p) for m ∈ Z, can be augmented
into a gamma-Poisson construction as m ∼ Pois(λ), λ ∼ Gamma(r, p/(1 − p)),
where the gamma distribution is parameterized by its shape r and scale p/(1−p).
It can also be augmented under a compound Poisson representation as m =
l iid
t=1 ut , ut ∼ Log(p), l ∼ Pois(−rln(1−p)), where u ∼ Log(p) is the logarithmic
distribution [17]. Consequently, we have the following Lemma.
Γ (r)
Pr(l = j|m, r) = |s(m, j)|rj , j = 0, 1, · · · , m, (1)
Γ (m + r)
where |s(m, j)| are unsigned Stirling numbers of the first kind. We denote
this conditional posterior as (l|m, r) ∼ CRT(m, r), a Chinese restaurant m table
(CRT) count random variable, which can be generated via l = n=1 zn , zn ∼
Bernoulli(r/(n − 1 + r)).
K K
Lemma 2. Let X = k=1 xk , xk ∼ Pois(ζk ) ∀k, and ζ = k=1 ζk . If
(y1 , · · · , yK |X) ∼ Mult(X, ζ1 /ζ, · · · , ζK /ζ) and X ∼ Pois(ζ), then the follow-
ing holds:
P (X, x1 , · · · , xK ) = P (X, y1 , · · · , yK ). (2)
Lemma 3. If xi ∼ Pois(m
i λ), λ ∼ Gamma(r, 1/c), then x = i xi ∼ NB(r, p),
where p = ( i mi )/(c + i mi ).
Lemma 4. If xi ∼ Pois(mi λ), λ ∼ Gamma(r, 1/c), then
(λ|{xi }, r, c) ∼ Gamma r + xi , 1/(c + mi ) . (3)
i i
including more effective modeling of both sparsity and the dependence between
network interactions and side information.
A large number of discrete latent variable models for count matrix factoriza-
tion can be united under Poisson factor analysis (PFA) [43], which factorizes a
count matrix Y ∈ ZD×V under the Poisson likelihood as Y ∼ Pois(ΦΘ), where
Φ ∈ RD×K+ is the factor loading matrix or dictionary, Θ ∈ RK×V + is the factor
score matrix. A wide variety of algorithms, although constructed with different
motivations and for distinct problems, can all be viewed as PFA with different
prior distributions imposed on Φ and Θ. For example, non-negative matrix fac-
torization [9,20], with the objective to minimize the Kullback-Leibler divergence
between N and its factorization ΦΘ, is essentially PFA solved with maximum
likelihood estimation. LDA [6] is equivalent to PFA, in terms of both block Gibbs
sampling and variational inference, if Dirichlet distribution priors are imposed on
both φk ∈ RD V
+ , the columns of Φ, and θ k ∈ R+ , the columns of Θ. The gamma-
Poisson model [8,32] is PFA with gamma priors on Φ and Θ. A family of negative
binomial (NB) processes, such as the beta-NB [7,43] and gamma-NB processes
[41,42], impose different gamma priors on {θvk }, the marginalization of which
leads to differently parameterized NB distributions to explain the latent counts.
Both the beta-NB and gamma-NB process PFAs are nonparametric Bayesian
models that allow K to grow without limits [16].
J-GPPF models both Y and B using Poisson factorization. As discussed
in [1], Poisson factorization has several practical advantages over other factor-
ization methods that use Gaussian assumptions (e.g. in [22]). First, zero-valued
observations could be efficiently processed during inference, so the model can
readily accommodate large, sparse datasets. Second, Poisson factorization is a
natural representation of count data. Additionally, the model allows mixed mem-
bership across an unbounded number of latent communities using the gamma
Process as a prior. The authors in [4] also use Poisson factorization to model
a binary interaction matrix. However, this is a parametric model and a KL-
divergence based objective is optimized w.r.t. the latent factors without using
any prior information. To model the binary observations of the network matrix
B, J-GPPF additionally uses a novel Poisson-Bernoulli (PoBe) link, discussed in
detail in Section 3, that transforms the count values from the Poisson factoriza-
tion to binary values. Similar transformation has also been used in the BigCLAM
model [37] which builds on the works of [4]. This model was later extended to
include non-network information in the form of binary attributes [38]. Neither
BigCLAM nor its extension allows non-parametric modeling or imposing prior
structure on the latent factors, thereby limiting the flexibility of the models and
making the obtained solutions more sensitive to initialization. The collaborative
topic Poisson factorization (CTPF) framework proposed in [15] solves a different
problem where the objective is to recommend articles to users of similar interest.
CTPF is a parametric model and variational approximation is adopted to solve
the inference.
288 A. Acharya et al.
xnm = kB xnmkB , xnmkB ∼ Pois (λnmkB ). Thus each interaction pattern con-
tributes a count and the total latent count aggregates the countably infinite
interaction patters.
Unlike the usual approach that links the binary observations to latent Gaus-
sian random variables with a logistic or probit function, the above approach links
the binary observations to Poisson random variables. Thus, this approach trans-
forms the problem of modeling binary network interaction into a count modeling
problem, providing several potential advantages. First, it is more interpretable
because ρkB and φkB are non-negative and the aggregation of different interac-
tion patterns increases the probability of establishing a link between two nodes.
Second, the computational benefit is significant since the computational com-
plexity is approximately linear in the number of non-zeros SB in the observed
binary adjacency matrix B. This benefit is especially pertinent in many real-
word datasets where SB is significantly smaller than N 2 .
To model the matrix Y , its (d, w)th entry ydw is generated as:
ydw ∼ Pois(ζdw ), ζdw = ζY dwkY + ζBdwkB ,
kY kB
ζY dwkY = rkY θdkY βwkY , ζBdwkB = ρkB Znd φnkB ψwkB ,
n
where Znd ∈ {0, 1} and Znd = 1 if and only if author n is one of the authors of
paper d. One can consider ζdw as the affinity of document d for word w, This
affinity is influenced by two different components, one of which comes from the
network modeling. Without the contribution from network modeling, the joint
model reduces to a gamma process Poisson matrix factorization
model, in which
the matrix Y is factorized in such a way that ydw ∼ Pois rkY θdkY βwkY .
kY
Here, Θ ∈ RD×∞ + is the factor score matrix, β ∈ RV+ ×∞ is the factor loading
matrix (or dicticonary) and rkY signifies the weight of the kYth factor. The number
of latent factors, possibly smaller than both D and V , would be inferred from
the data.
In the proposed joint model, Y is also determined by the users participating
in writing the dth document. We assume that the distribution over word counts
for a document is a function of both its topic distribution as well as the charac-
teristics of the users associated with it. In the author-document framework, the
authors employ different writing styles and have expertise in different domains.
For example, an author from machine learning and statistics would use words like
“probability”, “classifiers”, “patterns”, “prediction” more often than an author
with an economics background. Frameworks such as author-topic model [23,30]
were motivated by a related concept. In the user-rating framework, the entries
in Y are also believed to be influenced by the interaction network of the users.
Such influence of the authors is modeled by the interaction of the authors in the
latent communities via the latent factors φ ∈ RN +
×∞
and ψ ∈ RV+ ×∞ , which
290 A. Acharya et al.
encodes the writing style of the authors belonging to different latent communi-
ties. Since an infinite number of network communities is maintained, each entry
ydw is assumed to come from an infinite dimensional interaction. ρkB signifies the
th
interaction strength corresponding to the kB network community. The contri-
butions of the interaction from all the authors participating in a given document
are accumulated to produce the total contribution from the networks in gener-
ating ydw . Since B and Y might have different levels of sparsity and the range
of integers in Y can be quite large, a parameter is required to balance the
contribution of the network communities in dictating the structure of Y. A low
value of forces disjoint modeling of B and Y, while a higher value implies
joint modeling of B and Y where information can flow both ways, from network
discovery to topic discovery and vice-versa. We present a thorough discussion of
the effect of in Section 4.1. To complete the generative process, we put Gamma
priors over σn , ςd , cB , cY and as:
Sampling of φnkB , ρkB , θdkY , rkY and : Sampling of these parameters follow
from Lemma 4 and are given as follows:
1
(φnkB |−) ∼ Gamma aB + xn·kB + y.n.kB , , (12)
σn + ρkB (φ−n
kB + |Zn |)
γB
(ρkB |−) ∼ Gamma K B
+ x ··k B
+ y ···k ,
B c +
φ φ
1
−n
+
|Z |φ
, (13)
B n nkB kB n n nkB
γY
(θdkY |−) ∼ Gamma aY + yd·kY , ςd +r
1
k
, (rkY |−) ∼ Gamma KY
1
+ y··kY , cY +θ .k
, (14)
Y Y
KB
KB
N
1
(|−) ∼ Gamma f0 + y···k , , q0 = ρkB |Zn |φnkB . (15)
g0 + q0 n=1
k=1 k=1
The sampling of parameters φnkB and ρkB exhibits how information from the
count matrix Y influences the discovery of the latent network structure. The latent
counts from Y impact the shape parameters for both the posterior gamma distri-
bution of φnkB and ρkB , while Z influences the corresponding scale parameters.
Sampling of ψ kB : Since ydnwkB ∼ Pois(ρkB φnkB ψwkB ), using Lemma 2 we
have: (y..wkB )Vw=1 ∼ Mult(y...kB ,(ψwk )Vw=1 ). Since the Dirichlet distribution is
B
conjugate to the multinomial, the posterior of ψ kB also becomes a Dirichlet
distribution and can be sampled as:
Section 2.3, N-GPPF can be considered as the gamma process infinite edge par-
tition model (EPM) proposed in [40], which is shown to well model assortative
networks but not necessarily disassortative ones. Using the techniques developed
in [40] to capture community-community interactions, it is relatively straightfor-
ward to extend J-GPPF to model disassortative networks Another special case
of J-GPPF appears when only the count matrix Y is modeled without using the
contribution from the network matrix B. The generative model of C-GPPF is
given in Table 3.
Fig. 1(a), we show the computation time for generating one million samples
from Gamma, Dirichlet (of dimension 50), multinomial (of dimension 50) and
truncated Poisson distributions using the samplers available from GNU Scientific
Library (GSL) on an Intel 2127U machine with 2 GB of RAM and 1.90 GHz of
processor base frequency. To highlight the average complexity of sampling from
Dirichlet and multinomial distributions, we further display another plot where
the computation time is divided by 50 for these samplers only. One can see that
to draw one million samples, our implementation of the sampler for truncated
Poisson distribution takes the longest, though the difference from the Gamma
sampler in GSL is not that significant.
4 Experimental Results
4.1 Experiments with Synthetic Data
We generate a synthetic network of size 60 × 60 (B) and a count data matrix
of size 60 × 45 (Y). Each user in the network writes exactly one document and
a user and the corresponding document are indexed by the same row-index in
B and Y respectively. To evaluate the quality of reconstruction in presence of
side-information and less of network structure, we hold-out 50% of links and
equal number of non-links from B. This is shown in Fig. 1(b) where the links
are presented by brown, the non-links by green and the held-out data by deep
blue. Clearly, the network consists of two groups. Y ∈ {0, 5}60×45 , shown in
Fig. 1(c), is also assumed to consist of the same structure as B where the zeros
are presented by deep blue and the non-zeros are represented by brown. The
performance of N-GPPF is displayed in Fig. 2(a). Evidently, there is not much
structure visible in the discovered partition of B from N-GPPF and that is
reflected in the poor value of AUC in Fig. 3(a). The parameter , when fixed at a
given value, plays an important role in determining the quality of reconstruction
for J-GPPF. As → 0, J-GPPF approaches the performance of N-GPPF on
B and we observe as poor a quality of reconstruction as in Fig. 2(a). When
is increased and set to 1.0, J-GPPF departs from N-GPPF and performs much
better in terms of both structure recovery and prediction on held-out data as
shown in Fig. 2(e) and Fig. 3(b). With = 10.0, perfect reconstruction and
prediction are recorded as shown in Fig. 2(i) and Fig. 3(c) respectively. In this
synthetic example, Y is purposefully designed to reinforce the structure of B
when most of its links and non-links are held-out. However, in real applications,
Y might not contain as much of information and the Gibbs sampler needs to
find a suitable value of that can carefully glean information from Y.
There are few more interesting observations from the experiment with syn-
thetic data that characterize the behavior of the model and match our intuitions.
In our experiment with synthetic data KB = KY = 20 is used. Fig. 2(b) demon-
strates the assignment of the users in the network communities and Fig. 2(d)
illustrates the assignment of the documents to the combined space of network
communities and the topics (with the network communities appearing before the
topics in the plot). For = 0.001, we observe disjoint modeling of B and Y, with
Gamma Process Poisson Factorization for Joint Modeling 295
two latent factors modeling Y and multiple latent factors modeling B without
any clear assignment. As we increase , we start observing joint modeling of B
and Y. For = 1.0, as Fig. 2(h) reveals, two of the network latent factors and
two of the factors for count data together model Y, the contribution from the
network factors being expectedly small. Fig. 2(f) shows how two of the exact
same latent factors model B as well. Fig. 2(j) and Fig. 2(l) show how two of
the latent factors corresponding to B dictate the modeling of both B and Y
when = 10.0. This implies that the discovery of latent groups in B is dictated
mostly by the information contained in Y. In all these cases, however, we observe
perfect reconstruction of Y as shown in Fig. 2(c), Fig. 2(g) and Fig. 2(k).
NIPS Authorship Network. This dataset contains the papers and authors
from NIPS 1988 to 2003. We took the 234 authors who published with the most
296 A. Acharya et al.
(a) (b)
(c)
Fig. 3. (a) AUC with = 0.001, (b) AUC with = 1.0, (c) AUC with = 10.0
other people and looked at their co-authors. After pre-processing and removing
words that appear less than 50 times, the number of users in the graph is 225 and
the total number of unique words is 1354. The total number of documents is 1165.
GoodReads Data. Using the Goodreads API, we collected a base set of users
with recent activity on the website. The friends and friends of friends of these
users were collected. Up to 200 reviews were saved per user, each consisting of
a book ID and a rating from 0 to 5. A similar dataset was used in [10]. After
pre-processing and removing words that appear less than 10 times, the number
of users in the graph is 84 and the total number of unique words is 189.
(a) (b)
Fig. 4. (a) NIPS Data, (b) GoodReads Data
5 Conclusion
We propose J-GPPF that jointly factorizes the network adjacency matrix and
the associated side information that can represented as a count matrix. The
model has the advantage of representing true sparsity in adjacency matrix, in
latent group membership, and in the side information. We derived an efficient
MCMC inference method, and compared our approach to several popular net-
work algorithms that model the network adjacency matrix. Experimental results
confirm the efficiency of the proposed approach in utilizing side information to
improve the performance of network models.
References
1. Acharya, A., Ghosh, J., Zhou, M.: Nonparametric bayesian factor analysis for
dynamic count matrices. In: Proc. of AISTATS (to appear, 2015)
2. Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic
blockmodels. JMLR 9, 1981–2014 (2008)
298 A. Acharya et al.
27. Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics (2000)
28. Palla, K., Ghahramani, Z., Knowles, D.A.: An infinite latent attribute model for
network data. In: Proc. of ICML, pp. 1607–1614 (2012)
29. Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-
based geolocation using language models on an adaptive grid. In: Proc. of EMNLP-
CoNLL, pp. 1500–1510 (2012)
30. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for
authors and documents. In: Proc. of UAI, pp. 487–494 (2004)
31. Salakhutdinov, R., Mnih, A.: Probabilistic matrix factorization. In: Proc. of NIPS
(2007)
32. Titsias, M.K.: The infinite gamma-poisson feature model. In: Proc. of NIPS (2008)
33. Walker, S.G.: Sampling the Dirichlet mixture model with slices. Communications
in Statistics Simulation and Computation (2007)
34. Wang, X., Mohanty, N., Mccallum, A.: Group and topic discovery from relations
and their attributes. In: Proc. of NIPS, pp. 1449–1456 (2006)
35. Wen, Z., Lin, C.: Towards finding valuable topics. In: Proc. of SDM, pp. 720–731
(2010)
36. Wolpert, R.L., Clyde, M.A., Tu, C.: Stochastic expansions using continuous dic-
tionaries: Lévy Adaptive Regression Kernels. Annals of Statistics (2011)
37. Yang, J., Leskovec, J.: Overlapping community detection at scale: a nonnegative
matrix factorization approach. In: Proc. of WSDM, pp. 587–596 (2013)
38. Yang, J., McAuley, J.J., Leskovec, J.: Community detection in networks with node
attributes. In: Proc. of ICDM, pp. 1151–1156 (2013)
39. Yoshida, T.: Toward finding hidden communities based on user profile. J. Intell.
Inf. Syst. 40(2), 189–209 (2013)
40. Zhou, M.: Infinite edge partition models for overlapping community detection and
link prediction. In: Proc. of AISTATS (to appear, 2015)
41. Zhou, M., Carin, L.: Augment-and-conquer negative binomial processes. In: Proc.
of NIPS (2012)
42. Zhou, M., Carin, L.: Negative binomial process count and mixture modeling. IEEE
Trans. Pattern Analysis and Machine Intelligence (2015)
43. Zhou, M., Hannah, L., Dunson, D., Carin, L.: Beta-negative binomial process and
poisson factor analysis. In: Proc. of AISTATS (2012)
44. Zhu, J.: Max-margin nonparametric latent feature models for link prediction. In:
Proc. of ICML (2012)
Generalization in Unsupervised Learning
1 Introduction
this setting, one needs to resort to worst case bounds. However, this cannot be
done without introducing additional assumptions about the behaviour of A. For
example, if one assumes that A chooses its output from a class of functions F
such that the class of loss random variables Λ : X × Y → R+ induced by F, i.e.
Λ = ◦ F, is uniformly upper bounded by c < ∞ and VCdim(Λ) = h < ∞, then
with probability at least 1 − η there is a uniform concentration of R EMP (AS )
around R(AS ):
⎛ ⎞
R(AS ) ≤ REMP (AS ) + τ c ⎝1 + 1 + 4REMP (AS ) ⎠ , (3)
2 τc
The analogy between our definition of uniform β–stability and the uniform
β–stability in supervised learning can be explained as follows. The uniform
β–stability in [5] is in terms of (AS , z) and (AS \i , z), where z = (x, y), x is
the input vector, and y is its expected output (or true label). Note that (AS , z)
can be written as (fS (x), y), where fS is the hypothesis learned by A using
the training set S. Similarly, (AS \i , z) can be written as (fS \i (x), y). Observe
that the difference between (fS (x), y) and (fS \i (x), y) is in the hypotheses fS
and fS \i . Note also that in supervised learning, the loss measures the discrep-
ancy between the expected output y and the estimated output y = fS (x). In
our unsupervised learning setting, the expected output is not available, and the
loss measures the reconstruction error between x and y ≡ AS (x). Hence, we
replace (AS , z) by (x, AS (x)), and (AS \i , z) by (x, AS \i (x)) to finally obtain
Definition 1.
Note that the uniform β–stability of A with respect to is complimentary
to the continuous Lipschitz condition on . If A is uniformly β–stable, then a
slight change in the input will result in a slight change in the output, resulting
in a change in the loss bounded by β. The following corollary upper bounds the
quantity R EMP (AS ) − R(AS ) using the uniform β–stability of A.
Discussion. The generalization bounds in (4) and (5) directly follow from
Theorem 12 in [5] for the regression case. The reason we consider A under the
regression framework is due to our characterization of unsupervised learning
algorithms in which we consider the output y ∈ Rk is a re-representation of
the input x ∈ Rd . This, in turn, defined the form of the external loss as :
X × Y → R+ . This characterization is very similar to the multivariate regression
setting, and hence our reliance on Theorem 12 in [5]. Note that if β ∝ n1 , then
the bounds in (4) and (5) will be tight.
Corollary 1 is interesting in our context for a few reasons. First, it defines
a generalization criterion for unsupervised learning algorithms in general: if A
3 1 κ
βn ∝ n
=⇒ βn = n
, for some constant κ > 0.
Generalization in Unsupervised Learning 305
is uniformly β–stable with respect to on S, then the bounds in (4) and (5)
hold with high probability. Note that the bound in (4) is tighter than the one
in (3). Second, the bounds for REMP and R LOO are very similar. Various works
have reported that REMP is an optimistically biased estimate for R, while R LOO
4
is almost an unbiased estimate [5,8,14]. Therefore, an advantage of uniform
β–stability is that this discrepancy is mitigated. This also shows that stability
based bounds are more suitable for studying algorithms whose empirical error
remains close to the LOO error.
Second, this result also shows that to be uniformly stable, a learning algo-
rithm needs to significantly depart from the empirical risk minimization prin-
ciple that emphasizes the minimization of R EMP . That is, a stable algorithm A
might exhibit a larger error during training but this would be compensated by
a decrease in complexity of the learned function. This characteristic is exactly
what defines the effects of regularization. Therefore, the choice for uniform sta-
bility allows one to consider a large class of unsupervised learning algorithms,
including those formulated as regularized minimization of an internal loss.
∀ S ∈ Xn EMP (AS ) − R
max |R EMP (AS \i )| ≤ βn . (6)
i=1,...,n
This states that for a uniformly βn –stable algorithm with respect to on S, the
change in the empirical loss due to the exclusion of one sample from S is at most
βn . In the finite sample setting, this will be:
n n
1 1
max (xj , AS (xj )) − (xj , AS \i (xj )) ≤ βn . (7)
i=1,...,n n j=1 n − 1 j=1
j=i
the data set S. Also, recall from Definitions 1 and 2 that if βn ∝ 1/n, then the
generalization bounds in (4) and (5) will hold with high probability. These two
requirements raise the need for two procedures; one to estimate βn at increasing
values of n, and another one to model the relation between the estimated βn ’s
and the values of n. However, to consider these two procedures for assessing
A’s generalization, we need to introduce a further mild assumption on A. In
particular, we need to assume that A does not change its learning mechanism
as the sample size is increasing from n to n + 1 for any n ≥ 1. Note that if
A changes its learning mechanism based on the sample size, then A can have
inconsistent trends of βn with respect to n which makes it unfeasible to obtain
consistent confidence bounds for R EMP (AS ) − R(AS ). Therefore, we believe that
our assumption is an intuitive one, and is naturally satisfied by most learning
algorithms.
local minima, or A has tendency to overfit the data, learning using all X will
obscure such traits.
Our proposed procedure for estimating βn , depicted in Algorithm 1, addresses
the above issues in the following ways. First it is based on repeated random sub-
sampling (with replacement) from the original data set S, similar in spirit to
bootstrapping [10]. Second, for each subsample of size nt , the procedure obtains
an estimate for the empirical loss before and after holding out one random sam-
ple. After repeating this subsampling process m times, m n, the procedure
obtains one estimate for βn , denoted by βnt , for sample size nt . Note that βnt
is the median of Bj ’s to increase the robustness of the estimate. This process is
repeated τ times and the final output of Algorithm 1 is the set of βnt ’s for the
increasing values of nt .
The proposed procedure is computationally intensive, yet it is efficient, scal-
able, and provides control over the accuracy of the estimates. First, note that the
proposed procedure is not affected by the fact that A is an unsupervised learning
algorithm. If A is a supervised learning algorithm, then assessing its generaliza-
tion through uniform β–stability results will still require 2τ m calls for A, as it is
the case for the unsupervised setting discussed here. Thus, the procedure does
not impose a computational overhead given the absence of the expected output,
and the black box assumption on A. Second, considering scalability for large
data sets, the procedure can be fully parallelized on multiple core architectures
and computing clusters [13]. Note that in each iteration j the processing steps
for each subsample are independent from all other iterations, and hence all m
subsamples can be processed in an embarrassingly parallel manner. Note also
that in each iteration, AS (Xj ) and AS \i (Xj ) can also be executed in parallel.
Parameters m and size of the subsamples, n1 , n2 , . . . ,nτ , control the tradeoff
between computational efficiency and estimation accuracy. These parameters are
user–specified and they depend on the data and problem in hand, its size n, A’s
complexity, and the available computational resources. Parameter m needs to be
sufficient to reduce the variance in {R1 , . . . , Rm } and {R1 , . . . , Rm
}. However,
increasing m beyond a certain value will not increase the accuracy of the esti-
mated empirical loss. Reducing the variance in {R1 , . . . , Rm } and {R1 , . . . , Rm
},
in turn, encourages reducing the variance in {B1 , . . . , Bm }. Note that for any
random variable Z with mean μ, median ν, and variance σ 2 , then |μ − ν| ≤ σ
with probability one. Therefore, in practice, increasing m encourages reducing
the variance in Bj ’s thereby reducing the difference |βnt − E(Bj )|. Observe that
the operator maxi=1,...,nt defined ∀S ∈ X nt in (6) is now replaced with the
estimate βnt .
The output of Algorithm 1 is the set B of estimated βnt ’s for the increasing
values of nt . In order to assess the stability of A, we need to observe whether
βnt = nκt , for some constant κ > 0. As an example, Figure 1 shows the trend
of βn for k–Means clustering and principal component analysis (PCA) on two
308 K.T. Abou-Moustafa and D. Schuurmans
Fig. 1. Left: Two synthetic data sets, (a) two normally distributed clouds of points
with equal variance and equal priors, and (d) two moons data points with equal priors.
Middle: The estimated βn (blue circles) from Algorithm 1 for k–Means clustering on
the two synthetic data sets. The fitted stability lines are shown in magenta. The slope
of the stability lines is indicated by w. Right: The estimated βn and stability lines for
PCA on the two synthetic data sets. The dispersion of βn ’s around the stability line
is reflected in the norm of the residuals for the stability line (not displayed). Note the
difference in the dispersion of points around the stability line for k–Means and PCA.
Note also that the more structure in the two moons data set is reflected in a smaller w
(compared to w for the tow Gaussians) for both algorithms.
synthetic toy data sets. The blue circles in the middle and right figures are the
estimated βn from Algorithm 1.5 Observe that βn is decreasing as n is increasing.
To formally detect and quantify this decrease, a line is fitted to the estimated
βn (shown in magenta); i.e. β(nt ) = wnt +ζ, where w is the slope of the line, and
ζ is the intercept. We call this line, the Stability Line. The slope of the stability
line indicates its steepness which is an esimtate for the decreasing rate of βn . For
stable algorithms, w < 0, and |w| indicates the stability degree of the algorithm.
Note that w = tan θ, where θ is the angle between the stability line and the
5
In these experiments, m = 100, and n1 , n2 , . . . , nτ were set to
0.5n, 0.51n, . . . , 0.99n, n. The loss for k–Means is the sum of L1 distances
between each point and its nearest centre, and for PCA, = tr(C), where C is the
data’s sample covariance matrix.
Generalization in Unsupervised Learning 309
abscissa, and − π2 < θ < π2 . For 0 ≤ θ < π2 , A is not stable. For − π2 < θ < 0, if θ
is approaching 0, then A is a less stable algorithm, while if θ is approaching − π2 ,
then A is a more stable algorithm. Observe that in this setting, β is a function
of n and w, and hence it can be denoted by β(n, w). Plugging β(n, w) in the
inequalities of Corollary 1, we get that:
1. A1 is similar to A2 , denoted A1 = A2 , if w1 ≈ w2 .6
2. A1 is better than A2 , denoted by A1 A2 , if w1 < w2 .
3. A1 is worse than A2 , denoted A1 ≺ A2 , if w1 > w2 .
Fig. 2. First Column: Generalization assessment for LEM on two Gaussians (a,c) and
two moons (e,g), with different number of nearest neighbours (nn) for constructing the
data’s neighbourhood graph. Compare the slopes (w) for the stability lines and the
dispersion of points around it, and note the sensitivity of LEM to the number of nn.
The same follows for the two moons case (e,g). Note also the difference in the stability
lines (slope, and dispersion of estimated βn ’s) for LEM and PCA on the same data
sets. Second Column: Generalization assessment for LLE on two Gaussians (b,d)
and two moons (f,h) data sets, with different number of nn.
Generalization in Unsupervised Learning 311
comparison for the slopes w1 and w2 is done using hypothesis testing for the
equality of slopes:
H0 : w1 = w2 vs. H1 : w1 = w2 .
run on four data sets from different domains: (i ) two faces data sets, AR and
CMUPIE with (samples × features) 3236 × 2900, and 2509 × 2900, respectively.
(ii ) two image features data sets, Coil20 and Multiple Features (MFeat) with
(samples × features) 1440 × 1024, and 2000 × 649, respectively, from the UCI
Repository for Machine Learning [18].7 In all these experiments, the number
of bootstraps m was set to 100, and the values for n1 , n2 , . . . , nτ were set to
0.5n, 0.51n, 0.52n, . . . , 0.99n, n, where n is the original size of the data set.
To apply the proposed generalization assessment, an external loss needs
to be defined for each problem. k–Means minimizes the sum of L2 distances
between each point and its nearest cluster centre. Thus, a suitable loss can be
the sum of L1 distances. Note that the number of clusters k is assumed to be
known. Note also that in this setting, for each iteration j in Algorithm 1, the
initial k centres are randomly chosen and they remain unchanged after holding
out the random sample. That is, k–Means starts from the same initial centres
before and after holding out one sample.
For PCA, LEM and LLE, the loss functions are chosen as follows: = tr(C)
for PCA, = tr(LL ) for LEM, and = tr(WW ) for LLE, where C is the
data’s sample covariance matrix, L is the Laplacian matrix defined by LEM,
and W is the weighted affinity matrix defined by LLE. The number of nearest
neighbours for constructing the neighbourhood graph for LEM and LLE was
fixed to 30 to ensure that the neighbourhood graph is connected. Note that we
did not perform any model selection for the number of nearest neighbours to
simplify the experiments and the demonstrations.
Figure 3 shows the stability lines for k–Means clustering on the four real data
sets used in our experiments. For both faces data sets, AR and CMUPIE, the
stability lines have similar slopes despite the different sample size. However,
the dispersion of points around the stability line is bigger for CMUPIE than
it is for AR. Hypothesis testing for the equality of slopes (at significance level
α = 0.05) did not reject H0 (p–value = 0.92). For Coil20 and UCI Mfeat, the
slopes of stability lines differ by one order of magnitude (despite the different
sample size). Indeed, the hypothesis test in this case rejected H0 with a very
small p–value. Note that the estimated βn ’s for the four data sets do not show
a clear trend as is the case for the two Gaussians and the two moons data sets
in Figure 1. This behaviour is expected from k–Means on real high dimensional
data sets, and is in agreement with what is known about its sensitivity to the
initial centres and its convergence to local minima. For a better comparison,
observe the stability lines for PCA on the same data sets in Figures 4 and 5.
7
The AR and CMUPIE face data sets were obtained from http://www.face-rec.org/
databases/.
Generalization in Unsupervised Learning 313
Fig. 4. Generalization assessment for PCA, LEM and LLE using stability analysis on
two faces data sets: AR (a,b,c), and CMUPIE (d,e,f).
Figures 4 and 5 show the stability lines for the three dimensionality reduc-
tion algorithms; PCA, LEM and LLE, on the four real data sets used in our
experiments. Note that the magnitude of w for these experiments should not
be surprising given the scale of n and βnt . It can be seen that PCA shows a
better trend of the estimated βn ’s than LEM and LLE (for our choice of fixed
neighbourhood size). This trend shows that PCA has better stability (and hence
better generalization) than LEM and LLE on these data sets. Note that in this
setting, the slope for PCA stability line cannot be compared to that of LEM
(nor LLE) since the loss functions are different. However, we can compare the
slopes for each algorithm stability lines (separately) on the face data sets and
on the features data sets.
Hypothesis testing (α = 0.05) for PCA stability lines on AR and CMUPIE
rejects H0 in favour of H1 with a p–value = 0.0124. For Coil20 and UCI Mfeat,
the test did not reject H0 and the p–value = 0.9. For LEM, the test did not
reject H0 for the slopes of AR and CMUPIE, while it did reject H0 in favour of
H1 for Coil20 and UCI MFeat. A similar behaviour was observed for LLE.
In these experiments and the previous ones on k–Means clustering, note that
no comparison of two algorithms were carried on the same data set. In these
314 K.T. Abou-Moustafa and D. Schuurmans
Fig. 5. Generalization assessment for PCA, LEM and LLE using stability analysis on
Coil20 (a,b,c), and UCI MFeat (d,e,f).
5 Concluding Remarks
In this paper we proposed a general criterion for generalization in unsupervised
learning that is analogous to the prediction error in supervised learning. We
also proposed a computationally intensive, yet efficient procedure to realize this
criterion on finite data sets, and extended it for comparing two different algo-
rithms on a common data source. Our preliminary experiments showed that,
for algorithms from three different unsupervised learning problems, the pro-
posed framework provided a unified mechanism and a unified interface to assess
Generalization in Unsupervised Learning 315
Acknowledgments. We would like to thank our Reviewers for their helpful comments
and suggestions, the Alberta Innovates Centre for Machine Learning and NSERC for
their support, and Frank Ferrie for additional computational support at McGill’s Centre
for Intelligent Machines.
Appendix
Hypothesis testing for the equality of slopes w1 and w2 for two regression lines
Y1 = w1 X1 + ζ1 and Y2 = w2 X2 + ζ2 , respectively, proceeds as follows. Let
S1 = {(xi1 , y1i )}ni=1 1
and S2 = {(xj2 , y2j )}nj=1
2
, be the two data sets to be used for
estimating the lines defined by w1 and w2 , respectively. Let { y11 , . . . , y1n1 } and
n
y21 , . . . , y2 2 } be the estimated predictions from each regression line. The null
{
and alternative hypotheses of the test are:
H0 : w1 = w2 vs. H1 : w1 = w2 .
where
ek
s2wk = ,
σk2 (nk − 1)
n
ek = i=1 k
(yki − yki )2 /(nk − 2), and σk2 = Var (Xk ), which can be replaced by the
sample variance. For significance level α, we compute the probability of observing
the statistic t from Tr given that H0 is true; this is the P value of the test. If
P > α, then H0 cannot be rejected. Otherwise, reject H0 in favour of H1 .
316 K.T. Abou-Moustafa and D. Schuurmans
References
1. Anandkumar, A., Hsu, D., Kakade, S.: A method of moments for mixture models
and hidden Markov models. CoRR abs/1203.0683 (2012)
2. Balle, B., Quattoni, A., Carreras, X.: Local loss optimization in operator models:
a new insight into spectral learning. In: Proceedings of the 29th International
Conference on Machine Learning, pp. 1879–1886 (2012)
3. Bartlett, P., Mendelson, S.: Rademacher and Gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research 3, 463–482 (2003)
4. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for data rep-
resentation. Neural Computation 15, 1373–1396 (2003)
5. Bousquet, O., Elisseeff, A.: Stability and generalization. Journal of Machine Learn-
ing Research 2, 499–526 (2002)
6. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal
of Machine Learning Research 7, 1–30 (2006)
7. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition.
Springer, New York (1996)
8. Devroye, L., Wagner, T.: Distribution-free inequalities for the deleted and holdout
error estimates. IEEE Transactions on Information Theory 25(2), 202–207 (1979)
9. Dietterich, T.: Approximate statistical tests for comparing supervised classification
learning algorithms. Neural Computation 10(7), 1895–1923 (1998)
10. Efron, B.: Bootstrap methods: another look at the jackknife. Annals of Statistics
7, 1–26 (1979)
11. Hansen, L., Larsen, J.: Unsupervised learning and generalization. In: Proceedings
of the IEEE International Conference on Neural Networks, pp. 25–30 (1996)
12. Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden Markov
models. In: Proceedings of the 22nd Conference on Learning Theory (2009)
13. Jordan, M.: On statistics, computation and scalability. Bernoulli 19(4), 1378–1390
(2013)
14. Kearns, M., Ron, D.: Algorithmic stability and sanity-check bounds for leave-
one-out cross-validation. In: Proceedings of the Conference on Learning Theory,
pp. 152–162. ACM (1999)
15. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and
model selection. In: Proceedings of the 14th International Joint Conference on
Artificial Intelligence, pp. 1137–1143 (1995)
16. Kutin, S., Niyogi, P.: Almost-everywhere algorithmic stability and generalization
error. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence,
pp. 275–282 (2002)
17. Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R.: Learning theory: stability is suf-
ficient for generalization and necessary and sufficient for consistency of empirical
risk minimization. Advances in Computational Mathemathics 25, 161–193 (2006)
18. Newman, D., Hettich, S., Blake, C., Merz, C.: UCI Repository of Machine Learning
Databases (1998). http://www.ics.uci.edu/∼mlearn/MLRepository.html
19. Saul, L., Roweis, S.: Think globally, fit locally: Unsupervised learning of low dimen-
sional manifolds. Journal of Machine Learning Research 4, 119–155 (2003)
20. Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K.: Learnability, stability
and uniform convergence. Journal of Machine Learning Research 11, 2635–2670
(2010)
21. Song, L., Boots, B., Siddiqi, S., Gordon, G., Smola, A.: Hilbert space embeddings
of hidden Markov models. In: Proceedings of the 27th International Conference on
Machine Learning, pp. 991–998 (2010)
Generalization in Unsupervised Learning 317
22. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, Sussex (1998)
23. Vapnik, V.N.: An overview of statistical learning theory. IEEE Transactions on
Neural Networks 10(5), 988–999 (1999)
24. Xu, H., Mannor, S.: Robustness and generalization. Machine Learning 86(3),
391–423 (2012)
25. Xu, L., White, M., Schuurmans, D.: Optimal reverse prediction: a unified perspec-
tive on supervised, unsupervised and semi-supervised learning. In: Proceedings of
the International Conference on Machine Learning, vol. 382 (2009)
Multiple Incomplete Views Clustering
via Weighted Nonnegative Matrix Factorization
with L2,1 Regularization
Abstract. With the advance of technology, data are often with mul-
tiple modalities or coming from multiple sources. Multi-view clustering
provides a natural way for generating clusters from such data. Although
multi-view clustering has been successfully applied in many applications,
most of the previous methods assumed the completeness of each view
(i.e., each instance appears in all views). However, in real-world appli-
cations, it is often the case that a number of views are available for
learning but none of them is complete. The incompleteness of all the
views and the number of available views make it difficult to integrate all
the incomplete views and get a better clustering solution. In this paper,
we propose MIC (Multi-Incomplete-view Clustering), an algorithm based
on weighted nonnegative matrix factorization with L2,1 regularization.
The proposed MIC works by learning the latent feature matrices for
all the views and generating a consensus matrix so that the difference
between each view and the consensus is minimized. MIC has several
advantages comparing with other existing methods. First, MIC incorpo-
rates weighted nonnegative matrix factorization, which handles the miss-
ing instances in each incomplete view. Second, MIC uses a co-regularized
approach, which pushes the learned latent feature matrices of all the
views towards a common consensus. By regularizing the disagreement
between the latent feature matrices and the consensus, MIC can be easily
extended to more than two incomplete views. Third, MIC incorporates
L2,1 regularization into the weighted nonnegative matrix factorization,
which makes it robust to noises and outliers. Forth, an iterative optimiza-
tion framework is used in MIC, which is scalable and proved to converge.
Experiments on real datasets demonstrate the advantages of MIC.
1 Introduction
With the advance of technology, real data are often with multiple modalities or
coming from multiple sources. Such data is called multi-view data. Different views
may emphasize different aspects of the data. Integrating multiple views may help
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 318–334, 2015.
DOI: 10.1007/978-3-319-23528-8 20
MIC via Weighted NMF with L2,1 Regularization 319
improve the clustering performance. For example, one news story may be reported
by different news sources, user group can be formed based on users’ profiles, user’s
online social connections, users’ transaction history or users’ credit score in online
shopping recommendation system, one patient can be diagnosed with a certain dis-
ease based on different measures, including clinical, imaging, immunologic, sero-
logical and cognitive measures. Different from traditional data with a single view,
these multi-view data commonly have the following properties:
1. Each view can have its own feature sets, and each view may emphasize dif-
ferent aspects. Different views share some consistency and complementary
properties. For example, in online shopping recommendation system, user’s
credit score has numerical features while users’ online social connections
provide graph relational features. The credit score emphasizes the credit-
worthiness of the user, while the social connection emphasizes the social life
of the user.
2. Each view may suffer from incompleteness. Due to the nature of the data
or the cost of data collection, each available view may suffer from incom-
pleteness of information. For example, not all the news stories are covered
by all the news sources, i.e., each news source (view) cannot cover all the
news stories. Thus, all the views are incomplete.
3. There may be an arbitrary number of sources. In some applications, the
number of available views may be small, while in other applications, it may
be quite large.
The above properties raise two fundamental challenges for clustering multi-
view data:
Multi-view clustering [1,7] provides a natural way for generating clusters from
such data. A number of approaches have been proposed for multi-view clustering.
Existing multi-view clustering algorithms can be classified into two categories
according to [28], distributed approaches and centralized approaches. Distributed
approaches, such as [4,15,28] first cluster each view independently from the oth-
ers, using an appropriate single-view algorithm, and then combine the individual
clusterings to produce a final clustering result. Centralized approaches, such as
[1,5,24,38] make use of multiple representations simultaneously to mine hid-
den patterns from the data. In this paper, we mainly focus on the centralized
approaches.
Most of the previous studies on multi-view clustering focus on the first chal-
lenge. They are all based on the assumption that all of the views are complete,
i.e., each instance appears in all views. Few of them addresses how to deal with
the second challenge. Recently, there are several methods working on the incom-
pleteness of the views [26,32,34]. They either require the completeness of at
320 W. Shao et al.
least one base view or cannot be easily extended to more than two incomplete
views. However, in real-world applications, it is often the case that more than
two views are available for learning and none of them is complete. For exam-
ple, in document clustering, we can have documents translated into different
languages representing multiple views. However, we may not get all the doc-
uments translated into each language. Another example is medical diagnosis.
Although multiple measurements from a series of medical examinations may be
available for a patient, it is not realistic to have each patient complete all the
potential examinations, which may result in the incompleteness of all the views.
The incompleteness of all the views and the number of available views make it
difficult to directly integrate all the incomplete views and get a better clustering
solution.
In this paper, we propose MIC (Multi-Incomplete-view Clustering) to han-
dle the situation of multiple incomplete views by integrating the joint weighted
nonnegative matrix factorization and L2,1 regularization. Weighted nonnegative
matrix factorization [20] is a weighted version of nonnegative matrix factorization
[25], and has been successfully used in document clustering [35] and recommen-
dation system [16]. L2,1 norm of a matrix was first introduced in [9] as rotational
invariant L1 norm. Because of its robustness to noise and outliers, L2,1 has been
widely used in many areas [11,13,18,21]. By integrating weighted nonnegative
matrix factorization and L2,1 norm, MIC tries to learn a latent subspace where
the features of the same instance from different views will be co-regularized to a
common consensus, while increasing the robustness of the learned latent feature
matrices. The proposed MIC method has several advantages comparing with
other state-of-art methods:
In this section, we will briefly describe the problem formulation. Then the back-
ground knowledge on weighted nonnegative matrix factorization will be intro-
duced.
where each row of M represent the instance presence for one view. Most of
the previous methods on multi-view clustering assume the completeness of all
the views. Every view contains all the instances, i.e., M is an all one matrix,
N
j=1 Mi,j = N, i = 1, 2, ..., nv . However, in most real-world situations, one
instance may only appear in some of the views, which may result in the incom-
(i)
N For each view, the data matrix X will have a number
pleteness of all the views.
of rows missing, i.e., j=1 Mi,j < N, i = 1, 2, ..., nv .
Our goal is to cluster all the N instances into K clusters by integrating all
the nv incomplete views.
matrix factorization [20] aims to factorize the data matrix X into two nonnega-
tive matrices, while giving different weights to the reconstruction errors of differ-
×K
ent entries. We denote the two nonnegative matrices factors as U ∈ RN + and
M ×K
V ∈ R+ . Here K is the desired reduced dimension. To facilitate discussions,
we call U the latent feature matrix and V the basis matrix. The objective func-
tion for general weighted nonnegative matrix factorization can be formulated as
below:
min W ∗ (X − UVT )2F , s.t. U ≥ 0, V ≥ 0, (1)
U,V
3 Multi-Incomplete-View Clustering
In this section, we present the Multi-Incomplete-view Clustering (MIC) frame-
work. We model the multi-incomplete-view clustering as a joint weighted non-
negative matrix factorization problem with L2,1 regularization. The proposed
MIC learns the latent feature matrices for each view and pushes them towards
a consensus matrix. Thus, the consensus matrix can be viewed as the shared
latent feature matrix across all the views. In the following, we will first describe
the construction of the objective function for the proposed method and derive
the solution to the optimization problem. Then the whole MIC framework is
presented.
Given nv views {X(i) ∈ RN ×di , i = 1, 2, ..., nv }, where each of the views suf-
N
fers from incompleteness, i.e., j=1 Mi,j < N . With more than two incomplete
views, we cannot directly apply the existing methods to the incomplete data.
One simple solution is to fill the missing instances with average features first, and
then apply the existing multi-view clustering methods. However, this approach
depends on the quality of the filled instances. For small missing percentages, the
quality of the information contained in the filled average features may be good.
However, when the number of missing instance increase, the quality of informa-
tion contained in the filling average features may be bad or even misleading.
Thus, simply filling the missing instance will not solve this problem.
Borrowing the similar idea from weighted NMF, we introduce a diagonal
weight matrix W(i) ∈ RN ×N for each incomplete views i by
(i) 1 if i-th view contains j-th instance, i.e., Mj,i = 1.
Wj,j =
wi otherwise.
MIC via Weighted NMF with L2,1 Regularization 323
(i)
Note that Wj,j indicates the weight of the j-th instance from view i, and wi is
the weight of the filled average feature instances for view i. In our experiment,
wi is defined as the percentage of the available instances for view i:
N
j=1 Mj,i
wi = .
N
It can be seen, W(i) gives lower weights to the missing instances than the pre-
sented instances in the i-th view. For different views with different incomplete
rates, the weights for missing instances are also different. The diagonal weight
matrices give higher weights to the missing instances from views with lower
incomplete rate.
A simple objective function to combine multiple incomplete views can be:
nv
T
min O= (W(i) (X(i) − U(i) V(i) )2F s.t. U(i) ≥ 0, V(i) ≥ 0, i = 1, 2, ..., nv ,
{U(i) },{V(i) }
i=1
(2)
where U(i) and V(i) are the latent feature matrix and basis matrix for the i-th
view.
However, Eq. (2) only decomposes the different views independently without
taking advantages of the relationship between the views. In order to make use
of the relation between different views, we push the latent feature matrices for
different views towards a common consensus by adding additional term R to
Eq. (2) to minimize the disagreement between different views and the common
consensus.
nv
T
min W(i) (X(i) − U(i) V(i) )2F + αi R(U(i) , U∗ )
{U(i) },{V(i) },U∗ (3)
i=1
∗ (i) (i)
s.t. U ≥ 0, U ≥ 0, V ≥ 0, i = 1, 2, ..., nv ,
where U∗ ∈ RN ∗K is the consensus latent feature matrix across all the views,
and αi is the trade-off parameter between reconstruction error and disagreement
between view i and the consensus. In this paper we define R as the square of
Frobenius norm of the weighted difference between the latent feature matrices:
3.2 Optimization
In the following, we give the solution to Eq. 4. For the sake of convenience, we will
T
see both αi and βi as positive in the derivation, and denote W̃(i) = W(i) W(i) .
∗
As we see, minimizing Eq. 4 is with respect to {U }, {V } and U , and we
(i) (i)
Fixing {U(i) } and {V(i) }, minimize O over U∗ . With {U(i) } and {V(i) }
fixed, we need to minimize the following objective function:
nv
J (U∗ ) = αi W(i) (U(i) − U∗ )2F s.t. U∗ ≥ 0 (5)
i=1
We take the derivative of the objective function J in Eq. 5 over U∗ and set it
to 0:
∂J v n
= 2αi W̃(i) U∗ − 2αi W̃(i) U(i) = 0 (6)
∂U∗ i=1
Fixing U∗ , minimize O over {U(i) } and {V(i) }. With U∗ fixed, the com-
putation of U(i) and V(i) does not depend on U(i ) or V(i ) , i = i. Thus for each
view i, we need to minimize the following objective function:
T
min W(i) (X(i) − U(i) V(i) )2F + αi W(i) (U(i) − U∗ )2F + βi U(i) 2,1
U(i) ,V(i) (8)
s.t. U(i) ≥ 0, V(i) ≥ 0
We will iteratively update U(i) and V(i) using the following multiplicative updat-
ing rules. We repeat the two steps iteratively until the objective function in Eq. 8
converges.
MIC via Weighted NMF with L2,1 Regularization 325
(1) Fixing U∗ and V(i) , minimize O over U(i) . For each U(i) , we need to
minimize the following objective function:
T
J (U(i) ) = W(i) (X(i) − U(i) V(i) )2F + αi W(i) (U(i) − U∗ )2F + βi U(i) 2,1
(9)
s.t. U(i) ≥ 0
∂J T
= − 2W̃(i) X(i) V(i) + 2W̃(i) U(i) V(i) V(i) + 2αi W̃(i) U(i) − 2αi W̃(i) U∗ + βi D(i) U(i)
∂U(i)
(10)
Here D(i) is a diagonal matrix with the j-th diagonal element given by
(i) 1
Dj,j = (i)
, (11)
Uj,: 2
(i)
where Uj,: is the j-th row of matrix U(i) , and · 2 is the L2 norm.
Using the Karush-Kuhn-Tucker (KKT) complementary condition [3] for the
nonnegativity of U(i) , we get
T (i)
(−2W̃(i) X(i) V(i) + 2W̃(i) U(i) V(i) V(i) + 2αi W̃(i) U(i) − 2αi W̃(i) U∗ + βi D(i) U(i) )j,k Uj,k = 0
(12)
Based on this equation, we can derive the updating rule for U(i) :
W̃(i) X(i) V(i) + αi W̃(i) U∗
← Uj,k
(i) (i)
j,k
Uj,k (13)
T
(i)
U V (i) V + αi W̃ U + 0.5βi D(i) U(i)
(i) (i) (i)
j,k
(2) Fixing U(i) and U∗ , minimize O over V(i) . For each V(i) ,we need to
minimize the following objective function:
T
J (V(i) ) = W(i) (X(i) − U(i) V(i) )2F s.t. V(i) ≥ 0 (14)
Based on this equation, we can derive the updating rule for V(i) :
(i) (i)
(X(i) T W̃(i) U(i) )j,k
Vj,k ← Vj,k T
(17)
(V(i) U(i) W̃(i) U(i) )j,k
326 W. Shao et al.
It is worth noting that to prevent V(i) from having arbitrarily large values
(which may lead to arbitrarily small values of U(i) ), it is common to put a
(i)
constraint on each basis matrix V(i) [14], s.t. V:,k 1 = 1, ∀ 1 ≤ k ≤ K.
However, the updated V(i) may not satisfy the constraint. We need to normalize
V(i) and change U(i) to make the constraint satisfied and keep the accuracy of
T
the approximation X(i) ≈ U(i) V(i) :
We compare the proposed MIC method with several state-of-art methods. The
details of comparison methods are as follows:
MIC via Weighted NMF with L2,1 Regularization 327
– MIC: MIC is the clustering framework proposed in this paper, which applies
weighted joint nonnegative matrix with L2,1 regularization. If not stated, the
co-regularization parameter set {αi } and the robust parameter set {βi } are
all set to 0.01 for all the views throughout the experiment.
– Concat: Feature concatenation is one straightforward way to integrate all
the views. We first fill the missing instances with the average features for each
view. Then we concatenate the features of all the views, and run k-means
directly on this concatenated view representation.
– MultiNMF: MultiNMF [27] is one of the most recent multi-view clustering
methods based on joint nonnegative matrix factorization. MultiNMF added
constraints to original nonnegative matrix factorization that pushes cluster-
ing solution of each view towards a common consensus.
– ConvexSub: The subspace-based multi-view clustering method developed
by [17]. In the experiments, we set β = 1 for all the views. We run the
ConvexSub method using a range of γ values as in the original paper, and
present the best results obtained.
– PVC: Partial multi-view clustering [26] is one of the state-of-art multi-
view clustering methods, which deals with incomplete views. PVC works by
establishing a latent subspace where the instances corresponding to the same
example in different views are close to each other. In our experiment, we set
the parameter λ to 0.01 as in the original paper.
– CGC: CGC [6] is the most recent work that deals with many-to-many
instance relationship, which can be used in the situation of incomplete views.
In order to run the CGC algorithm, for every pair of incomplete views, we
generate the mapping between the instances that appears in both views. In
the experiment, the parameter λ is set to 1 as in the original paper.
It is worth to note that MultiNMF and ConvexSub are two recent methods for
multi-view clustering. Both of them assumes the completeness of all the available
views. PVC is among the first works that does not assume the completeness of
any view. However, PVC can only works with two incomplete views. For the
sake of comparison, all the views are considered with equivalent importance
in the evaluation of all the multi-view algorithms. The results evaluated by two
metrics, the normalized mutual information (NMI) and the accuracy (AC). Since
we use k-means to get the clustering solution at the end of the algorithm, we
run k-means 20 times and report the average performance.
4.2 Dataset
In this paper, three different real-world datasets are used to evaluate the pro-
posed method MIC. Among the three datasets, the first one is handwritten digit
data, the second one is text data, while the last one is flower image data. The
important statistics of them are summarized in Table 2.
utility maps [12]. The following feature spaces (views) with different vector-
based features are available for the numbers: (1) 76 Fourier coefficients of the
character shapes, (2) 216 profile correlations, (3) 64 Karhunen-Love coeffi-
cients, (4) 240 pixel averages in 2 × 3 windows, (5) 47 Zernike moments. All
these features are conventional vector-based features but in different feature
spaces.
– 3-Source Text data (3Sources)1 It is collected from three online news
sources: BBC, Reuters, and The Guardian, where each news source can be
seen as one view for the news stories. In total there are 948 news articles
covering 416 distinct news stories from the period February to April 2009.
Of these distinct stories, 169 were reported in all three sources, 194 in two
sources, and 53 appeared in a single news source. Each story was manually
annotated with one of the six topical labels: business, entertainment, health,
politics, sport, technology.
– Oxford Flowers Data (Flowers): The Oxford Flower Dataset is composed
of 17 flower categories, with 80 images for each category [30]. Each image is
described by different visual features using color, shape, and texture. In this
paper, we use the χ2 distance matrices for different flower features (color,
shape, texture) as three different views.
Both Digit and Flowers data are complete. We randomly delete instances from
each view to make the views incomplete. To simplify the situation, we delete
the same number of instances for all the views, and run the experiment under
different incomplete percentages from 0% (all the views are complete) to 50% (all
the views have 50% instances missing). It is also worth to note that 3Sources is
naturally incomplete. Also since PVC can only with with two incomplete views,
in order to compare PVC with other methods, we take any two of the three
incomplete views and run experiments on them. We also report the results on
all the three incomplete views. The statistics of 3Sources data are summarized
in Table 3.
4.3 Results
The results for Digit data and Flower data are shown in Figs. 1-4. We report
the results for various incomplete rates (from 0% to 50% with 10% as interval).
Table 4 contains the results for 3Sources data.
1
http://mlg.ucd.ie/datasets/3sources.html
MIC via Weighted NMF with L2,1 Regularization 329
0.8 0.9
MIC MIC
Concat Concat
0.7 0.8
ConvexSub ConvexSub
MultiNMF MultiNMF
CGC 0.7 CGC
0.6
0.6
NMI
AC
0.5
0.5
0.4
0.4
0.3 0.3
0.2 0.2
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Incomplete Rate Incomplete Rate
From Figs. 1 and 2 for Digit data, we can see that the proposed MIC method
outperforms all the other methods in all the scenarios, especially for relatively
large incomplete rates (about 12% higher than other methods in NMI and about
20% higher in AC for incomplete rates 30% and 40% ). It is worth to note
that when the incomplete rate is 0, CGC is the second best method in both
NMI and AC, which is very close to MIC. However, as the incomplete rate
increases, the performance of CGC drops quickly. One of the possible reasons
is that CGC works on the similarity matrices/kernels, as the incomplete rate
increases, estimated similarity/kernel matrices are not accurate. Also, as the
incomplete rate increases, fewer instance mappings between views are available.
Combining these two factors, the performance of CGC drops for incomplete
views. We can also observe that for incomplete views (incomplete rate > 0),
multiNMF gives the second best performance (still at lease 5% lower in NMI
and at lease 8% lower in AC).
In Table 4, we can also observe that the proposed method outperforms all
the other methods in both NMI and AC. MultiNMF and ConvexSub perform
the best among the compared techniques.
From Figs. 3 and 4 for Flowers data, we can observe that in most of the
cases, MIC outperforms all the other methods. It is worth to note that when
all the views are complete, the performances of ConvexSub and MultiNMF are
almost the same as MIC. As the incomplete rate increases, MIC starts to show
the advantages over other methods. However, when the incomplete rate is too
large (e.g., 50%), the performance of MIC is almost the same as ConvexSub and
MultiNMF.
There are two sets of parameters in the proposed methods: {αi }, trade-off param-
eter between reconstruction error and view disagreement and {βi }, trade-off
parameter between the reconstruction error and robustness. Here we explore the
effects of the view disagreement trade-off parameter and the robust trade-off
parameter to clustering performance. We first fix {βi } to 0.01, run MIC with
various {αi } values (from 10−7 to 100). Then fix {αi } to 0.01, run MIC with
various {βi } values (from 10−7 to 100). Due to the limit of space, we only report
330 W. Shao et al.
0.45 0.4
MIC MIC
Concat Concat
ConvexSub
0.4 ConvexSub MultiNMF
MultiNMF
0.35
CGC
CGC
0.35
0.3
NMI
AC
0.3
0.25
0.25
0.2 0.2
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Incomplete Rate Incomplete Rate
the results on 3Souces data with all the three views in Fig. 5. From Fig. 5, we
can find that MIC achieves stably good performance when αi is around 10−2
and βi is from 10−5 to 10−1 .
0.445 0.56
0.65
NMI on 3Sources
NMI for 3Sources
AC on 3Sources 0.45 AC for 3Sources 0.6
AC
AC
0.25
0.4
0.43 0.5 0.2
0.35
0.15
0.425 0.48 0.1 0.3
0.05 0.25
0.42 0.46
-6 -4 -2 0 2
10 10 10 10 10 10
-6
10
-4
10
-2
10
0
10
2
αi value βi value
-6 -5
10 ×10 8 ×10 0.6
Objective Objective
0.8 Performance
Objection Function Value
0.55
7
AC
AC
0.75
0.5
6
5
0.7 0.45
0 10 20 30 40 50 60 0 5 10 15 20 25 30
Iteration Iteration
Digit data with 10% incomplete rate. 3Sources data with all three views.
10% incomplete rate and 3Sources data using all the three views. The blue solid
line shows the value of the objective function and the red dashed line indicates
the accuracy of the method. As can be seen, for Digit data, the algorithm will
converge after 30 iterations. For 3Sources data, after less than 10 iterations, the
algorithm will converge.
5 Related Work
There are two areas of related works upon which the proposed model is built.
Multi-view learning [2,22,29], is proposed to learn from instances which have
multiple representations in different feature spaces. Specifically, Multi-view clus-
tering [1,28] is most related to our work. For example, [1] developed and studied
partitioning and agglomerative, hierarchical multi-view clustering algorithms for
text data. [23,24] are among the first works proposed to solve the multi-view
clustering problem via spectral projection. Linked Matrix Factorization [33] is
proposed to explore clustering of a set of entities given multiple graphs. Recently,
[34] proposed a kernel based approach which allows clustering algorithms to be
applicable when there exists at least one complete view with no missing data.
As far as we know, [26,32] are the only two works that do not require the com-
pleteness of any view. However, both of the methods can only work with two
incomplete views.
Nonnegative matrix factorization [25] is the second area that is related to our
work. NMF has been successfully used in unsupervised learning [31,36]. Different
variations were proposed in the last decade. For example, [8] posed a three factor
NMF and added orthogonal constrains for rigorous clustering interpretation. [19]
introduced sparsity constraints on the latent feature matrix, which will give more
sparse latent representations. [20] proposed a weighted version of NMF, which
gives different weights to different entries in the data. Recently, [6,27] propsed to
use NMF to clustering data from multiple views/sources. However, they cannot
deal with multiple incomplete views. The proposed MIC, which uses weighted
joint NMF to handle the incompleteness of the views and maintain the robustness
by introducing the L2,1 regularization.
332 W. Shao et al.
6 Conclusion
In this paper, we study the problem of clustering on data with multiple incom-
plete views, where each view suffers from incompleteness of instances. Based on
weighted NMF, the proposed MIC method learns the latent feature matrices
for all the incomplete views and pushes them towards a common consensus. To
achieve the goal, we use a joint weighted NMF algorithm to learn not only the
latent feature matrix for each view but also minimize the disagreement between
the latent feature matrices and the consensus matrix. By giving missing instances
from each view lower weights, MIC minimizes the negative influences from the
missing instances. It also maintains the robustness to noises and outliers by
introducing the L2,1 regularization. Extensive experiments conducted on three
datasets demonstrate the effectiveness of the proposed MIC method on data
with multiple incomplete views comparing with other state-of-art methods.
References
1. Bickel, S., Scheffer, T.: Multi-view clustering. In: ICDM, pp. 19–26 (2004)
2. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training.
In: COLT, New York, NY, USA, pp. 92–100 (1998)
3. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press,
New York (2004)
4. Bruno, E., Marchand-Maillet, S.: Multiview clustering: a late fusion approach using
latent models. In: SIGIR. ACM, New York (2009)
5. Chaudhuri, K., Kakade, S.M., Livescu, K., Sridharan, K.: Multi-view clustering
via canonical correlation analysis. In: ICML, New York, NY, USA (2009)
6. Cheng, W., Zhang, X., Guo, Z., Wu, Y., Sullivan, P.F., Wang, W.: Flexible and
robust co-regularized multi-domain graph clustering. In: SIGKDD, pp. 320–328.
ACM (2013)
7. de Sa, V.R.: Spectral clustering with two views. In: ICML Workshop on Learning
with Multiple Views (2005)
8. Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix
T-factorizations for clustering. In: SIGKDD, pp. 126–135. ACM (2006)
9. Ding, C., Zhou, D., He, X., Zha, H.: R1-PCA: rotational invariant L1-norm princi-
pal component analysis for robust subspace factorization. In: ICML, pp. 281–288.
ACM (2006)
10. Ding, W., Wu, X., Zhang, S., Zhu, X.: Feature selection by joint graph sparse
coding. In: SDM, Austin, Texas, pp. 803–811, May 2013
11. L. Du, X. Li, and Y. Shen. Robust nonnegative matrix factorization via half-
quadratic minimization. In: ICDM, pp. 201–210 (2012)
12. Duin, R.P.: Handwritten-Numerals-Dataset
MIC via Weighted NMF with L2,1 Regularization 333
13. Evgeniou, A., Pontil, M.: Multi-task Feature Learning. Advances in Neural Infor-
mation Processing Systems 19, 41 (2007)
14. Févotte, C.: Majorization-minimization algorithm for smooth itakura-saito non-
negative matrix factorization. In: ICASSP, pp. 1980–1983. IEEE (2011)
15. Greene, D., Cunningham, P.: A matrix factorization approach for integrating mul-
tiple data views. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J.
(eds.) ECML PKDD 2009, Part I. LNCS, vol. 5781, pp. 423–438. Springer, Heidel-
berg (2009)
16. Gu, Q., Zhou, J., Ding, C.: Collaborative filtering: weighted nonnegative matrix
factorization incorporating user and item graphs. In: SDM. SIAM (2010)
17. Guo, Y.: Convex subspace representation learning from multi-view data. In: AAAI,
Bellevue, Washington, USA (2013)
18. Huang, H., Ding, C.: Robust tensor factorization using R1 norm. In: CVPR,
pp. 1–8. IEEE (2008)
19. Kim, J., Park, H.: Sparse Nonnegative Matrix Factorization for Clustering (2008)
20. Kim, Y., Choi, S.: Weighted nonnegative matrix factorization. In: International
Conference on Acoustics, Speech and Signal Processing, pp. 1541–1544 (2009)
21. Kong, D., Ding, C., Huang, H.: Robust nonnegative matrix factorization using
L21-norm. In: CIKM, New York, NY, USA, pp. 673–682 (2011)
22. Kriegel, H.P., Kunath, P., Pryakhin, A., Schubert, M.: MUSE: multi-represented
similarity estimation. In: ICDE, pp. 1340–1342 (2008)
23. Kumar, A., Daume III, H.: A co-training approach for multi-view spectral cluster-
ing. In: ICML, New York, NY, USA, pp. 393–400, June 2011
24. Kumar, A., Rai, P., Daumé III, H.: Co-regularized multi-view spectral clustering.
In: NIPS, pp. 1413–1421 (2011)
25. Lee, D., Seung, S.: Learning the Parts of Objects by Nonnegative Matrix Factor-
ization. Nature 401, 788–791 (1999)
26. Li, S., Jiang, Y., Zhou, Z.: Partial multi-view clustering. In: AAAI, pp. 1968–1974
(2014)
27. Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative
matrix factorization. In: SDM (2013)
28. Long, B., Philip, S.Y., (Mark) Zhang, Z.: A general model for multiple view unsu-
pervised learning. In: SDM, pp. 822–833. SIAM (2008)
29. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training.
In CIKM, pp. 86–93. ACM, New York (2000)
30. Nilsback, M.-E., Zisserman, A.: A visual vocabulary for flower classification. In:
CVPR, vol. 2, pp. 1447–1454 (2006)
31. Shahnaz, F., Berry, M., Pauca, V.P., Plemmons, R.: Document Clustering Using
Nonnegative Matrix Factorization. Information Processing & Management 42(2),
373–386 (2006)
32. Shao, W., Shi, X., Yu, P.: Clustering on multiple incomplete datasets via collective
kernel learning. In: ICDM (2013)
33. Tang, W., Lu, Z., Dhillon, I.S.: Clustering with multiple graphs. In: ICDM, Miami,
Florida, USA, pp. 1016–1021, December 2009
34. Trivedi, A., Rai, P., Daumé III, H., DuVall, S.L.: Multiview clustering with incom-
plete views. In: NIPS 2010: Workshop on Machine Learning for Social Computing,
Whistler, Canada (2010)
35. Wang, D., Li, T., Ding, C.: Weighted feature subset non-negative matrix factor-
ization and its applications to document understanding. In: ICDM (2010)
334 W. Shao et al.
36. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix
factorization. In: SIGIR, pp. 267–273 (2003)
37. Zhang, X., Yu, Y., White, M., Huang, R., Schuurmans, D.: Convex sparse coding,
subspace learning, and semi-supervised extensions. In: AAAI (2011)
38. Zhou, D., Burges, C.: Spectral clustering and transductive learning with multiple
views. In: ICML, pp. 1159–1166. ACM, New York (2007)
Solving a Hard Cutting Stock Problem
by Machine Learning and Optimisation
1 Introduction
The Cutting Stock Problem (CSP) [1] is a well-known NP-complete optimization
problem in Operations Research. It arises from many applications in industry
and a standard application is a paper mill. The mill produces rolls of paper of
a fixed width, but its customers require rolls of a lesser width. The problem is
to decide how many original rolls to make, and how to cut them, in order to
meet customer demands. Typically, the objective is to minimise waste, which
is leftover rolls or pieces of rolls. The problem can be modelled and solved by
Integer Linear Programming (ILP), and for large instances column generation
can be used.
We are working with a company on an industrial project and have encoun-
tered a hard optimisation problem. The application is commercially sensitive so
we cannot divulge details, but the problem can be considered as a variant of the
CSP. (We shall refer to “rolls” and “paper mills” but in fact the problem origi-
nates from another industry.) In this CSP variant, the choice of cutting pattern
is semi-automated so the user has only partial control over it via a “request”. A
request is a vector of continuous variables so there are infinitely many possibil-
ities, and their effect on the choice is complex. There are multiple paper mills,
and each can use only one request. The rolls made by the mills are of different
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 335–347, 2015.
DOI: 10.1007/978-3-319-23528-8 21
336 S.D. Prestwich et al.
sizes even before they are cut. For each mill, either all or none of its rolls are cut.
There are also demands to be met and costs to be minimised. For this paper the
interest is in the application of machine learning techniques (multivariate dis-
tribution approximation and cluster analysis) to reduce this infinite nonlinear
problem to a finite linear problem that can be solved by standard optimisation
methods.
This paper is structured as follows. First, in Section 2 the cutting stock
problem is described. Second, in Section 3, we define the framework associated
with the extra difficult cutting stock problem treated in this paper. We also
propose an Integer Linear Program for this problem in Section 4, and give a
brief overview of an alternative metaheuristic approach. The machine learning
approach presented for the problem is described in Section 5. The approaches
are evaluated with a real-life application in Section 6. Finally, the conclusions
are commented in Section 7.
where M is the set of roll types and dj is the demand for type j. There are n
cutting patterns and x is a vector of decision variables which state how many
times each pattern is used. The number of rolls of type j generated by pattern i
is aij . The objective function is to minimize the total cost, where ci is the cost
associated with pattern i. The costs depend on the specifications of the problem.
For instance, for some problems, such as the model described above, the costs
are associated with the patterns used (e.g. some cutting machines incur certain
cost), while for other problems the costs are associated with the amount of left-
over material (typically called waste if it can not be sold in future orders), etc.
For a literature review of cutting stock problems we recommend [2].
Variants of the CSP have been studied. The above problem is one-dimensional
but two or three dimensions might be necessary [3]. The problem might be multi-
stage, involving further processing after cutting [4], or might be combined with
other problems, e.g. [5]. Additional constraints might be imposed because of user
requirements. Widths might be continuous though restricted to certain ranges
of values.
3 Problem Formalization
As mentioned above, our CSP problem has several extra difficulties from the
standard CSP which makes it hard to model and solve. Instead of rolls of a fixed
Solving a Hard Cutting Stock Problem 337
n
vi2 = 1
i=1
s.t.
J
R
bj A(v j , σr ) ≥ d (2)
j=1 r=1
bj ∈ {0, 1} ∀j (3)
v j ∈ [0, 1]n ∀j (4)
where bj is a binary variable that is set to one iff mill j’s rolls are cut.
This problem is very hard to solve because there are infinite number of possi-
ble requests v j , and because A is not a simple function. A metaheuristic approach
is possible, searching in the space of bj and v j variable assignments with j Vj bj
338 S.D. Prestwich et al.
s.t.
K
xjk = bj ∀j (5)
k=1
J
K
cjk xjk ≥ d (6)
j=1 k=1
where bj = 1 indicates that all mill j’s rolls are cut, and xjk = 1 indicates that
they are cut using request k. If mill j is not selected then bj = 0 which forces
xjk = 0 for k = 1 . . . K.
Solving a Hard Cutting Stock Problem 339
This ILP can be solved by standard optimisation software. But to make this
approach practical we must first select a finite set of requests ujk that adequately
covers all possible requests. More precisely, the possible sets of cut rolls cjk must
be adequately covered. This requires the generation of a finite set of vectors that
approximately cover an unknown multivariate probability distribution.
In this section we explain our approach to the problem of covering the unknown
multivariate distribution in the CSAWCSP. An illustration of our approach is
shown in Figure 1.
In scatter plot (a) the circle represents the hypersphere of possible requests
v, with a small random number of them selected shown as dots. Plot (b) shows
the result of applying algorithm A to a mill’s rolls using the different v, to obtain
a small set of c vectors. The space of c vectors might have a very different shape
to that of the v, as shown. As a consequence, a small random set of v might
correspond to a very non-random small set of c vectors, showing the inadequacy
of merely sampling a few requests.
Instead we sample a large number of v as shown in plot (c), with their corre-
sponding c shown in plot (d): this represents the use of Monte Carlo simulation
to approximate the distribution of the c. We then select a small number of c
via a k-medoids algorithm to approximately cover the estimated distribution,
highlighted in plot (f). Finally we use a record of which v corresponds to which
c to derive the non-random set of v highlighted in plot (e). Next we describe
these phases of our approach in more detail.
(a) (b)
(c) (d)
(e) (f)
Unlike the k-means algorithm, partitioning is done around medoids (or exem-
plars) rather than centroids. This is vital for our problem because we require
a small set of c that are each generated from some known v. A medoid mk is
a data point in a cluster Ck which is most similar to all other points in that
cluster.
A k-medoids algorithm seeks to minimize the function
K
d (xi , mk ) (9)
k=1 i∈Ck
6 Emprirical Study
For empirically studying our approach, we have compared the solutions obtained
by our approach for a certain range of k medoids with respect the lower optimal-
ity bounds (explained below) of several instances. For such purpose we used real
data from our industrial partner. The total volume of the raw material analyzed
is 1191.3m3 for 8 mills (J = 8) with each mill’s rolls partitioned into a maximum
of 4 different types of products. We generated and solved 20 instances of ran-
dom demands in a 2.3 GHz Intel Core i7 processor. Monte Carlo simulation and
Solving a Hard Cutting Stock Problem 343
clustering were done in Java and R (using the CLARA algorithm) respectively
on a 3.0 GHz Intel Xeon Processor with 8 GB of RAM. For solving the integer
linear programming model, we used the CPLEX 12.6 solver with a time cut-off
of 1 hour.
To approximate the unknown multivariate distribution, we generated 10, 000
random request vectors for each mill. Then, we obtained the same number of
corresponding cutting patterns by applying the algorithm A (see Section 3). For
8 mills, this process resulted in total of 80, 000 cutting patterns. The total time
required for generating all the cutting patterns was 2 hours and 30 minutes. Note
that, in time-sensitive applications, a lower number of random cutting patterns
could be generated to reduce this overhead. Next we applied k-medoids clustering
to cover the distribution.
For evaluating the effect of the number of medoids used to cover the dis-
tribution, we varied k from 25 to 200 in steps of 25 units. The time taken to
generate the clusters can be found in Table 1. In addition, Figure 2 shows the
total clustering times (for all the mills). Note that the clustering times increase
exponentially from 40.54 seconds for k = 25, to 2, 651.08 seconds (∼ 44 min-
utes) for k = 200. Thus for real-life problems it is very important to make an
appropriate selection of the parameter k (especially in on-line applications).
Once all the medoids were obtained we used them as input parameters for the
ILP model (see Section 4). Then we solved the 20 instances of random demands.
Furthermore, we also applied a relaxed ILP model with the objective of calculat-
ing the lower bound of optimality of the instances analyzed. In this variation, we
consider feasible any linear combination of cutting patterns. However, it might
occur that the combination selected is not feasible for the real life problem. For
this reason, this measurement is a lower bound of optimality, since in the lat-
344 S.D. Prestwich et al.
Time (sec)
Mill k = 25 k = 50 k = 75 k = 100 k = 125 k = 150 k = 175 k = 200
0 5.13 14.47 31.40 56.80 107.00 158.85 229.90 310.47
1 4.94 15.68 35.66 59.88 95.85 156.30 234.19 313.49
2 4.99 14.62 27.59 56.59 94.38 154.96 235.22 313.99
3 5.06 17.59 36.63 62.06 102.29 167.75 266.71 356.86
4 5.17 16.04 33.82 60.64 99.78 164.89 240.40 331.72
5 5.43 16.83 33.04 63.48 95.85 171.37 255.47 351.97
6 4.99 15.77 33.21 61.18 98.10 163.63 241.04 327.87
7 4.83 16.27 34.37 67.03 95.04 168.46 251.94 344.71
Total Time 40.54 127.27 265.72 487.66 788.29 1306.23 1954.87 2651.08
ter case, the optimal solution is greater than this bound. The lower optimality
bound is very useful since it allows us to stop the search for a better solution
once we have reached such bound (since it can be ensured that this is the optimal
solution). For this reason, we incorporated such lower optimality bound in the
model solved by CPLEX.
Figure 3 shows the percentage difference between the solutions obtained with
our approach and the lower optimality bound. It can be observed that as k
increases, the percentage difference decreases, following an exponential inverse
function. This suggest that increasing the medoids is more effective when the
original medoids are fewer. Furthermore, it can be observed that there is a “sat-
uration” point in which it is not possible to further improve the quality of the
solutions. For this instance, the saturation point is located approximately at
k = 125 since higher values of k provide almost the same results. For this rea-
son, for these instances, the best option is to select k = 125, since it is not
worth it to spend more time computing higher k values. Note that for this case,
the difference in percentage with the lower optimality bound is ∼ 0.4%, which
indicates that we succeeded in finding optimal and close-optimal solutions.
Solving a Hard Cutting Stock Problem 345
We also would like to comment how these percentages differences are trans-
lated into economical earnings. On average, for these instance, the raw material
that was required to satisfy the demands when using k = 25 was almost A C 800
more expensive than when using k = 125. Needless to mention, the benefits in
the real life application that will involve to use the approach presented in this
paper with a great enough value of k.
In Table 2 and Figure 4 we show the mean times for solving the 20 instances
for the interval of values of k selected. Note that these times also increase in a
non-linear fashion, from a solution time of 1.506 seconds for k = 25 to 407.475
seconds (∼ 7 minutes) for k = 200. We would like to point out that there is a
correlation with the saturation point in the optimality with respect the satura-
tion point in the computation times for solving the ILP model. Note that for
higher values of k = 125, the increment of computation time is little appreciable.
References
1. Kantorovich, L.V.: Mathematical methods of organizing and planning production.
Management Science 6(4), 366–422 (1960)
2. Cheng, C.H., Feiring, B.R., Cheng, T.C.E.: The cutting stock problem - a survey.
International Journal of Production Economics 36(3), 291–305 (1994)
3. Gilmore, P.C., Gomory, R.E.: Multistage Cutting Stock Problems of Two and More
Dimensions. Operations Research 13, 94–120 (1965)
4. Furini, F., Malaguti, E.: Models for the two-dimensional two-stage cutting stock
problem with multiple stock size. Computers & Operations Research 40(8),
1953–1962 (2013)
5. Hendry, L.C., Fok, K.K., Shek, K.W.: A cutting stock scheduling problem in the
copper industry. Journal of the Operational Research Society 47, 38–47 (1996)
6. Murphy, G., Marshall, H., Bolding, M.C.: Adaptive control of bucking on harvesters
to meet order book constraints. Forest Products Journal and Index 54(12), 114–121
(2004)
7. Dueck, G., Scheuer, T.: Threshold accepting: a general purpose optimization algo-
rithm appearing superior to simulated annealing. Journal of computational physics
90(1), 161–175 (1990)
8. Sawilowsky, S.S.: You think you’ve got trivials? Journal of Modern Applied Sta-
tistical Methods 2(1), 218–225 (2003)
9. Kroese, D.P., Taimre, T., Botev, Z.I.: Handbook of Monte Carlo Methods, Wiley
Series in Probability and Statistics. John Wiley and Sons, New York (2011)
10. Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R., Sellie, L.: On the
Learnability of Discrete Distributions. ACM Symposium on Theory of Computing
(1994)
11. Chakravarti, I.M., Laha, R.G., Roy, J.: Handbook of Methods of Applied Statistics,
vol. I. John Wiley and Sons, pp. 392–394 (1967)
12. Adams, C.R., Clarkson, J.A.: On definitions of bounded variation for functions
of two variables. Transactions of the American Mathematical Society 35, 824–854
(1933)
13. Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical
Statistics 22(1), 79–86 (1951). doi:10.1214/aoms/1177729694. MR 39968
14. Rubinstein, R.Y., Kroese, D.P.: Simulation and the Monte Carlo Method, 2nd edn.
John Wiley & Sons (2008)
15. de Amorim, R.C., Fenner, T.: Weighting features for partition around medoids
using the Minkowski metric. In: Proceedings of the 11th International Symposium
in Intelligent Data Analysis, pp. 35–44, October 2012
16. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K., Studer, M.,
Roudier: cluster: Cluster Analysis Extended Rousseeuw et al. R package version
2.0.1, January 2015. http://cran.r-project.org/web/packages/cluster/cluster.pdf
17. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons Inc, New York (1990)
18. Wei, C., Lee, Y., Hsu, C.: Empirical comparison of fast clustering algorithms for
large data sets. In: Proceedings of the 33rd Hawaii International Conference on
System Sciences (2000)
19. Nagpaul, P.S.: 7.1.2 Clustering Large Applications (CLARA). In: Guide
to Advanced Data Analysis using IDAMS Software. http://www.unesco.org/
webworld/idams/advguide/Chapt7 1 2.htm (access date: October 02, 2015)
Data Preprocessing
Markov Blanket Discovery
in Positive-Unlabelled and Semi-supervised Data
1 Introduction
Markov Blanket (MB) is an important concept that links two of the main activ-
ities of machine learning: dimensionality reduction and learning. Using Pellet
& Elisseef’s [15] wording “Feature selection and causal structure learning are
related by a common concept: the Markov blanket.”
Koller & Sahami [10] showed that the MB of a target variable is the optimal
set of features for prediction. In this context discovering MB can be useful for
eliminating irrelevant features or features that are redundant in the context
of others, and as a result plays a fundamental role in filter feature selection.
Furthermore, Markov blankets are important in learning Bayesian networks [14],
and can also play an important role in causal structure learning [15].
In most real world applications, it is easier and cheaper to collect unlabelled
examples than labelled ones, so transferring techniques from fully to partial-
labelled datasets is a key challenge. Our work shows how we can recover the
MB around partially labelled targets. Since the main building block of the MB
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 351–366, 2015.
DOI: 10.1007/978-3-319-23528-8 22
352 K. Sechidis and G. Brown
X0 X0 X0 X1 X0 X2 X0 X0 X0 X0 X1 X0 X2 X0
X3 X0 X4 X0 X5 X0 X6 X3 X0 X4 X0 X5 X0 X6
X0 X7 X0 1 Y1 X0 X8 X0 X0 X7 X0 1 Y1 X0 X8 X0
X0 X0 X9 X0 X10 X0 X0 X0 X0 X9 X0 X10 X0 X0
X0 X0 X0 X11 X0 X0 X0 X0 X0 X11 X0 X0 X0
X X0
Fig. 1. Toy Markov blanket example where: white nodes represent the target variable,
black ones the features that belong to the MB of the target and grey ones the features
that do not belong to the MB. In 1a we know the value of the target over all examples,
while in 1b the target is partially observed (dashed circle) meaning that we know its
value only in a small subset of the examples.
Learning the Markov blanket for each variable of the dataset, or in other
words inferring the local structure, can naturally lead to causal structure learning
[15]. Apart from playing a huge role in structure learning of a Bayesian network,
Markov blanket is also related to another important machine learning activity:
feature selection.
Koller & Sahami [10] published the first work about the optimality of Markov
blanket in the context of feature selection. Recently, Brown et al. [5] introduced a
unifying probabilistic framework and showed that many heuristically suggested
feature selection criteria, including Markov blanket discovery algorithms, can
be seen as iterative maximizers of a clearly specified objective function: the
conditional likelihood of the training examples.
Finally, there is another class of algorithms that try to control the size of con-
ditioning test in a two-phase procedure: firstly identify parent-children, then
identify spouses. The most representative algorithms are the HITON [2] and
the Max-Min Markov Blanket (MMMB) [22]. All of these algorithms assume
faithfulness of the data distribution. As we already saw, in all Markov blanket
discovery algorithms, the conditional test of independence (Alg. 1 - Line 5 and
11) plays a crucial role, and this is the focus of the next paragraph.
p(x, y|z)
= 2N p(x, y, z) ln
= 2N I(X; Y |Z), (2)
x,y,z
p(x|z) p(y|z)
Markov Blanket Discovery in Positive-Unlabelled and Semi-supervised Data 355
where I(X; Y |Z) is the maximum likelihood estimator of the conditional mutual
information between X and Y given Z [8].
Hypothesis Testing Procedure: Under the null hypothesis that X and Y are
statistically independent given Z, the G-statistic is known to be asymptotically
χ2 -distributed, with ν = (|X | − 1)(|Y| − 1)|Z| degrees of freedom [1]. Know-
where F
ing that and using (2) we can calculate the pXY |Z value as 1 − F (G),
is the CDF of the χ2 -distribution and G the observed value of the G-statistic.
The p-value represents the probability of obtaining a test statistic equal or more
extreme than the observed one, given that the null hypothesis holds. After cal-
culating this value, we check to see whether it exceeds a significance level α. If
pXY |Z ≤ α, we reject the null hypothesis, otherwise we fail to reject it. This is
the procedure that we follow to take the decision in Lines 5 and 11 of the IAMB
algorithm 1. Furthermore, to choose the most strongly related feature in Line 4,
we evaluate the p-values and choose the feature with the smaller one.
Different Types of Error: Following this testing procedure, two possible types
of error can occur. The significance level α defines the probability of Type I
error or False Positive rate, that the test will reject the null hypothesis when
the null hypothesis is in fact true. While the probability of Type II error or False
Negative rate, which is denoted by β, is the probability that the test will fail to
reject the null hypothesis when the alternative hypothesis is true and there is
an actual effect in our data. Type II error is closely related with the concept of
statistical power of a test, which is the probability that the test will reject the
null hypothesis when the alternative hypothesis is true, i.e. power = 1 − β.
Power Analysis: With such a test, it is common to perform an a-priori power
analysis [7], where we would take a given sample size N, a required significance
level α, an effect size ω, and would then compute the power of the statistical test
to detect the given effect size. In order to do this we need a test statistic with
a known distribution under the alternative hypothesis. Under the alternative
hypothesis (i.e. when X and Y are dependent given Z), the G-statistic has a
large-sample non-central χ2 distribution [1, Section16.3.5]. The non-centrality
parameter (λ) of this distribution has the same form as the G-statistic, but
with sample values replaced by population values, λ = 2N I(X; Y |Z). The effect
size of the G-test can be naturally expressed as a function of the conditional
mutual information, since according to Cohen [7] the effect size (ω) is the square
root of the non-centrality parameter divided by the sample, thus we have ω =
2I(X; Y |Z).
Sample Size Determination: One important usage of a-priori power analy-
sis is sample size determination. In this prospective procedure we specify the
probability of Type I error (e.g. α = 0.05), the desired probability of Type II
error (e.g. β = 0.01 or power = 0.99) and the desired effect size that we want to
observe, and we can determine the minimum number of examples (N ) that we
need to detect that effect.
356 K. Sechidis and G. Brown
P |x, y ) = p(sP |y )
p(s+ ∀ x ∈ X.
+ + +
(3)
Markov Blanket Discovery in Positive-Unlabelled and Semi-supervised Data 357
Building upon this assumption, Sechidis et al. [19] proved that we can test
independence between a feature X and the unobservable variable Y, by simply
testing the independence between X and the observable variable SP , which can
be seen as a surrogate version of Y. While this assumption is sufficient for testing
independence and guarantees the same probability of false positives, it leads to
a less powerful test, and the probability of committing a false negative error is
increased by a factor which can be calculated using prior knowledge over p(y + ).
With our current work we extend this approach to test conditional independence.
X⊥
⊥ Y |Z ⇔ X ⊥
⊥ SP |Z.
0.6 Using Y and N fully supervised data 0.6 Using Y and N fully supervised data
Number of Variables Falsely Added
Using S and N positive unlabelled data Using SP and N positive unlabelled data
P
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
alarm insurance barley hailfinder alarm insurance barley hailfinder
While with the following theorem we will quantify the amount of power that we
are loosing with the naive assumption of all unlabelled examples being negative.
The proofs of the last two theorems are also available in the supplementary
material. So, by using prior knowledge over the p(y + ) we can use the naive test
for sample size determination, and decide the amount of data that we need in
order to have similar performance with the unobservable fully supervised test.
Now we will illustrate the last theorems again in the context of MB discovery.
A direct consequence of Theorem 2 is that using SP instead of Y results in a
higher number of false negative errors. In the MB discovery context this will
result in a larger number of variables falsely not added to the predicted blanket,
since we assumed that the variables were independent when in fact they were
dependent. In order to verify experimentally this conclusion we will compare
again the discovered blankets by using SP instead of Y. As we see in Figure
3, the number of variables that were falsely not added is higher when we are
using SP . This Figure also verifies Theorem 3, where we see that the number of
variables falsely removed when using the naive test G(X; SP |Z) with increased
sample size N/κP is the same as when using the unobservable test G(X; Y |Z)
with N data.
Fig. 3. Verification of Theorems 2 and 3. This illustrates the average number of vari-
ables falsely not added to the MB and the 95% confidence intervals over 10 trials when
we use IAMB with Y and SP . 3a for total sample size N = 2000 and 3b for total sample
size N = 5000. In all the scenarios we label 5% of the total examples as positives.
360 K. Sechidis and G. Brown
For an overall evaluation of the derived blankets using SP instead of Y we will use
the F -measure, which is the harmonic mean of precision and recall, against the
ground truth [17]. In Figure 4, we observe that the assumption of all unlabelled
examples to be negative gives worse results than the fully-supervised scenario,
and that the difference between the two approaches gets smaller as we increase
sample size. Furthermore, using the correction factor κP to increase the sample
size of the naive approach makes the two techniques perform similar.
Fig. 4. Comparing the performance in terms of F -measure when we use IAMB with
Y and SP . 4a for total sample size N = 2000 and 4b for total sample size N = 5000.
In all the scenarios we label 5% of the total examples as positives.
In this section we will present two informative ways, in terms of power, to test
conditional independence in semi-supervised data. Then we will suggest an algo-
rithm for Markov blanket discovery where we will incorporate prior knowledge
to choose the optimal way for testing conditional independence.
So, using SP instead of Y is equivalent to making the assumption that all unla-
belled examples are negative, as we did in the positive-unlabelled scenario, while
using SN instead of Y is equivalent to assuming all unlabelled examples being
positive. In this section we will prove the versions of the three theorems we
presented earlier for both variables SP and SN in the semi-supervised scenario.
Firstly we will show that testing conditional independence by assuming the
unlabelled examples to be either positive or negative is a valid approach.
Theorem 4 (Testing conditional independence in SS data).
In the semi-supervised scenario, under the selected completely at random assump-
tion, a variable X is independent of the class label Y given a subset of features
Z if and only if X is independent of SP given Z and the same result holds for
SN : X ⊥ ⊥ Y |Z ⇔ X ⊥ ⊥ SP |Z and X ⊥ ⊥ Y |Z ⇔ X ⊥ ⊥ SN |Z.
Proof. Since the selected completely at random assumption holds for both
classes, this theorem is a direct consequence of Theorem 1.
The consequence of this assumption is that the derived conditional tests of inde-
pendence are less powerful that the unobservable fully supervised test, as we
prove with the following theorem.
Theorem 5 (Power of the SS conditional tests of independence).
In the semi-supervised scenario, under the selected completely at random assump-
tion, when a variable X is dependent of the class label Y given a subset of features
Z, X⊥ ⊥Y |Z, we have: I(X; Y |Z) > I(X; SP |Z) and I(X; Y |Z) > I(X; SN |Z).
Proof. Since the selected completely at random assumption holds for both
classes, this theorem is a direct consequence of Theorem 2.
Finally, with the following theorem we can quantify the amount of power that
we are loosing by assuming that the unlabelled examples are negative (i.e. using
SP ) or positive (i.e. using SN ).
Theorem 6 (Correction factors for SS tests).
The non-centrality parameter of the conditional G-test can take the form:
Proof. Since the selected completely at random assumption holds for both
classes, this theorem is a direct consequence of Theorem 3.
362 K. Sechidis and G. Brown
When the opposing inequality holds the most powerful choice is SN . When equal-
ity holds, both approaches are equivalent.
We can estimate p(s+ +
P ) and p(sN ) from the observed data, and, using
+
some prior knowledge over p(y ), we can decide the most powerful option. In
Figure 5 we compare in terms of F -measure the derived Markov blankets when
we use the most powerful and the least powerful choice. As we observe by incor-
porating prior knowledge as Corollary 1 describes, choosing to test with the most
powerful option, results in remarkably better performance than with the least
powerful option.
Using Y and N fully supervised data Using Y and N fully supervised data
0.9 0.9
Using most-powerful (S or S ) and N semi superv. data Using most-powerful (SP or SN) and N semi sup. data
P N
0.8 Using least-powerful (S or S ) and N semi superv. data 0.8 Using least-powerful (S or S ) and N semi sup. data
P N P N
0.7 0.7
F-measure
F-measure
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
Fig. 5. Comparing the performance in terms of F -measure when we use the unob-
servable variable Y and the most and least powerful choice between SP and SN . 5a
for sample size N = 2000 out of which we label only 100 positive and 100 negative
examples and 5b for sample size N = 5000 out of which we label only 250 positive and
250 negative examples.
2
When the labelling depends directly in the class, eq. (4), we cannot have an unbiased
estimator for this probability without further assumptions, more details in [16].
Markov Blanket Discovery in Positive-Unlabelled and Semi-supervised Data 363
0.8 0.8
Using only the labelled data - listwise deletion Using only the labelled data - listwise deletion
Using semi-supervised data - pairwise deletion Using semi-supervised data - pairwise deletion
0.7 Our approach - Choosing most powerful between S and S
0.7 Our approach - Choosing most powerful between S and S
P N P N
0.6 0.6
F-measure
F-measure
0.5 0.5
0.4 0.4
0.3 0.3
0.8 0.8
Using only the labelled data - listwise deletion Using only the labelled data - listwise deletion
Using semi-supervised data - pairwise deletion Using semi-supervised data - pairwise deletion
0.7 Our approach - Choosing most powerful between S and S 0.7 Our approach - Choosing most powerful between S and S
P N P N
0.6 0.6
F-measure
F-measure
0.5 0.5
0.4 0.4
0.3 0.3
Acknowledgments. The research leading to these results has received funding from
EPSRC Anyscale project EP/L000725/1 and the European Union’s Seventh Frame-
work Programme (FP7/2007-2013) under grant agreement no 318633. This work was
supported by EPSRC grant [EP/I028099/1]. Sechidis gratefully acknowledges the sup-
port of the Propondis Foundation.
3
Downloaded from http://www.bnlearn.com/bnrepository/
366 K. Sechidis and G. Brown
References
1. Agresti, A.: Categorical Data Analysis. Wiley Series in Probability and Statistics,
3rd edn. Wiley-Interscience (2013)
2. Aliferis, C.F., Statnikov, A., Tsamardinos, I., Mani, S., Koutsoukos, X.D.: Local
causal and Markov blan. induction for causal discovery and feat. selection for clas-
sification part I: Algor. and empirical eval. JMLR 11, 171–234 (2010)
3. Allison, P.: Missing Data. Sage University Papers Series on Quantitative Applica-
tions in the Social Sciences, 07–136 (2001)
4. Bacciu, D., Etchells, T., Lisboa, P., Whittaker, J.: Efficient identification of inde-
pendence networks using mutual information. Comp. Stats 28(2), 621–646 (2013)
5. Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisa-
tion: a unifying framework for information theoretic feature selection. The Journal
of Machine Learning Research (JMLR) 13(1), 27–66 (2012)
6. Cai, R., Zhang, Z., Hao, Z.: BASSUM: A Bayesian semi-supervised method for
classification feature selection. Pattern Recognition 44(4), 811–820 (2011)
7. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Rout-
ledge Academic (1988)
8. Cover, T.M., Thomas, J.A.: Elements of information theory. J. Wiley & Sons (2006)
9. Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In:
ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2008)
10. Koller, D., Sahami, M.: Toward optimal feature selection. In: International Con-
ference of Machine Learning (ICML), pp. 284–292 (1996)
11. Lawrence, N.D., Jordan, M.I.: Gaussian processes and the null-category noise
model. In: Semi-Supervised Learning, chap. 8, pp. 137–150. MIT Press (2006)
12. Margaritis, D., Thrun, S.: Bayesian network induction via local neighborhoods. In:
NIPS, pp. 505–511. MIT Press (1999)
13. Mohan, K., Van den Broeck, G., Choi, A., Pearl, J.: Efficient algorithms for
bayesian network parameter learning from incomplete data. In: Conference on
Uncertainty in Artificial Intelligence (UAI) (2015)
14. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann Publishers Inc., San Francisco (1988)
15. Pellet, J.P., Elisseeff, A.: Using Markov blankets for causal structure learning. The
Journal of Machine Learning Research (JMLR) 9, 1295–1342 (2008)
16. Plessis, M.C.d., Sugiyama, M.: Semi-supervised learning of class balance under
class-prior change by distribution matching. In: 29th ICML (2012)
17. Pocock, A., Luján, M., Brown, G.: Informative priors for Markov blanket discovery.
In: 15th AISTATS (2012)
18. Rosset, S., Zhu, J., Zou, H., Hastie, T.J.: A method for inferring label sampling
mechanisms in semi-supervised learning. In: NIPS (2004)
19. Sechidis, K., Calvo, B., Brown, G.: Statistical hypothesis testing in positive unla-
belled data. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML
PKDD 2014, Part III. LNCS, vol. 8726, pp. 66–81. Springer, Heidelberg (2014)
20. Smith, A.T., Elkan, C.: Making generative classifiers robust to selection bias. In:
13th ACM SIGKDD Inter. Conf. on Knwl. Disc. and Data Min., pp. 657–666 (2007)
21. Tsamardinos, I., Aliferis, C.F.: Towards principled feature selection: relevancy, fil-
ters and wrappers. In: AISTATS (2003)
22. Tsamardinos, I., Aliferis, C.F., Statnikov, A.: Time and sample efficient discovery
of Markov blankets and direct causal relations. In: ACM SIGKDD (2003)
23. Yaramakala, S., Margaritis, D.: Speculative Markov blanket discovery for optimal
feature selection. In: 5th ICDM. IEEE (2005)
Multi-view Semantic Learning for Data
Representation
1 Introduction
In many real-world data analytic problems, instances are often described with
multiple modalities or views. It becomes natural to integrate multi-view repre-
sentations to obtain better performance than relying on a single view. A good
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 367–382, 2015.
DOI: 10.1007/978-3-319-23528-8 23
368 P. Luo et al.
2 Related Work
In this section, we present a brief review of related work about NMF-based
subspace learning. Firstly, we describe the notations used throughout the paper.
X ≈ UV.
370 P. Luo et al.
In recent years, many variants of the basic NMF model have been proposed.
We just list a few which are related to our work. One direction related to our
work is coupling label information to NMF [21], [25]. These works added dis-
criminative constraints into NMF via regularizing the encoding matrix V by
fisher-style discriminative constraints. Nevertheless, fisher discriminative analy-
sis assumes data of each class is approximately Gaussian distributed, a property
that cannot always be satisfied in real-world applications. Our method adopts a
nonparametric regularization scheme (i.e. regularization in neighborhoods) and
consequently can better model real-life data. Another related direction of NMF
is sparse NMF [8]. Sparseness constraints not only encourage local and compact
representations, but also improve the stability of the decomposition. Most pre-
vious works on sparse NMF employed L1 norm or ratio between L1 norm and
L2 norm to achieve sparsity on U and V. However, the story for our problem
is different since we have multiple views and the goal is to allow each latent
dimension to be associated with a subset of views. Therefore, L1,2 norm [9] is
used to achieve this goal.
There are also some extensions of NMF for multi-view data, e.g. clustering
[18], image annotation [11], graph regularized multi-view NMF [6] and semi-
supervised learning [10],[17]. Although [10] and [17] also exploited label infor-
mation, they incorporated label information as a factorization constraint on V,
i.e. reconstructing the label indicator matrix through multiplying V by a weight
matrix. Hence, those methods intrinsically imposed indirect affinity constraints
on encodings of labeled items in the latent subspace. On the contrary, our method
directly captures the semantic relationships between items in the latent subspace
through the proposed graph embedding framework. We will compare MvSL with
[10] in experiments.
However, the standard unsupervised NMF fails to discover the semantic structure
in the data. In the next, we introduce our graph embedding framework for multi-
view semantic learning.
1 T
Nl Nl
1 a l
min Wij vi − vjl 22 = min tr Vl La Vl , (2)
Vl 2 Vl 2
i=1 j=1
1 T
Nl Nl
1 p l
max Wij vi − vjl 22 = max tr Vl Lp Vl , (3)
Vl 2 Vl 2
i=1 j=1
where tr(·) denotes the trace of a matrix, and La = Da −Wa is the graph Lapla-
cian matrix for Ga with the (i, i)-th elements of the diagonal matrix Da equals
N l a p p
j=1 Wij (L is for G ). Generally speaking, Eq. (2) means items belonging to
the same class should be near each other in the learned latent subspace, while
Eq. (3) tries to keep items from different classes as distant as possible. However,
only with the nonnegative constraints Eq. (3) would diverge. Note that there is
an arbitrary scaling factor in solutions to problem (1): for any invertible K × K
matrix Q, we have U(v) V = (U(v) Q)(Q−1 V). Hence, without loss of generality,
we add the constraints {Vkj ≤ 1, ∀k, j} to put an upper bound on (3).
372 P. Luo et al.
where Nka (i) indicates the index set of the ka nearest neighbors of item i in the
same class,
p 1, if i ∈ Nkp (j) or j ∈ Nkp (i)
Wij = , (5)
0, otherwise
where Nkp (i) indicates the index set of the kp nearest neighbors of item i in the
distinct classes. We can see from the definitions of Wa and Wp that Ga and
Gp intrinsically preserve item semantic relations in local neighborhoods. This
nonparametric scheme can better handle real-life datasets which often exhibit
non-Gaussian distribution.
The remaining question is how to estimate nearest neighbors, which is a
routine function for constructing Ga and Gp . However, since real-life datasets
could be diverse and noisy, a single feature may not be sufficient to characterize
the affinity relations among items. Hence, we propose to use multiple features
for assessing the similarity between data items. In particular, we develop a novel
Multiple Kernel Learning (MKL) [20],[5] method for this task.
where yi denotes the label of item i. For each pair of items, we require its
combined kernel function value to conform to the corresponding ideal kernel
value. This leads to the following least square loss
Summing l(i, j, η ) over all pairs of labeled items, we could get the optimization
objective. However, in reality we would get imbalanced classes: the numbers of
labeled items for different classes can be quite different. The item pairs con-
tributed by classes with much larger number of items will dominate the overall
loss. In order to tackle this issue, we normalize the contribution of each pair of
classes (including same-class pairs) by its number of item pairs. This is equivalent
to multiplying each l(i, j, η ) by a weight tij which is defined as follows
1
n2i
, if yi = yj
tij = 1 , (10)
2ni nj , otherwise
where ni denotes the number of items belonging to the class with label yi .
Therefore, the overall loss becomes i,j tij l(i, j, η ). To prevent overfitting, a L2
regularization term is added for η . The final optimization problem is formulated
as
Nl
min tij l(i, j, η ) + ληη 22
η
i,j=1
(11)
H
s.t. ηv = 1, ηv ≥ 0
v=1
4 Optimization
The joint optimization function in (13) is not convex over all variables
U(1) , ..., U(H) and V simultaneously. Thus, we propose a block coordinate
descent method [15] which optimizes one block of variables while keeping the
other block fixed. The procedure is depicted in Algorithm 1. For the ease of
representation, we define
H H
1
O{(U(1) , ..., U(H) , V)} = X(v) − U(v) V2F + α U(v) 1,2
2 v=1 v=1 (14)
β
+ tr Vl La (Vl )T − tr Vl Lp (Vl )T
2
When V is fixed, U(1) , ..., U(H) are independent with one another. Since the
optimization method is the same, here we just focus on an arbitrary view and
use X and U to denote respectively the data matrix and the basis matrix for
the view. The optimization problem involving U can be written as
1
min φ(U) := X − UV2F + αU1,2
U 2 (15)
s.t. Uik ≥ 0, ∀i, k.
Two terms of φ(U) are convex functions. The first term of φ(U) is differentiable,
and its gradient is Lipschitz continuous. Hence, an efficient convex optimization
method can be adopted. [12] presented a variant of Nesterov’s first order method,
suitable for solving (15). In this paper, we take the optimization method of [12]
to update U. Due to the limitation of space, details can be found in [12].
Multi-view Semantic Learning for Data Representation 375
4.2 Optimizing V
When {U(v) }H
v=1 are fixed, the subproblem for V can be written as
H
1
min ψ(V) := X(v) − U(v) V2F
V 2 v=1
β (16)
+ tr Vl La (Vl )T − tr Vl Lp (Vl )T
2
s.t. 1 ≥ Vkj ≥ 0, ∀j, k.
1
H
= tr[(Vl )T (U(v) )T U(v) Vl ] − 2tr[(Vl )T (U(v) )T X(v),l ]
2 v=1
+ tr[(Vu )T (U(v) )T U(v) Vu ] − 2tr[(Vu )T (U(v) )T X(v),u ] + const.
H H
For convenience, let P = v=1 (U(v) )T U(v) and Ql = v=1 (U(v) )T X(v),l .
Qu is defined similarly for the unlabeled part. Eq. (16) can be transformed into
1 1
min tr[(Vl )T PVl ] − tr[(Vl )T Ql ] + tr[(Vu )T PVu ] − tr[(Vu )T Qu ]
V 2 2
β (17)
+ tr V L (V ) − tr V L (Vl )T
l a l T l p
2
s.t. 1 ≥ Vkj ≥ 0, ∀j, k.
Nl
1 1 l T
tr[(Vl )T PVl ] = (v ) Pvjl , (19)
2 2 j=1 j
β
tr Vl La (Vl )T − tr Vl Lp (Vl )T
2
K
β
= (v̄kl )T (Da + Wp )v̄kl − (v̄kl )T (Dp + Wa )v̄kl , (20)
2
k=1
where vjl and v̄kl represent the j-th column vector and k-th row vector of Vl ,
respectively. Each summand in Eq. (19) and (20) is a quadratic function of a
vector variable. Therefore, we can provide upper bounds for these summands:
K
(Pvjl,t )k
(vjl )T Pvjl ≤ l,t
l 2
(Vkj ) ,
k=1 Vkj
l
N
((Da + Wp )v̄l,t )j
(v̄kl )T (Da +W p
)v̄kl ≤ l,t
k l 2
(Vkj ) ,
j=1 Vkj
l,t l,t
l
Vki l
Vkj
−(v̄kl )T (Dp +W a
)v̄kl ≤− p
(D + W a
)ij Vki Vkj 1 + log l,t l,t
,
i,j Vki Vkj
where we use Vl,t to denote the value of Vl in the t-th iteration of the update
algorithm and vjl,t , v̄kl,t are its j-th column vector and k-th row vector, respec-
l
tively. Note that Vkj can be viewed both as the k-th element of vjl and as the
l
j-th element of v̄k . The proofs of these bounds follow directly from Lemmas
1 and 2 in [19]. Aggregating the bounds for all the summands, we obtain the
auxiliary function for Ol (Vl )
G l (Vl,t ; Vl )
l
N K l,t l,t
1 (Pvj )k + β((Da + Wp )v̄k )j l 2
= l,t
(Vkj )
2 j=1 Vkj
k=1
K
β p l,t l,t V l
ki V l
kj (21)
− (D + Wa )ij Vki Vkj 1 + log l,t l,t
2 i,j Vki Vkj
k=1
Nl
K
− Qlkj Vkj
l
.
j=1 k=1
∂G l (Vl,t ; Vl )
l
∂Vkj
(Pvjl,t )k + β((Da + Wp )v̄kl,t )j l β((Dp + Wa )v̄kl,t )j l,t
= l,t
Vkj − l
Vkj − Qlkj
Vkj Vkj
Here vjl and v̄kl denote the j-th column vector and the k-th row vector
of Vl , respectively. It is easy to verify that Ol (Vl,t+1 ) ≤ G l (Vl,t ; Vl,t+1 ) ≤
G l (Vl,t ; Vl,t ) = Ol (Vl,t ). Therefore, the update rule for Vl monotonically
decreases Eq. 13. The case for Vu is simpler since we do not have the graph
embedding terms:
1
Ou (Vu ) = tr[(Vu )T PVu ] − tr[(Vu )T Qu ] (24)
2
Similarly, the auxiliary function for Ou (Vu ) can be derived
u u
N K u,t N K
u u,t u 1 (Pvj )k u 2 u u
G (V ;V ) = u,t (Vkj ) − Qkj Vkj (25)
2 j=1 Vkj j=1
k=1 k=1
and the update rule can be obtained by setting the partial derivatives to 0:
u u
u,t+1 u,t Qkj − |Qkj |
Vkj = min 1, Vkj (26)
2(Pvju,t )k
5 Experiment
In this section, we conduct the experiments on two real-world data sets to vali-
date the effectiveness of the proposed algorithm MvSL.
378 P. Luo et al.
5.2 Baselines
• NMF [13].
• Feature concatenation (ConcatNMF): This method concatenates feature
vectors of different views to form a united representation and then applies
NMF.
• Multi-view NMF (MultiNMF): MultiNMF [18] is an unsupervised multi-
view NMF algorithm.
• Semi-supervised Unified Latent Factor method (SULF): SULF [10] is a semi-
supervised multi-view nonnegative factorization method which models par-
tial label information as a factorization constraint on Vl .
• Graph regularized NMF (GNMF): GNMF [2] is a manifold regularized ver-
sion of NMF. We extended it to the multi-view case and replaced the affinity
graph for approximating data manifolds with the within-class affinity graph
defined in Eq. (4) to make it a semi-supervised method on multi-view data.
Labeled
NMF-b ConcatNMF MultiNMF SULF GNMF MvSL
Percentage
10 61.55±1.08 63.04±1.67 63.69±1.52 67.93±1.92 68.93±1.77 70.56±1.21
20 65.71±1.37 66.09±1.08 67.42±1.97 68.40±1.64 70.59±1.65 72.67±1.02
30 67.30±0.27 68.40±1.91 69.16±1.52 70.05±1.48 71.80±1.24 74.78±1.34
40 68.41±1.96 69.81±1.96 70.28±1.83 71.86±1.38 72.23±1.54 75.87±1.26
50 70.44±1.72 70.75±2.03 71.81±1.47 72.78±1.44 73.78±1.75 77.33±0.79
Labeled
NMF-b ConcatNMF MultiNMF SULF GNMF MvSL
Percentage
10 24.56±0.98 27.41±0.83 26.26±0.95 27.47±1.03 28.03±1.17 30.92±0.44
20 25.37±0.85 31.24±0.93 30.39±1.12 30.94±1.25 31.55±1.14 33.83±1.52
30 26.09±0.71 32.47±0.80 31.85±0.87 33.13±0.87 34.15±0.51 35.80±0.68
40 28.03±0.46 34.25±0.71 33.48±0.65 34.94±0.65 35.26±0.97 37.12±0.73
50 28.06±0.28 35.08±0.48 34.33±0.56 36.32±0.56 36.61±0.57 38.16±0.65
general, which indicated that exploiting label information could lead to latent
spaces with better discriminative structures. Secondly, from comparison between
multi-view algorithms and single-view algorithm (NMF), it is easy to see that
multi-view algorithms are more preferable for multi-view data. This is in accord
with the results of previous multi-view learning work. Thirdly, MvSL and GNMF
show superior performance over SULF. SULF models partial label information
as a factorization constraint on Vl , which can be viewed as indirect affinity con-
straints on encoding of within-class items. On the contrary, the graph embedding
terms in MvSL and GNMF impose direct affinity constraints on item encodings
and therefore could lead to more explicit semantic structures in the learned
latent spaces. Finally, MvSL outperformed the baseline methods under all cases.
The reason should be that MvSL not only directly exploits label information
via a graph embedding framework, but also adds regularization by L1,2 -norm on
U(v) successfully promotes that sparsity pattern is shared among data items or
features within classes. These properties could help to learn a clearer semantic
latent space.
There are two essential parameters in new methods. β measures the importance
of the semi-supervised part of MvSL (i.e. the graph embedding regularization
terms), while α controls the degree of sparsity of the basis matrices. We inves-
tigate their influence on MvSL’s performance by varying one while fixing the
other one.
The classification results are shown in Figure 1 for MM2.0 and Reuters. We
found the general behavior of the two parameters was the same: when increas-
ing the parameter from 0, the performance curves first went up and then went
down. This indicates that when assigned moderate weights, the sparseness and
semi-supervised constraints indeed helped learn a better latent subspace. Based
observations , we set α = 10, β = 0.02 for experiments.
MM2.0
Accuracy
0.6
0.6
MM2.0
Reuters
0.4
0.4
0.2
0.2
0 20 40 60 80 100 0 0.02 0.04 0.06
α β
Fig. 1. Influence of different parameter settings on the performance of MvSL: (a) vary-
ing α while setting β = 0.02 , (b) varying β while setting α = 10
Multi-view Semantic Learning for Data Representation 381
6 Conclusion
We have proposed Multi-view semantic learning (MvSL), a novel nonnegative
latent representation learning algorithm for representation learning multi-view
data. MvSL tries to learn a semantic latent subspace of items by exploiting both
multiple views of items and partial label information. The partial label infor-
mation was used to construct a graph embedding framework, which encouraged
items of the same category to be near with each other and kept items belonging
to different categories as distant as possible in the latent subspace. What’s more,
kernel alignment effectively estimated the items pair similarity among multi-view
data, which further extended graph embedding framework. Another novel prop-
erty of MvSL was that it allowed each latent dimension to be associated with a
subset of views by imposing L1,2 -norm on each basis U(v) . Therefore, MvSL is
able to learn flexible latent factor sharing structures which could lead to more
meaningful semantic latent subspaces. An efficient multiplicative-based iterative
algorithm is developed to solve the proposed optimization problem. The classi-
fication experimental results on two real-world data sets have demonstrated the
effectiveness of our method.
Graph embedding is a general framework and different definitions of the
within class affinity graph Ga and the discriminative graph Gp can be employed.
How to propose more suitable similarity criteria with multi-view data is an
interesting direction for further study.
References
1. Amini, M., Usunier, N., Goutte, C.: Learning from multiple partially observed
views-an application to multilingual text categorization. In: Advances in Neural
Information Processing Systems, pp. 28–36 (2009)
2. Cai, D., He, X., Han, J., Huang, T.S.: Graph regularized nonnegative matrix fac-
torization for data representation. IEEE Transactions on Pattern Analysis and
Machine Intelligence 33(8), 1548–1560 (2011)
3. Chen, N., Zhu, J., Xing, E.P.: Predictive subspace learning for multi-view data: a
large margin approach. In: Advances in Neural Information Processing Systems,
pp. 361–369 (2010)
4. Han, Y., Wu, F., Tao, D., Shao, J., Zhuang, Y., Jiang, J.: Sparse unsupervised
dimensionality reduction for multiple view data. IEEE Transactions on Circuits
and Systems for Video Technology 22(10), 1485–1496 (2012)
5. He, J., Chang, S.-F., Xie, L.: Fast kernel learning for spatial pyramid matching.
In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008,
pp. 1–7. IEEE (2008)
6. Hidru, D., Goldenberg, A.: Equinmf: Graph regularized multiview nonnegative
matrix factorization (2014). arXiv preprint arXiv:1409.4018
7. Hotelling, H.: Relations between two sets of variates. Biometrika 321–377 (1936)
382 P. Luo et al.
8. Hoyer, P.O.: Non-negative sparse coding. In: Proceedings of the 2002 12th IEEE
Workshop on Neural Networks for Signal Processing, pp. 557–565 (2002)
9. Jia, Y., Salzmann, M., Darrell, T.: Factorized latent spaces with structured spar-
sity. In: Advances in Neural Information Processing Systems, pp. 982–990 (2010)
10. Jiang, Y., Liu, J., Li, Z., Lu, H.: Semi-supervised unified latent factor learning with
multi-view data. Machine Vision and Applications 25(7), 1635–1645 (2014)
11. Kalayeh, M., Idrees, H., Shah, M.: NMF-KNN: Image annotation using weighted
multi-view non-negative matrix factorization. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 184–191 (2014)
12. Kim, J., Monteiro, R., Park, H.: Group sparsity in nonnegative matrix factoriza-
tion. In: SDM, pp. 851–862. SIAM (2012)
13. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401(6755), 788–791 (1999)
14. Li, H., Wang, M., Hua, X.-S.: Msra-mm 2.0: A large-scale web multimedia dataset.
In: IEEE International Conference on Data Mining Workshops, pp. 164–169. IEEE
(2009)
15. Lin, C.-J.: Projected gradient methods for nonnegative matrix factorization. Neural
computation 19(10), 2756–2779 (2007)
16. Liu, H., Wu, Z., Li, X., Cai, D., Huang, T.S.: Constrained nonnegative matrix
factorization for image representation. IEEE Transactions on Pattern Analysis and
Machine Intelligence 34(7), 1299–1311 (2012)
17. Liu, J., Jiang, Y., Li, Z., Zhou, Z.-H., Lu, H.: Partially shared latent factor learning
with multiview data (2014)
18. Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative
matrix factorization. In: Proc. of SDM, vol. 13, pp. 252–260 (2013)
19. Sha, F., Lin, Y., Saul, L.K., Lee, D.D.: Multiplicative updates for nonnegative
quadratic programming. Neural Computation 19(8), 2004–2031 (2007)
20. Shawe-Taylor, N., Kandola, A.: On kernel target alignment. Advances in neural
information processing systems 14, 367 (2002)
21. Wang, Y., Jia, Y.: Fisher non-negative matrix factorization for learning local fea-
tures. In: Proc. Asian Conf. on Comp. Vision. Citeseer (2004)
22. Xia, T., Tao, D., Mei, T., Zhang, Y.: Multiview spectral embedding. IEEE Trans-
actions on Systems, Man, and Cybernetics, Part B: Cybernetics 40(6), 1438–1446
(2010)
23. Xu, C., Tao, D., Xu, C.: A survey on multi-view learning. arXiv preprint
arXiv:1304.5634 (2013)
24. Yan, S., Xu, D., Zhang, B., Zhang, H.-J., Yang, Q., Lin, S.: Graph embedding and
extensions: a general framework for dimensionality reduction. IEEE Transactions
on Pattern Analysis and Machine Intelligence 29(1), 40–51 (2007)
25. Zafeiriou, S., Tefas, A., Buciu, I., Pitas, I.: Exploiting discriminant information in
nonnegative matrix factorization with application to frontal face verification. IEEE
Transactions on Neural Networks 17(3), 683–695 (2006)
Unsupervised Feature Analysis with Class
Margin Optimization
Sen Wang1 , Feiping Nie2 , Xiaojun Chang3(B) , Lina Yao4 , Xue Li1 ,
and Quan Z. Sheng4
1
School of ITEE, The University of Queensland, Brisbane, Australia
sen.wang@uq.edu.au, xueli@itee.uq.edu.au
2
Center for OPTIMAL, Northwestern Polytechnical University, Shaanxi, China
feiping.nie@gmail.com
3
Center for QCIS, University of Technology Sydney, Ultimo, Australia
cxj273@gmail.com
4
School of CS, The University of Adelaide, Adelaide, Australia
lina@cs.adelaide.edu.au, michael.sheng@adelaide.edu.au
1 Introduction
Over the past few years, data are more than often represented by high-
dimensional features in a number of research fields, e.g. data mining, computer
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 383–398, 2015.
DOI: 10.1007/978-3-319-23528-8 24
384 S. Wang et al.
vision, etc. With the inventions of such many sophisticated data representations,
a problem has been never a lack of research attention: How to select the most dis-
tinctive features from high-dimensional data for subsequent learning tasks, e.g.
classification? To answer this question, we take two points into account. First,
the number of selected features should be smaller than the one of all features.
Due to a lower dimensional representation, the subsequent learning tasks with
no doubt can gain benefit in terms of efficiency [31]. Second, the selected fea-
tures should have more discriminant power than the original all features. Many
previous works have proven that removing those noisy and irrelevant features
can improve discriminant power in most cases. In light of advantages of feature
selection, different new algorithms have been flourished with various types of
applications recently [29,32,33].
According to the types of supervision, feature selection can be generally
divided into three categories, i.e. supervised, semi-supervised, and unsupervised
feature selection algorithms. Representative supervised feature selection algo-
rithms include Fisher score [6], Relief[11] and its extension, ReliefF [12], infor-
mation gain [20], etc [25]. Label information of training data points is utilized to
guide the supervised feature selection methods to seek distinctive subsets of fea-
tures with different search strategies, i.e. complete search, heuristic search, and
non-deterministic search. In the real world, class information is quite limited,
resulting in the development of semi-supervised feature selection methods [3,4],
in which both labeled and unlabeled data are utilized.
In unsupervised scenarios, feature selection is more challenging, since there is
no class information to use for selecting features. In the literature, unsupervised
feature selection can be roughly categorized into three groups, i.e. filter, wrap-
per, and embedded methods. Filter-based unsupervised feature selection methods
rank features according to some intrinsic properties of data. Then those features
with higher scores are selected for the further learning tasks. The selection is
independent to the consequent process. For example, He et al. [8] assume that
data from the same class are often close to each other and use the locality pre-
serving power of data, also termed as Laplacian Score, to evaluate importance
degrees of features. In [30], a unified framework has been proposed for both
supervised and unsupervised feature selection schemes using a spectral graph.
Tabakhi et al. [23] have proposed an unsupervised feature selection method to
select the optimal feature subset in an iterative algorithm, which is based on
ant colony optimization. Wrapper-based methods as a more sophisticated way
wrap learning algorithms to yield learned results that will be used to select dis-
tinctive subsets of features. In [15], for instance, the authors have developed a
model that selects relevant features using two backward stepwise selection algo-
rithms without prior knowledges of features. Normally, wrapper-based methods
have better performance than filter-based methods, since they use learning algo-
rithms. Unfortunately, the disadvantage is that the computation of wrapper
methods is more expensive. Embedded methods are seeking a trade-off between
them by integrating feature selection and clustering together into a joint frame-
work. Because clustering algorithms can provide pseudo labels that can reflect
Unsupervised Feature Analysis with Class Margin Optimization 385
the intrinsic information of data, some works [1,14,26] incorporate different clus-
tering algorithms in objective functions to select features.
Most of the existing unsupervised feature selection methods [8,9,14,19,24,30]
rely on a graph, e.g. graph Laplacian, to reflect intrinsic relationships among
data, labeled and unlabeled. When the number of data is extremely large, the
computational burden of constructing a graph Laplacian is significantly heavy.
Meanwhile, some traditional feature selection algorithms [6,8] neglect correla-
tions among features. The distinctive features are individually selected according
to the importance of each feature rather than taking correlations among fea-
tures into account. Recently, exploiting feature correlations has attracted much
research attention [5,17,18,27,28]. It has proven that discovering feature corre-
lation is beneficial to feature selection.
In this paper, we propose a graph-free method to select features by combining
Maximum Margin Criterion with feature correlation mining into a joint frame-
work. Specifically, the method, on one hand, aims to learn a feature coefficient
matrix that linearly combines features to maximize the class margins. With the
increase of the separability of the entire transformed data by maximizing the
total scatter, the proposed method also expects distances between data points
within the same class to be minimized after the linear transformation by the
coefficient matrix. Since there is no class information can be borrowed from, K-
means clustering is jointly embedded in the framework to provide pseudo labels.
Inspired by recent feature selection works using sparsity-based model on the
regularization term [4], on the other hand, the proposed algorithm learns sparse
structural information of the coefficient matrix, with the goal of reducing noisy
and irrelevant features by removing those features whose coefficients are zeros.
The main contributions of this paper can be summarized as follows:
– The proposed method makes efforts to maximize class margins in a frame-
work, where simultaneously considers the separability of the transformed
data and distances between the transformed data within the same class.
Besides, a sparsity-based regularization model is jointly applied on the fea-
ture coefficient matrix to analyze correlations among features in an iterative
algorithm;
– K-means clustering is embedded into the framework generating cluster
labels, which can be used as pseudo labels. Both maximizing class mar-
gins and learning sparse structures can benefit from generated pseudo labels
during each iteration;
– Because the performance of K-means is dominated by the initialization, we
propose a strategy to avoid our algorithm rapidly converge to a local opti-
mum, which is largely ignored by most of existing approaches using K-means
clustering. Theoretical proof of convergence is also provided.
– We have conducted extensive experiments over six benchmark datasets. The
experimental results show that our method has better performance than all
the compared unsupervised algorithms.
The rest of this paper is organized as follows: Notations and definitions that
are used throughout the entire paper will be given in section 2. Our method will
386 S. Wang et al.
where ni is the number of data for the c-th class. St = Sw + Sb . Other notations
and definitions will be explained when they are in use.
3 Proposed Method
matrix which can project the data into a new space where the original data are
more separable. PCA is the most popular approach to analyze the separability
of features. PCA aims to seek directions on which transformed data have max
variances. In other words, PCA is to maximize the separability of linearly trans-
n
formed data by maximizing the covariance: max (W T (xi −x̄))T (W T (xi −x̄)).
W i=1
Without losing the generality, we assume the data has zero mean, i.e. x̄ = 0.
Recall the definition of total scatter of data, PCA is equivalent to maximize the
total scatter of data. However, if only total scatter is considered as a separa-
bility measure, the within-class scatter might be also geometrically maximized
with the maximization of the total scatter. This is not helpful to distinctive
feature discovery. The representative model, LDA, solves this problem by max-
W T Sb W
imizing Fisher criterion: max W T S W . However, LDA and its variants require
w
W
class information to construct between-class and within-class scatter matrices
[2], which is not suitable for unsupervised feature selection. Before we give the
objective that can solve the aforementioned problem, we first look at a supervised
feature selection framework:
n
c
ni
max (W T xi )T (W T xi ) − α (W T (xj − x̄i ))T (W T (xj − x̄i )) − βΩ(W )
W
i=1 i=1 j=1
s.t. W T W = I,
(3)
where α and β are regularization parameters. In this framework, the first term is
to maximize the total scatter, while the second term is to minimize the within-
class scatter. The third part is a sparsity-based regularization term which con-
trols the sparsity of W . This model is quite similar with the classical LDA-based
methods. Due to there is no class information in the unsupervised scenario, we
need virtual labels to minimize the distances between data within the same class
while maximize the total separability at the same time. To achieve this goal, we
apply K-means clustering in our framework to replace the ground truth by gen-
erating cluster indicators of data. Given c centroids G = [g1 , . . . , gc ] ∈ Rd ×c ,
the objective function of the traditional K-means algorithm aims to minimize
the following function:
c
(yj − gi )T (yj − gi )
i=1 yj ∈Yi
(4)
n
= (yi − GuTi )T (yi − GuTi ),
i=1
n
n
max (W T xi )T (W T xi ) − α (W T xi − GuTi )T (W T xi − GuTi ) − βΩ(W )
W
i=1 i=1
s.t. W T W = I,
(5)
As mentioned above, the sparsity-based regularization term has been widely
used to find out correlated structures among features. The motivation behind
this is to exploit sparse structures of the feature coefficient matrix. By imposing
the sparse constraint, some of the rows of the feature coefficient matrix shrink
to zeros. Those features corresponding to non-zero coefficients are selected as
the distinctive subset of features. In this way, noisy and redundant features
can be removed. This sparsity-based regularization has been applied in various
problems. Inspired by the “shrinking to zero” idea, we utilize a sparsity model
to uncover the common structures shared by features. To achieve that goal, we
propose to minimize the 2,p -norm of the coefficient matrix, W 2,p , (0 < p <
2). From the definition of W 2,p in (1), outliers or negative impact of the
irrelevant wi ’s are suppressed by minimizing the 2,p -norm. Note that p is a
parameter that controls the degree of correlated structures among features. The
lower p is, the more shared structures among are expected to exploit. After a
number of optimization steps, the optimal feature coefficient matrix, W , can be
obtained. Thus, we impose the 2,p -norm on the regularization term and re-write
the objective function in a matrix representation as follows:
max T r(W T St W ) − αW T X − GU T 2F − βW 2,p
W ,G,U
(6)
s.t. W T W = I,
where U is an indicator matrix. T r(·) is trace operator, while · 2F is the
Frobenius norm of a matrix. Our proposed method integrates the Maximum
Margin Criterion and sparse regularization into a joint framework. Embedding
K-means into the framework not only minimizes the distances between within-
class data while maximizing total data separability, but also provides cluster
labels. The cluster centroids generated by K-means can further guide the sparse
structure learning on the feature coefficient matrix in each iterative step of our
solution, which will be explained in the next section. We name this method for
the unsupervised feature analysis with class margin optimization as UFCM.
4 Optimization
In this section, we present our solution to the objective function in (6). Since the
2,p -norm is used to exploit sparse structures, the objective function cannot be
solved in a closed form. Meanwhile, the objective function is not jointly convex
with respect to three variables, i.e. W, G, U . Thus, we propose to solve the
problem as follows.
We define a diagonal matrix D whose diagonal entries are defined as:
1
D ii = . (7)
i 2−p
p w 2
2
Unsupervised Feature Analysis with Class Margin Optimization 389
⇒ G = W T XU (U T U )−1
Substituting Equation (10) into Equation (8), we have:
Proof. Assuming that, in the i-th iteration, the transformation matrix W and
cluster centroid matrix G have been derived as Wi and Gi . In the (i + 1)-th
Unsupervised Feature Analysis with Class Margin Optimization 391
5 Experiments
In this section, experimental results will be presented together with related
analysis. We compare our method with seven approaches over six benchmark
datasets. Besides, we also conduct experiments to evaluate performance varia-
tions in different aspects. They are including the impact of different selected
feature numbers, the validation of feature correlation analysis, and parameter
sensitivity analysis. Lastly, the convergence demonstration is shown.
– UMIST: UMIST, which is also known as the Sheffield Face Database, con-
sists of 564 images of 20 individuals. Each individual is shown in a variety
of poses from profile to frontal views.
– USPS [10]: This dataset collects 9,298 images of handwritten digits (0-9)
from envelops by the U.S. Postal Service. All images have been normalized
to the same size of 16 × 16 pixels in gray scale.
– YaleB [7]: It consists of 2,414 frontal face images of 38 subjects. Differ-
ent lighting conditions have been considered in this dataset. All images are
reshaped into 32 × 32 pixels.
The pixel value in data is used as the feature. Details of data sets that are used
in this paper are summarized in Table 1.
0.64 0.77
0.74
0.76
Accuracy
Accuracy
Accuracy
0.635
0.73 0.75
0.63
0.74
0.72
0.625 0.73
0.71
0.615
0.7
0.7
0.61
500 600 700 800 900 All Features 380 460 540 620 700 All Features 500 600 700 800 900 All Features
Selected Features Selected Features Selected Features
Fig. 1. Performance variation results with respect to the number of selected features
using the proposed algorithm over three data sets, COIL20, MNIST, and USPS.
Coil20 MNIST USPS
0.735 0.77
0.63 0.75
0.725
0.625 0.74
Accuracy
Accuracy
Accuracy
0.72
0.62 0.73
0.715
0.72
0.71 0.615
0.71
0.705 0.61
0.7
0.7 0.605
0.69
0.695 0.6
0.68
-3 -2 -1 1 2 3 -3 -2 -1 1 2 3
0 10 10 10 1 10 10 10 0 10 10 10 1 10 10 10 0 10-3 10-2 10-1 1 101 102 103
β β β
0.7813 (600 selected features) are observed on MNIST and USPS, respectively.
3) When all features are in use, the performance is worse than the best. Similar
trends can be also observed on the other data sets. It is concluded that our
algorithm can select distinctive features.
To demonstrate exploiting feature correlation is beneficial to the perfor-
mance, we conduct an experiment in which parameters α and p are both fixed
at 1. β varies in a range of [0, 10−3 , 10−2 , 10−1 , 1, 101 , 102 , 103 ]. The performance
variation results with respect to different βs are plotted in Figure 2. The exper-
iment is conducted over three data sets, i.e. COIL20, MNIST, and USPS. From
the results, we can observe that the performance is relatively low, when there is
no correlation exploiting in the framework, i.e. β = 0. The performance always
peaks at a certain point when a proper degree of sparsity is imposed to the
regularization term. For example, the performance is only 0.6993 when β = 0
on COIL20. The performance increases to 0.7285 when β = 101 . Similar obser-
vations are also obtained on the other data sets. We can conclude that sparse
structure learning on feature coefficient matrix contributes to the performance
of our unsupervised feature selection method.
Unsupervised Feature Analysis with Class Margin Optimization 395
0.72 0.62
0.68
Accuracy
Accuracy
Accuracy
0.7
0.6
0.66
0.68
0.58 0.64
0.66
0.56 0.62
0.64
0.6
0.62 0.54
0.58
0.74
0.49
Accuracy
0.17
Accuracy
Accuracy
0.72
0.48
0.47 0.7 0.165
0.46 0.68
0.16
0.45 0.66
0.44 0.155
0.64
0.43
0.15
0.42 0.62
0.72 0.62
0.68
Accuracy
Accuracy
Accuracy
0.7
0.6
0.68 0.66
0.58
0.66 0.64
0.72 0.16
0.5
Accuracy
Accuracy
Accuracy
0.14
0.7
0.48 0.12
0.68
0.1
0.66
0.46 0.08
0.64
0.06
0.44 0.62
0.04
0.6 0.02
0.42 0.58 0
×10
8 Coil20 ×10
9 MNIST ×10
5 USPS
4.06
3.74 2.9264
4.05
Objective Function
Objective Function
Objective Function
2.9263
4.04
3.738
2.9262
4.03
3.736 2.9261
4.02
3.734 2.926
4.01
2.9259
4
3.732
2.9258
3.99
3.73 2.9257
3.98
2.9256
3.728 3.97
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Iteration Numbers Iteration Numbers Iteration Numbers
Fig. 5. Objective function values of our proposed objective function in (6) over three
data sets, COIL20, MNIST, and USPS.
how they exert influences on performance. Firstly, we fix β = 10−1 and derive the
performance variations under different combinations of αs and ps in Figure 3. Sec-
ondly, α is fixed at 10−1 . The performance variation results with respect to different
βs and ps are shown in Figure 4. Both α and β vary in a range of [10−3 , 10−1 , 101 , 101 ].
While p changes in [0.5, 1.0, 1.5]. We only take ACC as the metric.
To validate that our algorithm will monotonically increase the objective func-
tion value in (6), we conduct an experiment to demonstrate this fact. In this
experiment, all parameters (α, β, and p) in (6) are fixed at 1. The objective
function values and corresponding iteration numbers are drawn in Figure 5.
We take COIL20, MNIST, and USPS as examples. Similar observations can be
also obtained on the other data sets. From the figure, it can be seen that our
algorithm converges to the optimum, usually within eight iteration steps, over
three data sets. We can then conclude that the proposed method is efficient and
effective.
6 Conclusion
References
1. Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data.
In: SIGKDD (2010)
2. Chang, X., Nie, F., Wang, S., Yang, Y.: Compound rank-k projections for bilinear
analysis. IEEE Trans. Neural Netw. Learning Syst. (2015)
3. Chang, X., Nie, F., Yang, Y., Huang, H.: A convex formulation for semi-supervised
multi-label feature selection. In: AAAI (2014)
4. Chang, X., Shen, H., Wang, S., Liu, J., Li, X.: Semi-supervised feature analysis
for multimedia annotation by mining label correlation. In: Tseng, V.S., Ho, T.B.,
Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS, vol.
8444, pp. 74–85. Springer, Heidelberg (2014)
5. Chang, X., Yang, Y., Hauptmann, A.G., Xing, E.P., Yu, Y.: Semantic concept
discovery for large-scale zero-shot event detection. In: IJCAI (2015)
6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley & Sons
(2012)
7. Georghiades, A.S., Belhumeur, P.N., Kriegman, D.: From few to many: Illu-
mination cone models for face recognition under variable lighting and pose.
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23(6),
643–660 (2001)
8. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: NIPS (2005)
9. Hou, C., Nie, F., Li, X., Yi, D., Wu, Y.: Joint embedding learning and sparse
regression: A framework for unsupervised feature selection. IEEE T. Cybernetics
44(6), 793–804 (2014)
10. Hull, J.J.: A database for handwritten text recognition research. IEEE Transactions
on Pattern Analysis and Machine Intelligence (TPAMI) 16(5), 550–554 (1994)
11. Kira, K., Rendell, L.A.: A practical approach to feature selection. In: IWML (1992)
12. Kononenko, I.: Estimating attributes: analysis and extensions of relief. In:
Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182.
Springer, Heidelberg (1994)
13. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
14. Li, Z., Yang, Y., Liu, J., Zhou, X., Lu, H.: Unsupervised feature selection using
nonnegative spectral analysis. In: AAAI (2012)
15. Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection for clustering
with gaussian mixture models. Biometrics 65(3), 701–709 (2009)
16. Nene, S.A., Nayar, S.K., Murase, H., et al.: Columbia object image library (coil-20).
Tech. rep., Technical Report CUCS-005-96 (1996)
17. Nie, F., Huang, H., Cai, X., Ding, C.H.Q.: Efficient and robust feature selection
via joint l2, 1-norms minimization. In: NIPS (2010)
18. Nie, F., Huang, H., Cai, X., Ding, C.H.: Efficient and robust feature selection via
joint 2, 1-norms minimization. In: NIPS, pp. 1813–1821 (2010)
19. Qian, M., Zhai, C.: Robust unsupervised feature selection. In: IJCAI (2013)
398 S. Wang et al.
20. Raileanu, L.E., Stoffel, K.: Theoretical comparison between the gini index and
information gain criteria. AMAI (2004)
21. Samaria, F.S., Harter, A.C.: Parameterisation of a stochastic model for human face
identification. In: IEEE Workshop on Applications of Computer Vision (1994)
22. Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combin-
ing multiple partitions. Journal of Machine Learning Research (JMLR) 3, 583–617
(2003)
23. Tabakhi, S., Moradi, P., Akhlaghian, F.: An unsupervised feature selection algo-
rithm based on ant colony optimization. Engineering Applications of Artificial
Intelligence 32, 112–123 (2014)
24. Wang, D., Nie, F., Huang, H.: Unsupervised feature selection via unified trace
ratio formulation and K -means clustering (TRACK). In: Calders, T., Esposito, F.,
Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014, Part III. LNCS, vol. 8726,
pp. 306–321. Springer, Heidelberg (2014)
25. Wang, S., Chang, X., Li, X., Sheng, Q.Z., Chen, W.: Multi-task support vector
machines for feature selection with shared knowledge discovery. Signal Processing
(December 2014)
26. Wang, S., Tang, J., Liu, H.: Embedded unsupervised feature selection. AAAI (2015)
27. Yang, Y., Shen, H.T., Ma, Z., Huang, Z., Zhou, X.: l2, 1-norm regularized discrim-
inative feature selection for unsupervised learning. In: IJCAI (2011)
28. Yang, Y., Zhuang, Y., Wu, F., Pan, Y.: Harmonizing hierarchical manifolds for
multimedia document semantics understanding and cross-media retrieval. IEEE
Transactions on Multimedia 10(3), 437–446 (2008)
29. Yang, Y., Ma, Z., Hauptmann, A.G., Sebe, N.: Feature Selection for Multime-
dia Analysis by Sharing Information Among Multiple Tasks. IEEE TMM 15(3),
661–669 (2013)
30. Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learn-
ing. In: ICML
31. Zhu, X., Huang, Z., Yang, Y., Shen, H.T., Xu, C., Luo, J.: Self-taught dimensional-
ity reduction on the high-dimensional small-sized data. Pattern Recognition 46(1),
215–229 (2013)
32. Zhu, X., Suk, H.-I., Shen, D.: Matrix-similarity based loss function and feature
selection for alzheimer’s disease diagnosis. In: IEEE CVPR, pp. 3089–3096 (2014)
33. Zhu, X., Suk, H.-I., Shen, D.: Discriminative feature selection for multi-class
alzheimer’s disease classification. In: MLMI, pp. 157–164 (2014)
Data Streams and Online Learning
Ageing-Based Multinomial Naive Bayes
Classifiers Over Opinionated Data Streams
1 Introduction
Nowadays, we experience an increasing interest on word-of-mouth communica-
tion in social media, including opinion sharing [16]. A vast amount of voluntary
and bona fide feedback accumulates, referring to products, persons, events etc.
Opinionated information is valuable for consumers, who benefit from the experi-
ences of other consumers, in order to make better buying decisions [13], but also
for vendors, who can get insights on what customers like and dislike [18]. The
extraction of such insights requires a proper analysis of the opinionated data.
In this work, we address the issue of polarity learning over opinionated
streams. The accumulating opinionated documents are subject to different forms
of drift: the subjects discussed change, the attitude of people towards specific
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 401–416, 2015.
DOI: 10.1007/978-3-319-23528-8 25
402 S. Wagner et al.
2 Related Work
Stream mining algorithms typically assume that the most recent data are the
most informative, and thus employ different strategies to downgrade old, obsolete
data. In a recent survey [6], Gama et al. discuss two forgetting mechanisms: (i)
abrupt forgetting where only recent instances, within a sliding window, contribute
to the model, and (ii) gradual forgetting where all instances contribute to the
model but with a weight that is regulated by their age. In the context of our
study, the forgetting strategy also affects the vocabulary – the feature space. In
particular, if a set of documents is deleted (abrupt forgetting), all words that
are in them but in none of the more recent documents are also removed from the
feature space. This may harm the classifier, because such words may re-appear
soon after their removal. Therefore, we opt for gradual forgetting.
Another approach for selecting features/ words for polarity classification in
a dynamic environment is presented in [11]. In this approach, selection does not
1
An exception is our own prior work [24, 25].
Ageing-Based Multinomial Naive Bayes Classifiers 403
rely on the age of the words but rather on their usefulness, defined as their contri-
bution to the classification task. Usefulness is used in [11] as a selection criterion,
when the data volume is high, but is also appropriate for streams with recur-
ring concepts. Concept recurrence is studied e.g. in [8] (where meta-classifiers
are trained on data referring to a given concept), and in [12] (where a concept
is represented by a data bucket, and recurrence refers to similar buckets). Such
methods can be beneficial for opinion stream classification, if all encountered
opinionated words can be linked to a reasonably small number of concepts that
do recur. In our study, we do not pose this requirement; word polarity can be
assessed in our MNB model, without linking the words to concepts.
Multinomial Naive Bayes (MNB) [14] is a popular classifier due to its sim-
plicity and good performance, despite its assumption on the class-conditional
independence of the words [5,22]. Its simplicity and easy online maintenance
constitutes it particularly appealing for data streams. As pointed out in Section
1, MNB is particularly appropriate for adaptation to an evolving vocabulary. In
[24,25], we present functions that recompute the class probabilities of each word.
In this study, we use different functions, as explained in the last paragraph of
this section.
Bermingham et al. [2] compared the performance of Support Vector
Machines (SVM) and MNB classifiers on microblog data and reviews (not
streams) and showed that MNB performs well on short-length, opinion-rich
microblog messages (rather than on long texts). In [10], popular classifica-
tion algorithms were studied such as MNBs, Random Forest, Bayesian Logistic
Regression and SVMs using sequential minimal optimization for the classifica-
tion in Twitter streams while building classifiers at different samples. Across the
tested classifiers, MNBs showed the best performance for all applied data sets.
In [3], MNB has been compared to Stochastic Gradient Descend (SGD) and
Hoeffding Trees for polarity classification on streams. Their MNB approach is
incremental, i.e., it accumulates information on class appearances and word-in-
class appearances over the stream, however, it did not forget anything. Their
experiments showed that MNB had the largest difficulty in dealing with drifts in
the stream population, although its performance in times of stability was very
good. Regarding runtime, MNB was the fastest model due to its simplicity in
predictions but also due to the easy incorporation of new instances in the model.
The poor performance of MNB in times of change was also observed in [21], and
triggered our ageing-based MNB approach.
Closest to our approach is our earlier work [24]. There, MNB is in the core of
a polarity learner that uses two adaptation techniques: i) a forward adaptation
technique that selects “useful” instances from the stream for model update and
ii) a backward adaptation technique that downgrades the importance of old
words from the model based on their age. There are two differences between
that earlier method (and our methods that build upon it, e.g. [25]) and the
work proposed here. First, the method in [24] is semi-supervised: after receiving
an initial seed of labeled documents, it relies solely on the labels it derives
from the learner. Backward adaptation is not performing well in that scenario,
404 S. Wagner et al.
presumably because the importance of the words in the initial seed of documents
diminishes over time. Furthermore, in [24], the ageing of the word-class counts
is based directly upon the age of the original documents containing the words.
The word-class counts are weighted locally, i.e., within the documents containing
the words, and the final word-class counts are aggregations of these local scores.
In our current work, we do not monitor the age of the documents. Rather, we
use the words as first class objects, which age with time. Therefore the ageing
of a word depends solely on the last time the word has been observed in some
document from the stream.
3 Basic Concepts
We observe a stream S of opinionated documents arriving at distinct timepoints
t0 , . . ., ti , . . .; at each ti a batch of documents might arrive. The definition of
the batch depends on the application per se: i) one can define the batch at the
instance level, i.e., a fixed number of instances is received at each timepoint or ii)
at the temporal level, i.e., the batch consists of the instances arriving within each
time period, e.g. on a daily basis if day is the considered temporal granularity.
A document d ∈ S is represented by the bag-of-words model and for each word
wi ∈ d its frequency fid is also stored.
Our goal is to build a polarity classifier for the prediction of the polarity of
new arriving documents. As it is typical in streams, the underlying population
might undergo changes over time, referred in the literature as concept drift. The
changes are caused by two reasons: i) change in the sentiment of existing words
(for example, words have different sentiment for different contexts, e.g. the word
“heavy” is negative for a camera, but positive for a solid wood piece of furniture);
ii) new words might appear over time and old words might become obsolete (for
example, new topics emerge all the time in the news and some topics are not
mentioned anymore). The drift in the population might be gradual or drastic; the
later is referred also as concept shift in the literature. A stream classifier should
be able to adapt to drift while maintaining a good predictive power. Except for
the quality of predictions, another important factor for a stream classifier is fast
adaptation to the underlying evolving stream population.
where P (c) is the prior probability of class c, P (wi |c) is the conditional proba-
bility that word wi belongs to class c and fid is the number of occurrences of wi
in document d. These probabilities are typically estimated based on a dataset
Ageing-Based Multinomial Naive Bayes Classifiers 405
D with class labels (training set); we indicate the estimates from now on by a
“hat” as in P̂ .
The class prior P (c) is easily estimated as the fraction of the set of training
documents belonging to class c, i.e.,:
Nc
P̂ (c) = (2)
|D|
forgotten, neither words nor class prior information. To make MNBs adaptable
to change, we introduce the notion of time (Section 4.1) and we show how such
a “temporal” model can be used for polarity learning in a stream environment
(Sections 4.2, 4.3).
The weight of the document d decays over time based on the time period elapsing
from the arrival of d and the decay factor λ. The higher the value of λ, the lower
the impact of historical data comparing to recent data. The ageing factor λ is
critical for the ageing process as it determines what is the contribution of old
data to the model and how fast old data is forgotten. Another way of thinking
of λ is by considering that λ1 is the period for an instance to loose half of its
original weight. For example, if λ = 0.5 and timestamps correspond to days, this
1
means that 0.5 = 2 days after its observation an instance will loose 50% of its
weight. For λ = 0 there is no ageing and therefore the classifier is equivalent to
the accumulative MNB (cf. Section 3.1).
The timestamp of the document from the stream is “transferred” to its com-
ponent words and finally, to the MNB model. In particular, each word-class pair
(w, c) entry in the model is associated with
– the last observation time, tlo , which represents the last time that the word
w has been observed in the stream in a document of class c.
The tlo entry indicates how recent is the last observation of word w in class c.
Similarly, each class entry in the model is associated with a last observation
timestamp indicating the last time that the class was observed in the stream.
Ageing-Based Multinomial Naive Bayes Classifiers 407
Based on the above, the ageing-based MNB model consists of the temporally
annotated class prior counts (Nc , tclo ) and class conditional word counts (Nic , tic
lo ).
Hereafter, we focus on how such a temporal model can be used for predic-
tion while being maintained online. We distinguish between a normal fading
MNB approach (Section 4.2) and an aggressive/drastic fading MNB approach
(Section 4.3).
In order to predict the class label of a new document d arriving from the stream
S at timepoint t, we employ the ageing-based version of MNB: in particular, the
temporal information associated with the class prior counts and the class condi-
tional word counts is incorporated in the class prior estimation P̂ (c) and in the
class conditional word probability estimation P̂ (wi |c), i.e., in Equations 2 and 3,
respectively.
The updated temporal class prior for class c ∈ C at timepoint t is given by :
c
N t ∗ e−λ·(t−tlo )
Pˆt (c) = c (4)
|S t |
Nic are increased based on the frequency of wi in d, i.e., Nic + = fwd i . The
temporal counterpart of Nc is updated w.r.t. arrival time t of d, i.e., tclo = t.
Similarly the temporal counterparts of any word class combination count Nic
in d will be updated, i.e., tic
lo = t. We refer to this method as fadingMNB here-
after.
5 Experiments
We use the TwitterSentiment dataset [19], introduced in [9]. The dataset was
collected by querying the Twitter API for messages between April 6, 2009 and
Ageing-Based Multinomial Naive Bayes Classifiers 409
June 25, 2009. The query terms belong to different categories, such as com-
panies (e.g. query terms “aig”, “at&t”), products (e.g. query terms “kindle2”,
“visa card”), persons (e.g. “warren buffet”, “obama”), events (e.g. “indian elec-
tion”, “world cup”). Evidently, the stream is very heterogeneous. It has not been
labeled manually. Rather, the authors of [9] derived the labels with a Maximum
Entropy classifier that was trained on emoticons.
We preprocessed the dataset as in our previous work [21], including following
steps: (i) dealing with data negation (e.g. replacing “not good” with “not good”,
“not pretty” with “ugly” etc.), (ii) dealing with colloquial language (e.g. conver-
ing “luv” to “love” and “youuuuuuuuu” to “you”), (iii) elimination of superfluous
words (e.g. Twitter signs like or #), stopwords (e.g. “the”, “and”), special char-
acters and numbers, (iv) stemming (e.g. “fishing”, “fisher” were mapped to their
root word “fish”). A detailed description of the preprocessing steps is in [20].
The final stream consists of 1,600,000 opinionated tweets, 50% of which are
positive and 50% negative (two classes). The class distribution changes over
time.
How to choose the temporal granularity of such a stream? On Figure 1, we
show the tweets at different levels of temporal granularity: weeks (left), days
(center), hours (right). The weekly-aggregated stream (Figure 1, left) consists
of #12 distinct weeks (the x-axis shows the week of the year). Both classes
are present up to week 25, but after that only instances of the negative class
appear. In the middle of the figure, we see the same data aggregated at day
level: there are #49 days (the horizontal axis denotes the day of the year). Up
to day 168, we see positive and negative documents; the positive class (green)
is overrepresented. But towards the end of the stream the class distribution
changes and the positive class disappears. We see a similar behavior in the
hourly-aggregated stream (Figure 1, right), where the x-axis depicts the hour
(of the year). On Figure 1, we see that independently of the aggregation level,
the amount of data received at each timepoint varies: there are high-traffic time
points, like day 157 or week 23 and low-traffic ones, like day 96 or week 15. Also,
there are “gaps” in the monitoring period. For example, in the daily-aggregated
stream there are several 1-day gaps like day 132 but also “bigger gaps” like 10
days of missing observations between day 97 and day 107.
The time granularity affects the ageing mechanism, since all documents asso-
ciated with the same time unit (e.g. day) have the same age/weight. In the
following, we experiment with all three levels of granularity.
The two most popular methods for evaluating classification algorithms are hold-
out evaluation and prequential evaluation. Their fundamental difference is in
the order in which they perform training and testing and the ordering of the
dataset [7]. In hold-out evaluation, the current model is evaluated over a sin-
gle independent hold-out test set. The hold-out set is the same over the whole
course of the stream. For our experiments, the hold-out set consists of 30% of
all instances randomly selected from the stream. In prequential evaluation, each
instance from the stream is first used for testing and then for training the model.
This way, the model is updated continuously based on new instances.
To evaluate the quality of the different classifiers, we employed accuracy
and kappa [23] over an evaluation window, evalW . Accuracy is the percentage
of correct classifications in w. Bifet et al. [3] use the kappa statistic defined
as k = p1−p
0 −pc
c
, where p0 is the accuracy of the studied classifier and pc is the
accuracy of the chance classifier. Kappa lies between -1 and 1.
The left part of the accuracy plots on Figure 2 shows that accumula-
tiveMNB has the best accuracy, followed closely by fadingMNB, while aggres-
siveFadingMNB has slightly inferior performance. In the left part of the plots
on kappa (Figure 3), the relative performance is the same, but the performance
inferiority of accumulativeMNB is more apparent.
In the right part of the accuracy plots on Figure 2, i.e., after the dras-
tic change in the class distribution, we see that accumulativeMNB experiences
a slight performance drop, while the accuracy of our two algorithms ascends
rapidly to 100% (the two curves coincide). The intuitive explanation for the
inferior performance of accumulativeMNB is that it remembers all past data, so
it cannot adapt to the disappearance of the positive class. The proposed fad-
ingMNB and aggressiveFadingMNB, on the contrary, manage to recover after
the change.
The right part of the plots on kappa (Figure 3) gives a different picture:
the performance of all three algorithms drops to zero after the concept shift (the
three curves coincide). This is owed to the nature of kappa: it juxtaposes the
performance of the classifier to that of a random classifier; as soon as there is
only one class in the data (here: the negative one), no classifier can be better
than the random classifier. Since the accuracy plots and the kappa plots show
the same trends before the drastic concept change, and since the accuracy plots
reflect the behavior of the classifiers after the change much better than kappa
does, we concentrate on accuracy as evaluation measure hereafter.
the λ increases, its performance drops. This is expected since low values imply
that the classifier gradually forgets in a moderate pace, whereas high values
mean that the past is forgotten very fast. The performance of fadingMNB is the
opposite, for very small values of λ there is no actual ageing and therefore the
performance is low (and similar to accumulativeMNB ); whereas as λ increases,
fadingMNB exploits the ageing of the old data, so its performance improves.
The results for a much larger λ, λ = 1.0, are shown in the right part of
Figure 7. A λ = 1.0 means that an instance loses half its weight after 1 hour, 1
day, 1 week for the hourly, daily, weekly aggregated stream respectively. The
aggressiveFadingMNB performs worse when there is no drift in the stream,
because the classifier forgets too fast. This fast forgetting though allows the
classifier to adapt fast in times of change, i.e., when drift occurs after instance
1,341,000. Among the different streams, in times of stability in the stream, the
aggressive fading classifier in the hourly aggregated stream shows the worst per-
formance, followed closely by the daily and then the weekly aggregated streams.
In times of change however, the behavior is the opposite with the hourly aggre-
gated stream showing the best adaptation rate, because of no memory. Regard-
ing fadingMNB, daily and weekly aggregation show best performance in times of
stability followed by the hourly aggregated stream. In times of drift on the other
hand, the hourly aggregation adapts the fastest, followed by daily and weekly
aggregation streams.
To summarize, the value of λ affects the performance of the classifier over
the whole course of the stream. In times of drifts in the stream, a larger λ is
preferable as it allows for fast adaptation to the new concepts appearing in the
stream. In times of stability though, a smaller λ is preferable as it allows the
classifier to exploit already learned concepts in the stream. The selection of λ
“works” in collaboration with the stream granularity. Obviously, at a very high
granularity (such as a time unit of one hour), lambda can be higher than at a
lower granularity (such as a time unit of one week).
References
1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected cluster-
ing of high dimensional data streams. In: Proceedings of the 30th International
Conference on Very Large Data Bases (VLDB), Toronto, Canada (2004)
2. Bermingham, A., Smeaton, A.F.: Classifying sentiment in microblogs: Is brevity an
advantage? In: Proceedings of the 19th ACM International Conference on Informa-
tion and Knowledge Management, CIKM 2010, pp. 1833–1836. ACM, New York
(2010)
3. Bifet, A., Frank, E.: Sentiment knowledge discovery in twitter streaming data.
In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010. LNCS, vol. 6332,
pp. 1–15. Springer, Heidelberg (2010)
4. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving
data stream with noise. In: Proceedings of the 6th SIAM International Conference
on Data Mining (SDM), Bethesda, MD (2006)
5. Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier
under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997)
6. Gama, J.A., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on
concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014)
7. Gama, J.: Knowledge Discovery from Data Streams, 1st edn. Chapman &
Hall/CRC (2010)
8. Gama, J., Kosina, P.: Recurrent concepts in data streams classification. Knowl.
Inf. Syst. 40(3), 489–507 (2014)
9. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant
supervision. In: Processing, pp. 1–6 (2009). http://www.stanford.edu/alecmgo/
papers/TwitterDistantSupervision09.pdf
10. Gokulakrishnan, B., Priyanthan, P., Ragavan, T., Prasath, N., Perera, A.S.: Opin-
ion mining and sentiment analysis on a twitter data stream. In: Proceedings of
2012 International Conference on Advances in ICT for Emerging Regions (ICTer),
ICTer 2012, pp. 182–188. IEEE (2012)
416 S. Wagner et al.
11. Guerra, P.C., Meira, Jr., W., Cardie, C.: Sentiment analysis on evolving social
streams: how self-report imbalances can help. In: Proceedings of the 7th ACM
International Conference on Web Search and Data Mining, WSDM 2014,
pp. 443–452. ACM, New York (2014)
12. Lazarescu, M.: A multi-resolution learning approach to tracking concept drift and
recurrent concepts. In: Gamboa, H., Fred, A.L.N. (eds.) PRIS, p. 52. INSTICC
Press (2005)
13. Liu, Y., Yu, X., An, A., Huang, X.: Riding the tide of sentiment change: Sentiment
analysis with evolving online reviews. World Wide Web 16(4), 477–496 (2013)
14. McCallum, A., Nigam, K.: A comparison of event models for naive bayes
text classification. In: AAAI-98 Workshop on Learning for Text Categorization,
pp. 41–48. AAAI Press (1998)
15. Ntoutsi, E., Zimek, A., Palpanas, T., Krger, P., peter Kriegel, H.: Density-based
projected clustering over high dimensional data streams. In: Proceedings of the
12th SIAM International Conference on Data Mining (SDM), Anaheim, CA,
pp. 987–998 (2012)
16. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr.
2(1–2), 1–135 (2008)
17. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using
machine learning techniques. In: Proceedings of the ACL-02 Conference on Empir-
ical Methods in Natural Language Processing, vol. 10, pp. 79–86. EMNLP, ACL,
Stroudsburg (2002)
18. Plaza, L., Carrillo de Albornoz, J.: Sentiment Analysis in Business Intelligence: A
survey, pp. 231–252. IGI-Global (2011)
19. Sentiment140: Sentiment140 - a Twitter sentiment analysis tool. http://help.
sentiment140.com/
20. Sinelnikova, A.: Sentiment analysis in the Twitter stream. Bachelor thesis, LMU,
Munich (2012)
21. Sinelnikova, A., Ntoutsi, E., Kriegel, H.P.: Sentiment analysis in the twitter
stream. In: 36th Annual Conf. of the German Classification Society (GfKl 2012),
Hildesheim, Germany (2012)
22. Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsu-
pervised classification of reviews. In: Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, ACL 2002, pp. 417–424. Association
for Computational Linguistics, Stroudsburg (2002)
23. Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: The kappa
statistic. Family Medicine 37(5), 360–363 (2005)
24. Zimmermann, M., Ntoutsi, E., Spiliopoulou, M.: Adaptive semi supervised opinion
classifier with forgetting mechanism. In: Proceedings of the 29th Annual ACM
Symposium on Applied Computing, SAC 2014, pp. 805–812. ACM, New York
(2014)
25. Zimmermann, M., Ntoutsi, E., Spiliopoulou, M.: Discovering and monitoring prod-
uct features and the opinions on them with OPINSTREAM. Neurocomputing 150,
318–330 (2015)
Drift Detection Using Stream Volatility
David Tse Jung Huang1(B) , Yun Sing Koh1 , Gillian Dobbie1 , and Albert Bifet2
1
Department of Computer Science, University of Auckland, Auckland, New Zealand
{dtjh,ykoh,gill}@cs.auckland.ac.nz
2
Huawei Noah’s Ark Lab, Hong Kong, China
bifet.albert@huawei.com
1 Introduction
Mining data that change over time from fast changing data streams has become a
core research problem. Drift detection discovers important distribution changes
from labeled classification streams and many drift detectors have been pro-
posed [1,5,8,10]. A drift is signaled when the monitored classification error devi-
ates from its usual value past a certain detection threshold, calculated from a
statistical upper bound [6] or a significance technique [9]. The current drift detec-
tors monitor only some form of mean and variance of the classification errors
and these errors are used as the only basis for signaling drifts. Currently the
detectors do not consider any previous trends in data or drift behaviors. Our
proposal incorporates previous drift trends to extend and improve the current
drift detection process.
In practice there are many scenarios such as traffic prediction where incorpo-
rating previous data trends can improve the accuracy of the prediction process.
For example, consider a user using Google Map at home to obtain a fastest route
to a specific location. The fastest route given by the system will be based on
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 417–432, 2015.
DOI: 10.1007/978-3-319-23528-8 26
418 D.T.J. Huang et al.
how congested the roads are at the current time (prior to leaving home) but is
unable to adapt to situations like upcoming peak hour traffic. The user could be
directed to take the main road that is not congested at the time of look up, but
may later become congested due to peak hour traffic when the user is en route.
In this example, combining data such as traffic trends throughout the day can
help arrive at a better prediction. Similarly, using historical drift trends, we can
derive more knowledge from the stream and when this knowledge is used in the
drift detection process, it can improve the accuracy of the predictions.
Fig. 1. Comparison of current drift detection process v.s. Our proposed design
The main contribution of this paper is the concept of using historical drift
trends to estimate the probability of expecting a drift at each point in the stream,
which we term the expected drift probability. We propose two approaches to
derive this probability: Predictive approach and Online approach. Figure 1 illus-
trates the comparison of the current drift detection process against our overall
proposed design. The Predictive approach uses Stream Volatility [7] to derive
a prediction of where the next drift point is likely to occur. Stream Volatility
describes the rate of changes in a stream and using the mean of the rate of the
changes, we can make a prediction of where the next drift point is. This predic-
tion from Stream Volatility then indicates periods of time where a drift is less
likely to be discovered (e.g. if the next drift point is predicted to be 100 steps
later, then we can assume that drifts are less likely to occur during steps farther
away from the prediction). At these times, the Predictive approach will have a
low expected drift probability. The predictive approach is suited for applications
where the data have some form of cyclic behavior (i.e. occurs daily, weekly, etc.)
such as the monitoring of oceanic tides, or daily temperature readings for agricul-
tural structures. The Online approach estimates the expected drift probability
by first training a model using previous non-drifting data instances. This model
represents the state of the stream when drift is not occurring. We then compare
how similar the current state of the stream is against the trained model. If the
current state matches the model (i.e. current state is similar to previous non-
drifting states), then we assume that drift is less likely to occur at this current
point and derive a low expected drift probability. The Online approach is better
suited for fast changing, less predictive applications such as stock market data.
We apply the estimated expected drift probability in the state-of-the-art detec-
tor ADWIN [1] by adjusting the detection threshold (i.e. the statistical upper
Drift Detection Using Stream Volatility 419
bound). When the expected drift probability is low, the detection threshold is
adapted and increased to accommodate the estimation. Through experimenta-
tion, we offer evidence that using our two new approaches with ADWIN, we
achieve a significantly fewer number of false positives.
The paper is structured as follows: in Section 2 we discuss the relevant
research. Section 3 details the formal problem definition and preliminaries. In
Section 4 our method is presented and we also discuss several key elements and
contributions. Section 5 presents our extensive experimental evaluations, and
Section 6 concludes the paper.
2 Related Work
Drift Detection: One way of describing a drift is a statistically significant shift
in the distribution of a sample of data which initially represents a single homo-
geneous distribution to a different data distribution. Gama et al. [4] present a
comprehensive survey on drift detection methods and points out that techniques
generally fall into four categories: sequential analysis, statistical process control
(SPC), monitoring two distributions, and contextual.
The Cumulative Sum [9] and the Page-Hinkley Test [9] are sequential analysis
based techniques. They are both memoryless but their accuracy heavily depends
on the required parameters, which can be difficult to set. Gama et al. [5] adapted
the SPC approach and proposed the Drift Detection Method (DDM), which
works best on data streams with sudden drift. DDM monitors the error rate
and the variance of the classifying model of the stream.When no changes are
detected, DDM works like a lossless learner constantly enlarging the number of
stored examples, which can lead to memory problems.
More recently Bifet et al. [1] proposed ADaptive WINdowing (ADWIN) based
on monitoring distributions of two subwindows. ADWIN is based on the use
of the Hoeffding bound to detect concept change. The ADWIN algorithm was
shown to outperform the SPC approach and provides rigorous guarantees on false
positive and false negative rates. ADWIN maintains a window (W ) of instances
at a given time and compares the mean difference of any two subwindows (W0
of older instances and W1 of recent instances) from W . If the mean difference is
statistically significant, then ADWIN removes all instances of W0 considered to
represent the old concept and only carries W1 forward to the next test. ADWIN
used a variation of exponential histograms and a memory parameter, to limit
the number of hypothesis tests.
changes in data distribution. In the context of this paper, we employ the idea of
stream volatility to help derive a prediction of when the next change point will
occur in the Predictive approach.
In [7] the authors describe volatility detection as the discovery of a shift in
stream volatility. A volatility detector was developed with a particular focus
on finding the shift in stream volatility using a relative variance measure. The
proposed volatility detector consists of two components: a buffer and a reservoir.
The buffer is used to store recent data and the reservoir is used to store an
overall representative sample of the stream. A volatility shift is observed when
the variance between the buffer and the reservoir is past a significance threshold.
3 Preliminaries
Let us frame the problem of drift detection and analysis more formally. Let
S1 = (x1 , x2 , ..., xm ) and S2 = (xm+1 , ..., xn ) with 0 < m < n represent two
samples of instances from a stream with population means μ1 and μ2 respec-
tively. The drift detection problem can be expressed as testing the null hypoth-
esis H0 that μ1 = μ2 , i.e. the two samples are drawn from the same distribution
against the alternate hypothesis H1 that they are drawn from different distri-
butions with μ1 = μ2 . In practice the underlying data distribution is unknown
and a test statistic based on sample means is constructed by the drift detector.
If the null hypothesis is accepted incorrectly when a change has occurred then
a false negative has occurred. On the other hand if the drift detector accepts
H1 when no change has occurred in the data distribution then a false posi-
tive has occurred. Since the population mean of the underlying distribution is
unknown, sample means need to be used to perform the above hypothesis tests.
The hypothesis tests can be restated as the following. We accept hypothesis H1
whenever P r(|μˆ1 − μˆ2 |) ≥ ) ≤ δ, where the parameter δ ∈ (0, 1) and controls
the maximum allowable false positive rate, while is the test statistic used to
model the difference between the sample means and is a function of δ.
points at times t = 50, t = 100, and t = 150 will have a mean volatility value
of 50. The μvolatility is then used to provide an estimate of a relative position of
where the next drift point, denoted tdrif t.next , is likely to occur. In other words,
tdrif t.next = tdrif t.previous + μvolatility , where tdrif t.previous is the location of the
previous signaled drift point and tdrif t.previous < tdrif t.next .
The expected drift probability at points tx in the stream is denoted as φtx .
We use the tdrif t.next prediction to derive φtx at each time point tx in the stream
between the previous drift and the next predicted drift point, tdrif t.previous <
tx < tdrif t.next . When tx is distant from tdrif t.next , the probability φtx is smaller
and as tx progresses closer to tdrif t.next , φtx is progressively increased.
We propose two variations of deriving φtx based on next drift prediction: one
based on the sine function and the other based on sigmoid function. Intuitively
the sine function will assume that the midpoint of previous drift and next drift is
the least likely point to observe a drift whereas the sigmoid function will assume
that immediately after a drift, the probability of observing a drift is low until
the stream approaches the next drift prediction.
The calculation of φtx using the sine and the sigmoid functions at a point tx
where tdrif t.previous < tx < tdrif t.next are defined as:
sigmoid 1 − tr
tx = 1 − sin (π · tr ) and φtx
φsin =1−
0.01 + |1 − tr |
The Online approach estimates the expected drift probability by first training a
model using previous non-drifting data. This model represents the state of the
stream when drift is not occurring. We then compare how similar the current
state of the stream is against the trained model. If the current state matches
the model (i.e. current state is similar to previous non-drifting states), then we
assume that drift is less likely to occur at this current point and derive a low
expected drift probability. Unlike the Predictive approach, which predicts where
a drift point might occur, the Online approach approximates the unlikelihood
of a drift occurring by comparing the current state of the stream against the
trained model representing non-drifting states.
The Online approach uses a sliding block B of size b that keeps the most
recent value of binary inputs. The mean of the binary contents in the block
at time tx is given as μBtx where the block B contains samples with values
vx−b , · · · , vx of the transactions tx−b , · · · , tx . The simple moving average of the
previous n inputs is also maintained where: (vx−b−n + · · · + vx−b )/n. The state
422 D.T.J. Huang et al.
of the stream at any given time tx is represented using the Running Magnitude
Difference, denoted as γ, and given by: γtx = μBtx − MovingAverage.
We collect a set of γ values γ1 , γ2 , · · · , γx using previous non-drifting data
to build a Gaussian training model. The Gaussian model will have the mean μγ
and the variance σγ of the training set of γ values. The set of γ values reflect
the stability of the mean of the binary inputs. A stream in its non-drifting states
will have a set of γ values that tend to a mean of 0.
To calculate the expected drift probability in the running stream at a point
tx , the estimation φtx is derived by comparing the Running Magnitude Difference
γtx against the trained Gaussian model with α as the threshold, the probability
calculation is given as:
φonline
tx = 1 if f (γtx , μγ , σγ ) ≤ α
and
φonline
tx = 0 if f (γtx , μγ , σγ ) > α
where 2
1 − (γtx2σ−μ γ)
f (γtx , μγ , σγ ) = √ e γ2
2π
For example, using previous non-drifting data we build a Gaussian model with
μγ = 0.0 and σγ = 0.05, if α = 0.1 and we observe that the current state γtx =
0.1, then the expected drift probability is 1 because f (0.1, 0, 0.05) = 0.054 ≤ α.
In a stream environment, the i.i.d. property of incoming random variables
in the stream is generally assumed. Although at first glance it may appear that
a trained Gaussian model is not suitable to be used in this setting, the central
limit theorem provides justification. In drift detection a drift primarily refers to
the real concept drift, which is a change in the posterior distributions of data
p(y|X), where X is the set of attributes and y is the target class label. When the
distribution of X changes, the class y might also change affecting the predictive
power of the classifier and signals a a drift. Since the Gaussian model is trained
using non-drifting data, we assume that the collected γ value originates from
the same underlying distribution and remains stable in the non-drifting set of
data. Although the underlying distribution of the set of γ values is unknown,
the Central Limit Theorem justifies that the mean of a sufficiently large num-
ber of random samples will be approximately normally distributed regardless
of the underlying distribution. Thus, we can effectively approximate the set of
γ values with a Gaussian model. To confirm that the Central Limit Theorem
is valid in our scenario, we have generated sets of non-drifting supervised two
class labeled streams using the rotating hyperplane generator with the set of
numeric attributes X generated from different source distributions such as uni-
form, binomial, exponential, and Poisson. The sets of data are then run through
a Hoeffding Tree Classifier to obtain the binary classification error inputs and
the set of γ values are gathered using the Online approach. We plot each set of
the γ values and demonstrate that the distribution indeed tends to Gaussian as
justified by the Central Limit Theorem in Figure 2.
Drift Detection Using Stream Volatility 423
In Sections 4.1 and 4.2 we have described two approaches at calculating the
expected drift probability φ and in this section we show how to apply the dis-
covered φ in the detection process of ADWIN.
ADWIN relies on using the Hoeffding bound with Bonferroni correction [1]
as the detection threshold. The Hoeffding bound provides guarantee that a drift
is signalled with probability at most δ (a user defined parameter): P r(|μW0 −
μW1 | ≥ ) ≤ δ where μW0 is the sample mean of the reference window of data,
W0 , and μW1 is the sample mean of the current window of data, W1 . The value
is a function of δ parameter and is the test statistic used to model the difference
between the sample mean of the two windows. Essentially when the difference in
sample means between the two windows is greater than the test statistic , a drift
will be signaled. is given by the Hoeffding bound with Bonferroni correction
as:
2 2 2 2
hoef f ding = · σ 2 · ln + ln
m W δ 3m δ
δ
1
where m = 1/|W0 |+1/|W 1|
, δ = |W0 |+|W 1|
.
We incorporate φ, expected drift probability, onto ADWIN and propose an
Adaptive bound which adjusts the detection threshold of ADWIN in reaction to
the probability of seeing a drift at different time tx across the stream. When φ
is low, the Adaptive bound (detection threshold) is increased to accommodate
the estimation that drift is less likely to occur.
The of the Adaptive bound is derived as follows:
2 2 · ln 2 + 2 ln 2
adaptive = (1 + β · (1 − φ)) · σW
m δ 3m δ
where β is a tension parameter that controls the maximum allowable adjustment,
usually set below 0.5. A comparison of the Adaptive bound using the Predic-
tive approach versus the original Hoeffding bound with Bonferroni correction is
shown in Figure 3. In such cases, we can see that by using the Adaptive bound
derived from the Predictive approach, we reduce the number of false positives
that would have otherwise been signaled by Hoeffding bound.
The Adaptive bound is based on adjusting the Hoeffding bound and main-
tains similar statistical guarantees as the original Hoeffding bound. We know
424 D.T.J. Huang et al.
that the Hoeffding bound provides guarantee that a drift is signaled with prob-
ability at most δ:
P r(|μW0 − μW1 | ≥ ) ≤ δ
and since
adaptive ≥ hoef f ding
therefore,
drift will pass the Hoeffding bound, then pass the Adaptive bound, while a false
positive might pass Hoeffding bound but not the Adaptive bound. When a drift
is signaled by the Adaptive bound, the mean volatility value is updated using
the drift point found when Hoeffding bound was first passed. The addition of
the Hoeffding as the warning level resolves the issue that drift points found by
the Adaptive bound might influence future volatility predictions.
The β value is a tension parameter used to control the degree at which the
statistical bound is adjusted based on drift expectation probability estimation φ.
Setting a higher β value will increase the magnitude of adjustment of the bound.
One sensible way to set the β parameter is to base its value on the confidence of
the φ estimation. If the user is confident in the φ estimation, then setting a high
β value (e.g. 0.5) will significantly reduce the number of false positives while still
detecting all the real drifts. If the user is less confident in the φ estimation, then
β can be set low (e.g. 0.1) to make sure drifts of significant value are picked up.
An approach for determining the β value is: β = P r(a) − P r(e)/2 · (1 − P r(e))
where P r(a) is the confidence of the φ estimation and P r(e) is the confidence
of estimation by chance. In most instances P r(e) should be set at 0.6 as any
estimation with confidence lower than 0.6 we can consider as a poor estimation.
In practice by setting β at 0.1 reduces the number of false positives found by
50-70% when incorporating our design into ADWIN while maintaining similar
true positive rates and detection delay compared to without using our design.
5 Experimental Evaluation
In this test we compare the false positives found between using only ADWIN
detector against using our Predictive and Online approaches on top of ADWIN.
For this test we replicate the standard false positive test used in [1]. A stream of
100,000 bits with no drifts generated from a Bernoulli distribution with μ = 0.2
is used. We vary the δ parameter and the β tension parameter and run 100
iterations for all experiments. The Online approach is run with a 10-fold cross
validation. We use α = 0.1 for the Online approach.
426 D.T.J. Huang et al.
Online Approach
δ Hoeffding β = 0.1 β = 0.2 β = 0.3 β = 0.4 β = 0.5
0.05 0.0014 0.0006 0.0005 0.0005 0.0005 0.0005
0.1 0.0031 0.0014 0.0012 0.0011 0.0011 0.0011
0.3 0.0129 0.0073 0.0063 0.0059 0.0058 0.0058
The results are shown in Table 1 and we observe that both the sine function
and sigmoid function with the Predictive approach are effective at reducing the
number of false positives in the stream. In the best case scenario the number
of false positive was reduced by 93%. Even with a small β value of 0.1, we still
observe an approximately 50-70% reduction in the number of false positives. For
the Online approach we observe around a 65% reduction.
In the true positive test we test the accuracy of the three different setting at
detecting true drift points. In addition, we look at the detection delay associated
with the detection of the true positives. For this test we replicate the true positive
test used in [1]. Each stream consists of 1 drift at different points of volatility
with varying magnitudes of drift and drift is induced with different slope values
over a period of 2000 steps. For each set of parameter values, the experiments
are run over 100 iterations using ADWIN as the drift detector with δ = 0.05.
The Online approach was run with a 10-fold cross validation.
We observed that for all slopes, the true positive rate for using all differ-
ent settings (ADWIN only, Predictive Sine, and Predictive Sigmoid, Online) is
100%. There was no notable difference between using ADWIN only, Predictive
Drift Detection Using Stream Volatility 427
Online Approach
Slope Hoeffding β = 0.1 β = 0.2 β = 0.3 β = 0.4 β = 0.5
0.0001 882±(181) 941±(180) 975±(187) 1001±(200) 1015±(211) 1033±(220)
0.0002 571±(113) 597±(116) 611±(123) 620±(130) 629±(136) 632±(140)
0.0003 441±(83) 460±(90) 469±(94) 472±(96) 475±(100) 476±(101)
0.0004 377±(71) 389±(73) 394±(74) 398±(79) 398±(79) 399±(80)
approach, and Online approach in terms of accuracy. The results for the associ-
ated detection delay on gradual drift stream are shown in Table 2. We note that
the Sine and Sigmoid functions yielded similar results and we only present one
of them here. We see that the detection delays remain stable between ADWIN
only and the Predictive approach as this test assumes an accurate next drift pre-
diction from volatility and does not have any significant variations. The Online
approach observed a slightly higher delay due to the nature of the approach
(within one standard deviation of Hoeffding bound delay). There is an increase
in delay when β was varied only in the Online approach.
The false negative test experiments if predictions are correct. Hence, the exper-
iments are carried out on the Predictive approach and not the Online approach.
For this experiment we generate streams with 100,000 bits containing exactly
one drift at a pre-specified location before the presumed drift location (100,000).
We experiment with 3 locations at steps 25,000 (the 1/4 point), 50,000 (the 1/2
point), and 75,000 (the 3/4 point). The streams are generated with different drift
slopes modelling both gradual and abrupt drift types. We feed the Predictive
approach a drift prediction at 100,000. We use ADWIN with δ = 0.05.
In Table 3 we show the detection delay results for varying β and drift
types/slopes when the drift is located at the 1/4 point. We observe from the
table that as we increase the drift slope, the delay decreases. This is because
a drift of a larger magnitude is easier to detect and thus found faster. As we
increase β we can see a positive correlation with delay. This is an apparent
tradeoff with adapting to a tougher bound. In most cases the increase in delay
associated with an unpredicted drift is still acceptable taking into account the
428 D.T.J. Huang et al.
Sine Function
Adaptive Bound
Slope Hoeffding β = 0.1 β = 0.2 β = 0.3 β = 0.4 β = 0.5
0.4 107±(37) 116±(39) 129±(42) 139±(43) 154±(47) 167±(51)
0.6 52±(12) 54±(11) 57±(11) 61±(12) 64±(14) 67±(15)
0.8 27±(10) 28±(11) 32±(14) 39±(16) 44±(15) 50±(12)
0.0001 869±(203) 923±(193) 972±(195) 1026±(200) 1090±(202) 1151±(211)
0.0002 556±(121) 593±(117) 634±(106) 664±(109) 692±(116) 727±(105)
0.0003 434±(89) 463±(91) 488±(84) 514±(83) 531±(80) 557±(75)
0.0004 367±(71) 384±(76) 403±(73) 420±(70) 439±(71) 457±(69)
Sigmoid Function
Adaptive Bound
Slope Hoeffding β = 0.1 β = 0.2 β = 0.3 β = 0.4 β = 0.5
0.4 108±(37) 121±(40) 136±(42) 156±(46) 176±(51) 196±(56)
0.6 52±(12) 54±(12) 57±(11) 61±(12) 64±(14) 67±(15)
0.8 27±(10) 28±(11) 32±(14) 39±(16) 44±(15) 50±(12)
0.0001 869±(203) 937±(200) 1013±(198) 1091±(201) 1177±(216) 1259±(221)
0.0002 556±(121) 605±(110) 657±(110) 695±(115) 738±(103) 776±(102)
0.0003 434±(89) 474±(87) 508±(87) 535±(78) 567±(76) 593±(76)
0.0004 367±(71) 390±(75) 415±(71) 442±(70) 471±(65) 492±(61)
magnitude of false positive reductions and the assumption that unexpected drifts
should be less likely to occur when volatility predictions are relatively accurate.
Table 4 compares the delay when the drift is thrown at different points during
the stream. It can be seen that the sine function does have a slightly higher delay
when the drift is at the 1/2 point. This can be traced back to the sine function
where the mid-point is the peak of the offset. In general the variations are within
reasonable variation and the differences are not significant.
Drift Detection Using Stream Volatility 429
This dataset is obtained from the Stream Data Mining Repository1 . It contains
the hourly power supply from an Italian electricity company from two sources.
The measurements from the sources form the attributes of the data and the class
label is the hour of the day from which the measurement is taken. The drifts in
this dataset are primarily the differences in power usage between different seasons
where the hours of daylight vary. We note that because real-world datasets do
not have ground truths for drift points, we are unable to report true positive
rates, false positive rates, and detection delay. The main objective is to compare
the behavior of ADWIN only versus our designs on real-world data and show
that they do find drifts at similar locations.
Fig. 4. Power Supply Dataset (drift points are shown with red dots)
Figure 4 shows the comparison using the drifts found between ADWIN Only
and our Predictive and Online approaches. We observe that the drift points are
fired at similar locations to the ADWIN only approach.
In this study we apply our design into incremental classifiers in data streams.
We experiment with both synthetically generated data and real-world datasets.
First, we generate synthetic data with the commonly used SEA concepts gener-
ator introduced in [11]. Second, we use Forest Covertype and Airlines real-world
datasets from MOA website2 . Note that the ground truth for drifts is not avail-
able for the real-world datasets. In all of the experiments we run VFDT [3], an
1
http://www.cse.fau.edu/∼xqzhu/stream.html
2
moa.cms.waikato.ac.nz/datasets/
430 D.T.J. Huang et al.
incremental tree classifier, and compare the prediction accuracy and learning
time of the tree using three settings: VFDT without drift detection, VFDT with
ADWIN, and VFDT with our approaches and Adaptive bound. In the no drift
detection setting, the tree learns throughout the entire stream. In the other set-
tings, the tree is rebuilt using the next n instances when a drift is detected. For
SEA data n = 10000 and for real-world n = 300. We used a smaller n for the
real-world datasets because they contain fewer instances and more drift points.
The synthetic data stream is generated from MOA [2] using the SEA concepts
with 3 drifts evenly distributed at 250k, 500k, and 750k in a 1M stream. Each
section of the stream is generated from one of the four SEA concept functions.
We use δ = 0.05 for ADWIN, Sine function for Predictive approach and β = 0.5
for both approaches and α = 0.1 for Online approach. Each setting is run over
30 iterations and the observed results are shown in Table 5.
The results show that by using ADWIN only, the overall accuracy of the
classifier is improved. There is also a reduction in the learning time because only
parts of the stream are used for learning the classifier as opposed to no drift
detection where the full stream is used. Using our Predictive approach and the
Online approach showed a further reduction in learning time and an improvement
in accuracy. An important observation is the reduction in the number of drifts
detected in the stream and an example of drift points is shown in Table 6. We
discovered that using Predictive and Online approaches found less false positives.
Table 5. Incremental Classifier Performance Comparisons
SEA Sample: induced drifts at 250k, 500k, and 750k (false positives colored)
ADWIN 19167 102463 106367 250399 407807 413535 432415 483519 489407 500223 739423 750143
Predict. 19167 102463 106399 250367 500255 750239
Online 19167 251327 500255 750143
The reduction in the number of drifts detected means that the user does not
need to react to unnecessary drift signals. In the real-world dataset experiments,
we generally observe a similar trend to the synthetic experiments. Overall the
classifier’s accuracy is improved when our approaches are applied. Using ADWIN
only yields the highest accuracy, however, it is only marginally higher than our
approaches while using our approaches the number of drifts detected is reduced.
With real-world datasets, we unfortunately do not have the ground truths and
cannot produce variance in the accuracy and number of drifts detected, but
the eliminated drifts using our approaches did not have apparent effects on the
accuracy of the classifier and thus are more likely to be false positives or less
significant drifts. Although the accuracy results are not statistically worse or
better, we observe a reduction in the number of drifts detected. In scenarios
where drift signals incur high costs of action, having a lower number of detected
drifts while maintaining similar accuracy is in general more favorable.
References
1. Bifet, A., Gavaldá, R.: Learning from time-changing data with adaptive windowing.
In: SIAM International Conference on Data Mining (2007)
2. Bifet, A., Holmes, G., Pfahringer, B., Read, J., Kranen, P., Kremer, H., Jansen, T.,
Seidl, T.: MOA: a real-time analytics open source framework. In: Gunopulos, D.,
Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part III.
LNCS, vol. 6913, pp. 617–620. Springer, Heidelberg (2011)
3. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the
6th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 71–80 (2000)
4. Gama, J.A., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on
concept drift adaptation. ACM Computing Surveys 46(4), 44:1–44:37 (2014)
5. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detec-
tion. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171,
pp. 286–295. Springer, Heidelberg (2004)
6. Hoeffding, W.: Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association 58, 13–29 (1963)
7. Huang, D.T.J., Koh, Y.S., Dobbie, G., Pears, R.: Detecting volatility shift in
data streams. In: 2014 IEEE International Conference on Data Mining (ICDM),
pp. 863–868 (2014)
8. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proceed-
ings of the 30th International Conference on VLDB, pp. 180–191. VLDB Endow-
ment (2004)
9. Page, E.: Continuous inspection schemes. Biometrika, 100–115 (1954)
10. Pears, R., Sakthithasan, S., Koh, Y.S.: Detecting concept change in dynamic data
streams - A sequential approach based on reservoir sampling. Machine Learning
97(3), 259–293 (2014)
11. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale clas-
sification. In: Proceedings of the 7th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2001, pp. 377–382. ACM, New York
(2001)
Early Classification of Time Series as a Non
Myopic Sequential Decision Making Problem
1 Introduction
In many applications, it is natural to acquire the description of an object
incrementally, with new measurements arriving sequentially. This is the case
in medicine, when a patient undergoes successive examinations until it is deter-
mined that enough evidence has been acquired to decide with sufficient certainty
the disease he/she is suffering from. Sometimes, the measurements are not con-
trolled and just arrive over time, as when the behavior of a consumer on a web
site is monitored on-line in order to predict what add to put on his/her screen.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 433–447, 2015.
DOI: 10.1007/978-3-319-23528-8 27
434 A. Dachraoui et al.
where xi1 , . . . , xit is the sequence of t measurements so far that must be classified
to either class −1 or class +1. As the number of measurements t increases, this
ratio is compared to two thresholds set according to the required error of the
first kind α (false positive error ) and error of the second kind β (false negative
error ). The main difficulty lies in the estimation of the conditional probabilities
P (xi1 , . . . , xit | y). (See also [4], a modern implantation of this idea).
A prominent limitation of this general approach is that the cost of delaying
the decision is not taken into account. More recent techniques include the two
components of the cost of early classification problems: the cost associated with
the quality of the prediction and the cost of the delay before a prediction is made
about the incoming sequence. However, most of them compute an optimal deci-
sion time from the learning set, which is then applied to any incoming example
whatever their characteristics are. The decision is therefore not adaptive since
the delay before making a prediction is independent on the input sequence.
The originality of the method presented here is threefold. First, the prob-
lem of early classification of time series is formalized as a sequential decision
problem involving the two costs: quality and delay of the prediction. Second, the
method is adaptive, in that the properties of the incoming sequence are taken
into account to decide what is the optimal time to make a prediction. And third,
in contrast to the usual sequential decision making techniques, the algorithm
presented is not myopic. At each time step, it computes what is the optimal
expected time for a decision in the future, and it is only if this expected time
is the current time that a decision is made. A myopic procedure would only
look at the current situation and decide whether it is time to stop asking for
more data and make a decision or not. It would never try to estimate in advance
the best time to make the prediction. The capacity of conjecturing when in the
future an optimal prediction should be made with regard to the quality and
delay of the prediction is however important and offers valuable opportunities
compared to myopic sequential decisions. Indeed, when the prediction is about
the breakdown of an equipment or about the possible failure of an organ in a
patient, this forecast capacity allows one to make preparations for thwarting as
Early Classification of Time Series 435
best as possible the breakdown or failure, rather than reacting in haste at the
last moment.
The paper is organized as follows. We first review some related work in
Section (2). The formal statement of the early classification problem (Section
(3)) leads to a generic sequential decision making meta algorithm. Our early
decision making proposed approach and its optimal decision rule are formalized
in Section (4). In Section (5), we propose one simple implementation of this meta
algorithm to illustrate our approach. Experiments and results on synthetic data
as well as on real data are presented and discussed in Section (6). The conclusion,
in Section (7), underlines the promising features of the approach presented and
discusses future works.
1: xt ←− ∅
2: t ←− 0
3: while (¬T rigger(xt , ht )) do /* wait for an additional measurement
4: xt ←− Concat(xt , xt ) /* a new measurement is added at the end of xt
5: t ←− t + 1
6: if (T rigger(xt , ht ) || t = T ) then
7: ŷ ←− ht (xt ) /* predict the class of xt and exit the loop
8: end if
9: end while
In the framework outlined above, we suppose that the training set S has been
used in order to learn a series of hypotheses ht (t ∈ {1, . . . , T }), each hypothesis
ht being able to classify examples of length t: xt = x1 , x2 , . . . , xt .
436 A. Dachraoui et al.
Then, the various existing methods for early classification of time series can
be categorized according to the T rigger function which decides when to stop
measuring additional information and output a prediction ht (xt ) for the class
of xt .
Several papers that are openly motivated by the problem of early classifi-
cation turn out indeed to be concerned with the problem of classifying from
incomplete sequences rather than with the problem of optimizing a tradeoff
between the precision of the prediction and the time it is performed. (see for
instance [5] where clever classification schemes are presented, but no explicit
cost for delaying the decision is taken into account). Therefore there is stricto
sensu no T rigger function used in these algorithms.
In [6], the T rigger function relies on an estimate of the earliest time at which
the prediction ht (xt ) should be equal to the one that would be made if the com-
plete example xT was known: hT (xT ). The so-called minimum prediction length
(MPL) is introduced, and is estimated using a one nearest neighbor classifier.
In a related work [7,8], the T rigger function is based on a very similar idea.
It outputs true when the probability that the assigned label ht (xt ) will match
the one that would be assigned using the complete time series hT (xT ) exceeds
some given threshold. To do so, the authors developed a quadratic discriminant
analysis that estimates a reliability bound on the classifier’s prediction at each
time step.
In [9], the T rigger function outputs true if the classification function ht has
a sufficient confidence in its prediction. In order to estimate this confidence level,
the authors use an ensemble method whereby the level of agreement is translated
into a confidence level.
In [10], an early classification approach relying on uncertainty estimations
is presented. It extends the early distinctive shapelet classification (EDGC) [11]
method to provide an uncertainty estimation for each class at each time step.
Thus, an incoming time series is labeled at each time step with the class that
has the minimum uncertainty at that time. The prediction is triggered once a
user-specified uncertainty threshold is met.
It is remarkable that even if the earliness of the decision is mentioned as a
motivation in these papers, the decision procedures themselves do not take it
explicitly into account. They instead evaluate the confidence or reliability of the
current prediction(s) in order to decide if the time is ripe for prediction, or if it
seems better to wait one more time step. In addition, the procedures are myopic
in that they do not look further than the current time to decide if it a prediction
should be made.
In this paper, we present a method that explicitly balance the expected gain
in the precision of the decision at all future time steps with the cost of delaying
the decision. In that way, the optimizing criterion is explicitly a function of both
aspects of the early decision problem, and, furthermore, it allows one to estimate,
and update if necessary, the future optimal time step for the decision.
Early Classification of Time Series 437
We can now define a cost function f associated with the decision problem.
f (xt ) = P (y|xt ) P (ŷ|y, xt ) Ct (ŷ|y) + C(t) (1)
y∈Y ŷ∈Y
However, this formulation of the decision problem requires that one be able
to compute the conditional probabilities P (y|xt ) and P (ŷ|y, xt ). The first one
is unknown, otherwise there would be no learning problem in the first place.
The second one is associated with a given classifier, and is equally difficult to
estimate.
Short of being able to estimate these terms, one can fall back on the expec-
tation of the cost for any sequence (hence the function now denoted f (t)):
f (t) = P (y) P (ŷ|y) Ct (ŷ|y) + C(t) (3)
y∈Y ŷ∈Y
From the training set S, it is indeed easy to compute the a priori probabilities
P (y) and the conditional probabilities P (ŷ|y) which are nothing else that the
confusion matrix associated with the considered classifier. One gets then the
optimal time for prediction as:
t∗ = ArgMin f (t)
t∈{1,...,T }
438 A. Dachraoui et al.
This can be computed before any new incoming sequence, and, indeed, t∗ is
independent on the input sequence. Of course, this is intuitively unsatisfactory
as one could feel, regarding a new sequence, very confident (resp. not confident)
in his/her prediction way before (resp. after) the prescribed time t∗ . If such is
the case, it seems foolish to make the prediction exactly at time t∗ . This is why
we propose an adaptive approach.
For each time step t ∈ [1, . . . , T ], a classifier ht is trained using a learning set
S . One can then estimate the associated confusion matrix for each cluster and
classifier ht : ck : Pt (ŷ|y, ck ) over a distinct learning set S”.
When a new input sequence xt of length t is considered, it is compared to
each cluster ck (of complete time series) and is given a probability membership
Pt (ŷ|y, ck ) for each of them (as detailed in Section (5)). In a way, this compares
the input sequence to all families of its possible continuations.
Given that, at time t, T − t measurements are still missing on the incoming
sequence, it is possible to compute the expected cost of classifying xt at each
future time step τ ∈ {0, . . . , T − t}:
fτ (xt ) = P (ck |xt ) Pt+τ (ŷ|y, ck ) C(ŷ|y) + C(t + τ ) (4)
ck ∈C y∈Y ŷ∈Y
Perhaps not apparent at first, this equation expresses two remarkable properties.
First, it is computable, which was not the case of Equation (1). Indeed, each
of the terms P (ck |xt ) and Pt+τ (ŷ|y, ck ) can now be estimated through frequencies
observed in the training data (see Figure (1)). Second, the cost depends on the
incoming sequence because of the use of the probability memberships P (ck |xt ).
It is therefore not computed beforehand, once for all.
Early Classification of Time Series 439
Fig. 2. The first curve represents an incoming time series xt . The second curve rep-
resents the expected cost fτ (xt ) given xt , ∀τ ∈ {0, . . . , T − t}. It shows the balance
between the gain in the expected precision of the prediction and the cost of waiting
before deciding. The minimum of this tradeoff is expected to occur at time τ . New
measurements can modify the curve of the expected cost and the estimated τ .
In addition, the fact that the expected cost fτ (xt ) can be computed for each
of the remaining τ time steps allows one to forecast what should be the optimal
horizon τ for the classification of the input sequence (see Figure (2)):
Of course, these costs, and the expected optimal horizon τ , can be re-
evaluated when a new measurement is made on the incoming sequence. At any
time step t, if the optimal horizon τ = 0, then the sequential decision process
stops and a prediction is made about the class of the input sequence xt using
the classifier hkt :
Returning to the general framework outlined for the early classification prob-
lem in Section (3), the proposed function that triggers a prediction for the incom-
ing sequence is given in Algorithm (2):
5 Implementation
Section (3) has outlined the general framework for the early classification prob-
lem while Section (4) has presented our proposed approach where the problem is
cast as a sequential decision problem with three properties: (i) both the quality of
the prediction and the delay before prediction are taken into account in the total
criterion to be optimized, (ii) the criterion is adaptive in that it depends upon
the incoming sequence xt , and (iii) the proposed solution leads to a non myopic
scheme where the system forecasts the expected optimal horizon τ ∗ instead of
just deciding that now is, or is not, the time to make a prediction.
In order to implement the proposed approach, choices have to be made about:
1. The type of classifiers used. For each time step t ∈ {1, . . . , T }, the input
dimension of the classifier is t.
2. The clustering method, which includes the technique (e.g. k-means), the dis-
tance used (e.g. the euclidean distance, the time warping distance, ...), and
the number of clusters that are looked for.
3. The method for computing the membership probabilities P (ck |xt ).
Early Classification of Time Series 441
6 Experiments
Our experiments aimed at checking the validity of the proposed method and at
exploring its capacities for various conditions. To this end, we devised controlled
experiments with artificial data sets for which we could vary the control param-
eters: difference between the two target classes, noise level, number of different
time series shapes in each class and the cost of waiting before decision C(t).
We also applied the method to the real data set TwoLeadECG from UCR Time
Series Classification/Clustering repository [13].
The constant b is used to set a general trend, for instance either ascending
(b > 0) or descending (b < 0), while the first term a sin(ωi t + phase) provides
442 A. Dachraoui et al.
Fig. 3. Subgroups of sequences generated for classes y = +1 and y = −1, when the
trend parameter b = −0.08 or b = +0.08, and the noise level ε(t) = 0.5.
a shape for this particular family of time series. The last term is a noise factor
that makes the overall prediction task more or less difficult.
For instance, Figure (3) shows a set of time series (one for each shape) where:
– b = −0.08 or b = +0.08
– a = 5 and phase = 0
– ω1 = 10 or ω2 = 10.3 (here, there are 2 groups of time sequences per class)
– ε(t) is a gaussian term of mean = 0 and standard deviation = 0.5
– T = 50
In this particular setting, it is apparent that it is easy to mix up the two
classes y = −1 and y = +1 until intermediate values of t. However, the wait-
ing cost C(t) may force the system to make a decision before there is enough
measurements to make a reasonably sure guess on the class y.
In our experiments, the training set S contained 2,500 examples, and the
testing set T contained 1000 examples, equally divided into the two classes
y = −1 and y = +1. (Nota: In case of imbalanced classes, it is easy to compensate
this by modifying the misclassification cost function Ct (ŷ|y)). Each class was
made of several subgroups: K−1 ones for class −1 and K+1 ones for class +1.
The misclassification costs were set as: C(ŷ|y) = 1, ∀ ŷ, y , and the time cost
function C(t) = d × t, where d ∈ {0.01, 0.05, 0.1}.
We varied:
The results for various combinations of these parameters are shown in Table
(1) as obtained on the time series of the testing set. It reports τ , the average of
computed optimal times of decision and its associated standard deviation σ(τ ).
Early Classification of Time Series 443
Table 1. Experimental results in function of the waiting cost C(t) = {0.01, 0.05, 0.1} ×
t, the noise level ε(t) and the trend parameter b.
0.2 9.0 2.40 0.99 9.0 2.40 0.99 10.0 0.0 1.00
0.5 13.0 4.40 0.98 13.0 4.40 0.98 15.0 0.18 1.00
0.01 1.5 24.0 10.02 0.98 32.0 2.56 1.00 30.0 12.79 0.99
5.0 26.0 7.78 0.84 30.0 18.91 0.87 30.0 19.14 0.88
10.0 38.0 18.89 0.70 48.0 1.79 0.74 46.0 5.27 0.75
15.0 23.0 15.88 0.61 32.0 13.88 0.64 29.0 17.80 0.62
20.0 7.0 8.99 0.52 11.0 11.38 0.55 4.0 1.22 0.52
0.2 8.0 2.00 0.98 8.0 2.00 0.98 9.0 0.0 1.00
0.5 10.0 2.80 0.96 8.0 4.0 0.98 14.0 0.41 0.99
0.05 1.5 5.0 0.40 0.68 20.0 0.42 0.95 14.0 4.80 0.88
5.0 8.0 3.87 0.68 6.0 1.36 0.64 5.0 0.50 0.65
10.0 4.0 0.29 0.56 4.0 0.25 0.56 4.0 0.34 0.57
15.0 4.0 0.0 0.54 4.0 0.25 0.56 4.0 0.0 0.55
20.0 4.0 0.0 0.52 4.0 0.0 0.52 4.0 0.0 0.52
0.2 6.0 0.80 0.95 7.0 1.60 0.94 8.0 0.40 0.96
0.5 6.0 0.80 0.84 9.0 2.40 0.93 10.0 0.0 0.95
0.10 1.5 4.0 0.0 0.67 5.0 0.43 0.68 6.0 0.80 0.74
5.0 4.0 0.07 0.64 4.0 0.05 0.64 4.0 0.11 0.64
10.0 4.0 0.0 0.56 48.0 1.79 0.74 4.0 0.22 0.56
15.0 4.0 0.0 0.55 4.0 0.0 0.55 4.0 0.0 0.55
20.0 4.0 0.0 0.52 11.0 11.38 0.55 4.0 0.0 0.52
Additionally, the Area Under the ROC Curve AUC evaluates the quality of the
prediction at the optimal decision time τ computed by the system.
Globally, one can see that when the noise level is low (ε ≤ 1.5) and the
waiting cost is low too (C(t) = ct × t, with ct ≤ 0.05), the system is able to reach
a high level of performance by waiting increasingly as the noise level augments.
When the waiting cost is high (C(t) = 0.1 × t), on the other hand, the system
takes a decision earlier at the cost of a somewhat lower prediction performance.
Indeed, with rising levels of noise, the system decides that it is not worth waiting
and makes a prediction early on, often at the earliest possible moment, which
was set to 4 in our experiments1 .
More specifically:
1
Below 4 measurements, the classifiers are not effective.
444 A. Dachraoui et al.
The above results, in Table (1) and Table (2), aggregate the measures on
the whole testing set. It is interesting to look as well at individual behaviors.
For instance, Figure (4) shows the expected costs fτ (x1t ) and fτ (x2t ) for two
different incoming sequences x1t and x2t , for each of the potentially remaining τ
time steps. First, one can notice the overall shape of the cost function fτ (xt ) with
a decrease followed by a rise. Second, the dependence on the incoming sequence
appears clearly, with different optimal times t . This confirms that the algorithm
takes into account the peculiarities of the incoming sequence.
In order to test the ability of the method to solve real problems, we have realized
experiments using the real data set TwoLeadECG from the UCR repository.
This data set contains 1162 ECG signals all together, that we randomly and
disjointedly re-sampled and split into a training set of 70% of examples and the
remainder for the test set. Each signal is composed of 81 data point representing
the electrical activity of the heart from two different leads. The goal is to detect
an abnormal activity in the heart. Our experiments show that it is indeed possible
to make an informed decision before all measurements are made.
Early Classification of Time Series 445
Table 2. Experimental results in function of the noise level ε(t), the trend parameter
b, and the number of subgroups k+1 and k−1 in each class. The waiting cost C(t) is
fixed to 0.01.
0.2 9.0 2.40 0.99 9.0 2.40 0.99 10.0 0.0 1.00
0.5 13.0 4.40 0.98 13.0 4.40 0.98 15.0 0.18 1.00
(3,2) 1.5 24.0 10.02 0.98 32.0 2.56 1.00 30.0 12.79 1.00
5.0 26.0 7.78 0.84 30.0 18.90 0.87 30.0 19.14 0.88
10.0 38.0 18.89 0.70 48.0 1.79 0.74 46.0 5.27 0.75
15.0 23.0 15.88 0.61 32.0 13.88 0.64 29.0 17.80 0.62
20.0 7.0 8.99 0.52 11.0 11.38 0.55 4.0 1.22 0.52
0.2 7.0 2.47 0.86 7.0 2.15 0.89 7.0 3.00 0.85
0.5 11.0 5.10 0.87 10.0 4.87 0.88 14.0 7.07 0.91
(3,5) 1.5 20.0 12.69 0.85 18.0 11.80 0.87 26.0 16.33 0.89
5.0 44.0 4.75 0.83 46.0 2.81 0.87 38.0 11.49 0.81
10.0 42.0 6.34 0.67 39.0 7.59 0.68 25.0 8.57 0.61
15.0 28.0 5.99 0.58 32.0 6.51 0.59 19.0 10.12 0.58
20.0 17.0 11.72 0.50 13.0 10.72 0.56 17.0 5.93 0.55
Fig. 4. For two different incoming sequences (top figure), the expected costs (bottom
figure) are different. The minima have different values and occur at different instants.
These differences confirm that deciding to make a prediction depends on the incoming
sequence. (Here, b = 0.05, C(t) = 0.01 × t and ε = 1.5).
446 A. Dachraoui et al.
Table 3. Experimental results on real data in function of the waiting cost C(t).
Since the costs involving quality and delay of decision are not provided with
this data set, we arbitrarily set these costs to C(ŷ|y) = 1, ∀ ŷ, y , and C(t) = d×t,
where d ∈ {0.01, 0.05, 0.1}. The question here is whether the method is able to
make reliable prediction early and provide reasonable results.
Table (3) reports the average of optimal times of decision τ of test time
series, its associated standard deviation σ(τ ), and the performance of the pre-
diction AUC. It is remarkable that a very good performance, as measured by the
AUC, can be obtained from a limited set of measurements: E.g. 22 out of 81 if
C(t) = 0.01, 24 out of 81 if C(t) = 0.05, and 10 out of 81 if C(t) = 0.1.
We therefore see that the baseline solution proposed here is able to (1) adapt
to each incoming sequence and (2) to predict an estimated optimal time of
prediction that yields very good prediction performances while controlling the
cost of delay.
The problem of online decision making has been known for decades, but numer-
ous new applications in medicine, electric grid management, automatic trans-
portation, and so on, give a new impetus to research works in this area. In this
paper, we have formalized a generic framework for early classification methods
that underlines two critical parts: (i) the optimization criterion that governs the
T rigger boolean function, and (ii) the manner by which the current information
about the incoming time sequence is taken into account.
Within this framework, we have proposed an optimization criterion that bal-
ances the expected gain in the classification cost in the future with the cost of
delaying the decision. One important property of this criterion is that it can be
computed at each time step for all future instants. This prediction of the future
gains is updated given the current observation and is therefore never certain,
but this yields a non myopic sequential decision process.
In this paper, we have sought to determine the baseline properties of our
proposed framework. Thus, we have used simple techniques as: (i) clustering of
time series in order to compare the incoming time sequence to known shapes
from the training set, (ii) a simple formula to estimate the membership proba-
bility P (ck |xt ), and (iii) not optimized classifiers, here: naı̈ve Bayes or a simple
implementation of Multi-Layer Perceptrons.
In this baseline setting, it is a remarkable feat that the experiments exhibit
a remarkable fit with desirable properties for an early decision classification
algorithm, as stated in Section 6. The system indeed controls the decision time
Early Classification of Time Series 447
References
1. DeGroot, M.H.: Optimal statistical decisions, vol. 82. John Wiley & Sons (2005)
2. Berger, J.O.: Statistical decision theory and Bayesian analysis. Springer Science &
Business Media (1985)
3. Wald, A., Wolfowitz, J.: Optimum character of the sequential probability ratio
test. The Annals of Mathematical Statistics, 326–339 (1948)
4. Sochman, J., Matas, J.: Waldboost-learning for time constrained sequential detec-
tion. In: IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, CVPR 2005, vol. 2, pp. 150–156. IEEE (2005)
5. Ishiguro, K., Sawada, H., Sakano, H.: Multi-class boosting for early classification
of sequences. Statistics 28, 337–407 (2000)
6. Xing, Z., Pei, J., Philip, S.Y.: Early prediction on time series: A nearest neighbor
approach. In: IJCAI, pp. 1297–1302. Citeseer (2009)
7. Anderson, H.S., Parrish, N., Tsukida, K., Gupta, M.: Early time-series classifica-
tion with reliability guarantee. Sandria Report (2012)
8. Parrish, N., Anderson, H.S., Gupta, M.R., Hsiao, D.Y.: Classifying with confidence
from incomplete information. J. of Mach. Learning Research 14, 3561–3589 (2013)
9. Hatami, N., Chira, C.: Classifiers with a reject option for early time-series classi-
fication. In: 2013 IEEE Symposium on Computational Intelligence and Ensemble
Learning (CIEL), pp. 9–16. IEEE (2013)
10. Ghalwash, M.F., Radosavljevic, V., Obradovic, Z.: Utilizing temporal patterns for
estimating uncertainty in interpretable early decision making. In: Proceedings of
the 20th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 402–411. ACM (2014)
11. Xing, Z., Pei, J., Philip, S.Y., Wang, K.: Extracting interpretable features for early
classification on time series. In: SDM, vol. 11, pp. 247–258. SIAM (2011)
12. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis. J. of computational and applied mathematics 20, 53–65 (1987)
13. Keogh, E., Xi, X., Wei, L., Ratanamahatana, C.A.: The ucr time series classifica-
tion/clustering homepage (2006). www.cs.ucr.edu/eamonn/time series data/
Ising Bandits with Side Information
1 Introduction
Many domains encounter a problem in collection of annotated training data due
to the difficulty and costs in requiring efforts of human annotators, while the
abundant unlabelled data come for free. What makes the problem more chal-
lenging is the data might often exhibit complex interactions that violate the inde-
pendent and identically distributed assumption of the data generation process.
In such domains, it is imperative that learning techniques can learn from unla-
belled data and the rich interactions based structure of the data. Learning from
unlabelled and a few labelled data falls under the purview of semi-supervised
learning. Coupling it with an encoding of the data interdependencies as a graph,
results in an attractive problem of learning on graphs.
Often, interesting applications are tied to such problems with rich underly-
ing structure. For example, consider the system of online advertising; serving
advertisements on web pages in an incremental fashion. The web pages can be
represented as vertices in the graph with the links as edges. At given time t,
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 448–463, 2015.
DOI: 10.1007/978-3-319-23528-8 28
Ising Bandits with Side Information 449
the current active concept for the learner to choose the action and receive the
feedback instantaneously. In their problem formulation under the Ising graph
setting, the algorithm tries to pick the action (the label of the concept) that is
NP hard. In contrast, we focus on the low temperature setting, where our actions
lie on the edges, and are not the labels of the vertices. The computation of the
marginal at the vertices is guided by the labels seen so far and the minimal cut.
We approximate the labelling of the entire graph rather than predicting the spin
configuration of a single vertex using the “cut” as the regularizer that dominates
the action selection. The contextual bandits work on online clustering of ban-
dits [9], deals with finding groups or clusters of bandits in the graphs. They have
a stochastic assumption of a linear function for reward generation. Similarity
is revealed by the parameter vector that is inferred over time. In contrast, we
use the similarity over edges to determine the “cut” which in-turns guides the
action selection process in adversarial settings. There work extends to running
a contextual bandit for every node, whereas ours is a single bandit algorithm,
where the context information is captured in the “cut”. The work of Castro et
al. [7] of edge bandits is similar in the sense that the bandits lie on the edges.
However, instead of direct rewards of action selection, rewards are a difference
in the values of the vertices. Further, this is the stochastic setting instead of
the adversarial one. In Spectral bandits [18], the actions are the nodes, while
there is a smooth Laplacian graph function for the rewards. We discuss later the
limitations of Laplacian based methods for graph labelling. Further, they do not
consider the Ising model that we study. The seminal work of semi supervised
graph labelling prediction can be found in [6], where minimum label-separating
cut is used for prediction. Laplacian based methods that results neighbouring
nodes connected by an edge to share similar values are widely studied in the
semi-supervised and manifold learning problems [5,11,12,19,20]. Typically, this
information is captured by the semi-norm induced by the Laplacian of the graph.
Essentially, the smoothness of the labelling is ensured by the “cut”. The “cut”
is the number of edges with disagreeing labels. Then, the norm induced by the
Laplacian can be considered as the regularizer. However, there are limitations in
these methods with increasing unlabelled data [1,16]. Here, we also use “cut” as
the regularization measure over an Ising model distribution of the values over the
vertices of the graph at low temperatures. We simultaneously find the partition
using the “min-cut” and then sample the actions from the relevant partition.
When u ∈ {0, 1}n , the “cut” is on the edge (i, j) where ui = uj , then ψG (u) is
the number of “cut” edges.
y = argmin{uT Lu : u ∈ Rn , ur = yr , r = 1, . . . , l} .
The prediction is made by using ŷi = sgn(yi ) [13]. The rationale behind mini-
mizing the cut enables the neighbouring vertices to have similarly valued labels.
With p → 1, the prediction problem is reduced to predicting using the label
consistent minimum cut.
−βE(u)
field P (u) = exp Z , where Z is the partition function and β in the inverse
temperature or the uncertainty in the model. There are multiple limitations
of the quadratic energy minimization technique. This model is not applicable
for p → 1 in the limit. Not only is the computation slow, the mistake bounds
obtained are not the best. Further, in our problem, we relax the values of the
labels such that u : V → [0, 1]. With p → 1, the energy function is equivalent
to the the one that finds the minimum cut. Further, when p → 1 using (2)
results in the minimization of a non-strongly convex function per trial that is
not differentiable. Also, interesting is that the Laplacian based methods are
limited with the abundance of unlabelled data [16]. Hence, we are interested in
the Markov random field applicable here with discrete states also known as the
Ising model. At low temperatures, the Ising probability distribution over the
labellings of a graph G is defined by:
1
PTG (u) ∝ exp − ψG (u) . (3)
T
where T is the temperature, u is the labelling over the vertices of G and ψG (u)
is the complexity of the labelling or the “cut-size”. The probabilistic Ising model
encodes the uncertainty about the labels of the vertices and at low temperatures
favours labellings that minimise the number of edges whose vertices have different
labels as shown in (2) with p = 1. If the vertex labels pairs seen so far is given
by Zt of vertex label pairs (j1 , y1 ) , . . . , (jt , yt ) such that (j, y) ∈ V (G) × {0, 1},
then the marginal probability of the label of the vertex v being y conditioned
on Zt is given by: PTG (uv = y|Zt ) = PTG (uv = y|uj1 = y1 , . . . , ujt = yt ). At low
temperatures and in the limit of zero temperature T → 0, the marginal favours
the labelling that is consistent with the labelling seen so far and the minimum
cut. Such label conditioning or label consistency in the context of graph labelling
has been extensively studied [11,12,15]. In this paper, we are only interested in
the low temperature setting of the Ising model as the environment in which the
player functions. However, at low temperatures, the minimum cut is still not
unique.
As with any sequential prediction game, the MAB is played between the learner
and the environment and proceeds in a series of rounds t = 1, . . . , n. At every
time instance t, the forecaster chooses an action It from the set of actions or
arms at ∈ A, where A is the action set with K actions. When sampling an arm,
the learner suffers a loss lt that the adversary chooses in a randomized way. The
forecaster receives the loss for the selected action only in the bandit setting.
The objective of the forecaster is to minimize the regret given by the difference
between the incurred cumulative loss on the sequence played and the optimal
cumulative loss with respect to the best possible action in hindsight. The decision
making process depends on the history of actions sampled and losses received
up until time t − 1. The notion of regret is expressed as expected (average)
Ising Bandits with Side Information 453
regret and pseudo regret, where pseudo regret is the weaker notion because of
the comparison with the optimal action in expectation. For the adversarial case,
it is given by:
n n
Rn = E lIt ,t − min E li,t . (4)
i=1,...,K
t=1 t=1
2.4 Formulation
We consider an undirected graph G = (V (G), E(G)) where the elements of E
are called edges that form an unordered pair between the unique elements of V
that are called vertices. We assume an unit weight on every edge. The number of
vertices in the graph are denoted by N . The vertices of the graph are associated
with partially unknown concept values or labels si that are gradually revealed,
while the bandits lie on the edges in E(G) to form the action set A with cardi-
nality |K|. We assume a κ connected graph, the maximum value of κ such that
each vertex has at least κ neighbours. Vertices i and j are neighbours if there
is an edge/action connecting them. Note, the number of rounds n ≤ |K|. In our
case, n is equal to number of vertices queried by the environment with unknown
labels. A vertex is randomly selected by the environment at every round t, in our
case, the queried vertex is given by xi where i ∈ NN . In our example, the queried
vertex could represent the request to place an advert on the product website the
user currently visits . More specifically, the connections in our graph, not only
capture the explicit connections between vertices given by locality, but our ban-
dits or edges also capture the implicit connections between the values of vertices
that are possibly differently labelled. In our case, the labels are relaxed such that
the label for the i-th vertex is denoted by si = {−1, 1}.
At the start we are given the labels of a small subset of observed vertices,
s o ∈ S o ⊂ V (G). The labels of the unlabelled vertices su ∈ S u ⊂ V (G) \ L, with
S = S o ∪ S u is revealed one at a time sequentially as at the end of each round
as side information. We assume that there are at least two vertices labelled at
454 S. Ghosh and A. Prügel-Bennett
the start, one in each category. The learning algorithm plays the online bandit
game where the adversary at each trial reveals the loss of the selected action
and the label of a randomly selected vertex. The goal of the learner is to be
able to predict the label of the randomly selected vertex and then sample the
appropriate action given the prediction.
that has auxiliary variables introduced such that there is one variable for every
vertex v and one variable for every edge fij . Since, we have an undirected graph,
we assume a directed edge in each direction, for every undirected edge. Hence
we have two flow variables per edge in the graph. Essentially, the free variables
in the optimization are the unlabelled vertices sui , suj and the flows across every
direction fij . The total flow across all the edges will be our maximum flow for
this low temperature Ising model. The formulation in Fig.1 below is what the
learner follows to find the minimum cut ψG . The output from the computation
is a directed graph with the value of flow at every edge and the labelling of
the vertices consistent with the labels seen so far; w(i,j) is the cost variable
of the LP. The sum of the flows is the maximum flow in the Ising model at
low temperatures. We fix one of the labelled vertices as a source, and one as
target, each with different labels. We assume a unit capacity on every edge.
The constraints in Fig. 1 ensure the capacity constraint f(ij) and conservation
constraint si −sj are adhered to i.e. the flow in any vertex v other than the source
and target, is equal to flow out from v. The largest amount of flow that can pass
through any edge is at most 1, as we have unit capacity on every edge. We know
that the cost of the maximum flow is equal to the capacity of the minimum
cut. The minimum cut obtained as a solution to the optimization problem is an
integer.
4 Experiments
In our experiments, we compare three competitor algorithms with our algorithm
IsingBandits. The three are LabProp [19,20], Exp3 [3] and Exp4[3]. Exp3 and
456 S. Ghosh and A. Prügel-Bennett
subject to:
f(i,j) ≥ 0 (6)
si − sj ≤ f(ij) (7)
si ≥ −1 (8)
si ≤ 1
Return: min-cut: c∗ ; flows: f ; consistent partition: S
Parameters: Graph: G; η ∈ R+
Input: Trial Sequence: H = (x1 , −1), (x2 , 1), (x3 , s3 ), . . . , (xt , st )
1 1 1
Initialization: p1 is the initial distribution over A such that, p1 = ( |K| , |K|
, . . . , |K| ),
Initial cut-size c = ∞; active partition distribution r1 = p1
for t = 1, . . . , n do
Receive: xt ∈ NN
(c∗ , f, S ) = ComputeMaxFlow(s , s , H, G)
if (c = c∗ ) then % if cut has changed
(E(R), E(J ), rt , jt ) = SelectPartition(xt , pt , S , A)
Assign: qt be the distribution over Ising bandits w.r.t xt , such that,
|E(R)|
i=1 qi,t = rt . For any t, pt = rt ∪ jt
Play: It from qt
Receive: Loss zt ; side information st
zi,t
Compute: Estimated loss z̃i,t = 1
qi,t It =i
end
Exp4 are from the same family of algorithms for bandits in the adversarial set-
ting. Exp4 is the contextual bandit setting, the close competitor to Ising from
the contextual perspective. The experts or contexts in Exp4 for our problem
setting are a number of possible labellings. Note that the number of experts
selected for prediction have a bearing on the performance of the algorithm. In
our experiments, we fixed the number of experts to 10. In reality, even at low
temperatures for the model we consider, the set of all possible labellings is expo-
nential in size. LabProp [19,20] is the implementation where the state-of-the-art
graph Laplacian based labelling procedure is used to optimize the labelling con-
sistent with the labels seen so far. For all of the above algorithms, we use our
own implementation in MATLAB. Since online experiments are extremely time
consuming while processing one data point at a time, we have averaged each set
of experiments over five trials but for ISOLET, where we average over ten trials.
The datasets that we use are the standardized UCI datasets namely the USPS
and the ISOLET datasets. All datasets are nearly balanced in our experiments to
demonstrate the fairness of the class distribution and for avoiding any majority
vote cases where the class with the majority vote wins.
We design our experiments to test the action selection algorithm under a num-
ber of different criteria of graph creation: balanced labels, varying degree of
connectedness, varying sizes of initial labels and noise. The parameters that are
varied across the experiments are graph size indicated by N , labels available as
L, connectivity K, noise levels nse.
In the set of experiments with ISOLET, we chose to build the graph from
the first 30 speakers in Isolet1 that forms a graph of 1560 vertices of 52 spoken
letters (each letter spoken twice) by 30 speakers. The concept classes that are
sampled are the first 13 letters of English alphabets as one concept vs. the next
13 letters as the other concept. We build a 3 nearest neighbour graph from the
Ising Bandits with Side Information 459
Euclidean distance matrix constructed using the pairwise distances between the
examples (spoken letters). In order to ensure that the graph is connected for
such low connectivity, we sample a MST for each graph and always maintain
the MST edges in the graph. The MST uses the Euclidean distances as weights.
The same underlying graph is used across trials. The edges or connections form
the bandits. The available side information is sampled randomly such that the
two classes are balanced over the entire graph size.
In the USPS experiments, we randomly sample a different graph for each
trial. While sampling the vertices of the graph, we ensure to select vertices
equally from each concept class. We use a variety of concept classes 1 vs. 2, 2
vs. 3 and 4 vs. 7. We use the pairwise Euclidean distance as the weights for
the MST construction. All the sampled graphs maintain the MST edges. In
all the experiments on the datasets, the unweighted minimum spanning tree
(MST) and “K = 3”-NN graph had their edge sets’ “unioned” to create the
resultant graph. The motivating reason being that most empirical experiments
had shown competitive performance of algorithms at K = 3, while the MST
guaranteed connectivity in the graph. Besides, MST based graphs are sparse in
general, enabling computational efficient completion of the experiments. All the
experiments were carried out in a quad-core processor notebooks (@2.30 GHz
each) with 8GB RAM and 16 GB RAM.
4.5 Results
Fig. 4. Results on torus graph generated from Squares image with equal number of
neighbours K = 4, N = 3600, L = 250.
Our dataset experiments begin with the USPS 2 Vs.3 experiment with con-
nectivity K = 3, available labels L = 8, and number of data points N = 1000.
In Fig. 5 below, algorithms Ising and LabProp are very competitive when side
information about more than half of the dataset is obtained. When the side
information is very limited at the beginning of the game, LabProp outperforms
Ising.
In Fig. 6 below, we test the behaviour of the algorithms with varying degree of
connectivity. We vary the parameter K over a range to check how well the cluster
size affects the performance. It is known from labelling over graph literature that
with increasing K the behaviour deteriorates. Here, we see Ising outperforms
LabProp for lower values of K, while LabProp wins for higher K.
Ising Bandits with Side Information 461
In our experiments over the dataset ISOLET, we sample the graph from
ISOLET 1. In Fig. 7, we observe that with K = 3 and L = 128, Ising out-
performs LabProp throughout. The overall regret achieved in ISOLET is higher
than the regret achieved in USPS as ISOLET is a noisy dataset.
The following set of experiments in Fig. 8 and Fig. 9 test the robustness of
our methods in presence of balanced noise. Our noise parameter nse is varied
over the percentage range s = 10, 20, 30, 40. When noise is say x percent, we
randomly eliminate the actions/edges in the graph (from existing connections)
for which the noise is less than x percent, and add a balanced equal number of
new actions (connections) to the graph. We see that the performance of Ising
is the most robust across various noise levels. LabProp suffers with noise as it is
heavily dependant on connectivity, and under performs in contrast to Exp4 and
Exp3. On the contrary, Ising uses the connectivity for side information, with its
action selection unaffected with the introduction of noise. When the noise level
increases, the performance of all the algorithms decrease uniformly.
462 S. Ghosh and A. Prügel-Bennett
Fig. 8. USPS 1 vs. 2 Robustness Experiments with noise levels 10% and 20%
Fig. 9. USPS 1 vs. 2 Robustness Experiments with noise levels 30% and 40%
5 Conclusion
There are real life scenarios where a core minimal subset of connections in a net-
work is responsible for partitioning the graph. Such a core group could be a focus
of targeted advertising or content-recommendation as that can have maximum
influence on the network with a potential to go viral. Typically, there is a lot
of available information in such settings that is potentially usable for detecting
the changing partitioning set. We address such advertising and content recom-
mendation challenges by casting the problem as an online Ising graph model of
bandits with side information. We use the notion of cut-size as a regularity mea-
sure in the model to identify the partition and play the bandits game. The best
case behaviour of the algorithm when there is a single partition is equivalent to
the standard adversarial MAB. We show a polynomial algorithm where the label
consistent “cut-size” can guide the sampling procedure. Further, we motivate a
linear time exact algorithm for computing the max flow that also respects the
label consistency. An interesting effect of the algorithm is that as long as the
cut-size does not change, the learner keeps playing the same partition on the
active action set (size smaller than the actual action set). The regret is then
bounded by the number of times the cut changes during the entire game. This
can be proven analytically, which we will like to pursue as future work.
Ising Bandits with Side Information 463
References
1. Alamgir, M., von Luxburg, U.: Phase transition in the family of p-resistances. In:
Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q.
(eds.) NIPS, pp. 379–387 (2011)
2. Amin, K., Kearns, M., Syed, U.: Graphical models for bandit problems (2012).
arXiv preprint arXiv:1202.3782
3. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino:
the adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual
Symposium on Foundations of Computer Science, 1995, pp. 322–331. IEEE (1995)
4. Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning
on large graphs. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI),
vol. 3120, pp. 624–638. Springer, Heidelberg (2004)
5. Belkin, M., Niyogi, P.: Semi-supervised learning on riemannian manifolds. Mach.
Learn. 56(1–3), 209–239 (2004)
6. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph min-
cuts. In: ICML, pp. 19–26 (2001)
7. Di Castro, D., Gentile, C., Mannor, S.: Bandits with an edge. In: CoRR,
abs/1109.2296 (2011)
8. Ford, L.R., Fulkerson, D.R.: Maximal Flow through a Network. Canadian Journal
of Mathematics 8, 399–404 (1956). http://www.rand.org/pubs/papers/P605/
9. Gentile, C., Li, S., Zappella, G.: Online clustering of bandits (2014). arXiv preprint
arXiv:1401.8257
10. Gentile, C., Orabona, F.: On multilabel classification and ranking with bandit
feedback. The Journal of Machine Learning Research 15(1), 2451–2487 (2014)
11. Herbster, M.: Exploiting cluster-structure to predict the labeling of a graph. In:
Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI),
vol. 5254, pp. 54–69. Springer, Heidelberg (2008)
12. Herbster, M., Lever, G.: Predicting the labelling of a graph via minimum
p-seminorm interpolation. In: Proceedings of the 22nd Annual Conference on
Learning Theory (COLT 2009) (2009)
13. Herbster, M., Lever, G.: Predicting the labelling of a graph via minimum
p-seminorm interpolation. In: COLT (2009)
14. Herbster, M., Lever, G., Pontil, M.: Online prediction on large diameter graphs.
In: Advances in Neural Information Processing Systems, pp. 649–656 (2009)
15. Herbster, M., Pontil, M., Wainer, L.: Online learning over graphs. In: Proceedings
of the 22nd international conference on Machine learning ICML 2005, pp. 305–312.
ACM, New York (2005)
16. Nadler, B., Srebro, N., Zhou, X.: Statistical analysis of semi-supervised learning:
the limit of infinite unlabelled data. In: NIPS, pp. 1330–1338 (2009)
17. Trevisan, L.: Lecture 15:cs261:optimization (2011). http://theory.stanford.edu/
trevisan/cs261/lecture15.pdf
18. Valko, M., Munos, R., Kveton, B., Kocák, T.: Spectral bandits for smooth graph
functions. In: 31th International Conference on Machine Learning (2014)
19. Zhu, X., Ghahramani, Z.: Towards semi-supervised classification with markov ran-
dom fields. Tech. Rep. CMU-CALD-02-106, Carnegie Mellon University (2002)
20. Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using gaussian
fields and harmonic functions. In: ICML, pp. 912–919 (2003)
Refined Algorithms for Infinitely Many-Armed
Bandits with Deterministic Rewards
1 Introduction
We consider a statistical learning problem in which the learning agent faces a
large pool of possible items, or arms, each associated with a numerical value
which is unknown a priori. At each time step the agent chooses an arm, whose
exact value is then revealed and considered as the agent’s reward at this time
step. The goal of the learning agent is to maximize the cumulative reward,
or, more specifically, to minimize the cumulative n-step regret (relative to the
largest value available in the pool). At every time step, the agent should decide
between sampling a new arm (with unknown value) from the pool, or sampling
a previously sampled arm with a known value. Clearly, this decision represents
the exploration vs. exploitation trade-off in the classic multi-armed bandit model.
Our model assumes that the number of available arms in the pool is unlimited,
and that the value of each newly observed arm is an independent sample from
a common probability distribution. We study two variants of the basic model:
the retainable arms case, in which the learning agent can return to any of the
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 464–479, 2015.
DOI: 10.1007/978-3-319-23528-8 29
Algorithms for Many-Armed Bandits 465
previously sampled arms (with known value), and the case of non-retainable
arms, where previously sampled arms are lost if not immediately reused.
This model falls within the so-call infinitely-many armed framework, stud-
ied in [3,4,6,7,10,11]. In most of these works, which are further elaborated on
below, the observed rewards are stochastic and the arms are retainable. Here,
we continue the work in [7] that assumes that the potential reward of each arm
is fixed and precisely observed once that arm is chosen. This simpler framework
allows to obtain sharper bounds which focus on the basic issue of the sample
size required to estimate the maximal value in the pool. At the same time, the
assumption that the reward is deterministic may be relevant in various appli-
cations, such as parts inspection, worker selection, and communication channel
selection. For this model, a lower bound on the regret and fixed time horizon
algorithms that attain this lower bound up to logarithmic terms were presented
in [7]. In the present paper, we propose algorithms that attain the same order
as the lower bound (with no additional logarithmic terms) under a fairly general
assumption on the tail of the probability distribution of the value. We further
demonstrate that these bounds may not be achieved without this assumption.
Furthermore, for the case where the time horizon is not specified, we provide an
anytime algorithm that also attains the lower bound under similar conditions.
As mentioned above, several papers have studied a similar model with
stochastic rewards. A lower bound on the regret was first provided in [3], for
the case of Bernoulli arms, with the arm values (namely the expected rewards)
distributed uniformly on the interval [0, 1]. For a known value distribution, algo-
rithms that attain the same regret order as that lower bound are provided in
[3,6,10], and an algorithm which attains that bound exactly under certain condi-
tions is provided in [4]. In [11], the model was analyzed under weaker conditions
that involve the form of the tail of the value distribution which is assumed known;
however, significantly, the maximal value need not be known a priori. A lower
bound and algorithms that achieve it up to logarithmic terms were developed
for this case. The assumptions in the present paper are milder, in the sense that
the tail distribution is not restricted in its form and only an upper bound on
this tail is assumed rather than exact match. Our work also addresses the case
of non-retainable arms, which has not been considered in the above-mentioned
papers.
In a broader perspective, the present model may be compared to the
continuum-armed bandit problem studied in [1,5,9]. In this model the arms
are chosen from a continuous set, and the arm values satisfy some continuity
properties over this set. In the model discussed here, we do not assume any
regularity conditions across arms. The non-retainable arms version of our model
is reminiscent of the classical secretary problem, see for example [8] and [2] for
extensive surveys. In the secretary problem, the learning agent interviews job
candidates sequentially, and wishes to maximize the probability of hiring the
best candidate in the group. Our model considers the cumulative reward (or
regret) as the performance measure.
466 Y. David and N. Shimkin
The paper proceeds as follows. In the next section we present our model and
the associated lower bound developed in [7]. Section 3 presents our algorithms
and regret bounds for the basic model (with known time horizon and retainable
arms). The extensions to anytime algorithms and the case of non-retainable
arms are presented in Section 4. Some numerical experiments which compare
the performance of the proposed algorithms to previous ones are described in
Section 5, followed by concluding remarks.
where r(t) is the reward obtained at time t, namely, the value of the arm chosen
at time t.
The following notations will be used in this paper:
– µ is a generic random variable with distribution function F .
– For 0 ≤ ≤ 1, let
D0 () = inf {P (µ ≥ μ∗ − D) ≥ } ,
D≥0
1
If the support of µ is a single interval, then D0 () is continuous. In that case,
definition (2) reduced to the equation nD0 () = 1 which, by monotonicity, has a
unique solution for n large enough.
Algorithms for Many-Armed Bandits 467
– Furthermore, let D() denote a given upper bound on the tail function D0 (),
and let ∗ (n) be defined similarly to ∗0 (n) with D0 () replaced by D(),
namely,
∗ 1
(n) = sup ∈ [0, 1] : nD() ≤ . (3)
Note that ∗ (n) ≤ ∗0 (n). Since D0 () is a non-decreasing function, we assume,
without loss of generality, that D() is also a non-decreasing function.
In the following sections, we shall assume that the upper bound D() on the tail
function D0 () is known to the learning agent, and that it satisfies the following
growth property.
Assumption 1
D() ≤ M D(0 )α/0
for every 0 < 0 ≤ ≤ 1 and constants M > 1 and 1 ≤ α < e.
A general class of distributions that satisfies Assumption 1 is given in the
following example, which will further serves us throughout the paper.
Example 1. Suppose that P (µ ≥ μ∗ − ) = Θ β for > 0 small enough, where
β > 0. This is the case considered in [11]. Then D0 () = Θ 1/β , and for D() =
A1/β , where β > 0 and A > 0, it can be obtained that D() ≤ M D(0 )α/0 ,
1/β
where 1 < α < e, M = λαλ and λ = β ln(α) 1
. Hence, in this case Assumption 1
holds. Note that β = 1 corresponds to a uniform probability distribution which
is the case considered in [3] and [4] for μ∗ = 1.
Remark 1. Assumption 1 can be extended to any upper bound α on the value
of α (instead of e). In that case, a proper modification to the algorithms below
leads to upper bounds that are larger by a constant multiplicative factor of ln(α).
However, as the assumption above covers most cases of interest, for simplicity
of presentation, we will not go further into this extension. We note that the
algorithms presented here do not use the values of α and M .
For the case in which the tail function D0 () itself is known to the learning
agent, the following lower bound on the expected regret was established in [7].
Theorem 1. The n-step regret is lower bounded by
μ∗ − E[μ] 1
regret(n) ≥ (1 − δn ) , (4)
16 ∗0 (n)
∗ 2
where ∗0 (n) satisfies (2), and δn = 1 − 2 exp − (μ 8−E[μ])
∗ (n) .
0
Corollary 1. Let D() be an upper bound on the tail function D0 () such that
D()
≤ L < ∞, ∀ 0 ≤ ≤ 1.
D0 ()
Then, the n-step regret is lower bounded by
μ∗ − E[μ] 1
regret(n) ≥ (1 − δn ) , (5)
16L ∗ (n)
where ∗ (n) satisfies (3) and δn is as defined in Theorem 1.
Proof: Let
D() 1
∗L (n) = sup ∈ [0, 1] : n ≤ . (6)
L
Then, for every 0 ≤ 1 ≤ 1 such that ∗L (n) < 1 , by (6) and the assumed
condition of the Corollary, it follows that 11 < n D(
L
1)
≤ nD0 (1 ). Therefore, by
∗
Equation (2), 0 (n) < 1 . Thus,
∗0 (n) ≤ ∗L (n). (7)
Now, we need to compare ∗L (n) to ∗ (n). Let L∗ (n) < 2 . Since the tail function
is non-decreasing, it follows that L2 < nD( L2 ) ≤ nD(2 ), so that 12 < n D( 2)
L .
Hence, ∗L (n) < 2 , and
∗L (n) ≤ L∗ (n). (8)
Equations (7) and (8) imply that ∗0 (n) ≤ L∗ (n), or 1
L∗ (n) ≤ ∗
1
. By substi-
0 (n)
tuting in Equation (4), the Corollary is obtained.
The upper bound obtained in the above Theorem is of the same order as the
lower bound in Equation (5). Note that the values of M and α in Assumption 1
are not used in the algorithm, but only appear in the regret bound.
Example 1 (continued). For β = 1 (µ is uniform on [a, b]), Assumption
√
1 holds
1/β n
for any α ∈ [1, e], with M = λαλ , where λ = β ln(α)
1
and ∗ 1(n) = √b−a . Therefore,
√
4.1 n
for β = 1, with the optimize choice of α = 1.47, we obtain regret(n) < √ +1.
b−a
Now, by Assumption 1,
N
N,∗ (n) α
Δ0 ≤ M D(∗ (n))αi e1−i < M e D(∗ (n)). (14)
i=1
e−α
Therefore, by (10),
1 α α 1
regret(n) ≤ + 1 + nM e D(∗ (n)) ≤ (1 + M e ) ∗ + 1.
∗ (n) e−α e − α (n)
For the case that Assumption 1 does not hold, we provide an example for
which the regret is larger than the lower bound presented in Equation (4) by a
logarithmic term.
1
Example 2. Suppose that P (µ ≥ μ∗ − ) = − ln() 1
. Then D0 () = e− , and it
follows that ln(n) ≤ ∗0 (n) ≤ ln(n) .
1 2
Take 0 = 12 . Then, for any α > 1 and M > 0, for small enough we obtain
D() 1/0 −1/
D(0 ) = e = e1/ > M α2 = M α/0 . Hence, Assumption 1 does not hold.
Lemma 1. For the case considered in Example 2, the best regret which can be
achieved is larger by multiplicative a logarithmic factor (ln(n)) than the lower
bound presented in Equation (4).
Proof: Let N stand for the number of sampled arms, then, one can find that
where Δ(N ) = E[μ∗ − VN (1)]. To bound the second term of Equation (15), note
that, for any N ≤ 1 ,
N
Δ(N ) ≥ D0 (i)P (D0 ((i + 1)) ≥ μ∗ − VN (1) > D0 ((i)))
i=1
N
= D0 (i)(ΔN, (i) − ΔN, (i + 1)) Δ̃(N ),
i=1
where
ΔN, (i) = P (μ∗ − VN (1) > D0 (i)) .
By the fact that D0 () is continuous, it follows that
and
1
Noting that e−1 ≥ (1 − ) ≥ exp −1 − 1− for ∈ (0, 1], we obtain for the
1
choice of = N that
1 1
ΔN, N (i) − ΔN, N (i + 1) ≥ e−i βN
i
,
−i
i
where βN = e N −1 − e−1 .
i−1
1
Now, since D0 (i) = D0 ()e i , again for the choice of = N, it follows that
N
1 N − N −i i √ 1 √ √
Δ̃(N ) ≥ D0 ( )e i e βN ≥ N D0 ( )eN −2 N βN N .
i=1
N N
1
Therefore, since D0 ( N1 ) = e− N , for N ≥ 3 we obtain that
√ √ √
regret(n) = N E[µ] + (n − N ) N e−2 N βN N ,
where A = E[µ] ∗
5 . But, since 0 (n) ≤ ln(n) , the order of the regret is larger by a
2
logarithmic factor than the lower bound on the regret of Equation (4).
4 Extensions
In this section we discuss two extensions of the basic model, the first is the case
in which the time horizon is not specified, leading to an anytime algorithm, and
the second is the non-reatainable arms model.
arms is increasing gradually, the upper bound on the regret obtained here is
worse than that obtained in the case of known time horizon. However, we show
in Corollary 2 that it is of the same order, under an additional condition.
We note that applying the standard doubling trick to Algorithm 1 does not
serve our purpose here, as it would add a logarithmic factor to the regret bound.
In the following Theorem we provide an upper bound on the regret achieved
by the proposed Algorithm.
As ∗ 1(n) ≥ ∗1(t) for t ≤ n, it is obtained that in the worst case, the bound in
Equation (16) is larger than the lower bound in Equation (5) by a logarithmic
term. However, as shown in the following corollary, under reasonable conditions
on the tail function D(), the bound in Equation (16) is of the same order as
the lower bound in Equation (5).
n+1 1
where f (n) = n ; note that f (n) → 1 asymptotically.
1+γ
1
Example 1 (continued). When D() = Θ 1/β , it follows that ∗ (t) =
β
Θ n 1+β . Therefore, the condition of Corollary 2 holds.
Algorithms for Many-Armed Bandits 473
Proof of Corollary 2: Under the assumed condition B1 tγ ≤ 1
∗ (t) ≤ B2 tγ ,
1 1
where B1 = B11+γ , B2 = B21+γ and γ = 1
1+γ . Therefore,
n
n+1 n+1
1 1 2 2B2
γ 2B2 f (n) 1
≤ ≤ ≤ (n + 1) ≤ .
t=2
t∗ (t) t=2 (t − 1)∗ (t) t=2 t∗ (t) γ B1 γ ∗ (n)
Proof of Theorem 3: Recall the notation VN (1) for the value of the best arm
found by sampling N different arms. We bound the regret by
n
regret(n) ≤ E 1 + I (Et ) + (μ∗ − Vt (1)) I Et ,
t=2
where Et = mt < ∗1(t) + 1 , and I(·) is the indicator function.
Since ∗1(t) + 1 is a monotone increasing function it follows that
n
1
1+ I (Et ) ≤ +2 .
t=2
∗ (n)
∗
Recall that Δ(t) = E[μ − Vt (1)], then, since Vt (1) is non-decreasing, by Equa-
tions (11)-(14), we obtain that
n n n
1 α
∗
E (μ − Vt (1)) I Et ≤ Δ ∗ +1 ≤ Me D(∗ (t)) .
t=2 t=2
(t) t=2
e − α
Theorem 4. Under Assumptions 1 and 2, for every n > 1, the regret of Algo-
rithm 3 is upper bounded by
τ
τ α 2 1 2 ln 1+τ (n)
regret(n) ≤ 1 + ln 1+τ (n) CM e + ∗
+ +1 (18)
e − α ln(2) (n) ln(2)
Proof: For N ≥ 1, recall that VN (1) stands for the value of the best arm
found by sampling N different arms and that Δ(N ) = E[μ∗ − VN (1)]. Clearly,
where the random variable Y (V ) is the number of arms sampled until an arm
with a value larger or equal to V is sampled. The second term in Equation (19)
can be bounded similarly to the second term in Equation (10). Namely, since
1
N ≥ ln− 1+τ (n) ∗ 1(n) ,
1 1
N,ln 1+τ (n)∗ (n) N,ln 1+τ (n)∗ (n)
Δ ≤ Δ0 ,
Let M be such that M is the first element in the sequence which is larger or
equal to one, and set M = 1. Then, since D(i+1 ) = D(2i ) = γi+1 ∀i ≥ 1, and
E[Y (V )] is non-decreasing in V , we obtain that
Then, by the expected value of a Geometric distribution, Equation (21), and the
fact that γi = D(i ), we obtain that
1
E [Y (μ∗ − γi )] = .
i
Also, since γi+1 = D(2i ), it follows that
M
τ
ln(n) 2 ln 1+τ (n) 1
ΦN ≤ 2N ≤ 2N ≤ +1 . (23)
i=1
ln(2) ln(2) ∗ (n)
By combining Equations (19), (20), (22) and (23) the claimed bound in Equa-
tion (18) is obtained.
476 Y. David and N. Shimkin
We note that a combined model which considers the anytime problem for the
non-retainable arms case can be analyzed by similar methods. However, we do
not consider this variant here.
5 Experiments
We next investigate numerically the algorithms presented in this paper, and
compare them to the relevant algorithms from [7,11]. We remind that the present
deterministic model was only studied in [7], while the model considered in [11]
is similar in its assumptions to the presented one in that only the form of the
tail function (rather the exact value distribution) is assumed known. Since the
algorithms
in [11] are analyzed only for the case of Example 1 (i.e. D() =
Θ β ), we adhere to this model with several values of β for our experiments.
The maximal value is taken μ∗ = 0.99, but is not known to the learning agent.
In addition to that, since the algorithms presented in [11] were planned for the
stochastic model, they apply the UCB-V policy on the sampled set of arms.
Here, we eliminate this stage which is evidently redundant for the deterministic
model considered here.
Table 1. Average regret for the retainable arms model for the known time horizon
case.
Time Horizon
β = 0.9 β=1 β = 1.1
Algorithm 4 × 104 7 × 104 10 × 104 4 × 104 7 × 104 10 × 104 4 × 104 7 × 104 10 × 104
UCB-F-10 574 740 870 1022 1350 1612 1376 1847 2227
UCB-F-0.2 1043 1410 1778 1043 1410 1778 1129 1445 1764
KT&RA 423 578 705 568 787 970 738 1035 1276
Algorithm 1 242 307 360 287 388 460 381 515 626
In Figure 1, we present the average regret of 200 runs vs. the time horizon
for β = 0.9, β = 1 and β = 1.1. The empirical standard deviation is smaller
than 5% in all of our results. As shown in Figure 1 and detailed in Table 1, the
performance of Algorithm 1 is significantly better than the other algorithms.
Algorithms for Many-Armed Bandits 477
1200 1200
1500
1000 1000
800 800
1000
600 600
400 400
500
200 200
0 0 0
0 5 10 0 5 10 0 5 10
4 4 4
x 10 x 10 x 10
Fig. 1. Average regret (y-axis) vs. the time horizon (x-axis) for β = 0.9, β = 1 and
β = 1.1.
For the retainable arms model and unspecified time horizon, we compare Algo-
rithm 2 with the UCB-AIR Algorithm presented in [11]. Since, these algorithms
are identical for β ≥ 1, we run this experiment for β = 0.7, β = 0.8 and β = 0.9.
In Figure 2 we present the average regret of 200 runs vs. the time. It is shown
in Figure 2 and detailed in Table 2 that the average regret of Algorithm 2 is
700
600
600
600
500
500 500
400
400
400
300
300
300
200 200
Fig. 2. Average regret (y-axis) vs. time (x-axis) for β = 0.7, β = 0.8 and β = 0.9.
478 Y. David and N. Shimkin
Table 2. Average regret for the retainable arms model for the unknown time horizon
case.
Time Horizon
β = 0.7 β = 0.8 β = 0.9
Algorithm 4 × 104 7 × 104 10 × 104 4 × 104 7 × 104 10 × 104 4 × 104 7 × 104 10 × 104
UCB-AIR 414 667 808 440 589 710 486 646 771
Algorithm 2 264 341 402 305 389 414 542 642 1764
x 10
4 β=0.9 x 10
4 β=1 x 10
4 β=1.1
2 2.5 3.5
KT&NA KT&NA KT&NA
1.8 Algorithm 3 Algorithm 3 Algorithm 3
3
1.6 2
1.4 2.5
1.2 1.5
2
1
1.5
0.8 1
0.6 1
0.4 0.5
0.5
0.2
0 0 0
0 5 10 0 5 10 0 5 10
4 4 4
x 10 x 10 x 10
Fig. 3. Average regret (y-axis) vs. the time horizon (x-axis) for β = 0.9, β = 1 and
β = 1.1.
smaller and increasing slower than that of the UCB-AIR Algorithm. Here the
empirical standard deviation is smaller than 7% in all of our results.
Time Horizon
β = 0.7 β = 0.8 β = 0.9
Algorithm 4 × 104 7 × 104 10 × 104 4 × 104 7 × 104 10 × 104 4 × 104 7 × 104 10 × 104
KT&NA 12800 16460 19760 13670 18800 22760 18850 24950 30170
Algorithm 3 418 741 983 509 646 791 674 1048 1277
References
1. Auer, P., Ortner, R., Szepesvári, C.: Improved rates for the stochastic continuum-
armed bandit problem. In: Bshouty, N.H., Gentile, C. (eds.) COLT. LNCS (LNAI),
vol. 4539, pp. 454–468. Springer, Heidelberg (2007)
2. Babaioff, M., Immorlica, N., Kempe, D., Kleinberg, R.: Online auctions and gen-
eralized secretary problems. ACM SIGecom Exchanges 7(2), 1–11 (2008)
3. Berry, D.A., Chen, R.W., Zame, A., Heath, D.C., Shepp, L.A.: Bandit problems
with infinitely many arms. The Annals of Statistics, 2103–2116 (1997)
4. Bonald, T., Proutiere, A.: Two-target algorithms for infinite-armed bandits
with bernoulli rewards. In: Advances in Neural Information Processing Systems,
pp. 2184–2192 (2013)
5. Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C.: X-armed bandits. Journal of
Machine Learning Research 12, 1655–1695 (2011)
6. Chakrabarti, D., Kumar, R., Radlinski, F., Upfal, E.: Mortal multi-armed bandits.
In: Advances in Neural Information Processing Systems, pp. 273–280 (2009)
7. David, Y., Shimkin, N.: Infinitely many-armed bandits with unknown value distri-
bution. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD
2014, Part I. LNCS, vol. 8724, pp. 307–322. Springer, Heidelberg (2014)
8. Freeman, P.: The secretary problem and its extensions: A review. International
Statistical Review, 189–206 (1983)
9. Kleinberg, R., Slivkins, A., Upfal, E.: Multi-armed bandits in metric spaces. In:
Proceedings of the 40th Annual ACM Symposium on Theory of Computing,
pp. 681–690. ACM (2008)
10. Teytaud, O., Gelly, S., Sebag, M.: Anytime many-armed bandits. In: CAP (2007)
11. Wang, Y., Audibert, J.-Y., Munos, R.: Algorithms for infinitely many-armed ban-
dits. In: Advances in Neural Information Processing Systems, pp. 1729–1736 (2009)
Deep Learning
An Empirical Investigation of Minimum
Probability Flow Learning Under Different
Connectivity Patterns
1 Introduction
where the sum in the first term is over the dataset, D, and the sum in the
second term is over the entire domain of x. The first term has the effect of
pushing the parameters in a direction that decreases the energy surface of the
model at the training data points, while the second term increases the energy of
all possible states. Since the second term is intractable for all but trivial models,
we cannot, in practice, accommodate for every state of x, but rather resort to
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 483–497, 2015.
DOI: 10.1007/978-3-319-23528-8 30
484 D.J. Im et al.
sampling. We call states in the sum in the first term positive particles and those
in the second term negative particles, in accordance with their effect on the
likelihood (opposite their effect on the energy). Thus, the intractability of the
second term becomes a problem of negative particle selection (NPS).
The most famous approach to NPS is Contrastive Divergence (CD) [4], which
is the centre-piece of unsupervised neural network learning in energy-based mod-
els. “CD-k” proposes to sample the negative particles by applying a Markov
chain Monte Carlo (MCMC) transition operator k times to each data state.
This is in contrast to taking an unbiased sample from the distribution by apply-
ing the MCMC operator a large number of times until the distribution reaches
equilibrium, which is often prohibitive for practical applications. Much research
has attempted to better understand this approach and the reasoning behind its
success or failure [6,12], leading to many variations being proposed from the
perspective of improving the MCMC chain. Here, we take a more general app-
roach to the problem of NPS, in particular, through the lens of the Minimum
Probability Flow (MPF) algorithm [11].
MPF works by introducing a continuous dynamical system over the model’s
distribution, such that the equilibrium state of the dynamical system is the dis-
tribution used to model the data. The objective of learning is to minimize the
flow of probability from data states to non-data states after infinitesimal evolu-
tion under the model’s dynamics. Intuitively, the less a data vector evolves under
the dynamics, the closer it is to an equilibrium point; or from our perspective,
the closer the equilibrium distribution is to the data. In MPF, NPS is replaced
by a more explicit notion of connectivity between states. Connected states are
ones between which probability can flow under the dynamical system. Thus,
rather than attempting to approximate an intractable function (as in CD-k), we
run a simple optimization over an explicit, continuous dynamics, and actually
never have to run the dynamics themselves.
Interestingly, MPF and CD-k have gradients with remarkably similar form.
In fact, the CD-k gradients can be seen as a special case of the MPF gradients
- that is, MPF provides a generalized form which reduces to CD-k under a
special dynamics. Moreover, MPF provides a consistent estimator for the model
parameters, while CD-k as typically formalized is an update heuristic, that can
sometimes do bizarre things like go in circles in parameter space [6]. Thus, in one
aspect, MPF solves the problem of contrastive divergence by re-conceptualizing
it as probability flow under an explicit dynamics, rather than the convenient but
biased sampling of an intractable function. The challenge thus becomes one of
how to design the dynamical system.
This paper makes the following contributions. First, we provide an expla-
nation of MPF that begins from the familiar territory of CD-k, rather than
the less familiar grounds of the master equation. While familiar to physicists,
the master equation is an apparent obscurity in machine learning, due most
likely to its general intractability. Part of the attractiveness of MPF is the way
it circumvents that intractability. Second, we derive a generalized form for the
MPF transition matrix, which defines the dynamical system. Third, we provide
An Empirical Investigation of Minimum Probability Flow Learning 485
where θ = {W, b, c} are the parameters of the model. The marginalized proba-
bility over visible variables is formulated from the Boltzmann distribution,
p∗ (v; θ) 1 −1
p(v; θ) = = exp E(v, h; θ) (3)
Z(θ) Z(θ) τ
h
−1
such that Z(θ) = v,h exp τ E(v, h; θ) is a normalizing constant and τ is the
thermodynamic temperature. We can marginalize over the binary hidden states
in Equation 2 and re-express in terms of a new energy F (v),
−1
F (v; θ) = − log exp E(v, h) (4)
τ
h
D H
D
1 1
= vi bi − log 1 + exp cj + vi Wi,j (5)
τ i τ j=1 i
exp − F (v; θ)
p(v; θ) = (6)
Z(θ)
Following physics, this form of the energy is better known as a free energy,
as it expresses the difference between the average energy and the entropy of a
distribution, in this case, that of p(h|v). Defining the distribution in terms of
free energy as p(v; θ) is convenient since it naturally copes with the presence of
latent variables.
1
https://github.com/jiwoongim/minimum probability flow learning
486 D.J. Im et al.
where j are the data states and i are the non-data states and Γij is the
probability flow rate from state j to state i. Note that each state is a full vector
An Empirical Investigation of Minimum Probability Flow Learning 487
of variables, and we are theoretically enumerating all states. ṗi is the rate of
change of the probability of state i, that is, the difference between the probability
flowing out of any state j into state i and the probability flowing out of state i
to any other state j at time t. We can re-express ṗi in a simple matrix form as
ṗ = Γp (9)
(t)
by setting Γii = − i=j Γji pi . We note that if the transition matrix Γ is
ergodic, then the model has a unique stationary distribution.
This is a common model for exploring statistical mechanical systems, but it
is unwieldy in practice for two reasons, namely, the continuous time dynamics,
and exponential size of the state space. For our purposes, we will actually find
the former an advantage, and the latter irrelevant.
The objective of MPF is to minimize the KL divergence between the data
distribution and the distribution after evolving an infinitesimal amount of time
under the dynamics:
MPF does not propose to actually simulate these dynamics. There is, in fact,
no need to, as the problem formulation reduces to a rather simple optimization
problem with no intractable component. However, we must provide a means
for computing the matrix coefficients Γij . Since our target distribution is the
distribution defined by the RBM, we require Γ to be a function of the energy,
or more particularly, the parameters of the energy function.
A sufficient (but not necessary) means to guarantee that the distribution
p∞ (θ) is a fixed point of the dynamics is to choose Γ to satisfy detailed balance,
that is
(∞) (∞)
Γji pi (θ) = Γij pj (θ). (11)
The following theorem provides a general form for the transition matrix such
that the equilibrium distribution is that of the RBM:
488 D.J. Im et al.
(∞) (∞)
Theorem 1. 1 Suppose pj is the probability of state j and pi is the proba-
bility of state i. Let the transition matrix be
o(Fi − Fj ) + 1
Γij = gij exp (Fj − Fi ) (12)
2
such that o(·) is any odd function, where gij is the symmetric connectivity
between the states i and j. Then this transition matrix satisfies detailed balance
in Equation 11.
The proof is provided in Appendix A.1. The transition matrix proposed by [11]
is thus the simplest case of Theorem 1, found by setting o(·) = 0 and gij = gji :
1
Γij = gij exp (Fj (θ) − Fi (θ) . (13)
2
Given a form for the transition matrix, we can now evaluate the gradient of
J(θ)
It can be shown that score matching is a special case of MPF in continuous state
spaces, where the connectivity function is set to connect all states within a small
Euclidean distance r in the limit of r → 0 [11]. For simplicity, in the case of a
discrete state space (Bernoulli RBM), we can fix the Hamming distance to one
instead, and consider that data states are connected to all other states 1-bit flip
away:
1, if state i, j differs by single bit flip
gij = (14)
0, otherwise
1-bit flip connectivity gives us a sparse Γ with 2D D non-zero terms (rather than
a full 22D ), and may be seen as NPS where the only negative particles are those
which are 1-bit flip away from data states. Therefore, we only ever evaluate
|D|D terms from this matrix, making the formulation tractable. This was the
only connectivity function pursued in [11] and is a natural starting point for the
approach.
⎛ ⎞
1 1 ⎝
1 ⎠
= exp Fj (x; θ) + log gj gi exp − Fi (x; θ) + log gi
|D| j∈D 2 2
i∈D
(16)
1
g 2
where gji is a scaling term required to counterbalance the difference between
gi and gj . The independence in the connectivity function allows us to factor
all the j terms in 15 out of the inner sum, leaving us with a product of sums,
something we could not achieve with 1-bit flip connectivity since the connection
to state i depends on it being a neighbor of state j. Note that, intuitively,
learning is facilitated by connecting data states to states that are probable under
the model (i.e. to contrast the divergence). Therefore, we can use p(v; θ) to
approximate gi . In practice, for each iteration n of learning, we need the gi and
gj terms to act as constants with respect to updating θ, and thus we sample
them from p(v; θn−1 ). We can then rewrite the objective function as J(θ) =
JD (θ)JS (θ)
1 1 n−1
JD (θ) = exp F (x; θ) − F (x; θ ) (17)
|D| 2
x∈D
1 1 n−1
JS (θ) = exp − F (x ; θ) + F (x ; θ ) (18)
|S| 2
x ∈S
where S is the sampled set from p(v; θn−1 ), and the normalization terms in log gj
and log gi cancel out. Note we use the θn−1 notation to refer to the parameters
at the previous iteration, and simply θ for the current iteration.
5 Experiments
We conducted the first empirical study of MPF under different types of connec-
tivity as discussed in Section 4. We compared our results to CD-k with vary-
ing values for K. We analyzed the MPF variants based on training RBMs and
assessed them quantitatively and qualitatively by comparing the log-likelihoods
of the test data and samples generated from model. For the experiments, we
denote the 1-bit flip, factorized, and persistent methods as MPF-1flip, FMPF,
and PMPF, respectively. The goals of these experiments are to
1. Compare among MPF algorithms under different connectivities; and
2. Compare between MPF and CD-k.
In our experiments, we considered the MNIST and CalTech Silhouette
datasets. MNIST consists of 60,000 training and 10,000 test images of size 28
× 28 pixels containing handwritten digits from the classes 0 to 9. The pixels in
MNIST are binarized based on thresholding. From the 60,000 training examples,
we set aside 10,000 as validation examples to tune the hyperparameters in our
models. The CalTech Silhouette dataset contains the outlines of objects from the
CalTech101 dataset, which are centred and scaled on a 28 × 28 image plane and
rendered as filled black regions on a white background creating a silhouette of
each object. The training set consists of 4,100 examples, with at least 20 and at
most 100 examples in each category. The remaining instances were split evenly
between validation and testing2 . Hyperparameters such as learning rate, number
of epochs, and batch size were selected from discrete ranges and chosen based on
a held-out validation set. The learning rate for FMPF and PMPF were chosen
from the range [0.001, 0.00001] and the learning rate for 1-bit flip was chosen
from the range [0.2, 0.001].
2
More details on pre-processing the CalTech Silhouettes can be found in http://
people.cs.umass.edu/∼marlin/data.shtml
492 D.J. Im et al.
Table 1. Experimental results on MNIST using 11 RBMs with 20 hidden units each.
The average training and test log-probabilities over 10 repeated runs with random
parameter initializations are reported.
Method Average log Test Average log Train Time(s) Batchsize
CD1 -145.63 ± 1.30 -146.62 ± 1.72 831 100
PCD -136.10 ± 1.21 -137.13 ± 1.21 2620 300
MPF-1flip -141.13 ± 2.01 -143.02 ± 3.96 2931 75
CD10 -135.40 ± 1.21 -136.46 ± 1.18 17329 100
FMPF10 -136.37 ± 0.17 -137.35 ± 0.19 12533 60
PMPF10 -141.36 ± 0.35 -142.73 ± 0.35 11445 25
FPMPF10 -134.04 ± 0.12 -135.25 ± 0.11 22201 25
CD15 -134.13 ± 0.82 -135.20 ± 0.84 26723 100
FMPF15 -135.89 ± 0.19 -136.93 ± 0.18 18951 60
PMPF15 -138.53 ± 0.23 -139.71 ± 0.23 13441 25
FPMPF15 -133.90 ± 0.14 -135.13 ± 0.14 27302 25
CD25 -133.02 ± 0.08 -134.15 ± 0.08 46711 100
FMPF25 -134.50 ± 0.08 -135.63 ± 0.07 25588 60
PMPF25 -135.95 ± 0.13 -137.29 ± 0.13 23115 25
FPMPF25 -132.74 ± 0.13 -133.50 ± 0.11 50117 25
Fig. 1. Samples generated from the training set. Samples in each panel are generated
by RBMs trained under different paradigms as noted above each image.
In our first experiment, we trained eleven RBMs on the MNIST digits. All RBMs
consisted of 20 hidden units and 784 (28×28) visible units. Due to the small num-
ber of hidden variables, we calculated the exact value of the partition function
by explicitly summing over all visible configurations. Five RBMs were learned
by PCD1, CD1, CD10, CD15, and CD25. Seven RBMs were learned by 1 bit
flip, FMPF, and FPMPF3 . Block Gibbs sampling is required for FMPF-k and
FPMPF-k similar to CD-k training, where the number of steps is given by k.
The average log test likelihood values of RBMs with 20 hidden units are
presented in Table 1. This table gives a sense of the performance under dif-
ferent types of MPF dynamics when the partition function can be calculated
exactly. We observed that PMPF consistently achieved a higher log-likelihood
than FMPF. MPF with 1 bit flip was very fast but gave poor performance
3
FPMPF is the composition of the FMPF and PMPF connectivities.
An Empirical Investigation of Minimum Probability Flow Learning 493
Table 2. Experimental results on MNIST using 11 RBMs with 200 hidden units each.
The average estimated training and test log-probabilities over 10 repeated runs with
random parameter initializations are reported. Likelihood estimates are made with
CSL [2] and AIS [8].
CSL AIS
Method Avg. log Test Avg. log Train Avg. log Test Avg. log Train Time(s) Batchsize
CD1 -138.63 ± 0.48 -138.70 ± 0.45 -98.75 ± 0.66 -98.61 ± 0.66 1258 100
PCD1 -114.14 ± 0.26 -114.13 ± 0.28 -88.82 ± 0.53 -89.92 ± 0.54 2614 100
MPF-1flip -179.73 ± 0.085 -179.60 ± 0.07 -141.95 ± 0.23 -142.38 ± 0.74 4575 75
CD10 -117.74 ± 0.14 -117.76 ± 0.13 -91.94 ± 0.42 -92.46 ± 0.38 24948 100
FMPF10 -115.11 ± 0.09 -115.10 ± 0.07 -91.21 ± 0.17 -91.39 ± 0.16 24849 25
PMPF10 -114.00 ± 0.08 -113.98 ± 0.09 -89.26 ± 0.13 -89.37 ± 0.13 24179 25
FPMPF10 -112.45 ± 0.03 -112.45 ± 0.03 -83.83 ± 0.23 -83.26 ± 0.23 24354 25
CD15 -115.96 ± 0.12 -115.21 ± 0.12 -91.32 ± 0.24 -91.87 ± 0.21 39003 100
FMPF15 -114.05 ± 0.05 -114.06 ± 0.05 -90.72 ± 0.18 -90.93 ± 0.20 26059 25
PMPF15 -114.02 ± 0.11 -114.03 ± 0.09 -89.25 ± 0.17 -89.85 ± 0.19 26272 25
FPMPF15 -112.58 ± 0.03 -112.60 ± 0.02 -83.27 ± 0.15 -83.84 ± 0.13 26900 25
CD25 -114.50 ± 0.10 -114.51 ± 0.10 -91.36 ± 0.26 -91.04 ± 0.25 55688 100
FMPF25 -113.07 ± 0.06 -113.07 ± 0.07 -90.43 ± 0.28 -90.63 ± 0.27 40047 25
PMPF25 -113.70 ± 0.04 -113.69 ± 0.04 -89.21 ± 0.14 -89.79 ± 0.13 52638 25
FPMPF25 -112.38 ± 0.02 -112.42 ± 0.02 -83.25 ± 0.27 -83.81 ± 0.28 53379 25
In our second set of experiments, we trained RBMs with 200 hidden units. We
trained them exactly as described in Section 5.1. These RBMs are able to gen-
erate much higher-quality samples from the data distribution, however, the par-
tition function can no longer be computed exactly.
In order to evaluate the model quantitatively, we estimated the test log-
likelihood using the Conservative Sampling-based Likelihood estimator (CSL)
[2] and annealed importance sampling (AIS) [8]. Given well-defined conditional
probabilities P (v|h) of a model and a set of latent variable samples S collected
from a Markov chain, CSL computes
log fˆ(v) = log meanh∈S P (v|h). (19)
The advantage of CSL is that sampling latent variables h instead of v has the
effect of reducing the variance of the estimator. Also, in contrast to annealed
importance sampling (AIS) [8], which tends to overestimate, CSL is much more
conservative in its estimates. However, most of the time, CSL is far off from the
494 D.J. Im et al.
Fig. 2. Samples generated from the training set. Samples in each panel are generated
by RBMs trained under different paradigms as noted above each image.
true estimator, so we bound our negative log-likelihood estimate from above and
below using both AIS and CSL.
Table 2 demonstrates the test log-likelihood of various RBMs with 200 hid-
den units. The ranking of the different training paradigms with respect to per-
formance was similar to what we observed in Section 5.1 with PMPF emerging
as the winner. However, contrary to the first experiment, we observed that MPF
with 1 bit flip did not perform well. Moreover, FMPF and PMPF both tended to
give higher test log-likelihoods than CD-k training. Smaller batch sizes worked
better with MPF when the number of hiddens was increased. Once again, we
observed smaller variances compared to CD-k with both forms of MPF, espe-
cially with FMPF. We noted that FMPF and PMPF always have smaller vari-
ance compared to CD-k. This implies that FMPF and PMPF are less sensitive to
random weight initialization. Figure 2 shows initial data and generated samples
after running 100 Gibbs steps for each RBM. PMPF clearly produces samples
that look more like digits.
dynamics always begin closer to equilibrium, and hence converge more quickly.
Figure 3 shows initial data and generated samples after running 100 Gibbs steps
for each RBM on Caltech28 dataset.
6 Conclusion
MPF is an unsupervised learning algorithm that can be employed off-the-shelf
to any energy-based model. It has a number of favourable properties but has not
seen application proportional to its potential. In this paper, we first expounded
on MPF and its connections to CD-k training, which allowed us to gain a bet-
ter understanding and perspective to CD-k. We proved a general form for the
transition matrix such that the equilibrium distribution converges to that of an
RBM. This may lead to future extensions of MPF based on the choice of o(·) in
Equation 12.
496 D.J. Im et al.
One of the merits of MPF is that the choice of designing a dynamic system
by defining a connectivity function is left open as long as it satisfies the fixed
point equation. Additionally, it should scale similarly to CD-k and its variants
when increasing the number of visible and hidden units. We thoroughly explored
three different connectivity structures, noting that connectivity can be designed
inductively or by sampling. Finally, we showed empirically that MPF, and in
particular, PMPF, outperforms CD-k for training generative models. Until now,
RBM training was dominated by methods based on CD-k including PCD; how-
ever, our results indicate that MPF is a practical and effective alternative.
References
1. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A.,
Bouchard, N., Bengio, Y.: Theano: new features and speed improvements. In: Deep
Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012)
2. Bengio, Y., Yao, L., Cho, K.: Bounding the test log-likelihood of generative mod-
els. In: Proceedings of the International Conference on Learning Representations
(ICLR) (2013)
3. Besag, J.: Statistical analysis of non-lattice data. The Statistician 24, 179–195
(1975)
4. Hinton, G.E.: Training products of experts by minimizing contrastive divergence.
Neural Computation 14, 1771–1880 (2002)
5. Hyvärinen, A.: Estimation of non-normalized statistical models by score matching.
Journal of Machine Learning Research 6, 695–709 (2005)
6. MacKay, D.J.C.: Failures of the one-step learning algorithm (2001). http://
www.inference.phy.cam.ac.uk/mackay/abstracts/gbm.html, unpublished Techni-
cal Report
7. Marlin, B.M., de Freitas, N.: Asymptotic efficiency of deterministic estimators for
discrete energy-based models: ratio matching and pseudolikelihood. In: Proceedings
of the Uncertainty in Artificial Intelligence (UAI) (2011)
8. Salakhutdinov, R., Murray, I.: On the quantitative analysis of deep belief networks.
In: Proceedings of the International Conference of Machine Learning (ICML)
(2008)
9. Smolensky, P.: Information processing in dynamical systems: foundations of
harmony theory. In: Parallel Distributed Processing: Volume 1: Foundations,
pp. 194–281. MIT Press (1986)
10. Sohl-Dickstein, J.: Persistent minimum probability flow. Tech. rep, Redwood Cen-
tre for Theoretical Neuroscience (2011)
11. Sohl-Dickstein, J., Battaglino, P., DeWeese, M.R.: Minimum probability flow learn-
ing. In: Proceedings of the International Conference of Machine Learning (ICML)
(2011)
12. Sutskever, I., Tieleman, T.: On the convergence properties of contrastive diver-
gence. In: Proceedings of the AI & Statistics (AI STAT) (2009)
13. Tieleman, T., Hinton, G.E.: Using fast weights to improve persistent contrastive
divergence. In: Proceedings of the International Conference of Machine Learning
(ICML) (2009)
14. Tierney, L.: Markov chains for exploring posterior distributions. Annals of Statistics
22, 1701–1762 (1994)
An Empirical Investigation of Minimum Probability Flow Learning 497
such that o(·) is any odd function, where gij is the symmetric connectivity
between the states i and j. Then this transition matrix satisfies detailed balance
in Equation 11.
Proof. By cancelling out the partition function, the detailed balance Equation
11 can be formulated as
1 Introduction
Recently, deep neural networks have achieved great success in hard AI tasks
[2,12,14,19], mostly relying on back-propagation as the main way of performing
credit assignment over the different sets of parameters associated with each layer
of a deep net. Back-propagation exploits the chain rule of derivatives in order
to convert a loss gradient on the activations over layer l (or time t, for recurrent
nets) into a loss gradient on the activations over layer l − 1 (respectively, time
t − 1). However, as we consider deeper networks– e.g., consider the recent best
ImageNet competition entrants [20] with 19 or 22 layers – longer-term dependen-
cies, or stronger non-linearities, the composition of many non-linear operations
becomes more strongly non-linear. To make this concrete, consider the compo-
sition of many hyperbolic tangent units. In general, this means that derivatives
obtained by back-propagation are becoming either very small (most of the time)
or very large (in a few places). In the extreme (very deep computations), one
would get discrete functions, whose derivatives are 0 almost everywhere, and
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 498–515, 2015.
DOI: 10.1007/978-3-319-23528-8 31
Difference Target Propagation 499
2 Target Propagation
Although many variants of the general principle of target propagation can be
devised, this paper focuses on a specific approach, which is based on the ideas
presented in an earlier technical report [4] and is described in the following.
where hi is the state of the i-th hidden layer (where hM corresponds to the
output of the network and h0 = x) and fi is the i-th layer feed-forward mapping,
defined by a non-linear activation function si (e.g. the hyperbolic tangents or
the sigmoid function) and the weights Wi of the i-th layer. Here, for simplicity
of notation, the bias term of the i-th layer is included in Wi . We refer to the
subset of network parameters defining the mapping between the i-th and the
i,j
j-th layer (0 ≤ i < j ≤ M ) as θW = {Wk , k = i + 1, . . . , j}. Using this notion,
i,j
we can write hj as a function of hi depending on parameters θW , that is we can
i,j
write hj = hj (hi ; θW ).
0,M
Given a sample (x, y), let L(hM (x; θW ), y) be an arbitrary global loss func-
0,M
tion measuring the appropriateness of the network output hM (x; θW ) for the
target y, e.g. the MSE or cross-entropy for binomial random variables. Then,
0,M
the training objective corresponds to adapting the network parameters θW so
0,M
as to minimize the expected global loss Ep {L(hM (x; θW ), y)} under the data
distribution p(x, y). For i = 1, . . . , M − 1 we can write
0,M 0,i i,M
L(hM (x; θW ), y) = L(hM (hi (x; θW ); θW ), y) (2)
to emphasize the dependency of the loss on the state of the i-th layer.
Training a network with back-propagation corresponds to propagating error
signals through the network to calculate the derivatives of the global loss with
respect to the parameters of each layer. Thus, the error signals indicate how
the parameters of the network should be updated to decrease the expected loss.
However, in very deep networks with strong non-linearities, error propagation
could become useless in lower layers due to exploding or vanishing gradients, as
explained above.
Difference Target Propagation 501
Then, Wi can be updated locally within its layer via stochastic gradient descent,
where ĥi is considered as a constant with respect to Wi . That is
0,i
(t+1) (t) ∂Li (ĥi , hi ) (t) ∂Li (ĥi , hi ) ∂hi (x; θW )
Wi = Wi − ηfi = Wi − ηfi , (5)
∂Wi ∂hi ∂Wi
where ηfi is a layer-specific learning rate.
Note, that in this context, derivatives can be used without difficulty, because
they correspond to computations performed inside a single layer. Whereas, the
problems with the severe non-linearities observed for back-propagation arise
when the chain rule is applied through many layers. This motivates target prop-
agation methods to serve as alternative credit assignment in the context of a
composition of many non-linearities.
However, it is not directly clear how to compute a target that guarantees a
decrease of the global loss (that is how to compute a ĥi for which equation (3)
holds) or that at least leads to a decrease of the local loss Li of the next layer,
that is
Li+1 (ĥi+1 , fi+1 (ĥi )) < Li+1 (ĥi+1 , fi+1 (hi )) . (6)
Proposing and validating answers to this question is the subject of the rest of
this paper.
Clearly, in a supervised learning setting, the top layer target should be directly
driven from the gradient of the global loss
∂L(hM , y)
ĥM = hM − η̂ , (7)
∂hM
where η̂ is usually a small step size. Note, that if we use the MSE as global loss
and η̂ = 0.5 we get ĥM = y.
502 D.-H. Lee et al.
But how can we define targets for the intermediate layers? In the previous
technical report [4], it was suggested to take advantage of an “approximate
inverse”. To formalize this idea, suppose that for each fi we have a function gi
such that
fi (gi (hi )) ≈ hi or gi (fi (hi−1 )) ≈ hi−1 . (8)
Then, choosing
ĥi−1 = gi (ĥi ) (9)
would have the consequence that (under some smoothness assumptions on f and
g) minimizing the distance between hi−1 and ĥi−1 should also minimize the loss
Li of the i-th layer. This idea is illustrated in the left of Figure 1. Indeed, if
the feed-back mappings were the perfect inverses of the feed-forward mappings
(gi = fi−1 ), one gets
But choosing g to be the perfect inverse of f may need heavy computation and
instability, since there is no guarantee that fi−1 applied to a target would yield
a value that is in the domain of fi−1 . An alternative approach is to learn an
approximate inverse gi , making the fi / gi pair look like an auto-encoder. This
suggests parametrizing gi as follows:
where s̄i is a non-linearity associated with the decoder and Vi the matrix of
feed-back weights of the i-th layer. With such a parametrization, it is unlikely
that the auto-encoder will achieve zero reconstruction error. The decoder could
be trained via an additional auto-encoder-like loss at each layer
Linv
i = ||gi (fi (hi−1 )) − hi−1 ||22 . (12)
Changing Vi based on this loss, makes g closer to fi−1 . By doing so, it also makes
fi (ĥi−1 ) = fi (gi (ĥi )) closer to ĥi , and is thus also contributing to the decrease
of Li (ĥi , fi (ĥi−1 )). But we do not want to estimate an inverse mapping only for
the concrete values we see in training but for a region around the these values
to facilitate the computation of gi (ĥi ) for ĥi which have never been seen before.
For this reason, the loss is modified by noise injection
Linv
i = ||gi (fi (hi−1 + )) − (hi−1 + )||22 , ∼ N (0, σ) , (13)
which makes fi and gi approximate inverses not just at hi−1 but also in its
neighborhood.
Difference Target Propagation 503
Fig. 1. (left) How to compute a target in the lower layer via difference target prop-
agation. fi (ĥi−1 ) should be closer to ĥi than fi (hi−1 ). (right) Diagram of the back-
propagation-free auto-encoder via difference target propagation.
greatly improves the stability of the optimization. This holds for vanilla target
propagation if gi = fi−1 , because
Although the condition is not guaranteed to hold for vanilla target propagation
if gi = fi−1 , for difference target propagation it holds by construction, since
Theorem 2. 3 Let the target for layer i − 1 be given by Equation (15), i.e.
ĥi−1 = hi−1 + gi (ĥi ) − gi (hi ). If ĥi − hi is sufficiently small, fi and gi are
differentiable, and the corresponding Jacobian matrices Jfi and Jgi satisfy that
the largest eigenvalue of (I − Jfi Jgi )T (I − Jfi Jgi ) is less than 1, then we have
The third condition in the above theorem is easily satisfied in practice, because gi
is learned to be the inverse of fi and makes gi ◦ fi close to the identity mapping,
so that (I − Jfi Jgi ) becomes close to the zero matrix which means that the
largest eigenvalue of (I − Jfi Jgi )T (I − Jfi Jgi ) is also close to 0.
where sig is the element-wise sigmoid function, W the weight matrix and b the
bias vector of the input units. The reconstruction is given by the decoder
3
The proof can be found in the appendix.
506 D.-H. Lee et al.
with c being the bias vector of the hidden units. And the reconstruction loss is
where the last equality follows from f (ẑ) = f (x) = h. As a target loss for the
hidden layer, we can use Lf = ||f (x + ) − ĥ||22 , where ĥ is considered as a
constant and which can be also augmented by a regularization term to yield a
contractive mapping.
3 Experiments
Fig. 2. Mean training cost (left) and train/test classification error (right) with target
propagation and back-propagation using continuous deep networks (tanh) on MNIST.
Error bars indicate the standard deviation over 10 independent runs with the same
optimized hyper-parameters and different initial weights.
The results are shown in Figure 2. We obtained a test error of 1.94% with
target propagation and 1.86% with back propagation. The final negative log-
likelihood on the training set was 4.584 × 10−5 with target propagation and
1.797 × 10−5 with back propagation. We also trained the same network with
rectifier linear units and got a test error of 3.15% whereas 1.62% was obtained
with back-propagation. It is well known that this nonlinearity is advantageous
for back-propagation [11], while it seemed to be less appropriate for this imple-
mentation of target propagation.
508 D.-H. Lee et al.
Linv
2 = ||g2 (f2 (h1 + )) − (h1 + )||22 , ∼ N (0, σ) . (27)
If the feed-forward mapping is discrete, back-propagated gradients become 0 and
useless when they cross the discretization step. So we compare target propagation
to two baselines. As a first baseline, we train the network with back-propagation
and the straight-through estimator [5], which is biased but was found to work
well, and simply ignores the derivative of the step function (which is 0 or infinite)
in the back-propagation phase. As a second baseline, we train only the upper
layers by back-propagation, while not changing the weight W1 which are affected
by the discretization, i.e., the lower layers do not learn.
Difference Target Propagation 509
The results on the training and test sets are shown in Figure 3. The training
error for the first baseline (straight-through estimator) does not converge to zero
(which can be explained by the biased gradient) but generalization performance
is fairly good. The second baseline (fixed lower layer) surprisingly reached zero
training error, but did not perform well on the test set. This can be explained
by the fact that it cannot learn any meaningful representation at the first layer.
Target propagation however did not suffer from this drawback and can be used
to train discrete networks directly (training signals can pass the discrete region
successfully). Though the training convergence was slower, the training error
did approach zero. In addition, difference target propagation also achieved good
results on the test set.
Fig. 3. Mean training cost (top left), mean training error (top right) and mean test
error (bottom left) while training discrete networks with difference target propaga-
tion and the two baseline versions of back-propagation. Error bars indicate standard
deviations over 10 independent runs with the same hyper-parameters and different
initial weights. Diagram of the discrete network (bottom right). The output of h1 is
discretized because signals must be communicated from h1 to h2 through a long cable,
so binary representations are preferred in order to conserve energy. With target prop-
agation, training signals are also discretized through this cable (since feedback paths
are computed by bona-fide neurons).
510 D.-H. Lee et al.
Another interesting model class which vanilla back-propagation cannot deal with
are stochastic networks with discrete units. Recently, stochastic networks have
attracted attention [3,5,21] because they are able to learn a multi-modal con-
ditional distribution P (Y |X), which is important for structured output predic-
tions. Training networks of stochastic binary units is also biologically motivated,
since they resemble networks of spiking neurons. Here, we investigate whether
one can train networks of stochastic binary units on MNIST for classification
using target propagation. Following [18], the network architecture was 784-200-
200-10 and the hidden units were stochastic binary units with the probability of
turning on given by a sigmoid activation
∂hpi
δhpi−1 = δhpi ≈ sig (Wi hi−1 )WiT δhpi . (29)
∂hpi−1
Linv
i = ||gi (fi (hi−1 + )) − (hi−1 + )||22 , ∼ N (0, σ) , (31)
Table 1. Mean test Error on MNIST for stochastoc networks. The first row shows the
results of our experiments averaged over 10 trials. The second row shows the results
reported in [18]. M corresponds to the number of samples used for computing output
probabilities. We used M=1 during training and M=100 for the test set.
3.4 Auto-Encoder
We trained a denoising auto-encoder with 1000 hidden units with difference tar-
get propagation as described in Section 2.4 on MNIST. As shown in Figure 4
stroke-like filters can be obtained by target propagation. After supervised fine-
tuning (using back-propagation), we got a test error of 1.35%. Thus, by training
an auto-encoder with target propagation one can learn a good initial representa-
tion, which is as good as the one obtained by regularized auto-encoders trained
by back-propagation on the reconstruction error.
512 D.-H. Lee et al.
4 Conclusion
We introduced a novel optimization method for neural networks, called target
propagation, which was designed to overcome drawbacks of back-propagation
and is biologically more plausible. Target propagation replaces training signals
based on partial derivatives by targets which are propagated based on an auto-
encoding feedback loop. Difference target propagation is a linear correction for
this imperfect inverse mapping which is effective to make target propagation
actually work. Our experiments show that target propagation performs compara-
ble to back-propagation on ordinary deep networks and denoising auto-encoders.
Moreover, target propagation can be directly used on networks with discretized
transmission between units and reaches state of the art performance for stochas-
tic neural networks on MNIST.
References
1. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A.,
Bouchard, N., Bengio, Y.: Theano: new features and speed improvements. In: Deep
Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012)
2. Bengio, Y.: Learning deep architectures for AI. Now Publishers (2009)
3. Bengio, Y.: Estimating or propagating gradients through stochastic neurons. Tech.
Rep. Universite de Montreal (2013). arXiv:1305.2982
4. Bengio, Y.: How auto-encoders could provide credit assignment in deep networks
via target propagation. Tech. rep. (2014). arXiv:1407.7906
5. Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients
through stochastic neurons for conditional computation (2013). arXiv:1308.3432
6. Bengio, Y., Thibodeau-Laufer, E., Yosinski, J.: Deep generative stochastic net-
works trainable by backprop. In: ICML 2014 (2014)
7. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization.
J. Machine Learning Res. 13, 281–305 (2012)
8. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,
Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expres-
sion compiler. In: Proceedings of the Python for Scientific Computing Conference
(SciPy), oral Presentation, June 2010
9. Carreira-Perpinan, M., Wang, W.: Distributed optimization of deeply nested sys-
tems. In: AISTATS 2014, JMLR W&CP, vol. 33, pp. 10–19 (2014)
10. Erhan, D., Courville, A., Bengio, Y., Vincent, P.: Why does unsupervised pre-
training help deep learning? In: JMLR W&CP: Proc. AISTATS 2010, vol. 9,
pp. 201–208 (2010)
Difference Target Propagation 513
11. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: JMLR
W&CP: Proceedings of the Fourteenth International Conference on Artificial Intel-
ligence and Statistics (AISTATS 2011), April 2011
12. Hinton, G., Deng, L., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke,
V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Processing Magazine 29(6), 82–97
(2012)
13. Konda, K., Memisevic, R., Krueger, D.: Zero-bias autoencoders and the benefits
of co-adapting features. Under review on International Conference on Learning
Representations (2015)
14. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convo-
lutional neural networks. In: NIPS 2012 (2012)
15. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images.
Master’s thesis, University of Toronto (2009)
16. LeCun, Y.: Learning processes in an asymmetric threshold network. In: Fogelman-
Soulié, F., Bienenstock, E., Weisbuch, G. (eds.) Disordered Systems and Biological
Organization, pp. 233–240. Springer-Verlag, Les Houches (1986)
17. LeCun, Y.: Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université de
Paris VI (1987)
18. Raiko, T., Berglund, M., Alain, G., Dinh, L.: Techniques for learning binary
stochastic feedforward neural networks. In: NIPS Deep Learning Workshop 2014
(2014)
19. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. Tech. rep. (2014). arXiv:1409.3215
20. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. Tech. rep. (2014).
arXiv:1409.4842
21. Tang, Y., Salakhutdinov, R.: A new learning algorithm for stochastic feedforward
neural nets. In: ICML 2013 Workshop on Challenges in Representation Learning
(2013)
22. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learn-
ing 4 (2012)
23. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denois-
ing autoencoders: Learning useful representations in a deep network with a local
denoising criterion. J. Machine Learning Res. 11 (2010)
Appendix
A Proof of Theorem 1
Proof. Given a training example (x, y) the back-propagation update is given by
0,M
∂L(x, y; θW ) ∂L
δWibp = − = −JfTi+1 . . . JfTM (si (hi−1 ))T ,
∂Wi ∂hM
∂hk
where Jfk = ∂h k−1
= Wi ·Si (hk−1 ), k = i+1, . . . , M . Here Si (hk−1 ) is a diagonal
matrix with each diagonal element being element-wise derivatives and Jfk is
514 D.-H. Lee et al.
and similarly
||vec(δWitp )||2 ≤ η̂||v||2 ||J −1 ||2 ||l||2 + ||o(η̂)||2 ||v||2 ,
where ||J T ||2 and ||J −1 ||2 are matrix Euclidean norms, i.e. the largest sin-
gular value of (JfM . . . Jfi+1 )T , λmax , and the largest singular value of
(JfM . . . Jfi+1 )−1 , λmin
1
(λmin is the smallest singular value of (JfM . . . Jfi+1 )T ,
because fk is invertable, so all the smallest singular values of Jacobians are larger
than 0). Finally, if η̂ is sufficiently small, the angle α between vec(δWibp ) and
vec(δWitp ) satisfies:
vec(δWibp ), vec(δWitp )
cos(α) =
||vec(δWibp )||2 · ||vec(δWitp )||2
η̂||v||22 ||l||22 − J T l, o(η̂) ||v||22
≥
(||v||2 λmax ||l||2 )(η̂||v||2 ( λmin
1
)||l||2 + ||o(η̂)||2 ||v||2 )
−J T l,o(η̂)
1+ η̂||l||22 1 + Δ1 (η̂)
= λmax λmax ||o(η̂)||2
= λmax
λmin + η̂||l||2 λmin+ Δ2 (η̂)
B Proof of Theorem 2
Proof. Let e = ĥi − hi . Applying Taylor’s theorem twice, we get
ĥi −fi (ĥi−1 ) = ĥi −fi (hi−1 +gi (ĥi )−gi (hi )) = ĥi − fi (hi−1 + Jgi e + o(||e||2 ))
= ĥi − fi (hi−1 ) − Jfi (Jgi e + o(||e||2 )) − o(||Jgi e + o(||e||2 )||2 )
= ĥi − hi − Jfi Jgi e − o(||e||2 ) = (I − Jfi Jgi )e − o(||e||2 )
||ĥi − fi (ĥi−1 )||22 = ((I − Jfi Jgi )e − o(||e||2 ))T ((I − Jfi Jgi )e − o(||e||2 ))
= eT (I − Jfi Jgi )T (I − Jfi Jgi )e − o(||e||2 )T (I − Jfi Jgi )e
−eT (I − Jfi Jgi )T o(||e||2 ) + o(||e||2 )T o(||e||2 ))
= eT (I − Jfi Jgi )T (I − Jfi Jgi )e + o(||e||22 )
≤ λ||e||22 + |o(||e||22 )| (A-1)
where o(||e||22 ) is the scalar value resulting from all terms depending on o(||e||2 )
and λ is the largest eigenvalue of (I − Jfi Jgi )T (I − Jfi Jgi ). If e is sufficiently
small to guarantee |o(||e||22 )| < (1 − λ)||e||22 , then the left of Equation (A-1) is
less than ||e||22 which is just ||ĥi − hi ||22 .
Online Learning of Deep Hybrid Architectures
for Semi-supervised Categorization
Alexander G. Ororbia II (B) , David Reitter, Jian Wu, and C. Lee Giles
1 Introduction
When it comes to collecting information from unstructured data sources, the
challenge is clear for any information harvesting agent: to recognize what is
relevant and to categorize what has been found. For applications such as web
crawling, models such as the competitive Support Vector Machine are often
trained on labeled datasets [6]. However, as the target distribution (such as
that of information content from the web) evolves, these models quickly become
outdated and require re-training on new datasets. Simply put, while unlabeled
data is plentiful, labeled data is not [28]. While incremental approaches such as
co-training [15] have been employed to face this challenge, they require careful,
time-consuming feature-engineering (to construct multiple views of the data).
To minimize the human effort in gathering data and facilitate scalable learn-
ing, a model capable of generalization with only a few labeled examples and vast
quantities of easily-acquired unlabeled data is highly desirable. Furthermore, to
avoid feature engineering, this model should exploit the representational power
afforded by deeper architectures, which have seen success in domains such as com-
puter vision and speech recognition. Such a multi-level model could learn feature
abstractions, arguably capturing higher-order feature relationships in an efficient
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 516–532, 2015.
DOI: 10.1007/978-3-319-23528-8 32
Online Semi-supervised Deep Hybrid Architectures 517
2 Related Work
two model candidates, building on principles and successes of previous work: the
Stacked Boltzmann Expert Network (SBEN) and the Hybrid Stacked Denoising
Autoencoders model (HSDA). Furthermore, we introduce the idea of layer-wise
ensembling, a simple prediction scheme we shall describe in Section 3.3 to utilize
layer-wise information learnt by these models.
e−E(y,x,h) 1 −E(y,x,h)
p(y, x, h) = , with, p(y, x) = e (1)
Z Z
h
where Z = (y,x,h) e−E(y,x,h) is the partition function meant to ensure that the
value assignment is a valid probability distribution. Noting that the ey = (1i=y )C
i=1
is the one-hot vector encoding of y, the model’s energy function may be defined as
It is often not possible to compute p(y, x, h) or the marginal p(y, x) due to the
intractable partition function. However, we may leverage block Gibbs sampling to
draw samples of the HRBM’s latent variable layer given the current state of the
visible layer (composed of x and ey ) and vice versa, owing to the graphical model’s
bipartite structure (i.e., no intra-layer connections). This yields implementable
equations for conditioning on various layers of the model as follows:
p(h|y, x) = p(hj |y, x), with p(hj = 1|y, x) = σ(cj + Ujy + Wji xi ) (3)
j i
p(x|h) = p(xi |h), with p(xi = 1|h) = σ(bi + Wji hj ) (4)
i j
edy + Ujy hj
j
p(y|h) =
dy + j Ujy hj
(5)
y e
σ(v) = 1/(1 + e−v ). Furthermore, to perform classification directly using the
HRBM, one uses the model’s free energy function F (y, x) to compute the con-
ditional
e−F (y,x)
p(y|x) = −F (y ,x)
(6)
y ∈{1,··· ,C} e
where −F (y, x) = (dy + j log(1 + exp (cj + Ujy + Wji xi ))).
The hybrid model is trained leveraging a supervised, compound objective loss
function that balances a discriminative objective Ldisc and generative objective
Lgen , defined as follows:
|Dtrain | |Dtrain |
Ldisc (Dtrain ) = − log p(y|xt ) Lgen (Dtrain ) = − log p(yt , xt )
t=1 t=1
(7) (8)
where Dtrain = {(yt , xt )}, the labeled training dataset. The gradient for Ldisc
may be computed directly, following the general form
∂ log p(yt |x) ∂ ∂
= −Eh|yt ,xt (E(yt , xt , h)) + Ey,h|,x (E(y, x, h)) (9)
∂θ ∂θ ∂θ
implemented via direct formulation (see [20] for details) or a form of Dropping, such
as Drop-Out or Drop-Connect [37]. The generative gradient follows the form
∂ log p(yt , x) ∂ ∂
= −Eh|yt ,xt (E(yt , xt , h)) + Ey,x,h (E(y, x, h)) (10)
∂θ ∂θ ∂θ
520 A.G. Ororbia et al.
and, although intractable for any (yt , xt ), is approximated via contrastive diver-
gence [17], where the intractable second expectation is replaced by a point esti-
mate using one Gibbs sampling step (after initializing the Markov Chain at the
training sample).
In the semi-supervised context, where Dtrain is small but a large, unlabeled
dataset Dunlab is available, the HRBM can be further extended to train with
an unsupervised objective Lunsup , where negative log-likelihood is optimized
according to
|Dunlab |
Lunsup (Dunlab ) = − log p(xt ). (11)
t=1
The gradient for Lunsup can be simply computed using the same contrastive diver-
gence update for Lgen but incorporating an extra step at the beginning by sam-
pling from the model’s current estimate of p(y|u) for an unlabeled sample u. This
form of the generative update could be viewed as a form of self-training or Entropy
Regularization [23]. The pseudo-code for the online procedure for computing the
generative gradient (either labeled or unlabeled example) for a single HRBM is
shown in Algorithm 1.
To train a fully semi-supervised HRBM, one composes the appropriate multi-
objective function using a simple weighted summation:
Lsemi (Dtrain , Dunlab ) = γLdisc (Dtrain )+αLgen (Dtrain )+βLunsup (Dunlab ) (12)
where α and β are coefficient handles designed to explicitly control the effects
that the generative gradients have on the HRBM’s learning procedure. We intro-
duced the additional coefficient γ as a means to also directly control the effect of
Online Semi-supervised Deep Hybrid Architectures 521
Fig. 1. Architecture of the SBEN model. The flow of data through the system is indi-
cated by the numbered arrows. Given a sample x, the dash-dotted arrow indicates
obtaining an estimated label by using the current layer’s conditional via Equation 6
(i.e., Step # 1 dash-dotted red arrow). The latent representation is computed using
this proxy label and the data vector via Equation 3 (i.e.,Step # 2, dashed green arrow).
This procedure is repeated recursively, replacing x with hn .
the frozen latent representations of the one below, generated by using lower level
expert’s inputs and predictions.
The generative objectives (for both unlabeled and labeled samples) of our
model can be viewed as a form of data-dependent regularization acting on the
discriminative learning gradient of each layer. One key advantage of SBEN train-
ing is that each layer’s discriminative progress may be tracked directly, since each
layer-wise expert is capable of direct classification using Equation 6 to compute
the conditional p(y|hbelow ). Note that setting the number of hidden layers equal
to 1 recovers the original HRBM architecture (a 1 -SEBN). One may notice some
similarity with the partially supervised, layer-wise procedure of [3] where a sim-
ple softmax classifier was loosely coupled with each RBM of a DBN. However,
this only served as a temporary mechanism for pre-training whereas the SBEN
leverages the more unified framework of the HRBM during and after training.
Note that inputs to the SBEN, like the DBN, can be trivially extended [29,40].
Online Semi-supervised Deep Hybrid Architectures 523
h = fθ (
x) = σ(W
x + b) (13) z = gθ (h) = σ(W h + b ) (14)
ultimately improve final predictive performance. This scheme exploits our model’s
inherent layer-wise discriminative ability, which stands as an alternative to cou-
pling helper classifiers as in [3] or the “companion objectives” of [22] to solve poten-
tial exploding gradients in deep convolution networks for object detection.
4 Experimental Results
5
Model implementations were computationally verified for correctness when applicable.
Since discriminative objectives entailed using an automatic differentiation framework,
we checked gradient validity via finite difference approximation.
6
For the SVM, λ was varied in the interval [0.0001, 0.5] while the learning rate for
all other models was varied in [0.0001, 0.1]. For HRBM, SBEN, & HSDA, β was
explored in the interval [0.05, 0.1], and for HRBM, SBEN, & HSDA, α was explored in
[0.075, 1.025]. The threshold p̄ was varied in [0.0, 1.0] and the number of latent layers N
for deeper architectures was explored in [2, 5] where we delineate the optimal number
with the prefix “N -”.
Online Semi-supervised Deep Hybrid Architectures 527
precision, recall, and F-Measure, where F-Measure was chosen to be the harmonic
mean of precision and recall, F 1 = 2(precision · recall)/(precision + recall).
Since the creation of training, validation, and unlabeled subsets was con-
trolled through a seeded random sampling without replacement process, the
procedure described above composes a single trial. For the Standford OCR and
CAPTCHA datasets, the results we report are 10-trial averages with a single
standard deviation from the mean, where each trial used a unique seed value.
On all of the datasets we experimented with, in the case when all samples
are available a priori, ranging from vision-based tasks to text classification,
we observe that hybrid incremental architectures have, in general, lower error
as compared to non-hybrid ones. In the CAPTCHA experiment (Table 1), we
observed that both the SBEN and HSDA models reduced prediction error over
the SVM by nearly 30% and 22% respectively. Furthermore, both models con-
sistently improved over the error the HRBM, with the SBEN model reducing
error by ∼ 12%. In the OCR dataset (Table 2), we see the SBEN improving over
528 A.G. Ororbia et al.
Fig. 2. Online error (y-axis) of 3-SBEN, 3-HSDA, & HRBM (or 1 - SBEN) evaluated
every 100 time steps (x-axis). Each curve reported is a 4-trial mean of the lowest
validation error model.
the HRBM by more than 16% and the SVM by more than 22%. In this case,
the HSDA only marginally improves over the SVM model (∼ 6%) and equal to
that of an HRBM, the poor performance we attribute to a coarse search through
a meta-parameter space window as opposed to an exhaustive grid search. For
WEBKB problem, over the MaxEnt model (which slightly outperformed the
SVM itself), we see a ∼ 57% improvement in error for the HSDA and ∼ 58% for
the SBEN (Table 3). Note that the rectifier network is competitive, however, in
both image-based experiments, the SBEN model outperforms it by more than
11% on CAPTCHA and nearly 14% on OCR.
In the online learning setting, samples from Dunlab may not be available at once
and instead available at a given rate in a stream for a single time instant (we
chose to experiment with one example presented at a given iteration and only
constant access to a |Dtrain | = 500). In order to train a deep architecture in
this setting, while still exploiting the efficiency of a greedy, layer-wise approach,
one may remove the “freezing” step of Algorithm 2 and train all layers dis-
jointly in an incremental fashion as opposed to a purely bottom-up approach.
Using the same sub-routines as depicted in Algorithm 2, this procedure may
be implemented as shown in Algorithm 3, effectively using a single bottom-up
pass to modify model parameters. This approach adapts the training of hybrid
architectures, such as the SBEN and HSDA, to the online learning setting.
As evidenced by Fig. 2, it is possible to train the layer-wise experts of a multi-
level hybrid architecture simultaneously and still obtain a gain in generalization
Online Semi-supervised Deep Hybrid Architectures 529
performance over a non-linear, shallow model such as the HRBM. The HRBM
settles at an online error of 0.356 whereas the 5-HSDA reaches an error of 0.327
and the 5-SBEN an error of 0.319 in a 10,000 iteration sweep. Online error was
evaluated by computing classification error on the next 1,000 unseen samples
generated by the CAPTCHA process.
While the simultaneous greedy training used in this experiment allows for
construction of a deep hybrid model in unity when faced with a data stream, we
note that instability may occur in the form of “shifting representations”. This
is where an upper level model is dynamically trained on a latent representation
of a lower-level model that has not yet settled since it has not yet seen enough
samples from the data distribution.
5 Conclusions
We developed two hybrid models, the SBEN and the HSDA, and their training
algorithms in the context of incremental, semi-supervised learning. They com-
bine efficient greedy, layer-wise construction of deeper architectures with a multi-
objective learning approach. We balance the goal of learning a generative model
of the data with extracting discriminative regularity to perform useful classifi-
cation. More importantly, the framework we describe facilitates more explicit
control over the multiple objectives involved. Additionally, we presented a ver-
tical aggregation scheme, layer-wise ensembling, for generating predictions that
exploit discriminative knowledge acquired at all levels of abstraction defined by
the architecture’s hierarchical form. Our framework allows for explicit control
over generative and discriminative objectives as well as a natural scheme for
tracking layer-wise learning.
Models were evaluated in two problem settings: optical character recognition
and text categorization. We compared results against shallow models and found
that our hybrid architectures outperform the others in all datasets investigated. We
found that the SBEN performed the best, improving classification error by as much
58% (compared to Maximum Entropy on WEBKB). Furthermore, we found that
improvement in performance holds when hybrid learning is adapted to an online
530 A.G. Ororbia et al.
setting (relaxing the purely bottom-up framework in Section 3.1). We observe that
we are able to improve error while significantly minimizing the number of required
labeled samples (as low as 2% of total available data in some cases).
The hybrid deep architectures presented in this paper are not without potential
limitations. First, there is the danger of “shifting representations” if using Algo-
rithm 3 for online learning. To combat this, samples could be pooled into mini-
batch matrices before computing gradients and minimize some of the noise of online
error-surface descent. Alternatively, all layer-wise experts could be extended tem-
porally to Conditional RBM-like structures, potentially improving performance as
in [43]. Second, additional free parameters were introduced that require tuning, cre-
ating a more challenging model selection process for the human user. This may be
alleviated with a parallelized, automated approach, however, a model that adapts
its objective weights during the learning process would be better, altering its hyper-
parameters in response to error progress on data subsets. Our frameworks may
be augmented with automatic latent unit growth for both auto-encoder [42] and
Boltzmann-like variants [10] or perhaps improved by “tying” all layer-wise expert
outputs together in a scheme like that in [11].
The models presented in this paper offer promise in the goal of incremen-
tally building powerful models that reduce expensive labeling and feature engi-
neering effort. They represent a step towards ever-improving models that adapt
to “in-the-wild” samples, capable of more fully embracing the “...unreasonable
effectiveness of data” [16].
References
1. Bengio, Y.: Deep learning of representations for unsupervised and transfer learning.
Journal of Machine Learning Research-Proceedings Track (2012)
2. Bengio, Y., Courville, A.C., Vincent, P.: Unsupervised feature learning and deep
learning: Review and new perspectives (2012). CoRR abs/1206.5538
3. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training
of deep networks. Advances in Neural Information Processing Systems (2007)
4. Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. Large-Scale Kernel
Machines 34, 1–41 (2007)
5. Calandra, R., Raiko, T., Deisenroth, M.P., Pouzols, F.M.: Learning deep
belief networks from non-stationary streams. In: Villa, A.E.P., Duch, W.,
Érdi, P., Masulli, F., Palm, G. (eds.) ICANN 2012, Part II. LNCS,
vol. 7553, pp. 379–386. Springer, Heidelberg (2012)
6. Caragea, C., Wu, J., Williams, K., Das, S., Khabsa, M., Teregowda, P., Giles, C.L.:
Automatic identification of research articles from crawled documents. In: Web-
Scale Classification: Classifying Big Data from the Web, co-located with WSDM
(2014)
Online Semi-supervised Deep Hybrid Architectures 531
27. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network
acoustic models. In: Proc. ICML, vol. 30 (2013)
28. Masud, M.M., Woolam, C., Gao, J., Khan, L., Han, J., Hamlen, K.W., Oza, N.C.:
Facing the reality of data stream classification: Coping with scarcity of labeled data.
Knowledge and Information Systems 33(1), 213–244 (2012)
29. Nair, V., Hinton, G.E.: Rectified linear units improve Restricted Boltzmann
Machines. In: Proc. 27th International Conference on Machine Learning (ICML
2010), pp. 807–814 (2010)
30. Ranzato, M.A., Szummer, M.: Semi-supervised learning of compact document repre-
sentations with deep networks. In: Proc. 25th International Conference on Machine
Learning, pp. 792–799. ACM (2008)
31. Salakhutdinov, R., Hinton, G.: Semantic Hashing. International Journal of Approxi-
mate Reasoning 50(7), 969–978 (2009)
32. Sarikaya, R., Hinton, G., Deoras, A.: Application of Deep Belief Networks for natural
language understanding. IEEE/ACM Transactions on Audio, Speech, and Language
Processing 22(4), 778–784 (2014)
33. Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227
(1990)
34. Schmah, T., Hinton, G.E., Small, S.L., Strother, S., Zemel, R.S.: Generative versus
discriminative training of RBMs for classification of fMRI images. In: Advances in
Neural Information Processing Systems, pp. 1409–1416 (2008)
35. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: Primal estimated sub-
gradient solver for SVM. Mathematical Programming 127(1), 3–30 (2011)
36. Sun, X., Li, C., Xu, W., Ren, F.: Chinese microblog sentiment classification based on
deep belief nets with extended multi-modality features. In: 2014 IEEE International
Conference on Data Mining Workshop (ICDMW), pp. 928–935 (2014)
37. Tomczak, J.M.: Prediction of breast cancer recurrence using classification Restricted
Boltzmann Machine with dropping (2013). arXiv preprint arXiv:1308.6324
38. Tomczak, J.M., Ziba, M.: Classification restricted Boltzmann machine for compre-
hensible credit scoring model. Expert Systems with Applications 42(4), 1789–1796
(2015)
39. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion. Journal of Machine Learning Research 11, 3371–3408 (2010)
40. Welling, M., Rosen-zvi, M., Hinton, G.E.: Exponential family harmoniums with
an application to information retrieval. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.)
Advances in Neural Information Processing Systems, vol. 17, pp. 1481–1488. MIT
Press (2005)
41. Zhang, J., Tian, G., Mu, Y., Fan, W.: Supervised deep learning with auxiliary net-
works. In: Proc. 20th ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining pp. 353–361. ACM (2014)
42. Zhou, G., Sohn, K., Lee, H.: Online incremental feature learning with denoising
autoencoders. In: Proc. 15th International Conference on Artificial Intelligence and
Statistics, pp. 1453–1461 (2012)
43. Zhou, J., Luo, H., Luo, Q., Shen, L.: Attentiveness detection using continu-
ous restricted Boltzmann machine in e-learning environment. In: Wang, F.L.,
Fong, J., Zhang, L., Lee, V.S.K. (eds.) ICHL 2009. LNCS, vol. 5685, pp. 24–34.
Springer, Heidelberg (2009)
Scoring and Classifying with Gated
Auto-Encoders
1 Introduction
Representation learning algorithms are machine learning algorithms which
involve the learning of features or explanatory factors. Deep learning techniques,
which employ several layers of representation learning, have achieved much
recent success in machine learning benchmarks and competitions, however, most
of these successes have been achieved with purely supervised learning methods
and have relied on large amounts of labeled data [10,22]. Though progress has
been slower, it is likely that unsupervised learning will be important to future
advances in deep learning [1].
The most successful and well-known example of non-probabilistic unsuper-
vised learning is the auto-encoder. Conceptually simple and easy to train via
backpropagation, various regularized variants of the model have recently been
proposed [20,21,25] as well as theoretical insights into their operation [6,24].
In practice, the latent representation learned by auto-encoders has typically
been used to solve a secondary problem, often classification. The most common
setup is to train a single auto-encoder on data from all classes and then a classifier
is tasked to discriminate among classes. However, this contrasts with the way
probabilistic models have typically been used in the past: in that literature, it is
more common to train one model per class and use Bayes’ rule for classification.
There are two challenges to classifying using per-class auto-encoders. First, up
until very recently, it was not known how to obtain the score of data under
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 533–545, 2015.
DOI: 10.1007/978-3-319-23528-8 33
534 D.J. Im and G.W. Taylor
an auto-encoder, meaning how much the model “likes” an input. Second, auto-
encoders are non-probabilistic, so even if they can be scored, the scores do not
integrate to 1 and therefore the per-class models need to be calibrated.
Kamyshanska and Memisevic have recently shown how scores can be com-
puted from an auto-encoder by interpreting it as a dynamical system [7].
Although the scores do not integrate to 1, they show how one can combine
the unnormalized scores into a generative classifier by learning class-specific
normalizing constants from labeled data.
In this paper we turn our interest towards a variant of auto-encoders which
are capable of learning higher-order features from data [15]. The main idea is to
learn relations between pixel intensities rather than the pixel intensities them-
selves by structuring the model as a tri-partite graph which connects hidden
units to pairs of images. If the images are different, the hidden units learn how
the images transform. If the images are the same, the hidden units encode within-
image pixel covariances. Learning such higher-order features can yield improved
results on recognition and generative tasks.
We adopt a dynamical systems view of gated auto-encoders, demonstrating
that they can be scored similarly to the classical auto-encoder. We adopt the
framework of [7] both conceptually and formally in developing a theory which
yields insights into the operation of gated auto-encoders. In addition to the
theory, we show in our experiments that a classification model based on gated
auto-encoder scoring can outperform a number of other representation learning
architectures, including classical auto-encoder scoring. We also demonstrate that
scoring can be useful for the structured output task of multi-label classification.
2 Gated Auto-Encoders
In this section, we review the gated auto-encoder (GAE). Due to space con-
straints, we will not review the classical auto-encoder. Instead, we direct the
reader to the reviews in [8,15] with which we share notation. Similar to the clas-
sical auto-encoder, the GAE consists of an encoder h(·) and decoder r(·). While
the standard auto-encoder processes a datapoint x, the GAE processes input-
output pairs (x, y). The GAE is usually trained to reconstruct y given x, though
it can also be trained symmetrically, that is, to reconstruct both y from x and
x from y. Intuitively, the GAE learns relations between the inputs, rather than
representations of the inputs themselves1 . If x = y, for example, they represent
sequential frames of a video, intuitively, the mapping units h learn transforma-
tions. In the case that x = y (i.e. the input is copied), the mapping units learn
pixel covariances.
In the simplest form of the GAE, the M hidden (mapping) units are given
by a basis expansion of x and y. However, this leads to a parameterization
that it is at least quadratic in the number of inputs and thus, prohibitively
large. Therefore, in practice, x, y, and h are projected onto matrices or (“latent
1
Relational features can be mixed with standard features by simply adding connec-
tions that are not gated.
Scoring and Classifying with Gated Auto-Encoders 535
Note that the parameters are usually shared between the encoder and decoder.
The choice of whether to apply a nonlinearity to the output, and the specific
form of objective function will depend on the nature of the inputs, for example,
binary, categorical, or real-valued. Here, we have assumed real-valued inputs for
simplicity of presentation, therefore, Eqs. 2 and 3 are bi-linear functions of h
and we use a squared-error objective:
1
J= r(y|x) − y2 . (4)
2
We can also constrain the GAE to be a symmetric model by training it to
reconstruct both x given y and y given x [15]:
1 1
J= r(y|x) − y2 + r(x|y) − x2 . (5)
2 2
The symmetric objective can be thought of as the non-probabilistic analogue
of modeling a joint distribution over x and y as opposed to a conditional [15].
In [7], the authors showed that data could be scored under an auto-encoder
by interpreting the model as a dynamical system. In contrast to the probabilis-
tic views based on score matching [6,21,24] and regularization, the dynamical
systems approach permits scoring under models with either linear (real-valued
data) or sigmoid (binary data) outputs, as well as arbitrary hidden unit activa-
tion functions. The method is also agnostic to the learning procedure used to
train the model, meaning that it is suitable for the various types of regularized
auto-encoders which have been proposed recently. In this section, we demon-
strate how the dynamical systems view can be extended to the GAE.
536 D.J. Im and G.W. Taylor
Similar to [7], we will view the GAE as a dynamical system with the vector field
defined by
F (y|x) = r(y|x) − y.
The vector field represents the local transformation that y|x undergoes as a
result of applying the reconstruction function r(y|x). Repeatedly applying the
reconstruction function to an input y|x → r(y|x) → r(r(y|x)|x) → · · · →
r(r · · · r(y|x)|x) yields a trajectory whose dynamics, from a physics perspective,
can be viewed as a force field. At any point, the potential force acting on a point
is the gradient of some potential energy (negative goodness) at that point. In
this light, the GAE reconstruction may be viewed as pushing pairs of inputs x, y
in the direction of lower energy.
Our goal is to derive the energy function, which we call a scoring function,
and which measures how much a GAE “likes” a given pair of inputs (x, y) up to
normalizing constant. In order to find an expression for the potential energy, the
vector field must be able to be written as the derivative of a scalar field [7]. To
check this, we can submit to Poincaré’s integrability criterion: For some open,
simple connected set U, a continuously differentiable function F : U → m
defines a gradient field if and only if
The vector field defined by the GAE indeed satisfies Poincaré’s integrability cri-
terion; therefore it can be written as the derivative of a scalar field. A derivation
is given in the Supplementary Material, Section 1.1. This also applies to the
GAE with a symmetric objective function (Eq. 5) by setting the input as ξ|γ
such that ξ = [y; x] and γ = [x; y] and following the exact same procedure.
r(y|x) − y = ∇E.
Hence, by integrating out the trajectory of the GAE (x, y), we can measure the
energy along a path. Moreover, the line integral of a conservative vector field
Scoring and Classifying with Gated Auto-Encoders 537
Note that if we consider the conditional GAE we reconstruct x given y only, this
yields
y2
Eσ (y|x) = log (1 + exp (W H (Wk·
Y X
y Wk· x))) − + const. (9)
2
k
Then the energy function of the cAE with dynamics r(x|y) − x is equivalent to
the free energy of Covariance RBM up to a constant:
x2
E(x, x) = log 1 + exp W H (W X x)2 + b − + const. (10)
2
k
The proof is given in the Supplementary Material, Section 2.2. We can extend
this analysis to the mcAE by using the above theorem and the results from [7].
Corollary 1. The energy function of a mcAE and the free energy of a Mean-
covariance RBM (mcRBM) with Gaussian-distributed visibles and Bernoulli-
distributed hiddens are equivalent up to a constant. The energy of the mcAE is:
E= log 1+ exp −W H (W X x)2 − b + log (1+ exp(W x+c))−x2 +const
k k
(11)
Scoring and Classifying with Gated Auto-Encoders 539
where Bi is a learned bias for class i. The bias terms take the role of calibrating
the unnormalized energies. Note that we can similarly combine the energies from
a symmetric gated auto-encoder where x = y (i.e. a covariance auto-encoder)
and apply Eq. 12. If, for each class, we train both a covariance auto-encoder and
a classical auto-encoder (i.e. a “mean” auto-encoder) then we can combine both
sets of unnormalized energies as follows
exp(EiM (x) + EiC (x) + Bi )
PmcAE (zi |x) = M C
, (13)
j exp(Ej (x) + Ej (x) + Bj )
540 D.J. Im and G.W. Taylor
where EiM (x) is the energy which comes from the “mean” (standard) auto-
encoder trained on class i and EiC (x) the energy which comes from the “covari-
ance” (gated) auto-encoder trained on class i. We call the classifiers in Eq. 12
and Eq. 13 “Covariance Auto-encoder Scoring” (cAES) and “Mean-Covariance
Auto-encoder Scoring” (mcAES), respectively.
The training procedure is summarized as follows:
The dominant application of deep learning approaches to vision has been the
assignment of images to discrete classes (e.g. object recognition). Many applica-
tions, however, involve “structured outputs” where the output variable is high-
dimensional and has a complex, multi-modal joint distribution. Structured out-
put prediction may include tasks such as multi-label classification where there
are regularities to be learned in the output, and segmentation, where the output
is as high-dimensional as the input. A key challenge to such approaches lies in
Scoring and Classifying with Gated Auto-Encoders 541
Table 1. Classification error rates on the Deep Learning Benchmark dataset. SAA3
stands for three-layer Stacked Auto-encoder. SVM and RBM results are from [24],
DEEP and GSM are results from [15], and AES is from [7].
DATA SVM RBM DEEP GSM AES cAES mcAES
RBF SAA3
developing models that are able to capture complex, high level structure like
shape, while still remaining tractable.
Though our proposed work is based on a deterministic model, we have shown
that the energy, or scoring function of the GAE is equivalent, up to a constant, to
that of a conditional RBM, a model that has already seen some use in structured
prediction problems [12,18].
GAE scoring can be applied to structured output problems as a type of
“post-classification” [17]. The idea is to let a naiv̈e, non-structured classifier
make an initial prediction of the outputs in a fast, feed-forward manner, and
then allow a second model (in our case, a GAE) clean up the outputs of the first
model. Since GAEs can model the relationship between input x and structured
output y, we can initialize the output with the output of the naiv̈e model, and
then optimize its energy function with respect to the outputs. Input x is held
constant throughout the optimization.
Li et al recently proposed Compositional High Order Pattern Potentials,
a hybrid of Conditional Random Fields (CRF) and Restricted Boltzmann
Machines. The RBM provides a global shape information prior to the locally-
connected CRF. Adopting the idea of learning structured relationships between
outputs, we propose an alternate approach which the inputs of the GAE are
not (x, y) but (y, y). In other words, the post-classification model is a covari-
ance auto-encoder. The intuition behind the first approach is to use a GAE to
learn the relationship between the input x and the output y, whereas the second
method aims to learn the correlations between the outputs y.
We denote our two proposed methods GAEXY and GAEY 2 . GAEXY corre-
sponds to a GAE, trained conditionally, whose mapping units directly model the
relationship between input and output and GAEY 2 corresponds to a GAE which
models correlations between output dimensions. GAEXY defines E (y|x), while
GAEY 2 defines E (y|y) = E(y). They differ only in terms of the data vectors
that they consume. The training and test procedures are detailed in Algorithm 1.
542 D.J. Im and G.W. Taylor
y0 = f (xtest ) (16)
6: while E(yt+1 |x) − E(yt |x) > or ≤ max. iter. do
7: Compute yt E
8: Update yt+1 = yt − λyt E
9: where is the tolerance rate with respect to the convergence of the
optimization.
2
In our experiments, we used the cross-entropy loss function for loss1 and loss2 .
Scoring and Classifying with Gated Auto-Encoders 543
Fig. 1. Covariance matrices for the multi-label datasets: Yeast, Scene, MTurk, and
MajMin.
Table 2. Error rate on multi-label datasets. As in previous work, we report the mean
across 10 repeated runs with different random weight initializations.
Method Yeast Scene MTurk MajMin
LogReg 20.16 10.11 8.10 4.34
HashCRBM∗ 20.02 8.80 7.24 4.24
MLP 19.79 8.99 7.13 4.23
GAESXY 19.27 6.83 6.59 3.96
GAESY 2 19.58 6.81 6.59 4.29
6 Conclusion
as [7], we showed that the GAE could be scored according to an energy function.
From this perspective, we demonstrated the equivalency of the GAE energy to
the free energy of a FCRBM with Gaussian visible units, Bernoulli hidden units,
and sigmoid hidden activations. In the same manner, we also showed that the
covariance auto-encoder can be formulated in a way such that its energy function
is the same as the free energy of a covariance RBM, and this naturally led to
a connection between the mean-covariance auto-encoder and mean-covariance
RBM. One interesting observation is that Gaussian-Bernoulli RBMs have been
reported to be difficult to train [3,9], and the success of training RBMs is highly
dependent on the training setup [26]. Auto-encoders are an attractive alternative,
even when an energy function is required.
Structured output prediction is a natural next step for representation learn-
ing. The main advantage of our approach compared to other popular approaches
such as Markov Random Fields, is that inference is extremely fast, using a
gradient-based optimization of the auto-encoder scoring function. In the future,
we plan on tackling more challenging structured output prediction problems.
References
1. Bengio, Y., Thibodeau-Laufer, É.: Deep generative stochastic networks trainable
by backprop (2013). arXiv preprint arXiv:1306.1091
2. Boutell, M.R., Luob, J., Shen, X., Brown, C.M.: Learning multi-label scene classi-
fication. Pattern Recognition 37, 1757–1771 (2004)
3. Cho, K.H., Ilin, A., Raiko, T.: Improved learning of Gaussian-Bernoulli restricted
Boltzmann machines. In: Honkela, T. (ed.) ICANN 2011, Part I. LNCS, vol. 6791,
pp. 10–17. Springer, Heidelberg (2011)
4. Droniou, A., Sigaud, O.: Gated autoencoders with tied input weights. In: ICML
(2013)
5. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: NIPS
(2002)
6. Guillaume, A., Bengio, Y.: What regularized auto-encoders learn from the data
generating distribution. In: ICLR (2013)
7. Kamyshanska, H., Memisevic, R.: On autoencoder scoring. In: ICML, pp. 720–728
(2013)
8. Kamyshanska, H., Memisevic, R.: The potential energy of an auto-encoder. IEEE
Transactions on Pattern Analysis and Machine Intelligence 37(6), 1261–1273
(2014)
9. Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep.,
Department of Computer Science, University of Toronto (2009)
10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: NIPS (2012)
11. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical
evaluation of deep architectures on problems with many factors of variation. In:
ICML (2007)
12. Li, Y., Tarlow, D., Zemel, R.: Exploring compositional high order pattern potentials
for structured output learning. In: CVPR (2013)
13. Mandel, M.I., Eck, D., Bengio, Y.: Learning tags that vary within a song. In:
ISMIR (2010)
Scoring and Classifying with Gated Auto-Encoders 545
14. Mandel, M.I., Ellis, D.P.W.: A web-based game for collecting music metadata.
Journal New of Music Research 37, 151–165 (2008)
15. Memisevic, R.: Gradient-based learning of higher-order image features. In: ICCV
(2011)
16. Memisevic, R., Zach, C., Hinton, G., Pollefeys, M.: Gated softmax classification.
In: NIPS (2010)
17. Mnih, V., Hinton, G.E.: Learning to detect roads in high-resolution aerial images.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS,
vol. 6316, pp. 210–223. Springer, Heidelberg (2010)
18. Mnih, V., Larochelle, H., Hinton, G.E.: Conditional restricted Boltzmann machines
for structured output prediction. In: UAI (2011)
19. Ranzato, M., Hinton, G.E.: Modeling pixel means and covariances using factorized
third-order Boltzmann machines. In: CVPR (2010)
20. Rifai, S.: Contractive auto-encoders: explicit invariance during feature extraction.
In: ICML (2011)
21. Swersky, K., Ranzato, M., Buchman, D., Freitas, N.D., Marlin, B.M.: On autoen-
coders and score matching for energy based models. In: ICML, pp. 1201–1208
(2011)
22. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions (2014). arXiv
preprint arXiv:1409.4842
23. Taylor, G.W., Hinton, G.E.: Factored conditional restricted Boltzmann machines
for modeling motion style. In: ICML, pp. 1025–1032 (2009)
24. Vincent, P.: A connection between score matching and denoising auto-encoders.
Neural Computation 23(7), 1661–1674 (2010)
25. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.: Extracting and composing
robust features with denoising autoencoders. In: ICML (2008)
26. Wang, N., Melchior, J., Wiskott, L.: Gaussian-binary restricted Boltzmann
machines on modeling natural image statistics. Tech. rep., Institut fur Neuroin-
formatik Ruhr-Universitat Bochum, Bochum, 44780, Germany (2014)
Sign Constrained Rectifier Networks
with Applications to Pattern Decompositions
1 Introduction
Deep rectifier networks have achieved great success in object recognition
[4,8,10,18], face verification [14,15], speech recognition ([3,6,12] and handwrit-
ten digit recognition [2]. However, the lack of understanding of the roles of the
hidden layers makes the deep learning network difficult to interpret for tasks
of discriminant factor analysis and pattern structure analysis. Towards a clear
understanding of the success of the deep rectifier networks, a recent work [1] pro-
vides a constructive proof for the universal classification power of two-hidden-
layer rectifier networks. For binary classification, the proof uses the first hidden
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 546–559, 2015.
DOI: 10.1007/978-3-319-23528-8 34
Sign Constrained Rectifier Networks with Applications 547
layer to make the pattern sets convexly separable. The second hidden layer is
then used to achieve linear separability, and finally a linear classifier is used to
separate the patterns. Although this strategy can be used in constructive proofs,
it cannot be used to analyse the learnt rectifier network since it might not be
verified in the empirical learning from data. Fortunately, this paper will show
that such a strategy can be verified if additional sign constraints are imposed on
the weights of the output layer and on those of the top hidden layer. A funda-
mental result of this paper is that a pair of pattern sets can be separated by a
single-hidden-layer rectifier network with non-negative output layer weights and
non-positive bias if and only if one of the pattern sets is disjoint to the convex
hull of the other. With this fundamental result, this paper introduces sign con-
strained rectifier networks (SCRN) and proves that the two-hidden-layer SCRNs
are capable of separating any two disjoint pattern sets. SCRN can automatically
learn a rectifier network classifier which achieves convex separability and lin-
ear separability in the first and second hidden layers respectively. For any pair
of disjoint pattern sets, a two-hidden-layer SCRN can be used to decompose
one of the pattern sets into several subsets each convexly separable from the
entire other pattern set; and for any pair of convexly separable pattern sets, a
single-hidden-layer SCRN can be used to decompose one of the pattern sets into
several subsets each linearly separable from the entire other pattern set. SCRN
thus can be used to analyse the pattern structures and the discriminant factors
of different patterns.
Compared to traditional unconstrained rectifier networks, SCRN is more
interpretable and convenient for tasks of discriminant factor analysis and pat-
tern structure analysis. It can help in initializations or refining of the traditional
rectifier networks. The outliers and the non-crucial points of the decomposed sub-
sets can be identified. Classification accuracy can thus be improved by training
after removal of outliers, while training efficiency can be improved by removing
the non-crucial training patterns, especially when the original training size is
large.
Notations: Throughout the paper, we use capital letters to denote matrices, lower
case letters for scalar terms, and bold lower letters for vectors. For instance, we
use wi to denote the ith column of a matrix W , and use bi to denote the ith
element of a vector b. For any integer m, we use [m] to denote the integer set
from 1 to m, i.e., [m] {1, 2, · · · , m}. We use I to denote the identity matrix
with proper dimensions, 0 a vector with all elements being 0, and 1 a vector
with all elements being 1. W 0 and b 0 denote that all elements of W and
b are non-negative while W 0 and b 0 denote that all elements of W and
b are non-positive. Given a finite number of points xi (i ∈ [m]) in Rn , a convex
combination x of these points is a linear combination of these points, in which all
coefficients are non-negative and sum to 1. The convex hull of a set X , denoted
by CH(X ), is a set of all convex combinations of the points in X .
The rest of this paper is organised as follows. In Section 2, the categories of
separable pattern sets are described with a brief review on the disjoint convex hull
decomposition models of patterns [1]. In Section 3, we address the formulation
548 S. An et al.
Fig. 1. Illustration of the categories of pattern sets (Left: linear separable; Middle:
convexly separable; Right: convexly inseparable). Best viewed in color.
2) Convexly Separable Pattern Sets: The two pattern sets have a disjoint decom-
position convex hull model with min(L1 , L2 ) = 1, i.e., CH(X1 ) ∩ X2 = ∅ or
CH(X2 ) ∩ X1 = ∅. These pattern sets are referred to as convexly separable
because there exists a convex region which can separate one class from the
other;
3) Disjoint but Convexly Inseparable Pattern Sets: X1 and X2 have no common
points, X1 ∩ X2 = ∅, and all their disjoint convex hull decompositions satisfy
min(L1 , L2 ) > 1.
Figure 1 demonstrates the three categories of pattern sets. There exists a
hyperplane to separate the linearly separable patterns, and the discriminant fac-
tor can be characterized by the geometrically interpretable linear classifiers (i.e.,
the separating hyperplanes). However, patterns are rarely linearly separable in
practice, and nonlinear classifiers are required to separate the patterns. Existing
nonlinear classification methods such as kernel methods [13] and deep rectifier
network methods [4,8,10,18] are not geometrically interpretable due to the non-
linear transformations induced by kernels or hidden layers. In this paper, we
investigate the methods to decompose the convexly inseparable pattern sets into
convexly separable pattern subsets, and the methods to decompose the convexly
separable pattern sets into linearly separable subsets, so that pattern structures
and discriminant factors can be analysed through the decomposed pattern sub-
sets by linear classifiers.
2 8 2
7
1.5 1.5
6
1 5 1
4
0.5 0.5
3
0 2 0
1
-0.5 -0.5
0
-1 -1 -1
-3 -2 -1 0 1 2 3 -5 0 5 10 15 20 25 -3 -2 -1 0 1 2 3
f {G(x)} > 0, ∀ x ∈ X+
(7)
f {G(x)} < 0, ∀ x ∈ X− .
wiT x+
i + bi > 0 (8)
wiT x + bi < 0, ∀ x ∈ X− .
Denote
W = [w1 , w2 , · · · , wn+ ]
b = [b1 , b2 , · · · , bn+ ]T (9)
z = max(0, W T x + b).
Then we have
Z− {z = max(0, W T x + b) : x ∈ X− }
= {0}
(10)
Z+ {z = max(0, W T x + b) : x ∈ X+ }
⊂ {z : 1T z > γmin , zi ≥ 0, ∀ i ∈ [n+ ]; zj > 0, ∃ j ∈ [n+ ]}
where
γmin min 1T max(0, W T x + b)
x∈X+ (11)
> 0.
The middle subfigure of Fig 2 provides an example of Z+ (in red color) and
Z− (in green color), the transformed pattern sets of a convexly separable pattern
sets as shown in the left subfigure of Figure 2.
For a single-hidden-layer binary classifier f (x), as described in (4), if we
choose β = −1 and a = γmin 2
1 0, then
2 T
f (x) = 1 max(0, W T x + b) − 1
γmin
satisfies
f (x) ≥ 1 > 0, ∀ x ∈ X+ ,
(12)
f (x) = −1 < 0, ∀ x ∈ X−
which imply that X+ and X− can be separated by a sign-constrained single-
hidden-layer binary classifier.
(Necessity). Suppose that X+ , X− can be separated by a sign-constrained
single-hidden-layer binary classifier, that is, there exist a 0, β ≤ 0, and W, b,
such that f (x), as defined in (4), satisfies (5). Next, we will prove the convexity
of the set {x : f (x) < 0} and show that f (x) < 0 holds for all x in the convex
hull of X− .
Let z0 , z1 be two arbitrary real numbers and let zλ = λz1 + (1 − λ)z0 be their
linear combination. Since
we have
aT max(0, zλ ) ≤ λaT max(0, z0 ) + (1 − λ)aT max(0, z1 ), ∀ λ ∈ [0, 1] (14)
for any a 0 and z0 , z1 with same dimensions, where zλ λz1 + (1 − λ)z0 .
In particular, let zλ = W T xλ + b with xλ = λx1 + (1 − λ)x0 . Then we have
f (xλ ) = aT max(0, zλ ) + β
≤ λ aT max(0, z0 ) + β + (1 − λ) aT max(0, z1 ) + β (15)
= λf (x0 ) + (1 − λ)f (x1 ), ∀ λ ∈ [0, 1]
and therefore
f (xλ ) < 0, ∀ λ ∈ [0, 1] (16)
if and only if
f (xλ ) < 0, ∀ λ = 0, 1. (17)
Hence {x : f (x) < 0} is a convex set, and thus
f (x) < 0, ∀ x ∈ CH(X− ) (18)
follows from f (x) < 0, ∀ x ∈ X− . Note that f (x) > 0 for all x ∈ X+ (from (5)).
So X+ and CH(X− ) are separable and thus CH(X− ) ∩ X+ = ∅, which completes
the proof.
The following Lemma shows the capacity of sign-constrained single-hidden-
layer classifiers in decomposing the positive pattern set into several subsets so
that each subset is linearly separable from the negative pattern set.
Lemma 2. Let X+ and X− be two convexly separable pattern sets with X+ ∩
CH(X− ) = ∅, and let f (x), as defined in (4) with m hidden nodes and satisfying
a 0, β ≤ 0, be one of their sign-constrained single-hidden-layer separators.
Define
fI (x) ai (wiT x + bi ) + β (19)
i∈I
and
X+I {x : fI (x) > 0, x ∈ X+ } . (20)
for any subset I ⊂ [m]. Then we have
X+ = X+I (21)
I⊂[m]
and
CH X+I ∩ CH(X− ) = ∅, (22)
i.e, X+I
and X− are linearly separable, and furthermore, fI (x) is their linear
separator satisfying
fI (x) > 0, ∀ x ∈ X+I
(23)
fI (x) < 0, ∀ x ∈ X− .
Sign Constrained Rectifier Networks with Applications 553
Before proving this Lemma, we give an example about the subsets of [m] and
explain the decomposed subsets of the positive pattern set with Fig 2. When
m = 2, [m] has three subsets: I1 = {1}, I2 = {2} and I3 = {1, 2}. In the
example of Fig 2, two hidden nodes are used and the positive pattern set (in
red color) can be decomposed into three subsets, each associated with one of
the three (extended) lines of the separating boundary in the right subfigure of
Fig 2, and the middle line of the boundary is associated with I3 . Note that
the decomposed subsets have overlaps. The number of the subsets is determined
by the number of hidden nodes. For compact decompositions and meaningful
discriminate factor analysis, small numbers of hidden nodes are preferable. The
significance of this Lemma is in the discovery that single-hidden-layer SCRN can
decompose the convexly separable pattern sets into linearly separable subsets so
that the discriminate factors of convexly separable patterns can be analysed
through the linear classifies between one pattern set (labelled negative) and the
subsets of the other pattern set (labelled positive).
Proof: From a 0, it follows that
and consequently
fI (x) < 0, ∀ I ⊂ [m], x ∈ X− . (25)
Then (23) follows straightforward from the above inequality and the definition
of X+I . Note that fI (x) is a linear classifier satisfying (23), fI (x) is a linear
separator of X+I and X− , and (22) holds consequently.
To complete the proof, it remains to prove (21). Let x ∈ X+ be any pattern
with positive label and let I ⊂ [m] be the index set so that wiT x + bi > 0 for all
i ∈ I and wiT x + bi ≤ 0 for all i ∈ I. Then fI (x) = f (x) > 0 and thus x ∈ X+I .
This proves that any element in X+ is in X+I for some I ⊂ [m]. Hence (21) is true
and the proof is completed.
2 1
0.9
1.5
20
0.8
1
15 0.7
0.5
0.6
10
0 0.5
5 0.4
-0.5
0.3
-1 0
30
0.2
30
-1.5 20
20 0.1
10
10
-2 0
-3 -2 -1 0 1 2 3 0 0 0 0.2 0.4 0.6 0.8 1
Theorem 3. For any two disjoint pattern sets, namely X+ and X− , in Rn , there
exists a sign-constrained two-hidden-layer binary classifier f {G(x)}, as defined
in (6) and satisfying a 0, β ≤ 0, W 0, b 0, such that f {G(x)} > 0 for all
x ∈ X+ and f {G(x)} < 0 for all x ∈ X− .
Proof: Let
L1
L2
X+ = X+i , X− = X−j (26)
i=1 j=1
gi (x) > 0, ∀x ∈ X−
(29)
gi (x) < 0, ∀x ∈ X+i
where
gi (x) wiT max(0, V T x + c) + bi . (30)
Note that X− is treated as the pattern set with positive labels when applying
Lemma 1, and X+i is treated as the pattern set with negative labels correspond-
ingly.
Let W = [w1 , w2 , · · · , wL1 ], b = [b1 , b2 , · · · , bL1 ]T and consider the transfor-
mation
z = G(x) max(0, −W T max(0, V T x + c) − b) (31)
where −W and −b, instead of W and b, are used in the above transformation
so that the responses of the negative patterns in X− are 0.
Sign Constrained Rectifier Networks with Applications 555
Denote
Z− {z : z = G(x), x ∈ X− }
= {0}
(32)
Z+ {z : z = G(x), x ∈ X+ }
⊂ {z : 1T z > γmin , zi ≥ 0, ∀i ∈ [L1 ]; zj > 0, ∃ j ∈ [L1 ]}.
where
γmin min 1T max(0, −W T G(x) − b)
x∈X+ (33)
> 0.
2
Let a = γmin 1 0, β = −1 and f (z) aT z + β. Then f (z) ≥ 1 > 0 for
z ∈ Z+ and f (z) = −1 < 0 for z ∈ Z− , or equivalently
satisfies f {G(x)} > 0 for x ∈ X+ and f {G(x)} < 0 for x ∈ X− . Note that
−W 0, −b 0, a 0, β ≤ 0, f {G(x)} is a sign-constrained two-hidden-layer
binary classifier, and the proof is completed.
Theorem 4. Let X+ , X− be two disjoint pattern sets and let f {G(x)}, as defined
in (6) and satisfying a 0, β ≤ 0, W 0, b 0, be one of their sign-constrained
two-hidden-layer binary separators with m top hidden nodes and satisfying (7).
Define
T
fI {G(x)} ai [wi G(x) + bi ] + β
(35)
i∈I
X+I {x : fI {G(x)} > 0, x ∈ X+ }
for any subset I in [m]. Then we have
X+ = X+I (36)
I⊂[m]
and
CH X+I ∩ X− = ∅, (37)
i.e, X+Iand X− are convexly separable, and furthermore, fI {G(x)} is their
single-hidden-layer separator satisfying
and therefore
fI {G(x)} < 0, ∀ x ∈ X− (40)
which implies (38), together with the definition of X+I in (35). Hence, fI {G(x)}
is a single-hidden-layer separator of X− and X+I . Next, we show that −fI {G(x)}
can be described as a sign-constrained single-hidden-layer separator of X− and
X+I if they are labelled positive and negative respectively.
Let
gI (x) −fI {G(x)}
= âT G(x) + β̂I (41)
T T
= â max(0, V x + c) + β̂I
where
â = − ai wi 0
i∈I
(42)
β̂ = − ai bi − β.
i∈I
gI (x) > 0 , ∀ x ∈ X−
(43)
gI (x) < 0 , ∀ x ∈ X+I .
Note that â 0. If X+I is not empty, β̂I must be non-positive, and gI (x) is
thus a sign-constrained single-hidden-layer separator of X− and X+I . Then by
Lemma 1, we have (37). Note that X− corresponds to the positive set while X+I
corresponds to the negative set when applying Lemma 1, with gI being one of
their sign-constrained single-hidden-layer classifiers.
Now it remains to prove (36). It suffices to prove that, for any x ∈ X+ , there
exists I ⊂ [m] such that x ∈ X+I . Let x be a member in X+ and let I ⊂ [m], I = ∅
be the index set such that wiT G(x)+bi > 0 for all i ∈ I and wiT G(x)+bi ≤ 0 for
all i ∈ I. Then fI {G(x)} = f {G(x)} > 0 and thus x is in X+I .
Theorem 4 states that the positive pattern set can be decomposed into several
subsets by a two-hidden-layer SCRN, namely
t
X+ = X+i
i=1
so that X+i and X−j are linearly separable. Hence, one can investigate the discrim-
inant factors of the two patterns by using the linear classifiers of these subsets
of the patterns. With the decomposed subsets, one can investigate the pattern
structures. The numbers of the subsets are determined by the numbers of hidden
nodes in the top hidden layers of the two-hidden-layer SCRNs and the numbers
of the hidden nodes of the single-hidden-layer SCRNs. To find compact pattern
structures and meaningful discriminant factors, small number of hidden nodes
is preferable. Since the convex hulls of the subsets can be represented by the
convex hull of a small set of boundary points, and the interior points are not
crucial for training rectifier networks (similar to the training points other than
support vectors in support vector machine (SVM) training [13,16]), one can use
the proposed SCRN to find these non-crucial training patterns, and by removing
them to speed up the training of the unconstrained rectifier networks. One can
also use SVMs to separate the decomposed subsets of the patterns and identify
the outliers, and by removing them to improve classification accuracy.
For multiple classes of pattern sets, multiple times of SCRN training and
analysis are required, each time labelling one class positive and all the other
classes negative.
6 Discussion
In this paper, we have shown that, with sign constraints on the weights of the
output layer and on those of the top hidden layer, two-hidden-layer SCRNs are
capable of separating any two finite training patterns as well as decomposing one
of them into several subsets so that each subset is convexly separable from the
other pattern set; and single-hidden-layer SCRNs are capable of separating any
pair of convexly separable pattern sets as well as decomposing one of them into
several subsets so that each subset is linearly separable from the other pattern
set.
The proposed SCRN not only enables pattern and feature analysis for knowl-
edge discovery but also provides insights on how to improve the efficiency
and accuracy of training the rectifier network classifiers. Potential applications
include: 1) Feature analysis such as in health and production management of
precision livestock farming[17], where one needs to identify the key features asso-
ciated with diseases (e.g. hock burn of broiler chickens) on commercial farms,
using routinely collected farm management data [5]. 2). User-supervised neural
network training. Since each decomposed subset of the pattern sets via the pro-
posed SCRN is convexly separable from the other pattern set, one can visualize
the clusters of each pattern set, identify the outliers, and check the separating
boundaries. By removing the outliers and adjusting the boundaries, users could
manage to improve the training and/or validation performance. By removing the
non-crucial points (i.e., interior points) of the convex hull of each decomposed
subset, users could speed up the training especially for big size training data.
Related Works: This work on the universal classification power of the proposed
SCRN is related to [7,9,11], which address the universal approximation power
558 S. An et al.
of deep neural networks for functions or for probability distributions. This work
is also closely related to [1], which proves that any multiple pattern sets can
be transformed to be linearly separable by two hidden layers, with additional
distance preserving properties. In this paper, we prove that any two disjoint
pattern sets can be separated by a two-hidden-layer rectifier network with addi-
tional sign constraints on the weights of the output layer and on those of the top
hidden layer. The significance of SCRN lies in the fact that, through two hidden
layers, it can decompose one of any pair of pattern sets into several subsets so
that each subset is convexly separable from the entire other pattern set; and
through a single hidden layer, it can decompose one of the convexly separable
pair of pattern sets into several subsets so that each subset is linearly separable
from the entire other pattern set. This decomposition can be used to analyse
pattern sets and identify the discriminative features for different patterns. Our
effort to make the deep rectifier network interpretable is related to [18], which
manages to visualize and understand the learnt convolutional neural network
features. The visualization technique developed in [18] has been shown useful to
improve classification accuracy as evidenced by the state-of-the-art performance
in object classification.
Limitations and Future Works: This paper focuses on the theoretical devel-
opment of interpretable rectifier networks aiming to understand the learnt hid-
den layers and pattern structures. Future works are required to develop learning
algorithms for SCRN and demonstrate their applications to image classification
on popular databases.
References
1. An, S., Boussaid, F., Bennamoun, M.: How can deep rectifier networks achieve lin-
ear separability and preserve distances? In: Proceedings of The 32nd International
Conference on Machine Learning, pp. 514–523 (2015)
2. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for
image classification. In: 2012 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 3642–3649. IEEE (2012)
3. Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G.,
He, X., Williams, J., et al.: Recent advances in deep learning for speech research
at Microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 8604–8608. IEEE (2013)
4. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on ImageNet classification. arXiv preprint arXiv:1502.01852
(2015)
5. Hepworth, P.J., Nefedov, A.V., Muchnik, I.B., Morgan, K.L.: Broiler chickens can
benefit from machine learning: support vector machine analysis of observational
epidemiological data. Journal of the Royal Society Interface 9(73), 1934–1942
(2012)
Sign Constrained Rectifier Networks with Applications 559
6. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups. IEEE
Signal Processing Magazine 29(6), 82–97 (2012)
7. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are uni-
versal approximators. Neural Networks 2(5), 359–366 (1989)
8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
9. Le Roux, N., Bengio, Y.: Deep belief networks are compact universal approxima-
tors. Neural Computation 22(8), 2192–2207 (2010)
10. Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. arXiv
preprint arXiv:1409.5185 (2014)
11. Montufar, G., Ay, N.: Refinements of universal approximation results for deep
belief networks and restricted Boltzmann machines. Neural Computation 23(5),
1306–1319 (2011)
12. Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-
dependent deep neural networks. In: Interspeech, pp. 437–440 (2011)
13. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge
University Press (2004)
14. Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint
identification-verification. In: Advances in Neural Information Processing Systems,
pp. 1988–1996 (2014)
15. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-
level performance in face verification. In: 2014 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1701–1708. IEEE (2014)
16. Vapnik, V.N., Vapnik, V.: Statistical learning theory, vol. 2. Wiley, New York
(1998)
17. Wathes, C., Kristensen, H.H., Aerts, J.M., Berckmans, D.: Is precision livestock
farming an engineer’s daydream or nightmare, an animal’s friend or foe, and a
farmer’s panacea or pitfall? Computers and Electronics in Agriculture 64(1), 2–10
(2008)
18. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks.
In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I.
LNCS, vol. 8689, pp. 818–833. Springer, Heidelberg (2014)
Aggregation Under Bias: Rényi Divergence
Aggregation and Its Implementation
via Machine Learning Markets
1 Introduction
Aggregation of predictions from different agents or algorithms is becoming
increasingly necessary in distributed, large scale or crowdsourced systems. Much
previous focus is on aggregation of classifiers or point predictions. However,
aggregation of probabilistic predictions is also of particular importance, espe-
cially where quantification of risk matters, generative models are required or
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 560–574, 2015.
DOI: 10.1007/978-3-319-23528-8 35
Rényi Divergence Aggregation 561
2 Background
Aggregation methods have been studied for some time, and have been discussed
in a number of contexts. Aggregation methods differ from ensemble approaches
(see e.g. [9]), as the latter also involves some control over the form of the individ-
uals within the ensemble: with aggregation, the focus is entirely on the method
of combination - there is no control over the individual agent beliefs. In addition,
most aggregation methods focus on aggregating hard predictions (classifications,
mean predictive values etc.) [4,10]. Some, but not all of those are suitable for
562 A.J. Storkey et al.
3 Problem Statement
access to their predictions for a small validation set of our own data which we
know relates to our domain of interest (the distribution of which we denote by
PG ). We consider the case where it may be possible that the data individual
agents see are different in distribution (i.e. biased) with respect to our domain
of interest.
Our objective is to minimize the negative log likelihood for a model P for
future data generated from an unknown data generating distribution P G . This
can be written as desiring arg minP KL(P G ||P ), where KL denotes the KL-
Divergence. However in an aggregation scenario, we do not have direct access
to data that can be used to choose a model P by a machine learning method.
Instead we have access to beliefs Pi from i = 1, 2, . . . , NA other agents, which do
have direct access to some data, and we must use those agent beliefs Pi to form
our own belief P .
We have no control over the agents’ beliefs Pi , but we can expect that the
agents have learnt Pi using some learning algorithm with respect to data drawn
from individual data distributions PiG . Hence agents will choose Pi with low
KL(Pi ||PiG ) with respect to their individual data, drawn from PiG . For example
agents can choose their own posterior distributions Pi with respect to the data
they observe.
We also assume that each PiG is ‘close’ to the distribution P G we care about.
Where we need to be specific, we use the measure KL(PiG ||P G ) as the measure
of closeness, which is appropriate if PiG is obtained by sample selection bias [26]
from P G . In this case KL(PiG ||P G ) gives a standardized expected log acceptance
ratio, which is a measure of how the acceptance rate varies across the data
distribution. Lower KL divergence means lower variation in acceptance ratio
and Pi is closer to P . The simplest case is to assume KL(PiG ||P G ) = 0 ∀i, which
implies an unbiased data sample.
1
NA wj
By contrast, a logarithmic opinion pool is given by P (y|·) = Z(w) j=1 P (y|·)
where wj ≥ 0 ∀j, where we use the P (y|·) notation to reflect that this applies to
both conditional and unconditional distributions. The logarithmic opinion pool
is more problematic to work with due to the required computation of the nor-
malization constant, which is linear in the number of states. Again the value
of w can be obtained using a maximum-entropy or a gradient-based optimizer.
Others (see e.g. [16]) have used various approximate schemes for log opinion
pools when the state space is a product space.
Weighted Divergence aggregation is very general but we need to choose a
particular form of divergence. In this paper we analyse the family of Rényi
divergences for weighted divergence aggregation. This choice is motivated by
two facts:
– Rényi divergence aggregators satisfy maximum entropy arguments for the
aggregator class under highly relevant assumptions about the biases of indi-
vidual agents.
– Rényi divergence aggregators are implemented by machine learning markets,
and hence can result from autonomous self interested decision making by the
individuals contributing different predictors without centralized imposition.
Hence this approach can incentivize agents to provide their best information
for aggregation.
In much of the analysis that follows we will drop the conditioning (i.e. write
P (y) rather than P (y|x)) for the sake of clarity, but without loss of generality
as all results follow through in the conditional setting.
The Rényi divergence has two relevant special cases: limγ→1 (1/γ)DγR (P ||Q) =
KL(P ||Q), and limγ→0 (1/γ)DγR (P ||Q) = KL(Q||P ) (which can be seen via
L’hôpital’s rule). We assume the value for the Rényi divergence for γ = 1 is
defined by KL(P ||Q) via analytical continuation.
Definition 2 (Weighted Rényi Divergence Aggregation). The weighted
Rényi divergence aggregation is a weighted divergence aggregation given by (1),
where each divergence D(Pi , Q) = γi−1 DγRi [Pi ||Q].
Note that each component i in (1) can have a Rényi divergence with an indi-
vidualized parameter γi . Sometimes we will assume that all divergences are the
same, and refer to a single γ = γi ∀i used by all the components.
Rényi Divergence Aggregation 565
1 Pi (y)γi P (y)1−γi
P (y) = wi γi−1 γi 1−γi
(3)
Z i y Pi (y ) P (y )
−1
where wi are given non-negative weights, and Z = Z({γi }) = i wi γi is a
normalisation constant, and {γi } is the set of Rényi divergence parameters.
for the
optimum values of P (y). Multiply each equation with P (y) and find
Z = j wj γj−1 by summing over all equations. Rearrange to obtain the result.
In the next section we show that Rényi divergence aggregation provides the
maximum entropy distribution for combining together agent distributions where
the belief of each agent is subject to a particular form of bias. Two consequences
that are worth alerting the reader to ahead of that analysis are:
1. If all agents form beliefs on data drawn from the same (unbiased) distribution
then the maximum entropy distribution is of the form of a log opinion pool.
2. If all agents form beliefs on unrelated data then the maximum entropy dis-
tribution is of the form of a linear opinion pool.
have access to data drawn from P G , but instead the data seen by the individual
agents was biased, and instead sampled from distribution PiG . In aggregating
the agent beliefs, we neither know the target distribution P G , nor any of the
individual bias distributions PiG , but model them with P and Qi respectively.
As far as the individual agents are concerned they train and evaluate their
methods on their individual data, unconcerned that their domains were biased
with respect to the domain we care about. We can think of this scenario as
convergent dataset shift [26], where there is a shift from the individual train-
ing to a common test scenario. The result is that we are given information
regarding the test log likelihood performance for each Pi in their own domains:
G
y P (y) log Pi (y) = ai .
The individual agent data is biased, not unrelated, and so we make the
assumption that the individual distributions PiG are related to P in some way. We
assume that KL(PiG ||P G ) is subject to some bound (and call this the nearness
constraint). As mentioned in the Problem Statement this is a constraint on the
standardized expected log acceptance ratio, under an assumption that PiG is
derived from P G via a sample selection bias.
Given this scenario, a reasonable ambition is to find maximum entropy dis-
tributions Qi to model PiG that capture the performance of the individual dis-
tributions Pi , while at the same time being related via an unknown distribution
P . As we know the test performance, we write this as the constraints:
Qi (y) log Pi (y) = ai , (5)
y
1
The nearness constraints for Qi are written as
KL(Qi ||P ) ≤ Ai (6)
Qi (y)
⇒ Qi (y) log ≤ Ai for some P . (7)
y
P (y)
encoding that our model Qi for PiG must be near to the model P for P G . That
is the KL divergence between the two distributions must be bounded by some
value Ai .
Given these constraints, the maximum entropy (minimum negative entropy)
Lagrangian optimisation can be written as arg min{Qi },P L({Qi }, P ), where
L({Qi }, P ) = Qi (y) log Qi (y) + bi (1 − Qi (y))
i y i y
− λi Qi (y) log Pi (y) − ai + c(1 − P (y))
i y y
Qi (y)
+ ρi Qi (y) log − Ai + si (8)
i y
P (y)
1
We could work with a nearness penalty of the same form rather than a nearness
constraint. The resulting maximum entropy solution would be of the same form.
Rényi Divergence Aggregation 567
Interim Summary. We have shown that the Rényi divergence aggregator is not
an arbitary choice of aggregating distribution. Rather it is the maximum entropy
aggregating distribution when the individual agent distributions are expected to
be biased using a sample selection mechanism.
6 Implementation
Renyi divergence aggregators can be implemented with direct optimization,
stochastic gradient methods, or using a variational optimization for the sum of
weighted divergences, which is described here. The weighted Rényi Divergence
objective given by Definition 2 can be lower bounded using
wi γi [Pi (y)γi Q(y)1−γi ]
wi D(Pi , Q) ≥ Qi (y) log (11)
i i,y
γi − 1 Qi (y)
568 A.J. Storkey et al.
7 Experiments
To test the practical validity of the maximum entropy arguments, the following
three tasks were implemented.
Algorithm 1. Generate test data for agents with different biases, and test
aggregation methods.
Select a target discrete distribution P ∗ (.) over K values. Choose NA , the number of
agents.
Sample IID a small number NV a of values from the target distribution to get a
validation set DV a
Sample IID a large number N of values {yn ; n = 1, 2, 3, . . . , N } from the target
distribution to get the base set D from which agent data is generated.
Sample bias probabilities fi (y) for each agent to be used as a rejection sampler.
for annealing parameter β = 0 TO 4 do
for each agent i do
Anneal fi to get fk∗ (y) = fk (y)β ./ maxy fi (y)β .
For each data point yi , reject it with probability (1 − fk∗ (yi )).
Collect the first 10000 unrejected points, and set Pi to be the resulting empirical
distribution.
This defines the distribution Pi for agent i given the value of β.
end for
Find aggregate P (.) for different aggregators given agent distributions Pi and an
additional P0 corresponding to just the uniform distribution, using the validation
dataset DV a for any parameter estimation.
Evaluate the performance of each aggregator using the KL Divergence between
the target distribution P ∗ (.) and the aggregate distribution P (.): KL(P ∗ ||P ).
end for
the data sizes were sufficient for use on student machines, and that submis-
sion files were suitable for uploading (this is the reason for the 6 bit grayscale
representation).
The problem was to provide a probabilistic prediction on the next pixel y
given information from previous pixels in a raster scan. The competitor’s per-
formance was measured by the perplexity on a public set at submission time,
but the final ranked ordering was on a private test set. We chose as agent dis-
tributions the 269 submissions that had perplexity greater than that given by
a uniform distribution and analysed the performance of a number of aggrega-
tion methods for the competition: weighted Rényi divergence aggregators, sim-
ple averaging of the top submissions (with an optimized choice of number),
and a form of heuristic Bayesian model averaging, via an annealed likelihoood:
α
P (y|·) ∝ j Pj (y|·) (P (j|Dtr )) , where α is an aggregation parameter choice.
The weighted Rényi divergence aggregators were optimized using stochastic gra-
dient methods, until the change between epochs became negligible. The valida-
tion set (20, 000 pixels) is used for learning the aggregation parameters. The test
set (also 20, 000 pixels) is only used for the test results.
Results. For Task 1, Figure 1(a) shows the test performance on different biases
for different values of log(γi ) in (10), where all γi are taken to be identical
and equal to γ. Figure 1(b) shows how the optimal value of γ changes, as the
bias parameter β changes. Parameter optimization was done using a conjugate
gradient method. The cost of optimization for Rényi mixtures is comparable
to that of log opinion pools. For Task 2, Figure 2(a) shows the performance
on the Bach chorales with 10 agents, with the implementation described in the
Implementation section. Again in this real data setting, the Rényi mixtures show
improved performance.
The two demonstrations show that when agents received a biased subsample
of the overall data then Rényi-mixtures perform best as an aggregation method,
in that they give the lowest KL divergence. As the bias increases, so the optimal
value of γ increases. In the limit that the agents see almost the same data from
the target distribution, Rényi-mixtures with small γ perform the best, and are
indistinguishable from the γ = 0 limit. Rényi mixtures are equivalent to log
opinion pools for γ → 0.
For Task 3, all agents see unbiased data and so we would expect log opin-
ion pools to be optimal. The perplexity values as a function of η = 1/γ for all
the methods tested on the test set can be seen in Figure 2(b). The parameter-
based pooling methods perform better than simple averages and all forms of
heuristic model averaging as these are inflexible methods. There is a significant
performance benefit of using logarithmic opinion pooling over linear pooling,
and weighted Rényi divergence aggregators interpolate between the two opin-
ion pooling methods. This figure empirically supports the maximum entropy
arguments.
Rényi Divergence Aggregation 571
0.8 2
0.7
0.6 1.5
KL Divergence
0.5
Optimal γ
0.4 1
0.3
0.2 0.5
0.1
0 0
−3 −2 −1 0 1 2 3 0 1 2 3 4
log(γ) β
(a) (b)
Fig. 1. (a) Task 1: Plot of the KL divergence against log γ for one dataset with β = 0
(lower lines, blue) through to β = 4 (upper lines, red) in steps of 0.5. Note that,
unsurprisingly, more bias reduces performance. However the optimal value of γ (lowest
KL), changes as β changes. for low values of β the performance of γ = 0 (log opinion
pools) is barely distinguishable from other low γ values. Note that using a log opinion
pool (low γ) when there is bias produces a significant hit on performance. (b) Task
1: Plot of the optimal γ (defining the form of Rényi mixture) for different values of β
(determining the bias in the generated datasets for each agent). The red (upper) line is
the mean, the blue line the median and the upper and lower bars indicate the 75th and
25th percentiles, all over 100 different datasets. For β = 0 (no bias) we have optimal
aggregation with lower γ values, approximately corresponding to a log opinion pool.
As β increases, the optimal γ gets larger, covering the full range of Rényi Mixtures.
1500
1000
relative test log prob
500
−500
−1000
−1500
0 0.2 0.4 0.6 0.8 1
gamma
(a) (b)
Fig. 2. (a) Task 2: test log probability results (relative to the log probability for a
mixture) for the Bach chorales data for different values of γ, indicating the benefit
of Rényi mixtures over linear (γ = 1) and log (γ = 0) opinion pools. Error bars are
standard errors over 10 different allocations of chorales to agents prior to training. (b)
Task 3: perplexity on the test set of all the compared aggregation methods against
η = 1/γ. For each method, the best performance is plotted. Log opinion pools perform
best as suggested by the maximum entropy arguments, and is statistically significantly
better than the linear opinion pool(p = 8.0 × 10−7 ). All methods perform better than
the best individual competition entry (2.963).
572 A.J. Storkey et al.
with γi = ηi−1 (generalising (10) in [28]). This shows the isoelastic market aggre-
gator linearly mixes together components that are implicitly a weighted product
of the agent belief and the final solution. Simple comparison of this market equi-
librium with the Rényi Divergence aggregator (3) shows that the market solution
and the Rényi divergence aggregator are of exactly the same form.
We conclude that a machine learning market implicitly computes a Rényi
divergence aggregation via the actions of individual agents. The process of
obtaining the market equilibrium is a process for building the Rényi Diver-
gence aggregator, and hence machine learning markets provide a method of
implementation of weighted Rényi divergence aggregators. The benefit of mar-
ket mechanisms for machine learning is that they are incentivized. There is no
assumption that the individual agents behave cooperatively, or that there is an
overall controller who determines agents’ actions. Simply, if agents choose to
maximize their utility (under myopic assumptions) then the result is weighted
Rényi Divergence aggregation.
In general, equilibrium prices are not necessarily straightforward to compute,
but the algorithm in the implementation section provides one such method. As
this iterates computing an interim P (corresponding to a market price) and an
interim Qi corresponding to agent positions given that price, the mechanism
in this paper can lead to a form of tâtonnement algorithm with a guaranteed
market equilibrium – see e.g. [6].
The direct relationship between the risk averseness parameter for the isoelas-
tic utilities and the bias controlling parameter of the Rényi mixtures (γi = ηi−1 )
provides an interpretation of the isoelastic utility parameter: if agents know
they are reasoning with respect to a biased belief, then an isoelastic utility is
warranted, with a choice of risk averseness that is dependent on the bias.
In [28] the authors show, on a basket of UCI datasets, that market aggre-
gation with agents having isoelastic utilities performs better than simple linear
Rényi Divergence Aggregation 573
opinion pools (markets with log utilities) and products (markets with exponential
utilities) when the data agents see is biased. As such markets implement Rényi
mixtures, this provides additional evidence that Rényi mixtures are appropriate
when combining biased predictors.
9 Discussion
When agents are training and optimising on different datasets than one another,
log opinion pooling is no longer a maximum entropy aggregator. Instead, under
certain assumptions, the weighted Rényi divergence aggregator is the maximum
entropy solution, and tests confirm this practically. The weighted Rényi diver-
gence aggregator can be implemented using isoelastic machine learning markets.
Though there is some power in providing aggregated prediction mechanisms
as part of competition environments, there is the additional question of the
competition mechanism itself. With the possibility of using the market-based
aggregation mechanisms, it would be possible to run competitions as prediction
market or collaborative scenarios [1], instead of as winner takes all competitions.
This alternative changes the social dynamics of the system and the player incen-
tives, and so it is an open problem as to the benefits of this. We recognize the
importance of such an analysis as an interesting direction for future work.
References
1. Abernethy, J., Frongillo, R.: A collaborative mechanism for crowdsourcing predic-
tion problems. In: Advances in Neural Information Processing Systems 24 (NIPS
2011) (2011)
2. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.
ics.uci.edu/ml
3. Barbu, A., Lay, N.: An introduction to artificial prediction markets for classification
(2011). arXiv:1102.1465v3
4. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
5. Chen, Y., Wortman Vaughan, J.: A new understanding of prediction markets via
no-regret learning. In: Proceedings of the 11th ACM Conference on Electronic
Commerce (2010)
6. Cole, R., Fleischer, L.: Fast-converging tatonnement algorithms for the market
problem. Tech. rep., Dept. Computer Science. Dartmouth College (2007)
7. Dani, V., Madani, O., Pennock, D., Sanghai, S., Galebach, B.: An empirical com-
parison of algorithms for aggregating expert predictions. In: Proceedings of the
Conference on Uncertainty in Artificial Intelligence (UAI) (2006)
8. Dietrich, F.: Bayesian group belief. Social Choice and Welfare 35(4), 595–626
(2010)
9. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F.
(eds.) MCS 2000. LNCS, vol. 1857, pp. 1–5. Springer, Heidelberg (2000)
10. Domingos, P.: Why does bagging work? a Bayesian account and its implications.
In: Proceedings KDD (1997)
574 A.J. Storkey et al.
11. Everingham, M., et al.: The 2005 PASCAL visual object classes challenge. In:
Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005.
LNCS (LNAI), vol. 3944, pp. 117–176. Springer, Heidelberg (2006)
12. Garg, A., Jayram, T., Vaithyanathan, S., Zhu, H.: Generalized opinion pooling.
AMAI (2004)
13. Goldbloom, A.: Data prediction competitions - far more than just a bit of fun. In:
IEEE International Conference on Data Mining Workshops (2010)
14. Green, K.: The $1 million Netflix challenge. Technology Review (2006)
15. van Hateren, J.H., van der Schaaf, A.: Independent component filters of natural
images compared with simple cells in primary visual cortex. Proceedings: Biological
Sciences 265(1394), 359–366 (1998)
16. Heskes, T.: Selecting weighting factors in logarithmic opinion pools. In: Advances
in Neural Information Processing Systems 10 (1998)
17. Kahn, J.M.: A generative Bayesian model for aggregating experts’ probabilities.
In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence,
pp. 301–308. AUAI Press (2004)
18. Lay, N., Barbu, A.: Supervised aggregation of classifiers using artificial prediction
markets. In: Proceedings of ICML (2010)
19. Maynard-Reid, P., Chajewska, U.: Aggregating learned probabilistic beliefs. In:
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence,
pp. 354–361. Morgan Kaufmann Publishers Inc. (2001)
20. Ottaviani, M., Sørensen, P.: Aggregation of information and beliefs in prediction
markets, fRU Working Papers (2007)
21. Pennock, D., Wellman, M.: Representing aggregate belief through the competitive
equilibrium of a securities market. In: Proceedings of the Thirteenth Conference
on Uncertainty in Artificial Intelligence, pp. 392–400 (1997)
22. Bell, R.M., Koren, Y., Volinsky, C.: All together now: A perspective on the NET-
FLIX PRIZE. Chance 24 (2010)
23. Rubinstein, M.: An aggregation theorem for securities markets. Journal of Finan-
cial Economics 1(3), 225–244 (1974)
24. Rubinstein, M.: Securities market efficiency in an Arrow-Debreu economy. Ameri-
can Economic Review 65(5), 812–824 (1975)
25. Rubinstein, M.: The strong case for the generalised logarithmic utility model as
the premier model of financial markets. Journal of Finance 31(2), 551–571 (1976)
26. Storkey, A.: When training and test sets are different: Characterising learning
transfer. In: Lawrence, C.S.S. (ed.) Dataset Shift in Machine Learning, chap. 1,
pp. 3–28. MIT Press (2009). http://mitpress.mit.edu/catalog/item/default.asp?
ttype=2&tid=11755
27. Storkey, A.: Machine learning markets. In: Proceedings of Artificial Intelligence and
Statistics, vol. 15. Journal of Machine Learning Research W&CP (2011). http://
jmlr.csail.mit.edu/proceedings/papers/v15/storkey11a/storkey11a.pdf
28. Storkey, A., Millin, J., Geras, K.: Isoelastic agents and wealth updates in machine
learning markets. In: Proceedings of ICML 2012 (2012)
29. West, M.: Bayesian aggregation. Journal of the Royal Statistical Society 147,
600–607 (1984)
30. Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992).
http://www.sciencedirect.com/science/article/pii/S0893608005800231
Distance and Metric Learning
Higher Order Fused Regularization for
Supervised Learning with Grouped Parameters
1 Introduction
Various regularizers for supervised learning have been proposed, aiming at pre-
venting a model from overfitting and at making estimated parameters more
interpretable [1,3,16,30,31]. Least absolute shrinkage and selection opera-
tor (Lasso) [30] is one of the most well-known regularizers that employs the
1 norm over a parameter vector as a penalty. This penalty enables a sparse esti-
mation of parameters that is robust to noise in situations with high-dimensional
data. However, Lasso does not explicitly consider relationships among param-
eters. Recently, structured regularizers have been proposed to incorporate aux-
iliary information about structures in parameters [3]. For example, the Fused
Lasso proposed in [31] can incorporate the smoothness encoded with a similar-
ity graph defined over the parameters into its penalty.
While such a graph representation is useful to incorporate information about
pairwise interactions of variables (i.e. the second-order information), we often
encounter situations where there exist possibly overlapping groups that consist
of more than two parameters. For example, we might work on parameters that
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 577–593, 2015.
DOI: 10.1007/978-3-319-23528-8 36
578 K. Takeuchi et al.
correspond to words expressing the same meaning, music pieces in the same
genre, and books released in the same year. Based on such auxiliary informa-
tion, we naturally suppose that a group of parameters would provide similar
functionality in a supervised learning problem and thus take similar values.
In this paper, we propose Higher Order Fused (HOF) regularization that
allows us to employ such prior knowledge about the similarity on groups of
parameters as a regularizer. We define the HOF penalty as the Lovász extension
of a submodular higher-order potential function, which encourages parameters in
a group to take similar estimated values when used as a regularizer. Our penalty
has effects not only on such variations of estimated values in a group but also
on supports over the groups. That is, it could detect whether a group is effective
for a problem, and utilize only effective ones by solving the regularized estima-
tion. Moreover, our penalty is robust to noise of the group structure because it
encourages an effective part of parameters within the group to have the same
value and allows the rest of the parameters to have different estimated values.
The HOF penalty is defined as a non-smooth convex function. Therefore, a
forward-backward splitting algorithm [7] can be applied to solve the regularized
problem with the HOF penalty, where the calculation of a proximity operator [22]
is a key for the efficiency. Although it is not straightforward to develop an efficient
way of solving the proximity operator for the HOF penalty due to its inseparable
form of the HOF penalty, we develop an efficient network flow algorithm based
on [14] for calculating the proximity operator.
Note that Group Lasso (GL) [34] is also known as a class of regularizers to
use explicitly a group structure of parameters. However, while our HOF penalty
encourages the smoothness over parameters in a group, GL imposes parameters
to be sparse in a group-wise manner.
In this paper, we conduct experiments on regression with both synthetic and
real-world data. In the experiments with the synthetic data, we investigate the
comparative performance of our method on two settings of overlapping and non-
overlapping groups. In the experiments with the real-world data, We first test
the predictive performances about the average rating of each item (such as movie
and book) from a set of users who watched or read items, given user demographic
groups. And then, we confirm the predictive performance on a rating value from
a review text given semantic and positive-negative word groups.
The rest of this paper is organized as below. In Section 2, we introduce
regularized supervised learning and the forward-backward splitting algorithm.
In Section 3, we propose Higher Order Fused regularizer. In Section 4, we derive
a efficient flow algorithm for solving the proximity operator of HOF. In Section
5, we review related work of our method. In Section 6, we conduct experiments
to compare our methods and existing regularizers. We conclude this paper and
discuss future work in Section 6.
f (S) + f (T ) ≥ f (S ∪ T ) + f (S ∩ T ), (3)
ck0,i if i ∈ gk , ck1,i if i ∈ gk ,
ck0 , ck1 ∈ Rd≥0 , ck0,i = , ck1,i = , (i ∈ V ). (6)
0 otherwise 0 otherwise
and hence,
⎧
⎪ k
⎪βji c1 ({ji })
⎪
⎪
(1 ≤ i < s)
⎪
⎪β θ k k k
− (θ1 + c1 (Us−1 )) (i = s)
k k
⎨ js max
βji fho (Ui ) − fho (Ui−1 ) = 0 (s < i < t) , (11)
⎪
⎪ k
⎪βjt θ0 + c0 (V \ Ut ) − θmax
⎪ k k
(i = t)
⎪
⎪
⎩−β ck ({j }) (t < i ≤ d)
ji 0 i
where ck1 (Ui ) = k k
i∈{j1 ,··· ,ji } c1,i and c0 (V \ Ui ) =
k
i∈{ji+1 ,··· ,jd } c0,i . As a
result, we have Ωho (β) by summing all of these from the definition of the Lovász
extension Eq. (4).
Although the penalty Ωho (β) includes many hyper parameters (such as ck0 , ck1 ,
θ0k , θ1k and θmax
k
), it would be convenient to use the same value for θ0k , θ0k , θmax
k
for different g ∈ G and constant values for non-zero elements in c0 and ck1 ,
k
4 Optimization
4.1 Proximity Operator via Minimum-Norm-Point Problem
From the definition, the HOF penalty belongs to the class of the lower semicon-
tinuous convex function but is non-smooth. To attain a solution of the penalty,
we define the proximity operator as:
1
proxγΩho β̂ = arg min Ωho (β) + β̂ − β22 , (12)
β∈Rd 2γ
1
The higher order potential fho (S) can be always transformed by excluding the con-
stant terms θ0 and θ1 and by accordingly normalizing c0 and c1 respectively.
Higher Order Fused Regularization for Supervised Learning 583
3.0
θ1 + c1(Ui)
2.5
β ji
10 θmax
θ0c0(V \Ui) 2.0
θmax
1.5
βjt
β ji
1.0
5
0.5 βjs
0.0
−0.5
θ1 = θ0
0 −1.0
1 s t d 1 s t d
i i
(a) values of the set functions (b) parameters sorted by the decreasing
order
function under submodular constraints [23] that can be solved via paramet-
ric optimization. We denote a parameter α ∈ R≥0 and define a set function
hα (S) = h(S) − α1(S), (∀S ⊂ V ), where 1(S) = i∈S 1. When h is non-
decreasing submodular function, there exists a set of r + 1 (≤ d) subsets: S ∗ =
{S0 ⊂ S1 ⊂ · · · ⊂ Sr }, where Sj ⊂ V , S0 =, and Sr = V , respectively. And there
are r + 1 subintervals Qr of α: Q0 = [0, α0 ), Q1 = [α1 , α2 ), · · · , Qr = [αr , ∞),
such that, for each j ∈ {0, 1, · · · , r}, Sj is the unique maximal minimizer of
hα (S), ∀α ∈ Qj [23]. The optimal minimizer of Eq. (14) t∗ = (t∗1 , t∗2 , · · · , t∗d ) is
then determined as:
fho (Sj+1 ) − fho (Sj )
t∗i = , ∀i ∈ (Sj+1 \ Sj ), j = (1, · · · , r). (15)
1(Sj+1 \ Sj )
Lemma 2. Set η = maxi=1,··· ,d {0, h(V \ {i}) − h(V )}, then h + η1 is a non-
decreasing submodular function.
and then apply Lemma 1 to obtain a solution of the original problem. Because
Eq. (16) is a specific form of a min cut problem, we can be solved the problem
efficiently.
Theorem 1. Problem in Eq. (16) is equivalent to a minimum s/t-cut problem
defined as in Figure. 2.
Proof. The cost function in Eq. (16) is a sum of a modular and submodular
functions, because the higher order potential can be transformed as a second
order submodular function. Therefore, this cost function is a F 2 energy func-
tion [19] that is known to be “graph-representative”. In Figure. 2, the groups
of parameters are represented with hyper nodes uk1 , uk0 that correspond to each
group, and capacities of edges between hyper nodes and ordinal nodes vi ∈ V .
These structures are not employed in [32]. Edges between source and sink nodes
correspond to input parameters like [32]. We can attain a solution of s/t min
cut problem via graph cut algorithms. We employ an efficient parametric flow
algorithm provided by [14] that run in O(d|E| log(d2 /|E|)) as the worst case,
where |E| is the number of edges of the graph in Figure 2.
Higher Order Fused Regularization for Supervised Learning 585
ck1,i
θmax
k
− θ1k v2
v1 v3 βi − (γ − α)
v4 (if βi > γ − α)
s v5 v6 V t
v7
βi − (γ − α)
vi vd
(if βi < γ − α)
···
ck0,i θmax
k
− θ0k
Fig. 2. A minimum s/t-cut problem of Problem 16. Given a graph G = (V, E) for the
HOF penalty, capacities of edges are defined as: c(s, uk1 ) = θmax − θ1 , c(uk1 , vi ) = ck1,i ,
c(vi , uk0 ) = ck0,i , c(uk0 , t) = θmax
k
− θ0k , c(s, vi ) = zi − (γ − α) if zi > γ − α, and
c(vi , t) = (γ − α) − zi if zi < γ − α. Nodes uk1 and uk0 , k = (1, · · · , K) are hyper nodes
that correspond to the groups. And s, t, and vi are source-node, sink-node, and nodes
of parameters, respectively.
5 Related Work
6 Experiments
In this section, we compared our proposed method with existing methods on lin-
ear regression problems2 using both synthetic and real-world data. We employed
the ordinary least squares (OLS), Lasso3 , Sparse Group Lasso (SGL) [20],
and Generalized Fused Lasso (GFL) as comparison methods. We added the
1 penalty of Lasso to GFL and our proposed method by utilizing a property:
proxΩLasso +Ω = proxΩLasso ◦ proxΩ [10]. With GFL, we encoded groups of param-
eters by constructing cliques that connect edges between whole pairs of param-
eters in the group.
2
Where the number of variables and features are equal (m = d).
3
We used matlab built-in codes of OLS and Lasso.
Higher Order Fused Regularization for Supervised Learning 587
Table 1. Average RMSE and their standard deviations with synthetic data. Hyper
parameters were selected from 10-fold cross validation. Values in bold typeface are
statistically better (p < 0.01) than those in normal typeface as indicated by a paired
t-test.
(a) non-overlapping groups
N Proposed SGL GFL Lasso OLS
30 0.58 ± 0.32 174.40 ± 75.60 0.48 ± 0.29 189.90 ± 74.50 208.80 ± 119.00
50 0.56 ± 0.14 119.70 ± 40.80 0.57 ± 0.14 115.30 ± 54.10 260.40 ± 68.80
70 0.40 ± 0.19 128.10 ± 39.90 0.40 ± 0.19 125.00 ± 48.10 313.20 ± 42.40
100 0.47 ± 0.13 120.40 ± 42.00 0.47 ± 0.13 112.80 ± 45.60 177.10 ± 68.90
150 0.51 ± 0.08 106.80 ± 22.00 0.51 ± 0.08 79.40 ± 20.90 1.08 ± 0.13
methods had no ability to fuse parameters. In the second setting with over-
lapping groups, our proposed method showed superior performance than SGL,
GFL, Lasso, and OLS. When N < D, existing methods suffered from overfitting;
however, our proposed method showed small errors even if N = 30. GFL showed
low performance in this setting because the graph cannot represents groups.
Examples of estimated parameters on an experiment (N = 30) are shown in
Figures 3 and 4. In this situation (N < D), the number of observation was less
than the number of features; therefore, the problems of parameter estimation
became undetermined system problems. From Figure 3, we confirmed that our
proposed method and GFL successfully recovered the true parameters by utiliz-
ing the group structure. From Figure 4, we confirmed that our proposed methods
were able to recover true parameters with overlapping groups. This is because
our proposed method can represent overlapping groups appropriately. GFL fell
into an imperfect result because it employed the pairwise representation that
cannot describe groups.
We conducted two settings of experiments with real-world data sets. With the first
setting, we predicted the average rating of each item (movie or book) from a set of
users who watched or read items. We used publicly available real-world data pro-
vided by MovieLens100k, EachMovie, and Book-Crossing4 . We utilized a group
structure of users, for example; age, gender, occupation and country as auxiliary
4
http://grouplens.org
588 K. Takeuchi et al.
Proposed
5
0
−5
0 10 20 30 40 50 60 70 80 90 100
Sparse Group Lasso
10
0
−10
0 10 20 30 40 50 60 70 80 90 100
Generalized Fused Lasso
5
0
−5
0 10 20 30 40 50 60 70 80 90 100
Lasso
20
0
−20
0 10 20 30 40 50 60 70 80 90 100
Ordinary Least Squares
5
0
−5
0 10 20 30 40 50 60 70 80 90 100
Fig. 3. Estimated parameters from synthetic data with five non-overlapping groups.
Circles and Blue lines correspond to estimated and true parameter values, respectively.
Proposed
5
0
−5
0 10 20 30 40 50 60 70 80 90 100
Sparse Group Lasso
10
0
−10
0 10 20 30 40 50 60 70 80 90 100
Generalized Fused Lasso
5
0
−5
0 10 20 30 40 50 60 70 80 90 100
Lasso
10
0
−10
0 10 20 30 40 50 60 70 80 90 100
Ordinary Least Squares
5
0
−5
0 10 20 30 40 50 60 70 80 90 100
Fig. 4. Estimated parameters from synthetic data with five overlapping groups. Circles
and Blue lines correspond to estimated and true parameter values, respectively.
information. The MovieLens100k data contained movie rating records with three
types groups including ages, genders and occupations. The EachMovie data con-
sisted of movie rating records with two types groups including ages and genders.
Higher Order Fused Regularization for Supervised Learning 589
We used the 1, 000 most frequently watching users. The Book-Crossing data was
made up of book rating records with two types of groups including ages and coun-
tries. We eliminated users and books whose total reading counts were less than 30
from the Book-Crossing data. Summaries of real-world data are shown in Table 2.
To check the performance of each method, we changed the number of training data
N . ck0,i and ck1,i were set to have the same value that was 1.0 if the i-th item belonged
to the k-th group or 0.0 otherwise. In each experiment, other hyper parameters
were selected by 10-fold cross validation in the same manner as previous experi-
ments.
The results are summarized in Table 3. With the MovieLens100k data, our
proposed method showed the best performance on whole settings of N because
it was able to utilize groups as auxiliary information for parameter estimations.
When N = 1600, SGL and GFL also showed competitive performance. With
the EachMovie and Book-Crossing data, we confirmed that our proposed model
showed the best performance. SGL and Lasso showed competitive performance
on some settings of N . With the EachMovie and Book-Crossing data sets, esti-
mated parameters were almost sparse therefore SGL and Lasso showed compet-
itive performance.
Next, we conducted another type of an experiment employing the Yelp data5 .
The task of this experiment was to predict a rating value from a review text.
We randomly extracted reviews and used the 1, 000 most frequently occurred
words, where stop words were eliminated by using a list6 . We employed two
types of groups of words. We attained 50 semantic groups of words by applying
k-means to semantic vectors of words. The semantic vectors were learned form
the GoogleNews data by word2vec7 . We also utilized a positive-negative word
dictionary8 to construct two positive and negative groups of words [29]. Other
settings were set to be the same as the MovieLens100k data.
The results are shown in Table 4. We confirmed that our proposed method
showed significant improvements over other existing methods with the Yelp data.
GFL also showed competitive performance when the number of training data
N = 1, 000. The semantic groups constructed by k-means have no overlap and
overlap was only appeared between semantic and positive-negative groups. When
5
http://www.yelp.com/dataset challenge
6
https://code.google.com/p/stop-words
7
https://code.google.com/p/word2vec/
8
http://www.lr.pi.titech.ac.jp/∼takamura/pndic en.html
590 K. Takeuchi et al.
Table 3. Average RMSE and their standard deviations with real-world data sets.
Hyper Parameters were selected from 10-fold cross validation. Values in bold typeface
are statistically better (p < 0.01) than those in normal typeface as indicated by a
paired t-test.
(a) MovieLens100k
N Proposed SGL GFL Lasso OLS
200 0.30 ± 0.02 0.32 ± 0.02 0.33 ± 0.02 1.11 ± 0.71 3.81 ± 1.97
400 0.28 ± 0.02 0.32 ± 0.02 0.33 ± 0.02 0.82 ± 0.31 2718.00 ± 6575.00
800 0.27 ± 0.02 0.31 ± 0.02 0.33 ± 0.02 0.54 ± 0.21 134144.00 ± 370452.00
1200 0.27 ± 0.03 0.32 ± 0.03 0.33 ± 0.03 0.48 ± 0.31 4.19 ± 2.97
1600 0.27 ± 0.07 0.30 ± 0.09 0.31 ± 0.09 0.44 ± 0.45 1.01 ± 0.81
(b) EachMovie
N Proposed SGL GFL Lasso OLS
200 0.86 ± 0.03 0.86 ± 0.02 0.92 ± 0.02 1.24 ± 0.15 2.15 ± 1.17
400 0.83 ± 0.03 0.85 ± 0.02 0.90 ± 0.03 1.17 ± 0.09 3.20 ± 1.66
800 0.81 ± 0.03 0.84 ± 0.02 0.89 ± 0.03 1.09 ± 0.06 14.30 ± 14.70
1200 0.80 ± 0.05 0.84 ± 0.05 0.88 ± 0.05 1.06 ± 0.07 2479.00 ± 9684.00
1500 0.79 ± 0.09 0.83 ± 0.07 0.87 ± 0.09 1.01 ± 0.12 29.90 ± 29.60
(c) Book-Crossing
N Proposed SGL GFL Lasso OLS
200 0.71 ± 0.02 0.73 ± 0.02 0.82 ± 0.02 0.92 ± 0.14 3.98 ± 0.83
400 0.70 ± 0.02 0.72 ± 0.02 0.82 ± 0.02 0.79 ± 0.03 66.60 ± 109.20
800 0.68 ± 0.02 0.70 ± 0.02 0.81 ± 0.02 0.71 ± 0.02 34.00 ± 27.70
1200 0.67 ± 0.04 0.71 ± 0.04 0.82 ± 0.04 0.70 ± 0.03 551.00 ± 1532.00
1700 0.64 ± 0.07 0.68 ± 0.07 0.78 ± 0.08 0.66 ± 0.06 1.18 ± 0.12
Table 4. Linear regression problems with Yelp data (D = 1, 000) and G = 52 (50
semantic groups and two positive and negative groups). Means and standard deviations
of the loss on the test data are shown. Parameters were selected from 10-fold cross
validation. Bold font corresponds to significant difference of t-test (p < 0.01).
N Proposed SGL GFL Lasso OLS
1000 1.23 ± 0.02 3.31 ± 0.20 1.24 ± 0.01 1.62 ± 0.11 135.60 ± 211.00
2000 1.20 ± 0.02 1.58 ± 0.05 1.23 ± 0.01 1.27 ± 0.02 1.61 ± 0.06
3000 1.13 ± 0.02 1.34 ± 0.07 1.22 ± 0.01 1.18 ± 0.03 1.35 ± 0.07
5000 1.10 ± 0.02 1.18 ± 0.03 1.22 ± 0.01 1.12 ± 0.02 1.18 ± 0.03
(a) Positive group. (b) Negative group. (c) Positive domi- (d) Negative domi-
nant group. nant group.
Fig. 5. Estimated parameters of four semantic groups of words. Blue and Red cor-
respond to plus and minus of estimated parameters, respectively. The size of words
correspond to the absolute values of estimated parameters.
7 Conclusion
We proposed a structured regularizer named Higher Order Fused (HOF) reg-
ularization in this paper. HOF regularizer exploits groups of parameters as a
penalty in regularized supervised learning. We defined the HOF penalty as a
Lovaśtz extension of a robust higher order potential named the robust P n poten-
tial. Because the penalty is non-smooth and non-separable convex function, we
provided the proximity operator of the HOF penalty. We also derived a flow algo-
rithm to calculate the proximity operator efficiently, by showing that the robust
P n potential is graph-representative. We examined experiments of linear regres-
sion problems with both synthetic and real-world data and confirmed that our
proposed method showed significantly higher performance than existing struc-
tured regularizers. We also showed that our proposed method can incorporate
groups properly by utilizing the robust higher-order representation.
We provided the proximity operator of the HOF penalty but only adopted it
to linear regression problems in this paper. We can apply the HOF penalty to
other supervised or unsupervised learning problems including classification and
learning to rank, and also to other applicational fields including signal processing
and relational data analysis.
592 K. Takeuchi et al.
References
1. Bach, F.R.: Structured sparsity-inducing norms through submodular functions. In:
Proc. of NIPS, pp. 118–126 (2010)
2. Bach, F.R.: Shaping level sets with submodular functions. In: Proc. of NIPS,
pp. 10–18 (2011)
3. Bach, F.R., Jenatton, R., Mairal, J., Obozinski, G.: Structured sparsity through
convex optimization. Statistical Science 27(4), 450–468 (2012)
4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
5. Chaux, C., Combettes, P.L., Pesquet, J.C., Wajs, V.R.: A variational formulation
for frame-based inverse problems. Inverse Problems 23(4), 1495 (2007)
6. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal process-
ing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering,
pp. 185–212. Springer (2011)
7. Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward split-
ting. Multiscale Modeling & Simulation 4(4), 1168–1200 (2005)
8. Edmonds, J.: Submodular functions, matroids, and certain polyhedra. In: Combi-
natorial Structures and their Applications, pp. 69–87 (1970)
9. Friedman, J., Hastie, T., Tibshirani, R.: A note on the group lasso and a sparse
group lasso. arXiv preprint arXiv:1001.0736 (2010)
10. Friedman, J., Hastie, T., Höfling, H., Tibshirani, R., et al.: Pathwise coordinate
optimization. The Annals of Applied Statistics 1(2), 302–332 (2007)
11. Fujishige, S.: Submodular functions and optimization, vol. 58. Elsevier (2005)
12. Fujishige, S., Hayashi, T., Isotani, S.: The minimum-norm-point algorithm applied
to submodular function minimization and linear programming. Technical report,
Research Institute for Mathematical Sciences Preprint RIMS-1571, Kyoto Univer-
sity, Kyoto, Japan (2006)
13. Fujishige, S., Patkar, S.B.: Realization of set functions as cut functions of graphs
and hypergraphs. Discrete Mathematics 226(1), 199–210 (2001)
14. Gallo, G., Grigoriadis, M.D., Tarjan, R.E.: A fast parametric maximum flow algo-
rithm and applications. SIAM Journal on Computing 18(1), 30–55 (1989)
15. Jacob, L., Obozinski, G., Vert, J.P.: Group lasso with overlap and graph lasso. In:
Proc. of ICML, pp. 433–440 (2009)
16. Jenatton, R., Audibert, J.Y., Bach, F.: Structured variable selection with sparsity-
inducing norms. The Journal of Machine Learning Research 12, 2777–2824 (2011)
17. Koh, K., Kim, S.J., Boyd, S.P.: An interior-point method for large-scale l1-
regularized logistic regression. Journal of Machine Learning Research 8(8),
1519–1555 (2007)
18. Kohli, P., Ladicky, L., Torr, P.H.S.: Robust higher order potentials for enforcing
label consistency. International Journal of Computer Vision 82(3), 302–324 (2009)
19. Kolmogorov, V., Zabin, R.: What energy functions can be minimized via graph
cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence 26(2),
147–159 (2004)
20. Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona
State University (2009)
Higher Order Fused Regularization for Supervised Learning 593
21. Lovász, L.: Submodular functions and convexity. In: Mathematical Programming
the State of the Art, pp. 235–257. Springer (1983)
22. Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilber-
tien. CR Acad. Sci. Paris Sér. A Math. 255, 2897–2899 (1962)
23. Nagano, K., Kawahara, Y.: Structured convex optimization under submodular con-
straints. In: Proc. of UAI, pp. 459–468 (2013)
24. Nagano, K., Kawahara, Y., Aihara, K.: Size-constrained submodular minimization
through minimum norm base. In: Proc. of ICML, pp. 977–984 (2011)
25. Nesterov, Y.E.: A method of solving a convex programming problem with conver-
gence rate O(1/k2 ). Soviet Mathematics Doklady 27, 372–376 (1983)
26. Nesterov, Y.E.: Smooth minimization of non-smooth functions. Mathematical Pro-
gramming 103(1), 127–152 (2005)
27. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal
algorithms. Physica D: Nonlinear Phenomena 60(1), 259–268 (1992)
28. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers.
Neural Processing Letters 9(3), 293–300 (1999)
29. Takamura, H., Inui, T., Okumura, M.: Extracting semantic orientations of words
using spin model. In: Proc. of ACL, pp. 133–140. Association for Computational
Linguistics (2005)
30. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological), 267–288 (1996)
31. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smooth-
ness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 67(1), 91–108 (2005)
32. Xin, B., Kawahara, Y., Wang, Y., Gao, W.: Efficient generalized fused lasso
with its application to the diagnosis of alzheimers disease. In: Proc. of AAAI,
pp. 2163–2169 (2014)
33. Yuan, L., Liu, J., Ye, J.: Efficient methods for overlapping group lasso. In: Proc.
of NIPS, pp. 352–360 (2011)
34. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped vari-
ables. Journal of the Royal Statistical Society: Series B (Statistical Methodology)
68(1), 49–67 (2006)
35. Zhang, X., Burger, M., Osher, S.: A unified primal-dual algorithm framework based
on bregman iteration. Journal of Scientific Computing 46(1), 20–46 (2011)
36. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Jour-
nal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2),
301–320 (2005)
Joint Semi-supervised Similarity Learning
for Linear Classification
.
1 Introduction
Many researchers have used the underlying geometry of the data to improve
classification algorithms, e.g. by learning Mahanalobis distances instead of the
standard Euclidean distance, thus paving the way for a new research area termed
metric learning [5,6]. If most of these studies have based their approaches on
distance learning [3,9,10,22,24], similarity learning has also attracted a grow-
ing interest [2,12,16,20], the rationale being that the cosine similarity should
in some cases be preferred over the Euclidean distance. More recently, [1] have
proposed a complete framework to relate similarities with a classification algo-
rithm making use of them. This general framework, that can be applied to any
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 594–609, 2015.
DOI: 10.1007/978-3-319-23528-8 37
Joint Semi-supervised Similarity Learning for Linear Classification 595
2 Related Work
optimization problem. The approaches that have received the most attention
this field involve learning a Mahalanobis distance, defined as dA (x, x ) =
in
(x − x )T A(x − x ), in which the distance is parameterized by the symmetric
and positive semi-definite (PSD) matrix A ∈ Rd×d . One nice feature of this type
of approaches is its interpretability: the Mahalanobis distance implicitly corre-
sponds to computing the Euclidean distance after linearly projecting the data to
a different (possibly lower) feature space. The PSD constraint on A ensures dA
is a pseudo metric. Note that the Mahalanobis distance reduces to the Euclidean
distant when A is set to the identity matrix.
Mahalanobis distances were used for the first time in metric learning in [25].
In this study, they aim to learn a PSD matrix A as to maximize the sum of dis-
tances between dissimilar instances, while keeping the sum of distances between
similar instances small. Eigenvalue decomposition procedures are used to ensure
that the learned matrix is PSD, which makes the computations intractable for
high-dimensional spaces. In this context, LMNN [23,24] is one of the most widely-
used Mahalanobis distance learning methods. The constraints they use are pair-
and triplet-based, derived from each instance’s nearest neighbors. The optimiza-
tion problem they solve is convex and has a special-purpose solver. The algorithm
works well in practice, but is sometimes prone to overfitting due to the absence
of regularization, especially when dealing with high dimensional data. Another
limitation is that enforcing the PSD constraint on A is computationally expen-
sive. One can partly get around this latter shortcoming by making use of specific
solvers or using information-theoretic approaches, such as ITML [9]. This work
was the first one to use LogDet divergence for regularization, and thus pro-
vides an easy and cheap way for ensuring that A is a PSD matrix. However, the
learned metric A strongly depends on the initial value A0 , which is an important
shortcoming, as A0 is handpicked.
The following metric learning methods use a semi-supervised setting, in order
to improve the performance through the use of unlabeled data. LRML [14,15]
learns Mahalanobis distances with manifold regularization using a Laplacian
matrix. Their approach is applied to image retrieval and image clustering. LRML
performs particularly well compared to fully supervised methods when side infor-
mation is scarce. M-DML [28] uses a similar formulation to that of LRML,
with the distinction that the regularization term is a weighted sum using multi-
ple metrics, learned over auxiliary datasets. SERAPH [19] is a semi-supervised
information-theoretic approach that also learns a Mahalanobis distance. The
metric is optimized to maximize the entropy over labeled similar and dissimilar
pairs, and to minimize it over unlabeled data.
However, learning Mahalanobis distances faces two main limitations. The
first one is that enforcing the PSD and symmetry constraints on A, beyond
the cost it induces, often rules out natural similarity functions for the problem
at hand. Secondly, although one can experimentally notice that state-of-the-
art Mahalanobis distance learning methods give better accuracy than using the
Euclidean distance, no theoretical guarantees are provided to establish a link
between the quality of the metric and the behavior of the learned classifier. In
Joint Semi-supervised Similarity Learning for Linear Classification 597
this context, [20,21] propose to learn similarities with theoretical guarantees for
the kNN based algorithm making use of them, on the basis of perceptron algo-
rithm presented in [11]. The performance of the classifier obtained is competitive
with those of ITML and LMNN. More recently, [1] introduced a theory for learn-
ing with so called (, γ, τ )-good similarity functions based on non PSD matrices.
This was the first stone to establish generalization guarantees for a linear classi-
fier that would be learned with such similarities. Their formulation is equivalent
to a relaxed L1 -norm SVM [29]. The main limitation of this approach is however
that the similarity function K is predefined and [1] do not provide any learn-
ing algorithm to design (, γ, τ )-good similarities. This problem has been fixed
by [4] who optimize the (, γ, τ )-goodness of a bilinear similarity function under
Frobenius norm regularization. The learned metric is then used to build a good
global linear classifier. Moreover, their algorithm comes with a uniform stability
proof [8] which allows them to derive a bound on the generalization error of
the associated classifier. However, despite good results in practice, one limita-
tion of this framework is that it imposes to deal with strongly convex objective
functions.
Recently, [13] extended the theoretical results of [4]. Using the Rademacher
complexity (instead of the uniform stability) and Khinchin-type inequalities,
they derive generalization bounds for similarity learning formulations that are
regularized w.r.t. more general matrix-norms including the L1 and the mixed
L(2,1) -norms. Moreover, they show that such bounds for the learned similarities
can be used to upper bound the true risk of a linear SVM. The main distinction
between this approach and our work is that we propose a method that jointly
learns the metric and the linear separator at the same time. This allows us to
make use of the semi-supervised setting presented by [1] to learn well with only
a small amount of labeled data.
their own label than to random reasonable examples x of the other label. It
also expresses the tolerated margin violations in an averaged way: this allows
for more flexibility than pair- or triplet-based constraints. The second condition
sets the minimum mass of reasonable points one must consider (greater than
τ ). Notice that no constraint is imposed on the form of the similarity function.
Definition 1 is used to learn a linear separator:
Theorem 1. [1] Let K be an (, γ, τ )-good similarity function in hinge loss for a
learning problem P. For any 1 > 0 and 0 < δ < γ1 /4 let S = {x1 , x2 , . . . , xdu }
be a sample of du = τ2 log(2/δ) + 16 log(2/δ)
(1 γ)2 landmarks drawn from P. Con-
sider the mapping φS : X → Rdu , φSi (x) = K(x, xi ), i ∈ {1, . . . , du }. With
probability 1 − δ over the random sample S, the induced distribution φS (P ) in
Rdu , has a separator achieving hinge loss at most + 1 at margin γ.
We now extend this framework to jointly learn the similarity and the sep-
arator in a semi-supervised way. Let S be a sample set of dl labeled exam-
ples (x, y) ∈ Z = X × {−1; +1}) and du unlabeled examples. Furthermore, let
KA (x, x ) be a generic (, γ, τ )-good similarity function, parameterized by the
matrix A ∈ Rd×d . We assume that KA (x, x ) ∈ [−1; +1] and that ||x||2 ≤ 1,
but all our developments and results can directly be extended to any bounded
similarities and datasets. Our goal here is to find the matrix A and the global
separator α ∈ Rdu that minimize the empirical loss (in our case, the hinge loss)
over a finite sample S, with some guarantees on the generalization error of the
associated classifier. To this end, we propose a generalization of Problem (2)
based on a joint optimization of the metric and the global separator:
dl
du
min 1− αj yi KA (xi , xj ) + λ||A − R|| (4)
α,A +
i=1 j=1
Joint Semi-supervised Similarity Learning for Linear Classification 599
du
s.t. |αj | ≤ 1/γ (5)
j=1
k σ1 1 σ12 σ2 (μ2 − μ1 )2
DKL = log + 2 − 22 + , 1 ≤ k ≤ d.
σ2 2 σ2 σ1 σ22
d
and the matrix R corresponds to diag(DKL
1 2
, DKL , · · · , DKL ).
Lastly, once A and α have been learned, the associated (binary) classifier
takes the form given in Eq. (3).
600 M.-I. Nicolae et al.
Examples: Using some classical norm properties and the Frobenius inner prod-
uct, we can show that:
– The bilinear similarity KA 1
(x, x ) = xT Ax is (0, 1)-admissible, that is
T
|KA (x, x )| ≤ ||x x || · ||A||;
1
1 2
Note that we will make use of these two similarity functions KA and KA in our
n×d
experiments. For any B, A ∈ R and any matrix norm || · ||, its dual norm
|| · ||∗ is defined, for any B, by ||B||∗ = sup||A||≤1 Tr(BT A), where Tr(·) denotes
the trace of a matrix. Denote X∗ = supx,x ∈X ||x xT ||∗ .
Let us now rewrite the minimization problem (4) with a more generalized
notation of the loss function:
dl
1
min (A, α, zi = (xi , yi )) + λ||A − R||, (7)
α,A dl i=1
du
s.t. |αj | ≤ 1/γ (8)
j=1
the Rademacher average over F, where S = {zi : i ∈ {1, . . . , n}} are independent
random variables distributed according to some probability measure and {σi : i ∈
{1, . . . , n}} are independent Rademacher random variables, that is, Pr(σi = 1) =
Pr(σi = −1) = 12 .
The Rademacher average w.r.t. the dual matrix norm is then defined as:
1 dl
Rdl := ES,σ sup σi yi xi x̃T
x̃∈X dl i=1
∗
Inequality (10) comes from the 1-lipschitzness of the hinge loss; Inequality (11)
comes from Constraint (8), ||A|| ≤ d and the (β, c)-admissibility of KA . Applying
McDiarmid’s inequality to the term supA,α [ES (A, α) − E(A, α)], with proba-
bility 1 − δ, we have:
β + cX∗ d 2 ln 1δ
sup [ES (A, α) − E(A, α)] ≤ ES sup [ES (A, α) − E(A, α)] + .
A,α A,α γ dl
In order to bound the gap between the true loss and the empirical loss, we now
need to bound the expectation term of the right hand side of the above equation.
5 Experiments
The state of the art in metric learning is dominated by algorithms designed
to work in a purely supervised setting. Furthermore, most of them optimize a
metric adapted to kNN classification (e.g. LMNN, ITML), while our work is
designed for finding a global linear separator. For these reasons, it is difficult
to propose a totally fair comparative study. In this section, we first evaluate
the effectiveness of problem (4) with constraints (5) and (6) (JSL, for Joint
Similarity Learning) with different settings. Secondly, we extensively compare
it with state-of-the-art algorithms from different categories (supervised, kNN-
oriented). Lastly, we study the impact of the quantity of available labeled data
on our method. We conduct the experimental study on 7 classic datasets taken
from the UCI Machine Learning Repository1 , both binary and multi-class. Their
characteristics are presented in Table 1. These datasets are widely used for metric
learning evaluation.
Table 2. Average accuracy (%) with confidence interval at 95%, 5 labeled points per
class, 15 unlabeled landmarks.
Sim. Reg. Balance Ionosphere Iris Liver Pima Sonar Wine
I-L1 85.2±3.0 85.6±2.4 76.8±3.2 63.3±6.2 71.0±4.1 72.9±3.6 91.9±4.2
1
KA I-L2 85.1±2.9 85.6±2.6 76.8±3.2 63.1±6.3 71.0±4.0 73.2±3.8 91.2±4.5
KL-L1 84.9±2.9 85.0±2.6 77.3±2.7 63.9±5.5 71.0±4.0 72.9±3.6 90.8±4.7
KL-L2 85.2±3.0 85.8±3.3 76.8±3.2 62.9±6.4 71.3±4.3 74.2±3.8 90.0±5.4
I-L1 87.2±2.9 87.7±2.6 78.6±4.6 64.7±5.6 75.1±3.5 73.9±5.7 80.8±9.5
2
KA I-L2 86.8±3.0 87.7±2.8 75.9±5.7 64.3±5.4 75.6±3.6 74.8±5.8 80.8±8.6
KL-L1 87.2±2.9 87.3±2.4 78.6±4.6 62.9±5.6 75.0±3.7 75.5±6.2 79.6±11.8
KL-L2 87.1±2.7 85.8±3.3 79.1±5.4 64.9±5.9 75.6±3.4 77.1±5.2 79.6±9.7
Table 3. Average accuracy (%) with confidence interval at 95%, all points used as
landmarks.
Sim. Reg. Balance Ionosphere Iris Liver Pima Sonar Wine
I-L1 85.8±2.9 88.8±2.5 74.5±3.1 65.5±4.5 71.4±3.8 70.3±6.6 85.8±5.0
1
KA I-L2 85.8±2.9 87.7±2.7 74.5±3.5 64.7±5.5 71.7±4.1 68.7±6.7 84.6±5.5
KL-L1 85.6±3.1 87.9±3.4 75.0±3.5 65.3±4.9 71.6±4.2 70.3±6.8 85.4±5.3
KL-L2 85.1±3.1 88.5±3.7 75.9±3.4 65.1±4.8 72.1±4.2 71.9±6.7 86.5±6.0
I-L1 85.9±2.3 90.4±2.2 71.8±6.1 67.3±3.5 73.1±3.5 72.9±4.2 81.5±8.4
2
KA I-L2 86.2±2.5 90.6±2.2 73.2±6.6 68.6±3.3 73.3±3.2 73.2±4.2 82.7±9.0
KL-L1 85.8±2.6 89.4±2.0 72.7±5.5 67.5±3.8 73.8±3.5 71.0±4.1 80.0±7.4
KL-L2 85.9±2.4 89.6±2.2 74.5±6.2 68.4±3.6 73.1±3.8 72.3±4.8 80.0±11.5
1
bilinear (cosine-like) similarities of the form KA (x, x ) = xT Ax and similarities
derived from the Mahalanobis distance KA (x, x ) = 1 − (x − x )T A(x − x ).
2
For the regularization term, R is either set to the identity matrix (JSL-I), or
to the approximation of the Kullback-Leibler divergence (JSL-KL) discussed in
Section 3. As mentioned above, these two settings correspond to different prior
knowledge one can have on the problem. In both cases, we consider L1 and L2
regularization norms. We thus obtain 8 settings, that we compare in the situation
where few labeled points are available (5 points per class), using a small amount
(15 instances) of unlabeled data or a large amount (the whole training set) of unla-
beled data. The results of the comparisons are reported in Tables 2 and 3.
2
As one can note from Table 2, when only 15 points are used as landmarks, KA
obtains better results in almost all of the cases, the difference being more impor-
tant on Iris, Pima and Sonar. The noticeable exception to this better behavior
2
of KA is Wine, for which cosine-like similarities outperform Mahalanobis-based
similarities by more than 10 points. A similar result was also presented in [21].
The difference between the use of the L1 or L2 norms is not as marked, and there
is no strong preference for one or the other, even though the L2 norm leads to
slightly better results in average than the L1 norm. Regarding the regularization
matrix R, again, the difference is not strongly marked, except maybe on Sonar.
606 M.-I. Nicolae et al.
Fig. 1. Average accuracy (%) with confidence interval at 95%, 5 labeled points per
class, 15 unlabeled landmarks.
Table 4. Average accuracy (%) over all datasets with confidence interval at 95%.
Fig. 2. Average accuracy w.r.t. the number of labeled points with 15 landmarks.
methods are tested using the similarity based on the Mahalanobis distance, we
use the euclidean distance for BBS to ensure fairness.
Figure 1 presents the average accuracy per dataset obtained with 5 labeled
points per class. In this setting, JSL outperforms the other algorithms on 5
out of 7 datasets and has similar performances on one other. The exception is
the Wine dataset, where none of the JSL settings yields competitive results. As
stated before, this is easily explained by the fact cosine-similarities are more
adapted for this dataset. Even though JSL-15 and JSL-all perform the same
when averaged over all datasets, the difference between them is marked on some
datasets: JSL-15 is considerably better on Iris and Sonar, while JSL-all signifi-
cantly outperforms JSL-15 on Ionosphere and Liver. Averaged over all datasets
(Table 4), JSL obtains the best performance in all configurations with a limited
amount of labeled data, which is particularly the setting that our method is
designed for. The values in bold are significantly better than the rest of their
respective columns, confirmed by a one-sided Student t-test for paired samples
with a significance level of 5%.
608 M.-I. Nicolae et al.
6 Conclusion
In this paper, we have studied the problem of learning similarities in the situation
where few labeled (and potentially few unlabeled) data is available. To do so,
we have developed a semi-supervised framework, extending the (, γ, τ )-good
of [1], in which the similarity function and the classifier are learned at the same
time. To our knowledge, this is the first time that such a framework is provided.
The joint learning of the similarity and the classifier enables one to benefit
from unlabeled data for both the similarity and the classifier. We have also
showed that the proposed method was theoretically well-founded as we derived a
Rademacher-based bound on the generalization error of the learned parameters.
Lastly, the experiments we have conducted on standard metric learning datasets
show that our approach is indeed well suited for learning with few labeled data,
and outperforms state-of-the-art metric learning approaches in that situation.
Acknowledgments. Funding for this project was provided by a grant from Région
Rhône-Alpes.
References
1. Balcan, M.-F., Blum, A., Srebro, N.: Improved guarantees for learning via similarity
functions. In: COLT, pp. 287–298. Omnipress (2008)
2. Bao, J.-P., Shen, J.-Y., Liu, X.-D., Liu, H.-Y.: Quick asymmetric text similarity
measures. ICMLC 1, 374–379 (2003)
3. Baoli, L., Qin, L., Shiwen, Y.: An adaptive k-nearest neighbor text categorization
strategy. ACM TALIP (2004)
4. Bellet, A., Habrard, A., Sebban, M.: Similarity learning for provably accurate
sparse linear classification. In: ICML, pp. 1871–1878 (2012)
5. Bellet, A., Habrard, A., Sebban, M.: A survey on metric learning for feature vectors
and structured data. arXiv preprint arXiv:1306.6709 (2013)
6. Bellet, A., Habrard, A., Sebban, M.: Metric Learning. Synthesis Lectures on Arti-
ficial Intelligence and Machine Learning. Morgan & Claypool Publishers (2015)
7. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification : a survey of some
recent advances. ESAIM: Probability and Statistics 9, 323–375 (2005)
8. Bousquet, O., Elisseeff, A.: Stability and generalization. JMLR 2, 499–526 (2002)
Joint Semi-supervised Similarity Learning for Linear Classification 609
9. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: ICML, pp. 209–216. ACM, New York (2007)
10. Diligenti, M., Maggini, M., Rigutini, L.: Learning similarities for text documents
using neural networks. In: ANNPR (2003)
11. Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algo-
rithm. Machine Learning 37(3), 277–296 (1999)
12. Grabowski, M., Szalas, A.: A Technique for Learning Similarities on Complex
Structures with Applications to Extracting Ontologies. In: Szczepaniak, P.S.,
Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528,
pp. 183–189. Springer, Heidelberg (2005)
13. Guo, Z.-C., Ying, Y.: Guaranteed classification via regularized similarity learning.
CoRR, abs/1306.3108 (2013)
14. Hoi, S.C.H., Liu, W., Chang, S.-F.: Semi-supervised distance metric learning for
collaborative image retrieval. In: CVPR (2008)
15. Hoi, S.C.H., Liu, W., Chang, S.-F.: Semi-supervised distance metric learning for
collaborative image retrieval and clustering. TOMCCAP 6(3) (2010)
16. Hust, A.: Learning Similarities for Collaborative Information Retrieval. In:
Machine Learning and Interaction for Text-Based Information Retrieval Workshop,
TIR 2004, pp. 43–54 (2004)
17. Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and
Processes. Springer, New York (1991)
18. Nicolae, M.-I., Sebban, M., Habrard, A., Gaussier, É., Amini, M.: Algorithmic
robustness for learning via (, γ, τ )-good similarity functions. CoRR, abs/1412.6452
(2014)
19. Niu, G., Dai, B., Yamada, M., Sugiyama, M.: Information-theoretic semi-
supervised metric learning via entropy regularization. In: ICML. Omnipress (2012)
20. Qamar, A.M., Gaussier, É.: Online and batch learning of generalized cosine simi-
larities. In: ICDM, pp. 926–931 (2009)
21. Qamar, A.M., Gaussier, É., Chevallet, J., Lim, J.: Similarity learning for nearest
neighbor classification. In: ICDM, pp. 983–988 (2008)
22. Shalev-Shwartz, S., Singer, Y., Ng, A.Y.: Online and batch learning of pseudo-
metrics. In: ICML. ACM, New York (2004)
23. Weinberger, K., Saul, L.: Fast solvers and efficient implementations for distance
metric learning. In: ICML, pp. 1160–1167. ACM (2008)
24. Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neighbor
classification. JMLR 10, 207–244 (2009)
25. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with
application to clustering with side-information. NIPS 15, 505–512 (2002)
26. Xu, H., Mannor, S.: Robustness and generalization. In: COLT, pp. 503–515 (2010)
27. Xu, H., Mannor, S.: Robustness and generalization. Machine Learning 86(3),
391–423 (2012)
28. Zha, Z.-J., Mei, T., Wang, M., Wang, Z., Hua, X.-S.: Robust distance metric learn-
ing with auxiliary knowledge. In: IJCAI, pp. 1327–1332 (2009)
29. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm support vector machines. In:
NIPS, page 16. MIT Press (2003)
Learning Compact and Effective Distance
Metrics with Diversity Regularization
Pengtao Xie(B)
1 Introduction
In data mining and machine learning, learning a proper distance metric is of vital
importance for many distance based tasks and applications, such as retrieval [22],
clustering [18] and classification [16]. In retrieval, a better distance measure can
help find data entries that are more relevant with the query. In k-means based
clustering, data points can be grouped into more coherent clusters if the distance
metric is properly defined. In k-nearest neighbor (k-NN) based classification, to
find better nearest neighbors, the distances between data samples need to be
appropriately measured. All these tasks rely heavily on a good distance measure.
Distance metric learning (DML) [3,16,18] takes pairs of data points which are
labeled either as similar or dissimilar and learns a distance metric such that
similar data pairs will be placed close to each other while dissimilar pairs will
be separated apart. While formulated in various ways, most DML approaches
choose to learn a Mahalanobis distance (x − y)T M (x − y), where x, y are d-
dimensional feature vectors and M ∈ Rd×d is a positive semidefinite matrix to be
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 610–624, 2015.
DOI: 10.1007/978-3-319-23528-8 38
Learning Compact and Effective Distance Metrics 611
2 Related Works
Many works [3,5,8,16,18,21] have been proposed for distance metric learning.
Please see [15,19] for an extensive review. There are many problem settings
regarding the form of distance supervision, the type of distance metric to be
learned and the learning objective. Among them, the most common setting
[3,5,18] is given data pairs labeled either as similar or dissimilar, learning a
Mahalanobis distance metric, such that similar pairs will be placed close to each
other and dissimilar pairs will be separated apart. As first formulated in [18], a
Mahalanobis distance metric is learned under similarity and dissimilarity con-
straints by minimizing the distances of similar pairs while separating dissimilar
pairs with a certain margin. Guillaumin et al [5] proposed to use logistic discrim-
inant to learn a Mahalanobis metric from a set of labelled data pairs, with the
goal that positive pairs have smaller distances than negative pairs. Kostinger
et al [6] proposed to learn Mahalanobis metrics using likelihood test, which
defines the Mahalanobis matrix to be the difference of covariance matrices of
two Gaussian distributions used for modeling dissimilar pairs and similar pairs
respectively. Ying and Li [21] developed an eigenvalue optimization framework
for learning a Mahalanobis metric, which is shown to be equivalent to minimizing
the maximal eigenvalue of a symmetric matrix.
Some works take other forms of distance supervision such as class labels
[14], rankings [10], triple constraints [13], time series alignments [8] to learn
distance metrics for specific tasks, such as k-nearest neighbor classification [16],
ranking [10], time series aligning [8]. Globerson and Roweis [4] assumed the class
label for each sample is available and proposed to learn a Mahalanobis matrix
for classification by collapsing all samples in the same class to a single point
and pushing samples in other classes infinitely far away. Weinberger et al [16]
proposed to learn distance metrics for k-nearest neighbor classification with the
goal that the k-nearest neighbors always belong to the same class while samples
from different classes are separated by a large margin. This method also requires
the presence of class labels of all samples. Trivedi et al [14] formulated the
problem of metric learning for k nearest neighbor classification as a large margin
structured prediction problem, with a latent variable representing the choice
of neighbors and the task loss directly corresponding to classification error. In
this paper, we focus on the pairwise similarity/dissimilarity constraints which
are considered as the most common and natural supervision of distance metric
learning. Other forms of distance supervision, together with the corresponding
specific-purpose tasks, will be left for future study.
To avoid overfitting, various methods have been proposed to regularize dis-
tance metric learning. Davis et al [3] imposed a regularization that the Maha-
lanobis distance matrix should be close to a prior matrix and the Bregman
divergence is utilized to measure how close two matrices are. Ying and Li
[20] and Niu et al [11] utilized a mixed-norm regularization to encourage the
sparsity of the projection matrix. Qi et al [12] used 1 regularization to learn
sparse metrics for high dimensional problems with small sample size. Qian et
al [13] applied dropout to regularize distance metric learning. In this paper,
Learning Compact and Effective Distance Metrics 613
In this section, we begin with reviewing the DML problem proposed in [18] and
reformulate it as a latent space model using ideas introduced in [16]. Then we
present how to diversify DML.
Distance metric learning represents a family of models and has various formula-
tions regarding the distance metric to learn, the form of distance supervision and
how the objective function is defined. Among them, the most popular setting is:
(1) distance metric: Mahalanobis distance (x − y)T M (x − y), where x and y are
feature vectors of two data instances and M is a symmetric and positive semidef-
inite matrix to be learned; (2) the form of distance supervision: pairs of data
instances labeled either as similar or dissimilar; (3) learning objective: to learn a
distance metric to place similar points as close as possible and separate dissimi-
|S|
lar points apart. Given a set of pairs labeled as similar S = {(xi , yi )}i=1 and a
|D|
set of pairs labeled as dissimilar D = {(xi , yi )}i=1 , DML learns a Mahalanobis
614 P. Xie
To this end, we see that the DML problem can be interpreted as a latent space
modeling problem. The goal is to seek a latent space where the squared Euclidean
distances of similar data pairs are small and those of dissimilar pairs are large.
The latent space is characterized by the projection matrix A.
|a ·a |
is defined as arccos( aiiajj ). A larger θ(ai , aj ) indicates that ai and aj are
more different from each other. Given the pairwise angles, the diversity measure
Ω(A) is defined as Ω(A) = Ψ (A) − Π(A), where Ψ (A) and Π(A) are the mean
and variance of all pairwise angles. The mean Ψ (A) measures how these factors
are different from each other on the whole and the variance Π(A) is intended
to encourage these factors to be evenly different from each other. The larger
Ω(A) is, the more diverse these latent factors are. And Ω(A) attains the global
maximum when the factors are orthogonal to each other.
Using this diversity measure to regularize the latent factors, we define a
Diversified DML (DDML) problem as:
minA |S| 1
Ax − Ay2 − λΩ(A)
(x,y)∈S (3)
s.t. Ax − Ay2 ≥ 1, ∀(x, y) ∈ D
where λ > 0 is a tradeoff parameter between the distance loss and the diversity
regularizer. The term −λΩ(A)1 in the new objective function encourages the
latent factors in A to be diverse.λ plays an important role in balancing the
fitness of A to the distance loss (x,y)∈S Ax − Ay2 and its diversity. Under
a small λ, A is learned to best minimize the distance loss and its diversity is
ignored. Under a large λ, A is learned with high diversity, but may not be well
fitted to the distance loss and hence lose the capability to properly measure
distances. A proper λ needs to be chosen to achieve the optimal balance.
3.3 Optimization
In this section, we present an algorithm to solve the problem defined in Eq.(3),
which is summarized in Algorithm 1. First, we adopt a strategy similar to [16]
1
Note that a negative sign is used here because the overall objective function is to be
minimized but Ω(A) is intended to be maximized.
616 P. Xie
which can be optimized with subgradient method. Fixing g, the problem defined
is
over A − y)2 − λΩ(A)
1
minA |S| diag(g)A(x
(x,y)∈S
− y)2 )
1
+ |D| max(0, 1 − diag(g)A(x (8)
(x,y)∈D
s.t. i = 1, ∀i = 1, · · · , k
A
4 Experiments
4.1 Datasets
Our method DDML contains two key parameters — the number k of latent
factors and the tradeoff parameter λ — both of which were tuned using 5-fold
cross validation. We compared with 6 baseline methods, which were selected
according to their popularity and the state of the art performance. They are: (1)
Euclidean distance (EUC); (2) Distance Metric Learning (DML) [18]; (3) Large
Margin Nearest Neighbor (LMNN) metric learning [16]; (4) Information Theoret-
ical Metric Learning (ITML) [3]; (5) Distance Metric Learning with Eigenvalue
Optimization (DML-eig) [21]; (6) Information-theoretic Semi-supervised Met-
ric Learning via Entropy Regularization (Seraph) [11]. Parameters of the base-
line methods were tuned using 5-fold cross validation. Some methods, such as
ITML, achieve better performance on lower-dimensional representations which
are obtained via Principal Component Analysis. The number of leading principal
components were selected via 5-fold cross validation.
4.3 Retrieval
We first applied the learned distance metrics for retrieval. To evaluate the effec-
tiveness of the learned metrics, we randomly sampled 100K similar pairs and
100K dissimilar pairs from 20-News test set, 50K similar pairs and 50K dissim-
ilar pairs from 15-Scenes test set, 100K similar pairs and 100K dissimilar pairs
from 6-Activities test set and used the learned metrics to judge whether these
pairs were similar or dissimilar. If the distance was greater than some threshold t,
the pair was regarded as similar. Otherwise, the pair was regarded as dissimilar.
We used average precision (AP) to evaluate the retrieval performance.
Learning Compact and Effective Distance Metrics 619
Table 2, 3 and 4 show the average precision under different number k of latent
factors on 20-News, 15-Scenes and 6-Activities dataset respectively. As shown in
these tables, DDML with a small k can achieve retrieval precision that is com-
parable to DML with a large k. For example, on the 20-News dataset (Table 2),
with 10 latent factors, DDML is able to achieve a precision of 76.7%, which can-
not be achieved by DML with even 900 latent factors. As another example, on
the 15-Scenes dataset (Table 3), the precision obtained by DDML with k = 10
is 82.4%, which is largely better than the 80.8% precision achieved by DML
with k = 200. Similar behavior is observed on the 6-Activities dataset (Table 4).
This demonstrates that, with diversification, DDML is able to learn a distance
metric that is as effective as (if not more effective than) DML, but is much more
compact than DML. Such a compact distance metric greatly facilitates retrieval
efficiency. Performing retrieval on 10-dimensional latent representations is much
easier than on representations with hundreds of dimensions. It is worth noting
that the retrieval efficiency gain comes without sacrificing the precision, which
allows one to perform fast and accurate retrieval. For DML, increasing k con-
sistently increases the precision, which corroborates that a larger k would make
the distance metric to be more expressive and powerful. However, k cannot be
arbitrarily large, otherwise the distance matrix would have too many parameters
that lead to overfitting. This is evidenced by how the precision of DDML varies
as k increases.
Table 5 presents the comparison with the state of the art distance metric
learning methods. As can be seen from this table, our method achieves the
best performances across all three datasets. The Euclidean distance does not
620 P. Xie
4.4 Clustering
The second task we study is to apply the learned distance metrics for k-means
clustering, where the number of clusters was set to the number of categories in
each dataset and k-means was run 10 times with random initialization of the
centroids. Following [2], we used two metrics to measure the clustering perfor-
mance: accuracy (AC) and normalized mutual information (NMI). Please refer
to [2] for their definitions.
Table 6,8 and 10 show the clustering accuracy on 20-News, 15-Scenes and
6-Activity test set respectively under various number of latent factors k. Table 7,
9 and 11 show the normalized mutual information on 20-News, 15-Scenes and 6-
Activity test set respectively. These tables show that the clustering performance
achieved by DDML under a small k is much better than DML under a much
larger k. For instance, DDML can achieve 33.4% accuracy on the 20-News dataset
(Table 6) with 10 latent factors, which is much better than the 28.4% accuracy
Learning Compact and Effective Distance Metrics 621
obtained by DML with 900 latent factors. As another example, the NMI obtained
by DDML on the 15-Scenes dataset (Table 9) with k = 10 is 46.7%, which
is largely better than the 41.6% NMI achieved by DML with k = 200. This
again corroborates that the diversity regularizer can enable DDML to learn
compact and effective distance metrics, which significantly reduce computational
complexity while preserving the clustering performance.
Table 12 and 13 present the comparison of DDML with the state of the
art methods on clustering accuracy and normalized mutual information. As can
be seen from these two tables, our method outperforms the baselines in most
cases except that the accuracy on 20-News dataset is worse than the Seraph
method. Seraph performs very well on 20-News and 15-Scenes dataset, but its
performance is bad on the 6-Activities dataset. DDML achieves consistently good
performances across all three datasets.
622 P. Xie
0.83
0.956
0.78
Average Precision
Average Precision
Average Precision
0.825
0.954
0.76
0.82
0.952
0.74 0.815
0.95
0.81
0.72
k=100 0.948
0.805 k=100 k=100
k=300
0.7 k=150 k=150
k=500 0.8 0.946
k=200 k=200
k=700
0.68 0.795 0.944
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 0.5 1 1.5 2 2.5
Tradeoff Parameter λ −3
x 10 Tradeoff Parameter λ −3
x 10 Tradeoff Parameter λ −3
x 10
(a) (b) (c)
Fig. 1. Sensitivity of DDML to the tradeoff parameter λ on (a) 20-News dataset (b)
15-Scenes dataset (c) 6-Activities dataset
0.81 0.84 0.96
0.8 0.835
0.958
0.79 0.83
Average Precision
Average Precision
Average Precision
0.825 0.956
0.78
0.82
0.77 0.954
0.815
0.76 λ=0.0009
0.81 0.952
λ=0 λ=0.0012
0.75 λ=0.0015
λ=0.0003 0.805 λ=0.0015
0.95 λ=0.002
0.74 λ=0.0006 0.8
λ=0.0018
λ=0.0025
λ=0.0009
0.73 0.795 0.948
0 100 200 300 400 500 600 700 800 900 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Number k of Latent Factors Number k of Latent Factors Number k of Latent Factors
(a) (b) (c)
Fig. 2. Sensitivity of DDML to the number of latent factors k on (a) 20-News dataset
(b) 15-Scenes dataset (c) 6-Activities dataset
4.5 Classification
We also apply the learned metrics for k-nearest neighbor classification, which
is also an algorithm that largely depends on a good distance measure. For each
testing sample, we find its k-nearest neighbors in the training set and use the class
labels of the nearest neighbors to classify the test sample. Table 14, 16 and 18
show the 3-NN classification accuracy on the 20-News, 15-Scenes and 6-Activities
dataset. Table 15, 17 and 19 show the 10-NN classification accuracy on the 20-
News, 15-Scenes and 6-Activities dataset. Similar to retrieval and clustering,
DDML with a small k can achieve classification accuracy that is comparable to
or better than DML with a large k. Table 20 and 21 present the comparison
of DDML with the state of the art methods on 3-NN and 10-NN classification
accuracy. As can be seen from these two tables, our method outperforms the
baselines in most cases except that the accuracy on 20-News dataset is worse
than the Seraph method.
We study the sensitivity of DDML to the two key parameters: tradeoff parameter
λ and the number of latent factors k. Figure 1 shows how the retrieval average
precision (AP) varies as λ increases on the 20-News, 15-Scenes and 6-Activities
dataset respectively. The curves correspond to different k. As can be seen from
the figure, initially increasing λ improves AP. The reason is that a larger λ
Learning Compact and Effective Distance Metrics 623
5 Conclusions
References
1. Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: Human activity
recognition on smartphones using a multiclass hardware-friendly support vector
machine. In: Ambient Assisted Living and Home Care, pp. 216–223. Springer (2012)
2. Cai, D., He, X., Han, J.: Locally consistent concept factorization for docu-
ment clustering. IEEE Transactions on Knowledge and Data Engineering 23(6),
902–913 (2011)
3. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: Proceedings of the 24th International Conference on Machine Learn-
ing, pp. 209–216. ACM (2007)
4. Globerson, A., Roweis, S.T.: Metric learning by collapsing classes. In: Advances in
Neural Information Processing Systems, pp. 451–458 (2005)
5. Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? metric learning approaches
for face identification. In: IEEE International Conference on Computer Vision,
pp. 498–505. IEEE (2009)
6. Kostinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric
learning from equivalence constraints. In: IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2288–2295. IEEE (2012)
624 P. Xie
7. Kwok, J.T., Adams, R.P.: Priors for diversity in generative latent variable models.
In: Advances in Neural Information Processing Systems, pp. 2996–3004 (2012)
8. Lajugie, R., Garreau, D., Bach, F., Arlot, S.: Metric learning for temporal sequence
alignment. In: Advances in Neural Information Processing Systems, pp. 1817–1825
(2014)
9. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyra-
mid matching for recognizing natural scene categories. In: IEEE Conference on
Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178. IEEE (2006)
10. Lim, D., Lanckriet, G., McFee, B.: Robust structural metric learning. In: Pro-
ceedings of The 30th International Conference on Machine Learning, pp. 615–623
(2013)
11. Niu, G., Dai, B., Yamada, M., Sugiyama, M.: Information-theoretic semisupervised
metric learning via entropy regularization. Neural Computation, 1–46 (2012)
12. Qi, G.-J., Tang, J., Zha, Z.-J., Chua, T.-S., Zhang, H.-J.: An efficient sparse metric
learning in high-dimensional space via l1-penalized log-determinant regularization.
In: Proceedings of the 26th Annual International Conference on Machine Learning,
pp. 841–848. ACM (2009)
13. Qian, Q., Hu, J., Jin, R., Pei, J., Zhu, S.: Distance metric learning using dropout:
a structured regularization approach. In: Proceedings of the 20th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 323–332.
ACM (2014)
14. Trivedi, S., Mcallester, D., Shakhnarovich, G.: Discriminative metric learning by
neighborhood gerrymandering. In: Advances in Neural Information Processing
Systems, pp. 3392–3400 (2014)
15. Wang, F, Sun, J.: Survey on distance metric learning and dimensionality reduction
in data mining. Data Mining and Knowledge Discovery, 1–31 (2014)
16. Weinberger, KQ., Blitzer, J., Saul, L.K.: Distance metric learning for large mar-
gin nearest neighbor classification. In: Advances in Neural Information Processing
Systems, pp. 1473–1480 (2005)
17. Xie, P., Deng, Y., Xing, E.P.: Diversifying restricted boltzmann machine for docu-
ment modeling. In: ACM SIGKDD Conference on Knowledge Discovery and Data
Mining (2015)
18. Xing, E.P., Jordan, M.I., Russell, S., Ng, A.Y.: Distance metric learning with
application to clustering with side-information. In: Advances in Neural Information
Processing Systems, pp. 505–512 (2002)
19. Liu, Y., Rong, J.: Distance metric learning: A comprehensive survey. Michigan
State Universiy, vol. 2 (2006)
20. Ying, Y., Huang, K., Campbell, C.: Sparse metric learning via smooth optimiza-
tion. In: Advances in Neural Information Processing Systems, pp. 2214–2222 (2009)
21. Ying, Y., Li, P.: Distance metric learning with eigenvalue optimization. The Journal
of Machine Learning Research 13(1), 1–26 (2012)
22. Zhang, P., Zhang, W., Li, W.-J., Guo, M: Supervised hashing with latent factor
models. In: SIGIR (2014)
23. Zou, J.Y., Adams, R.P.: Priors for diversity in generative latent variable models.
In: Advances in Neural Information Processing Systems, pp. 2996–3004 (2012)
Scalable Metric Learning for Co-Embedding
1 Introduction
The goal of metric learning is to learn a distance function that is tuned to a target
task. For example, a useful distance between person images would be significantly
different when the task is pose estimation versus identity verification. Since
many machine learning algorithms rely on distances, metric learning provides
an important alternative to hand-crafting a distance function for specific prob-
lems. For a single modality, metric learning has been well explored (Xing et al.
2002; Globerson and Roweis 2005; Davis et al. 2007; Weinberger and Saul 2008,
2009; Jain et al. 2012). However, for multi-modal data, such as comparing text
and images, metric learning has been less explored, consisting primarily of a slow
semi-definite programming approach (Zhang et al. 2011) and local alternating
descent approaches (Xie and Xing 2013).
Concurrently, there is a growing literature that tackles co-embedding prob-
lems, where multiple sets or modalities are embedded into a common space to
improve prediction performance, reveal relationships and enable zero-shot learn-
ing. Current approaches to these problems are mainly based on deep neural
networks (Ngiam et al. 2011; Srivastava and Salakhutdinov 2012; Socher et al.
2013a, b; Frome et al. 2013) and simpler non-convex objectives (Chopra et al.
2005; Larochelle et al. 2008; Weston et al. 2010; Cheng 2013; Akata et al. 2013).
Unlike metric learning, the focus of this previous work has been on exploring
heterogeneous data, but without global optimization techniques. This disconnect
An erratum to this chapter is available at DOI: 10.1007/978-3-319-23528-8 45
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 625–642, 2015.
DOI: 10.1007/978-3-319-23528-8 39
626 F. Mirzazadeh et al.
2 Metric Learning
The goal of metric learning is to learn a distance function between data instances
that helps solve prediction problems. To obtain task-specific distances without
extensive manual design, supervised metric learning formulations attempt to
exploit task-specific information to guide the learning process. For example, to
recognize individual people in images a distance function needs to emphasize
certain distinguishing features (such as hair color, etc.), whereas to recognize
person-independent facial expressions in the same data, different features should
be emphasized (such as mouth shape, etc.).
Suppose one has a sample of t observations, xi ∈ X , and a feature map
φ : X → Rn . Then a training matrix φ(X) = [φ(x1 ), . . . , φ(xt )] ∈ Rn×t can be
Scalable Metric Learning for Co-Embedding 627
For example, in large margin nearest neighbor learning, one might want to min-
imize
L(φ(X) Cφ(X)) = dC (xi , xj ) + [1 + dC (xi , xj ) − dC (xi , xk )]+
(i,j)∈S (i,j,k)∈R
where S is a set of “should link” pairs, and R provides a set of triples (i, j, k)
specifying that if (i, j) ∈ S then xk should have a different label than xi .
Although supervised metric learning has typically been used for classification,
it can also be applied to other settings where distances between data points are
useful, such as for kernel regression or ranking. Interestingly, the applicability of
metric learning can be extended well beyond the framework (2) by additionally
observing that co-embedding elements from different sets can also be expressed
as a joint metric learning problem.
For co-embedding, assume we are given two sets of data objects X and Y with
feature maps φ(x) ∈ Rn and ψ(y) ∈ Rm respectively. Without loss of generality,
we assume that the number of samples from Y, ty , is no more than t, the number
of samples from X ; that is, ty ≤ t. The goal is to map the elements x ∈ X and
y ∈ Y from each set into a common Euclidean space.3
A standard approach is to consider linear maps into a common d dimensional
space where U ∈ Rd×n and V ∈ Rd×m are parameters. To provide decision thresh-
olds two dummy items can also be embedded from each space, parameterized by
u0 and v0 respectively. Figure 1 depicts this standard co-embedding set-up as
a neural network, where the trainable parameters, U , V , u0 and v0 , are in the
first layer. The inputs to the network are the feature representations φ(x) ∈ Rn
and ψ(y) ∈ Rm . The first hidden layer, the embedding layer, linearly maps input
to embeddings in a common d dimensional space via:
The second hidden layer, the co-embedding layer, computes the distance func-
tion between embeddings, d(x, y), and decision thresholds, t1 (x) and t2 (y):
The output layer nonlinearly combines the association scores and thresholds
to predict targets. For example, in a multi-label classification problem, given
an element x ∈ X , its association to each y ∈ Y can be determined via:
3
The extension to more than two sets can be achieved by considering tensor repre-
sentations.
Scalable Metric Learning for Co-Embedding 629
Duan et al. (2012) developed a similar algorithm for domain adaptation, which
learned a matrix C 0 instead of U and V ; however, they approached a less
general setting, which, for example, did not include thresholds nor general losses.
Furthermore, their formulation leads to a non-convex optimization problem.
630 F. Mirzazadeh et al.
4 Algorithm
Given the formulation (6), we consider how to
efficiently solve it. First note that
the objective can be written, using L(C) = i Li (f (X, Y ) Cf (X, Y )), as
min f (C) where f (C) = L(C) + β tr(C). (7)
C∈Rp×p ,C0
Proof. Part 1: First, form the Lagrangian of (7), given by L(C) + β tr(C) −
tr(SC) with S 0, and consider the KKT conditions:
∂
Considering the gradients yields ∂u i
= −Sui − 2ξi ui = 0, which implies
(−S)ui = 2ξi ui ; that is ui is an eigenvector of −S corresponding to eigen-
value λi = 2ξi > 0.
Proof. First assume condition (i) holds and argue by contradiction. Assume
QQ is not a global optimum of (7), and let u1 ∈ Rp be as defined as in Proposi-
tion 1. Then, f (QQ + βu1 u1 ) < f (QQ ) for a sufficiently small β > 0. Further-
more, since rank(Q) < d, there exists an orthogonal matrix V ∈ Rd×d such that
QV has a zero column. Let Q α be the matrix obtained from QV by replacing
√
this zero column by αu1 , α = β. Then limα→0 Q α V = QV V = Q. More-
over, since u1 is orthogonal to the columns of Q, it is also orthogonal to the
columns of QV , so Q α V ) = QV (QV ) + α2 u1 u = QQ + βu1 u . There-
α V (Q
1 1
fore, f (Qα Qα ) = f (QQ + βu1 u1 ) < f (QQ ) for Qα ∈ Rp×d , hence Q is not a
local optimum of f .
Next assume (ii). Since Q is a critical point of f (QQ ), ∇f (QQ )Q = 0.
Since Q has rank p, the null-space of ∇f (QQ ) is of dimension p, yielding that
∇f (QQ ) = 0. Since QQ 0 and f is convex, C = QQ is an optimum of
(7).
4
One can create problems where adding single columns improves performance, but we
observe in our experiments that the proposed approach is more effective in practice.
Scalable Metric Learning for Co-Embedding 633
ciently exploits the local algorithm. The convergence analysis of Zhang et al.
(2012) does not include local training. In practice, we find that solely using
boosting (with the top eigenvector as the weak learner) without local optimiza-
tion, results in much slower convergence.
Corollary 1 implies ILA solves (7) when the local optimizer avoids saddle
points.
Corollary 2. Suppose the local optimizer always finds a local optimum, where
d is the number of columns in Q. Then ILA stops with a solution to (7) in line 12
with rank(Q)<d or d = p. If, in addition, the local optimizer is nice, this happens
for d > d∗ .
Proof. Let Qm and Um denote the matrix Q and U in ILA when line 10 is exe-
cuted the mth time,√and let Qinit,m denote Qinit obtained from Qm . Note that
√
Qinit,m = aQm + bUm and Qm+1 is obtained from Qinit,m via local optimiza-
tion in line 12. Furthermore, let Cm = Qm Qm and Cinit,m = Qinit,m Qinit,m =
am Cm + bm Um Um .
If Cm is not a global optimum of (7), then f (Cinit,m ) < f (Cm ) by Propo-
sition 1. Furthermore, we assume that the local optimizer in line 12 can-
not increase the function value f of Cinit,m , hence f (Cm+1 ) ≤ f (Cinit,m ),
634 F. Mirzazadeh et al.
and consequently f (Cm+1 ) < f (Cm ). Note that since L(Cm ) ≥ 0, we have
Qm F = tr(Cm ) ≤ f (C0 ), thus the entries of Cm are uniformly bounded for all
m. Therefore, (Cm )m has a convergent subsequence, and denote its limit point
We will show that C
by C. is an optimal solution of (7) by verifying the KKT
First notice that C
conditions (10) with S = ∇f (C). is positive semi-definite,
C
∇f (C) = 0 by continuity since ∇f (Cm )Cm = ∇f (QQ )QQ = 0. Thus, we
is positive semi-definite.
only need to verify that ∇f (C)
To show the latter, we first apply Lemma 1 (provided in the appendix) to
obtain a lower bound ILA’s progress:
f (Cm+1 ) ≤ f (Cinit,m+1 ) = f (aCm + bUm Um ) ≤ f (Cm + b̂Um Um )
ν 2
≤ f (Cm ) + tr((b̂Um Um ) ∇f (Cm )) + ρ(b̂Um Um )
2
ν b̂2
= f (Cm ) + tr(b̂Um ∇f (Cm )Um ) + (11)
2
for any b̂ ≥ 0, where the last equality holds since Um Um has km eigenvalues
equal 1, and p − km equal 0, where km denotes the number of columns of Um .
Now consider
km
tr(Um ∇f (Cm )Um )
tr(Um Λm Um ) 1
b̂ = − = = λm,i ,
ν ν ν i=1
km 2
ν 1 λ2m,1
f (Cm ) − f (Cm+1 ) ≥ b̂2 = λm,i ≥ .
2 ν i=1
2ν
Fig. 2. Comparing the run time in minutes (y-axis) of linear versus exponential strate-
gies in ILA as data dimension (x-axis) is increased. Left shows t = 250, middle shows
t = 1000, and right shows t = 2000.
the rank of the solution, we generated synthetic data X ∈ Rn×t from a standard
normal distribution, systematically increasing the data dimension from n = 1 to
n = 1000 and increasing the sample sizes from t = 250 to t = 2000. The training
objective was set to
the calibrated pairwise label ranking method of Fürnkranz et al. (2008) with
SVM and LOG, respectively; and CC(SMO) and CC(LOG), a chain of SVM
classifiers and a chain of logistic regression classifiers for multi-label classifica-
tion by Read et al. (2011). The results in Tables 2–4 are averaged over 10 splits
and demonstrate comparable performance to the best competitors consistently
in all three criteria for all data sets.
Next, to also investigate the properties of the local optima achieved we ran
local optimization from 1000 random initializations of Q at successive values
for d, using β = 1. The values of the local optima we observed are plotted in
Figure 3 as a function of d.5 As expected, the local optimizer always achieves
the globally optimal value when d ≥ d∗ . Interestingly, for d < d∗ we see that the
initially wide diversity of local optimum values contracts quickly to a singleton,
with values approaching the global minimum before reaching d = d∗ . Although
not displayed in the graphs, other useful properties can be observed. First, for
d ≥ d∗ , the global optimum is achieved by local optimization under random
initialization, but not with initialization to any of the critical points of smaller
5
Note that Q is not unique since C = QQ is invariant to transform QR for orthonor-
mal R.
638 F. Mirzazadeh et al.
Fig. 4. F1 measure achieved by ILA on test data with an increasing number of columns
(optimal rank is 84 in this case).
Fig. 5. Training objectives for β ∈ {0.01, 0.1, 1} as a function of the rank of C, where
the optimal ranks are 105, 84 and 62 respectively.
by the squared distance between the user and tag embeddings, and between the
item and tag embeddings: d(x, y, z) := d(x, y) + d(z, y) = σ − τ 2 + ρ − τ 2 .
Given this definition, tags can be predicted from a given user-item pair (x, z)
via
1 if d(x, y, z) among smallest five d(x, ·, z)
T̂ (x, y, z) = .
−1 otherwise
The training problem can be expressed as metric learning by exploit-
ing a construction reminiscent of Section 3: the
embedding vectors can con-
ceptually be stacked in matrix factor Q = σ, τ, ρ , which defines the
inverse covariance C = QQ . To learn C, we use the same loss proposed by
Rendle and Schmidt-Thieme (2009), regularized by the Frobenius norm over σ,
τ and ρ (which again corresponds to trace regularization of C), yielding the
convex training problem
min β tr(C) + L(dC (x, z, ȳ) − dC (x, z, y)). (14)
C0
x,z y∈tag(x,z) ȳ ∈tag(x,z)
/
the local solutions become even more apparent: interestingly, the local minima
approach the training global minimum at ranks much smaller than the opti-
mum. These results further support the effectiveness of metric learning and the
potential for ILA to solve these problems much more efficiently than standard
semi-definite programming approaches.
8 Conclusion
We have demonstrated a unification of co-embedding and metric learning that
enables a new perspective on several machine learning problems while expand-
ing the range of applicability for metric learning methods. Additionally, by using
recent insights from semi-definite programming theory, we developed a fast local
optimization algorithm that is able to preserve global optimality while signifi-
cantly improving the speed of existing methods. Both the framework and the effi-
cient algorithm were investigated in different contexts, including metric learning,
multi-label classification and multi-relational prediction—demonstrating their
generality. The unified perspective and general algorithm show that a surpris-
ingly large class of problems can be tackled from a simple perspective, while
exhibiting a local-global property that can be usefully exploited to achieve faster
training methods.
References
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-
based classification. In: IEEE Conference on Computer Vision and Pattern Recog-
nition (2013)
Bach, F., Mairal, J., Ponce, J.: Convex sparse matrix factorizations. CoRR (2008)
Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning
representations for open-text semantic parsing. In: Proceedings AISTATS (2012)
Bordes, A., Weston, J., Usunier, N.: Open question answering with weakly supervised
embedding models. In: European Conference on Machine Learning (2014)
Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefi-
nite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)
Chechik, G., Shalit, U., Sharma, V., Bengio, S.: An online algorithm for large scale
image similarity learning. In: Neural Information Processing Systems (2009)
Cheng, L.: Riemannian similarity learning. In: Internat. Conference on Machine Learn-
ing (2013)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with
application to face verification. In: IEEE Conf. on Computer Vision and Pattern
Recogn. (2005)
Davis, J., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning.
In: International Conference on Machine Learning (2007)
Duan, L., Xu, D., Tsang, I.: Learning with augmented features for heterogeneous
domain adaptation. In: International Conference on Machine Learning (2012)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: A
deep visual-semantic embedding model. In: Neural Information Processing Systems
(2013)
Scalable Metric Learning for Co-Embedding 641
Fürnkranz, J., Hüllermeier, E., Loza Mencı́a, E., Brinker, K.: Multilabel classification
via calibrated label ranking. Machine Learning 73(2), 133–153 (2008)
Garreau, D., Lajugie, R., Arlot, S., Bach, F.: Metric learning for temporal sequence
alignment. In: Neural Information Processing Systems (2014)
Globerson, A., Chechik, G., Pereira, F., Tishby, N.: Euclidean embedding of co-
occurrence data. Journal of Machine Learning Research 8, 2265–2295 (2007)
Globerson, A., Roweis, S.T.: Metric learning by collapsing classes. In: NIPS (2005)
Haeffele, B., Vidal, R., Young, E.: Structured low-rank matrix factorization: Optimality,
algorithm, and applications to image processing. In: Proceedings ICML (2014)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference and Prediction, 2nd edn. Springer (2009)
Huang, Z., Wang, R., Shan, S., Chen, X.: Learning euclidean-to-riemannian metric for
point-to-set classification. In: IEEE Conference on Computer Vision and Pattern
Recogn. (2014)
Jain, P., Kulis, B., Davis, J.V., Dhillon, I.S.: Metric and kernel learning using a linear
transformation. Journal of Machine Learning Research 13, 519–547 (2012)
Jäschke, R., Marinho, L.B., Hotho, A., Schmidt-Thieme, L., Stumme, G.: Tag rec-
ommendations in social bookmarking systems. AI Communications 21(4), 231–247
(2008)
Journée, M., Bach, F.R., Absil, P.-A., Sepulchre, R.: Low-rank optimization on the cone
of positive semidefinite matrices. SIAM Journal on Optimization 20(5), 2327–2351
(2010)
Kulis, B.: Metric learning: A survey. Foundat. and Trends in Mach. Learn. 5(4), 287–
364 (2013)
Kulis, B., Saenko, K., Darrell, T.: What you saw is not what you get: Domain adapta-
tion using asymmetric kernel transforms. In: Proceedings CVPR (2011)
Larochelle, H., Erhan, D., Bengio, Y.: Zero-data learning of new tasks. In: AAAI (2008)
Mirzazadeh, F., Guo, Y., Schuurmans, D.: Convex co-embedding. In: AAAI (2014)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning.
In: International Conference on Machine Learning (2011)
Platt, J.C.: Sequential minimal optimization: A fast algorithm for training support
vector machines. Technical report, Advances in Kernel Methods (1998)
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classi-
fication. Machine Learning 85(3), 333–359 (2011)
Rendle, S., Schmidt-Thieme, L.: Factor models for tag recommendation in bibsonomy.
In: ECML/PKDD Discovery Challenge (2009)
Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for
knowledge base completion. In: Advances in Neural Information Processing Systems
(2013a)
Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-
modal transfer. In: Advances in Neural Information Processing Systems, pp. 935–943
(2013b)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines.
In: Advances in Neural Information Processing Systems (2012)
Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neighbor
classification. Journal of Machine Learning Research 10, 207–244 (2009)
Weinberger, K., Saul, L.K.: Fast solvers and efficient implementations for distance
metric learning. In: International Conference on Machine Learning (2008)
Weston, J., Bengio, S., Usunier, N.: Large scale image annotation: learning to rank
with joint word-image embeddings. Machine Learning 81(1), 21–35 (2010)
642 F. Mirzazadeh et al.
Xie, P., Xing, E.: Multi-modal distance metric learning. In: Proceedings IJCAI (2013)
Xing, E., Ng, A., Jordan, M., Russell, S.: Distance metric learning with application to
clustering with side-information. In: Neural Information Processing Systems (2002)
Yamanishi, Y.: Supervised bipartite graph inference. In: Proceedings NIPS (2008)
Zhai, X., Peng, Y., Xiao, J.: Heterogeneous metric learning with joint graph regu-
larization for cross-media retrieval. In: AAAI Conference on Artificial Intelligence
(2013)
Zhang, H., Huang, T.S., Nasrabadi, N.M., Zhang, Y.: Heterogeneous multi-metric
learning for multi-sensor fusion. In: International Conference on Information Fusion
(2011)
Zhang, X., Yu, Y., Schuurmans, D.: Accelerated training for matrix-norm regulariza-
tion: A boosting approach. In: Neural Information Processing Systems (2012)
A An Auxiliary Lemma
Proof. Define h(η) = f (C + ηS) for η ∈ [0, 1]. Note that h(0) = f (C), h(1) =
f (C + S), and h (η) = tr(S ∇f (C + ηS)) for any η ∈ (0, 1). Then
where the first inequality holds by the Cauchy-Schwarz inequality, and the second
by the Lipschitz condition on ∇f . Reordering the inequality establishes the
lemma.
Large Scale Learning and Big Data
Adaptive Stochastic Primal-Dual Coordinate
Descent for Separable Saddle Point Problems
1 Introduction
where g(x) is a proper convex function, φ∗ (·) is the convex conjugate of a convex
function φ(·), and matrix K ∈ Rd×q . Many machine learning tasks reduce to
solving a problem of this form [3,6]. As a result, this saddle problem has been
widely studied [1,2,4,5,14,16].
One important subclass of the general convex concave saddle point problem
is where g(x) or φ∗ (y) exhibitsan additive separable structure. We ∗
n say φ (y)
n
is separable when φ∗ (y) = n1 i=1 φ∗i (yi ), with yi ∈ Rqi , and q
i=1 i = q.
Separability for g(·) is defined likewise. To keep the consistent notation for the
machine learning applications discussed later, we introduce matrix A and let
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 645–658, 2015.
DOI: 10.1007/978-3-319-23528-8 40
646 Z. Zhu and A.J. Storkey
K = n1 A. Then we partition
n matrix A into n column blocks Ai ∈ Rd×qi , i =
1, . . . , n, and Ky = n1 i=1 Ai yi , resulting in a problem of the form
n
1 ∗
min max L(x, y) = g(x) + (x, Ai yi − φi (yi )) (2)
x∈Rd y∈Rq n i=1
for φ∗ (·) separable. We call any problem of the form (1) where g(·) or φ∗ (·)
has separable structure, a Separable Convex Concave Saddle Point (Sep-CCSP )
problem. Eq. (2) gives the explicit form for when φ∗ (·) is separable.
In this work, we further assume that each φ∗i (yi ) is γ-strongly convex, and
g(x) is λ-strongly convex, i.e.,
γ
φ∗i (yi ) ≥ φ∗i (yi ) + ∇φ∗ (yi )T (yi − yi ) + y − yi 22 , ∀yi , yi ∈ Rqi
2 i
λ
g(x ) ≥ g(x) + ∇g(x)T (x − x) + x − xi 22 , ∀x, x ∈ Rd ,
2 i
where we use ∇ to denote both the gradient for smooth function and subgradient
for non-smooth function. When the strong convexity cannot be satisfied, a small
strongly convex perturbation can be added to make the problem satisfy the
assumption [15].
One important instantiation of the Sep-CCSP problem in machine learning
is the regularized empirical risk minimization (ERM, [3]) of linear predictors,
n
1 T
min J(x) = φi (ai x) + g(x) , (3)
x∈Rd n i=1
Comparing with the general form, we note that the matrix Ai in (2) is now a
vector ai . For solving the general saddle point problem (1), many primal-dual
algorithms can be applied, such as [1,2,4,5,16]. In addition, the saddle point
Adaptive Stochastic Primal-Dual Coordinate Descent 647
Chambolle and Pock [1] proposed a first-order primal-dual method for the CCSP
problem (1). We refer this algorithm as PDCP. The update of PDCP in the
(t + 1)th iteration is as follows:
1
yt+1 = argminy φ∗ (y) − xt , Ky + y − yt 22 (6)
2σ
1
xt+1 = argminx g(x) + x, Kyt+1 + x − xt 22 (7)
2τ
xt+1 = xt+1 + θ(xt+1 − xt ). (8)
all the dual coordinates in each iteration for Sep-CCSP problem, which will be
computationally intensive for large-scale (high-dimensional) problems.
SPDC [15] can be viewed as a stochastic variant of the batch method PDCP
for handling Sep-CCSP problem. However, SPDC uses a conservative constant
stepsize for primal and dual updates. Both PDCP and SPDC do not consider
the structure of matrix K and only apply constant stepsize for all coordinates
of primal and dual variables. This might limit their convergence performance in
reality.
Based on this observation, we exploit the structure of matrix K (i.e., n1 A)
and propose an adaptive stepsize rule for efficiently solving Sep-CCSP problem.
A better linear convergence rate could be yielded when φ∗i (·) and g(·) are strongly
convex. Our algorithm will be elaborated in the following section.
For those coordinates in blocks not selected, i ∈/ St , we just keep yit+1 = yit . By
exploiting the structure of individual Ai , we configure the stepsize parameter of
the proximal term σi adaptively
1 nλ
σi = , (10)
2Ri mγ
where Ri = Ai 2 = μmax ATi Ai , with ·2 is the spectral norm of a matrix
and μmax (·) to denote the maximum singular value of a matrix.
Our step size is different from the one used in SPDC [15], where R is a
constant R = max{ai 2 : i = 1, . . . , n} (since SPDC only considers ERM
problem, the matrix Ai is a feature vector ai ).
Remark. Intuitively, Ri in AdaSPDC can be understood as the coupling
strength between the i-th dual variable block and primal variable, measured
by the spectral norm of matrix Ai . Smaller coupling strength allows us to use
larger stepsize for the current dual variable block without caring too much about
its influence on primal variable, and vice versa. Compared with SPDC, our pro-
posal of an adaptive coupling strength for the chosen coordinate block directly
results in larger step size, and thus helps to improve convergence speed.
Adaptive Stochastic Primal-Dual Coordinate Descent 649
1
rt+1 = rt + Aj yjt+1 − yjt . (13)
n
j∈St
– Compared with SPDC, our method uses adaptive step size to obtain faster
convergence (will be shown in Theorem 1), while the whole algorithm does
not bring any other extra computational complexity. As demonstrated in the
experiment Section 4, in many cases, AdaSPDC provides significantly better
performance than SPDC.
– Since, in each iteration, a number of block coordinates can be chosen and
updated independently (with independent evaluation of individual step size),
this directly enables parallel processing, and hence use on modern computing
clusters. The ability to select an arbitrary number of blocks can help to make
use of all the computation structure available as effectively as possible.
650 Z. Zhu and A.J. Storkey
3: for t = 0, 1, . . . , T − 1 do
4: Randomly pick m dual coordinate blocks from {1, . . . , n} as indices set St , with
the probability of each block being selected equal to m/n.
5: According to the selected subset St , compute the adaptive parameter configura-
tion of σi , τ t and θt using Eq. (10), (12) and (15), respectively.
6: for each selected block in parallel do
7: Update the dual variable block using Eq.(9).
8: end for
9: Update primal variable using Eq.(11).
10: Extrapolate primal variable block using Eq.(14).
11: Update the auxiliary variable r using Eq.(13).
12: end for
where (x , y )
is the optimal saddle point, νi = 1/(4σi )+γ
m , νi = 1/(2σi )+γ
m , and
n
yT − y 2ν = i=1 νi yiT − yi 22 .
Since the proof of the above is technical, we provide it in the full version of this
paper [17].
In our proof, given the proposed parameter θt , the critical point for obtaining
a sharper linear convergence rate than SPDC is that we configure τ t and σi as
Eq. (12) and (10) to guarantee the positive definiteness of the following matrix
in the t-th iteration, m
2τ t I −ASt
P = −A T 1 , (17)
St 2diag(σ S ) t
d×mqi
where ASt = [. . . , Ai , . . . ] ∈ R and diag(σ St ) = diag(. . . , σi Iqi , . . . ) for i ∈
St . However, we found that the parameter configuration to guarantee the positive
definiteness of P is not unique, and there exist other valid parameter configura-
tions besides the proposed one in this work. We leave the further investigation
on other potential parameter configurations as future work.
Adaptive Stochastic Primal-Dual Coordinate Descent 651
4 Empirical Results
In this section, we appy AdaSPDC to several regularized empirical risk min-
imization problems. The experiments are conducted to compare our method
AdaSPDC with other competitive stochastic optimization methods, including
SDCA [13], SAG [12], SPDC with uniform sampling and non-uniform sampling
[15]. In order to provide a fair comparison with these methods, in each iteration
only one dual coordinate (or data instance) is chosen, i.e., we run all the methods
sequentially. To obtain results that are independent of the practical implementa-
tion of the algorithm, we measure the algorithm performance in term of objective
suboptimality w.r.t. the effective passes to the entire data set.
Each experiment is run 10 times and the average results are reported to show
statistical consistency. We present all the experimental results we have done for
each application.
n = 1000 i.i.d. training points {ai , bi }ni=1 are generated in the following manner,
where a ∈ Rd and d = 1000, and the elements of the vector x are all ones. The
covariance matrix Σ is set to be diagonal with Σjj = j −2 , for j = 1, . . . , d. Then
the ridge regression tries to solve the following optimization problem,
n
11 T λ
min J(x) = (a x − bi ) + x2 .
2 2
(18)
x∈Rd n i=1 2 i 2
By employing the conjugate dual of quadratic loss (crossref, Eq. (4)), we can
reformulate the ridge regression as the following Sep-CCSP problem,
n
λ 1 1 2
min maxn x22 + x, yi ai − y i + bi y i . (19)
x∈Rd y∈R 2 n i=1 2
It is easy to figure out that g(x) = λ/2x22 is λ-strongly convex, and φ∗i (yi ) =
1 2
2 yi + bi yi is 1-strongly convex.
Thus, for ridge regression, the dual update in Eq. (9) and primal update in
Eq. (11) of AdaSPDC have closed form solutions as below,
1 t 1
yit+1 = x , ai + bi + yi , if i ∈ St
1 + 1/σi σi
⎛ ⎛ ⎞⎞
1 1 1
xt+1 = ⎝ xt − ⎝rt + aj (yjt+1 − yjt )⎠⎠
λ + 1/τ t τ t m
j∈St
5 0
10 10
−2
0 10
10
−4
10
−5
J(x ) − J(x )
J(x t) − J(x *)
10
*
−6
10
t
−10
10
−8
10
−2 −1
10 10
−4 −2
J(x t) − J(x *)
10
J(x ) − J(x )
10
*
t
−6 −3
10 10
−8 SDCA −4 SDCA
10 SAG
10 SAG
SPDC SPDC
SPDCnonUniform SPDCnonUniform
AdaSPDC AdaSPDC
−10 −5
10 10
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Number of Passes Number of Passes
−5 −6
(c) λ = 10 (d) λ = 10
Fig. 1. Ridge regression with synthetic data: comparison of convergence performance
w.r.t. the number of passes. Problem size: d = 1000, n = 1000. We evaluate the con-
vergence performance using objective suboptimality, J(xt ) − J(x ).
We now compare the performance of our method AdaSPDC with other com-
petitive methods on several real-world data sets. Our experiments focus on the
freely-available benchmark data sets for binary classification, whose detailed
information are listed in Table 1. The w8a, covertype and url data are obtained
from the LIBSVM collection1 . The quantum and protein data sets are obtained
from KDD Cup 20042 . For all the datasets, each sample takes the form (ai , bi )
1
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/binary.html
2
http://osmot.cs.cornell.edu/kddcup/datasets.html
654 Z. Zhu and A.J. Storkey
−2 −2 −2
10 10 10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
*
−4 −4 −4
w8a 10 10 10
t
−5 −5
10 10
−5
10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
*
−10 −10
covtype 10 10
t
−10
SDCA 10 SDCA SDCA
10
−15 SAG SAG 10
−15 SAG
SPDC SPDC SPDC
SPDCnonUniform SPDCnonUniform SPDCnonUniform
−20
AdaSPDC −15
AdaSPDC −20
AdaSPDC
10 10 10
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of Passes Number of Passes Number of Passes
0 0 0
10 10 10
−2 −2
10 −2 10
10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
−4 −4
*
10 10
−4
url −6
10
−6
t
10 10
SDCA SDCA SDCA
SAG −6
10 SAG SAG
−8 −8
10 SPDC SPDC 10 SPDC
SPDCnonUniform SPDCnonUniform SPDCnonUniform
AdaSPDC −8
AdaSPDC −10
AdaSPDC
−10
10 10 10
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of Passes Number of Passes Number of Passes
0 0 0
10 10 10
−1 −1
10 −1 10
10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
−2 −2
*
10 10
−2
quantum −3
10
−3
t
10 10
SDCA SDCA SDCA
SAG −3
10 SAG SAG
−4 −4
10 SPDC SPDC 10 SPDC
SPDCnonUniform SPDCnonUniform SPDCnonUniform
−5
AdaSPDC −4
AdaSPDC −5
AdaSPDC
10 10 10
0 10 20 30 40 50 0 20 40 60 80 100 0 100 200 300 400 500
Number of Passes Number of Passes Number of Passes
0 0 0
10 10 10
−2 −2 −2
10 10 10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
*
−4 −4 −4
protein 10 10 10
t
with ai is the feature vector and bi is the binary label −1 or 1. We add a bias term
to the feature vector for all the datasets. We aim to minimize the regularized
empirical risk with following form
n
1 λ
J(x) = φi (aTi x) + x22 (20)
n i=1 2
Adaptive Stochastic Primal-Dual Coordinate Descent 655
−2 −2
10 10
−5
10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
* −4 −4
10 10
w8a −6 −6
t
10 10 −10
SDCA SDCA 10 SDCA
SAG SAG SAG
−8 −8
10 SPDC 10 SPDC SPDC
SPDCnonUniform SPDCnonUniform SPDCnonUniform
−10
AdaSPDC −10
AdaSPDC −15
AdaSPDC
10 10 10
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of Passes Number of Passes Number of Passes
0 0 0
10 10 10
−5 −5
10 10
−5
10
J(x t) − J(x *)
J(x ) − J(x )
J(x t) − J(x *)
*
−10 −10
covertype 10 10
t
−10
10
−15 −15
10 10
SDCA SDCA SDCA
SAG SAG SAG
SPDC SPDC SPDC
SPDCnonUniform SPDCnonUniform SPDCnonUniform
AdaSPDC AdaSPDC AdaSPDC
−20 −15 −20
10 10 10
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of Passes Number of Passes Number of Passes
0 0 0
10 10 10
−2
−5 10
10
−5
10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
−4
*
10
−10
url 10
−6
t
10 −10
SDCA SDCA 10 SDCA
10
−15 SAG SAG SAG
−8
SPDC 10 SPDC SPDC
SPDCnonUniform SPDCnonUniform SPDCnonUniform
−20
AdaSPDC −10
AdaSPDC −15
AdaSPDC
10 10 10
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of Passes Number of Passes Number of Passes
0 2 2
10 10 10
0
−2 0 10
10 10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
−2
*
10
−4 −2
quantum 10 10
−4
t
10
SDCA SDCA SDCA
10
−6 SAG 10
−4 SAG SAG
−6
SPDC SPDC 10 SPDC
SPDCnonUniform SPDCnonUniform SPDCnonUniform
−8
AdaSPDC −6
AdaSPDC −8
AdaSPDC
10 10 10
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of Passes Number of Passes Number of Passes
0 0 0
10 10 10
−5 −5 −5
10 10 10
J(x ) − J(x )
J(x t) − J(x *)
J(x t) − J(x *)
*
protein
t
Logistic loss
φi (z) = log (1 + exp(−bi z)) ,
whose conjugate dual has the form
It is also easy to obtain that φ∗i (yi ) is γ-strongly convex with γ = 4. Note that
for logistic loss, the dual update in Eq. (9) does not have a closed form solution,
and we can start from some initial solution and further apply several steps of
Newton’s update to obtain a more accurate solution.
During the experiments, we observe that the performance of SAG is very
sensitive to the stepsize choice. To obtain best results of SAG, we try different
choices of stepsize in the interval [1/16L, 1/L] and report the best result for
each dataset, where L is Lipschitz constant of φi (aTi x), 1/16L is the theoretical
stepsize choice for SAG and 1/L is the suggested empirical choice [12]. For
smooth Hinge loss, L = maxi {ai 2 , i = 1, . . . , n}, and for logistic loss, L =
4 maxi {ai 2 , i = 1, . . . , n}.
1
Fig. 2 and Fig. 3 depict the algorithm performance on the different methods
with smooth Hinge loss and logistics loss, respectively. We compare all these
methods with different values of λ = {10−5 , 10−6 , 10−7 }. Generally, our method
AdaSPDC performs consistently better or at least comparably with other meth-
ods, and performs especially well for the tasks with small regularized parameter
λ. For some datasets, such as covertype and quantum, SPDC with non-uniform
sampling decreases the objective faster than other methods in early epochs,
however, cannot achieve comparable results with other methods in later epochs,
which might be caused by its conservative stepsize.
References
1. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems
with applications to imaging. Journal of Mathematical Imaging and Vision 40(1),
120–145 (2011)
2. Esser, E., Zhang, X., Chan, T.: A general framework for a class of first order
primal-dual algorithms for convex optimization in imaging science. SIAM Journal
on Imaging Sciences 3(4), 1015–1046 (2010)
3. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning, vol. 2.
Springer (2009)
4. He, B., Yuan, X.: Convergence analysis of primal-dual algorithms for a saddle-point
problem: from contraction perspective. SIAM Journal on Imaging Sciences 5(1),
119–149 (2012)
5. He, Y., Monteiro, R.D.: An accelerated hpe-type algorithm for a class of composite
convex-concave saddle-point problems. Optimization-online preprint (2014)
6. Jacob, L., Obozinski, G., Vert, J.P.: Group lasso with overlap and graph lasso. In:
Proceedings of the 26th Annual International Conference on Machine Learning,
pp. 433–440. ACM (2009)
7. Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87.
Springer (2004)
8. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization
problems. SIAM Journal on Optimization 22(2), 341–362 (2012)
9. Ouyang, Y., Chen, Y., Lan, G., Pasiliao Jr., E.: An accelerated linearized alter-
nating direction method of multipliers. SIAM Journal on Imaging Sciences 8(1),
644–681 (2015)
10. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate
descent methods for minimizing a composite function. Mathematical Programming
144(1–2), 1–38 (2014)
11. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data opti-
mization. Mathematical Programming, 1–52 (2012)
12. Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic
average gradient. arXiv preprint arXiv:1309.2388 (2013)
13. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for reg-
ularized loss. The Journal of Machine Learning Research 14(1), 567–599 (2013)
658 Z. Zhu and A.J. Storkey
14. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimiza-
tion. submitted to SIAM Journal on Optimization (2008)
15. Zhang, Y., Xiao, L.: Stochastic primal-dual coordinate method for regularized
empirical risk minimization. In: International Conference of Machine Learning
(2015)
16. Zhu, M., Chan, T.: An efficient primal-dual hybrid gradient algorithm for total
variation image restoration. UCLA CAM Report, pp. 08–34 (2008)
17. Zhu, Z., Storkey, A.J.: Adaptive stochastic primal-dual coordinate descent for sep-
arable saddle point problems. arXiv preprint arXiv:1506.04093 (2015)
Hash Function Learning via Codewords
1 Introduction
With the explosive growth of web data including documents, images and videos,
content-based image retrieval (CBIR) has attracted plenty of attention over the
past years [1]. Given a query sample, a typical CBIR scheme retrieves samples
stored in a database that are most similar to the query sample. The similarity
is gauged in terms of a pre-specified distance metric and the retrieved samples
are the nearest neighbors of the query point w.r.t. this metric. However, exhaus-
tively comparing the query sample with every other sample in the database may
be computationally expensive in many current practical settings. Additionally,
most CBIR approaches may be hindered by the sheer size of each sample; for
example, visual descriptors of an image or a video may number in the thousands.
Furthermore, storage of these high-dimensional data also presents a challenge.
Considerable effort has been invested in designing hash functions transform-
ing the original data into compact binary codes to reap the benefits of a poten-
tially fast similarity search; note that hash functions are typically designed to
preserve certain similarity qualities between the data. For example, approximate
nearest neighbors (ANN) search [2] using compact binary codes in Hamming
space was shown to achieve sub-liner searching time. Storage of the binary code
is, obviously, also much more efficient.
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 659–674, 2015.
DOI: 10.1007/978-3-319-23528-8 41
660 Y. Huang et al.
itself apart from past approaches in two major ways. First, it uses a set of Hamming
space codewords that are learned during training in order to capture the intrinsic
similarities between the data’s hash codes, so that same-class data are grouped
together. Unlabeled data also contribute to the adjustment of codewords leverag-
ing from the inter-sample dissimilarities of their generated hash codes as measured
by the Hamming metric. Due to these codeword-specific characteristics, a major
advantage offered by *SHL is that it can naturally engage supervised, unsupervised
and, even, semi-supervised hash learning tasks using a single formulation. Obvi-
ously, the latter ability readily allows *SHL to perform transductive hash learning.
In Sec. 2, we provide *SHL’s formulation, which is mainly motivated by an
attempt to minimize the within-group Hamming distances in the code space
between a group’s codeword and the hash codes of data. With regards to the
hash functions, *SHL adopts a kernel-based approach. The aforementioned for-
mulation eventually leads to a minimization problem over the codewords as
well as over the Reproducing Kernel Hilbert Space (RKHS) vectors defining
the hash functions. A quite noteworthy aspect of the resulting problem is that
the minimization over the latter parameters leads to a set of Support Vector
Machine (SVM) problems, according to which each SVM generates a single bit
of a sample’s hash code. In lieu of choosing a fixed, arbitrary kernel function, we
use a simple Multiple Kernel Learning (MKL) approach (e.g. see [21]) to infer
a good kernel from the data. We need to note here that Self-Taught Hashing
(STH) [10] also employs SVMs to generate hash codes. However, STH differs
significantly from *SHL; its unsupervised and supervised learning stages are
completely decoupled, while *SHL uses a single cost function that simultane-
ously accommodates both of these learning paradigms. Unlike STH, SVMs arise
naturally from the problem formulation in *SHL.
Next, in Sec. 3, an efficient Majorization-Minimization (MM) algorithm is
showcased that can be used to optimize *SHL’s framework via a Block Coordi-
nate Descent (BCD) approach. The first block optimization amounts to train-
ing a set of SVMs, which can be efficiently accomplished by using, for example,
LIBSVM [22]. The second block optimization step addresses the MKL parameters,
while the third one adjusts the codewords. Both of these steps are computation-
ally fast due to the existence of closed-form solutions.
Finally, in Sec. 5 we demonstrate the capabilities of *SHL on a series of
comparative experiments. The section emphasizes on supervised hash learning
problems in the context of CBIR, since the majority of hash learning approaches
address this paradigm. We also included some preliminary transductive hash
learning results for *SHL as a proof of concept. Remarkably, when compared
to other hashing methods on supervised learning hash tasks, *SHL exhibits the
best retrieval accuracy for all the datasets we considered. Some clues to *SHL’s
superior performance are provided in Sec. 4.
2 Formulation
In what follows, [·] denotes the Iverson bracket, i.e., [predicate] = 1, if the predi-
cate is true, and [predicate] = 0, if otherwise. Additionally, vectors and matrices
662 Y. Huang et al.
are denoted in boldface. All vectors are considered column vectors and ·T denotes
transposition. Also, for any positive integer K, we define NK {1, . . . , K}.
Central to hash function learning is the design of functions transforming data
to compact binary codes in a Hamming space to fulfill a given machine learning
B
task. Consider the Hamming space HB {−1, 1} , which implies B-bit hash
codes. *SHL addresses multi-class classification tasks with an arbitrary set X as
sample space. It does so by learning a hash function h : X → HB and a set of G
labeled codewords μg , g ∈ NG (each codeword representing a class), so that the
hash code of a labeled sample is mapped close to the codeword corresponding
to the sample’s class label; proximity is measured via the Hamming distance.
Unlabeled samples are also able to contribute to learning both the hash function
and the codewords as it will demonstrated in the sequel. Finally, a test sample
is classified according to the label of the codeword closest to the sample’s hash
code.
In *SHL, the hash code for a sample x ∈ X is eventually computed as
h(x) sgn f (x) ∈ HB , where the signum function is applied component-wise.
T
Furthermore, f (x) [f1 (x) . . . fB (x)] , where fb (x) wb , φ(x)Hb + βb with
wb ∈ Ωwb wb ∈ Hb : wb Hb ≤ Rb , Rb > 0 and βb ∈ R for all b ∈ NB . In
the previous definition, Hb is a RKHS with inner product ·, ·Hb , induced norm
wb Hb wb , wb Hb for all wb ∈ Hb , associated feature mapping φb : X → Hb
and reproducing kernel kb : X × X → R, such that kb (x, x ) = φb (x), φb (x )Hb
for all x, x ∈ X . Instead of a priori selecting the kernel functions kb , MKL
[21] is employed to infer the feature mapping for each bit from the available
data. In specific, it is assumed that each RKHS Hb is formed
as the direct
sum of M common, pre-specified RKHSs H m , i.e., Hb = m θb,m Hm , where
T
θ b [θb,1 . . . θb,M ] ∈ Ωθ θ ∈ RM : θ 0, θp ≤ 1, p ≥ 1 , denotes the
component-wise ≥ relation, ·p is the usual lp norm in RM and m ranges over
NM . Note that, if each preselected RKHS Hm has associated kernel function km ,
then it holds that kb (x, x ) = m θb,m km (x, x ) for all x, x ∈ X .
Now, assume a training set of size N consisting of labeled and unlabeled
samples and let NL and NU be the index sets for these two subsets respectively.
Let also ln for n ∈ NL be the class label of the nth labeled sample. By adjusting
its parameters, which are collectively denoted as ω, *SHL attempts to reduce
the distortion measure
where
[g = ln ] n ∈ NL
γg,n (3)
g = arg ming d¯ f (xn ), μg n ∈ NU
It turns out that Ē, which constitutes the model’s loss function, can be efficiently
minimized by a three-step algorithm, which delineated in the next section.
3 Learning Algorithm
The next proposition allows us to minimize Ē as defined in Eq. (2) via a MM
approach [23], [24].
Proposition 1. For any *SHL parameter values ω and ω , it holds that
Ē(ω) ≤ Ē(ω|ω )
γg,n d¯ f (xn ), μg (4)
g n
[g = ln ] n ∈ NL
γg,n (5)
g = arg ming d¯ f (xn ), μg n ∈ NU
2
1 wb,m Hm
+ b ∈ NB (6)
2 m
θb,m
Proof. After eliminating the hinge function in Prob. (6) with the help of slack
b
variables ξg,n , we obtain the following problem for the first block minimization:
2
b 1 wb,m Hm
min C γg,n ξg,n +
wb,m ,βb
g n
2 m
θb,m
ξg,n
b
b
s.t. ξg,n ≥0
b
ξg,n ≥1−( wb,m , φm (x)Hm + βb )μg,b (8)
m
where n is the training sample index. By defining ξ b ∈ RRG to be the vector con-
b
taining all ξg,n ’s, η b [ηb,1 , ηb,2 , ..., ηb,N ]T ∈ RN and μb [μ1,b , μ2,b , ..., μG,b ]T ∈
G
R , the vectorized version of Prob. (8) in light of Eq. (9) becomes
1
min Cγ ξ b + η Tb Kb η b
η b ,ξb ,βb 2
s.t. ξ b 0
ξ b 1N G − (μb ⊗ Kb )η b − (μb ⊗ 1N )βb (10)
Hash Function Learning via Codewords 665
where γ and Kb are defined in Prop. 3. From the previous problem’s Lagrangian
L, one obtains
∂L λb = Cγ − αb
=0⇒ (11)
∂ξ b 0 αb Cγ
∂L
= 0 ⇒ αTb (μb ⊗ 1N ) = 0 (12)
∂βb
∂L ∃K−1
= 0 ⇒b η b = K−1 T
b (μb ⊗ Kb ) αb (13)
∂η b
where αb and λb are the dual variables for the two constraints in Prob. (10).
Utilizing Eq. (11), Eq. (12) and Eq. (13), the quadratic term of the dual problem
becomes
(μb ⊗ Kb )K−1 T
b (μb ⊗ Kb ) =
= (μb ⊗ Kb )(1 ⊗ K−1 T
b )(μb ⊗ Kb )
= (μb ⊗ IN ×N )(μTb ⊗ Kb )
= (μb μTb ) ⊗ Kb (14)
(μb μTb ) ⊗ Kb =
= [(diag (μb ) 1G )(diag (μb ) 1G )T ] ⊗ Kb
= [diag (μb ) (1G 1TG ) diag (μb )] ⊗ [IN Kb IN ]
= [diag (μb ) ⊗ IN ][(1G 1TG ) ⊗ Kb ][diag (μb ) ⊗ IN ]
= [diag (μb ⊗ 1N )][(1G 1TG ) ⊗ Kb ][diag (μb ⊗ 1N )]
= Db [(1G 1TG ) ⊗ Kb ]Db (15)
The first equality stems from the identity diag (v) 1 = v for any vector v, while
the third one stems form the mixed-product property of the Kronecker product.
Also, the identity diag (v ⊗ 1) = diag (v) ⊗ I yields the fourth equality. Note
that Db is defined as in Prop. 3. Taking into account Eq. (14) and Eq. (15), we
reach the dual form stated in Prop. 3.
Given that γg,n ∈ {0, 1}, one can easily now recognize that Prob. (7) is an
SVM training problem, which can be conveniently solved using software pack-
ages such as LIBSVM. After solving it, obviously one can compute the quantities
2
wb,m , φm (x)Hm , βb and wb,m Hm , which are required in the next step.
Second block minimization: Having optimized over the SVM parameters,
one can now optimize the cost function of Prob. (6) with respect to the MKL
parameters θ b as a single block using the closed-form solution mentioned in Prop.
666 Y. Huang et al.
θb,m = m
p1 , m ∈ NM , b ∈ NB . (16)
2p
m wb,m Hmp+1
Third block minimization: Finally, one can now optimize the cost function
of Prob. (6) with respect to the codewords by mere substitution as shown below.
1
A MATLAB implementation of our framework is available at
https://github.com/yinjiehuang/StarSHL
Hash Function Learning via Codewords 667
ˆ N (Ψ ◦ F) ≤ L
ˆ N (F ) (18)
1
ˆ N (G) 1 Eσ sup
where ◦ stands for function composition, N g∈G n σn g(xn , ln )
is the empirical Rademacher complexity of a set G of functions, {xn , ln } are i.i.d.
samples and σn are i.i.d random variables taking values with P r{σn = ±1} = 12 .
To show the main theoretical result of our paper with the help of the previous
lemma, we will consider the sets of functions
{μl }G B
l=1 , μl ∈ H and any δ > 0, with probability 1 − δ, it holds that:
2r log 1δ
er (f , μl ) ≤ er
ˆ (f , μl ) + √ Rb + (21)
ρB N 2N
b
Then from Theorem 3.1 of [27] and Eq. (22), ∀ψ ∈ Ψ , ∃δ > 0, with probability
at least 1 − δ, we have:
log 1δ
er (f , μl ) ≤ er
ˆ (f , μl ) + 2N (Ψ ) + (23)
2N
where N (Ψ ) is the Rademacher complexity of Ψ . From Lemma 1, the following
inequality between empirical Rademacher complexities is obtained
ˆ N (Ψ ) ≤ 1
ˆ N F¯μ (24)
Bρ 1
where F¯μ {(x, l) → [f1 (x)μl,1 , ..., fB (x)μl,B ]T , f ∈ F̄ and μl,b ∈ {±1}}. The
right side of Eq. (24) can be upper-bounded as follows
1
ˆ
N F̄μ 1 = Eσ sup σn |μln ,b fb (xn )|
N f ∈F̄,{μln }∈HB n b
1
= Eσ sup σn |fb (xn )|
N f ∈F̄ n b
1
= Eσ sup σn | wb , φb (x)Hb + βb |
N ωb ∈Hb ,ωb H ≤Rb ,|βb |≤Mb n b
b
1
= Eσ sup σn | wb , sgn(βb )φb (x)Hb + |βb ||
N ωb ∈Hb ,ωb H ≤Rb ,|βb |≤Mb n b
b
1
= Eσ sup T
[Rb σ Kb σ + |βb | σn ]
N |βb |≤Mb b n
1 Jensen’s Ineq. 1
= Eσ T
R b σ Kb σ ≤ Rb Eσ {σ T Kb σ}
N N
b b
1 r
= Rb trace{Kb } ≤ √ Rb (25)
N N b
b
r
N (Ψ ) ≤ √ Rb (26)
ρB N b
The final result is obtained by combining Eq. (23) and Eq. (26).
Hash Function Learning via Codewords 669
It can be observed that, minimizing the loss function of Prob. (6), in essence,
also reduces the bound of Eq. (21). This tends to cluster same-class hash codes
around the correct codeword. Since samples are classified according to the label
of the codeword that is closest to the sample’s hash code, this process may lead
to good recognition rates, especially when the number of samples N is high, in
which case the bound becomes tighter.
5 Experiments
5.1 Supervised Hash Learning Results
In this section, we compare *SHL to other state-of-the-art hashing algorithms:
Kernel Supervised Learning (KSH) [15], Binary Reconstructive Embedding
(BRE) [6], single-layer Anchor Graph Hashing (1-AGH) and its two-layer ver-
sion (2-AGH) [17], Spectral Hashing (SPH) [16] and Locality-Sensitive Hashing
(LSH) [3].
Five datasets were considered: Pendigits and USPS from the UCI Repository,
as well as Mnist, PASCAL07 and CIFAR-10. For Pendigits (10, 992 samples, 256
features, 10 classes), we randomly chose 3, 000 samples for training and the rest
for testing; for USPS (9, 298 samples, 256 features, 10 classes), 3000 were used for
training and the remaining for testing; for Mnist (70, 000 samples, 784 features,
10 classes), 10, 000 for training and 60, 000 for testing; for CIFAR-10 (60, 000
samples, 1, 024 features, 10 classes), 10, 000 for training and the rest for testing;
finally, for PASCAL07 (6878 samples, 1, 024 features after down-sampling the
images, 10 classes), 3, 000 for training and the rest for testing.
For all the algorithms used, average performances over 5 runs are reported
in terms of the following two criteria: (i) retrieval precision of s-closest hash
codes of training samples; we used s = {10, 15, . . . , 50}. (ii) Precision-Recall
(PR) curve, where retrieval precision and recall are computed for hash codes
within a Hamming radius of r ∈ NB .
0.9 0.9
0.9 *SHL
Tos s Retrieval Precision (s = 10)
0.8
0.8 KSH
Tos s Retrieval Precision
LSH *SHL
0.7 KSH
0.7 0.8 SPH
BRE LSH
0.6 SPH
Precision
0.6
1−AGH
2−AGH BRE
0.7 0.5 1−AGH
0.5 2−AGH
*SHL 0.4
0.3
SPH
BRE 0.2
0.5
1−AGH
0.2 0.1
2−AGH
0.1 0.4 0
0 2 4 6 8 10 10 15 20 25 30 35 40 45 50 0 0.2 0.4 0.6 0.8 1
Number of Bits Number of Top s Recall
Fig. 1. The top s retrieval results and Precision-Recall curve on Pendigits dataset over
*SHL and 6 other hashing algorithms. (view in color)
670 Y. Huang et al.
The following *SHL settings were used: SVM’s parameter C was set to 1000;
for MKL, 11 kernels were considered: 1 normalized linear kernel, 1 normalized
polynomial kernel and 9 Gaussian kernels. For the polynomial kernel, the bias
was set to 1.0 and its degree was chosen as 2. For the bandwidth σ of the Gaus-
sian kernels the following values were used: [2−7 , 2−5 , 2−3 , 2−1 , 1, 21 , 23 , 25 , 27 ].
0.9 0.9
0.95
Tos s Retrieval Precision (s = 10)
0.8
0.8 Tos s Retrieval Precision
0.9
*SHL 0.7
0.7
0.85
KSH
LSH 0.6
Precision
0.6
SPH
0.8 BRE 0.5
0.5 1−AGH
*SHL
0.4 *SHL
KSH 0.75 2−AGH
0.4
LSH KSH
0.3 LSH
SPH 0.7
0.3 BRE SPH
0.2
1−AGH BRE
0.2 2−AGH 0.65 1−AGH
0.1
2−AGH
0.1 0
0 2 4 6 8 10 10 15 20 25 30 35 40 45 50 0 0.2 0.4 0.6 0.8 1
Number of Bits Number of Top s Recall
Fig. 2. The top s retrieval results and Precision-Recall curve on USPS dataset over
*SHL and 6 other hashing algorithms. (view in color)
0.9 0.9
0.95
Tos s Retrieval Precision (s = 10)
0.8 0.8
Tos s Retrieval Precision
0.9
0.7 0.7
0.85 *SHL
0.6 0.6
Precision
KSH
LSH
0.5 0.8 0.5
SPH
*SHL
0.4
BRE 0.4
KSH 0.75 1−AGH *SHL
LSH
0.3 2−AGH 0.3 KSH
SPH
0.7 LSH
BRE
0.2 0.2 SPH
1−AGH
2−AGH BRE
0.65
0.1 0.1 1−AGH
2−AGH
0 0
0 2 4 6 8 10 10 15 20 25 30 35 40 45 50 0 0.2 0.4 0.6 0.8 1
Number of Bits Number of Top s Recall
Fig. 3. The top s retrieval results and Precision-Recall curve on Mnist dataset over
*SHL and 6 other hashing algorithms. (view in color)
0.35 0.4
1−AGH 0.7 1−AGH
2−AGH 2−AGH
0.3 0.35
*SHL 0.6
Precision
KSH
0.25 0.3 LSH 0.5
SPH
0.4
0.2 0.25 BRE
1−AGH
0.3
2−AGH
0.15 0.2
0.2
0.1 0.15
0.1
0.05 0.1 0
0 2 4 6 8 10 10 15 20 25 30 35 40 45 50 0 0.2 0.4 0.6 0.8 1
Number of Bits Number of Top s Recall
Fig. 4. The top s retrieval results and Precision-Recall curve on CIFAR-10 dataset
over *SHL and 6 other hashing algorithms. (view in color)
Hash Function Learning via Codewords 671
0.35
SPH 0.8 SPH
BRE BRE
Precision
KSH
0.2 0.25 LSH 0.5
SPH
0.4
0.15 BRE
0.2 1−AGH
0.3
2−AGH
0.1
0.2
0.15
0.05
0.1
0 0.1 0
0 2 4 6 8 10 10 15 20 25 30 35 40 45 50 0 0.2 0.4 0.6 0.8 1
Number of Bits Number of Top s Recall
Fig. 5. The top s retrieval results and Precision-Recall curve on PASCAL07 dataset
over *SHL and 6 other hashing algorithms. (view in color)
Regarding the MKL constraint set, a value of p = 2 was chosen. For the remain-
ing approaches, namely KSH, SPH, AGH, BRE, parameter values were used
according to recommendations found in their respective references. All obtained
results are reported in Fig. 1 through Fig. 5.
We clearly observe that *SHL performs best among all the algorithms consid-
ered. For all the datasets, *SHL achieves the highest top-10 retrieval precision.
Especially for the non-digit datasets (CIFAR-10, PASCAL07 ), *SHL achieves
significantly better results. As for the PR-curve, *SHL also yields the largest
areas under the curve. Although noteworthy results were reported in [15] for
KSH, in our experiments *SHL outperformed it across all datasets. Moreover,
we observe that supervised hash learning algorithms, except BRE, perform bet-
ter than unsupervised variants. BRE may need a longer bit length to achieve
better performance as implied by Fig. 1 and Fig. 3. Additionally, it is worth
pointing out that *SHL performed remarkably well for short big lengths across
all datasets.
It must be noted that AGH also yielded good results, compared with other
unsupervised hashing algorithms, perhaps due to the anchor points it utilizes as
side information to generate hash codes. With the exception of *SHL and KSH,
the remaining approaches exhibit poor performance for the non-digit datasets
we considered.
When varying the top-s number between 10 and 50, once again with the
exception of *SHL and KSH, the performance of the remaining approaches
deteriorated in terms of top-s retrieval precision. KSH performs slightly worse,
when s increases, while *SHL’s performance remains robust for CIFAR-10 and
PSACAL07. It is worth mentioning that the two-layer AGH exhibits better
robustness than its single-layer version for datasets involving images of digits.
Finally, Fig. 6 shows some qualitative results for the CIFAR-10 dataset. In con-
clusion, in our experimentation, *SHL exhibited superior performance for every
code length we considered.
672 Y. Huang et al.
*SHL
KSH
LSH
SPH
BRE
1−AGH
2−AGH
Vowel Letter
1 0.9
Inductive Inductive
Transductive Transductive
0.85
0.9
0.8
0.8 0.75
Accuracy Precision
Accuracy Precision
0.7
0.7
0.65
0.6
0.6
0.5 0.55
0.5
0.4
0.45
0.4
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Number of Bits Number of Bits
test samples for the Letter. Each scenario was run 20 times and the code length
(B) varied from 4 to 15 bits. The results are shown in Fig. 7 and reveal the
potential merits of the transductive *SHL learning mode across a range of code
lengths.
6 Conclusions
In this paper we considered a novel hash learning framework with two
main advantages. First, its Majorization-Minimization (MM)/Block Coordinate
Descent (BCD) training algorithm is efficient and simple to implement. Sec-
ondly, this framework is able to address supervised, unsupervised and, even,
semi-supervised learning tasks in a unified fashion. In order to show the mer-
its of the method, we performed a series of experiments involving 5 benchmark
datasets. In these experiments, a comparison between *Supervised Hash Learn-
ing (*SHL) to 6 other state-of-the-art hashing methods shows *SHL to be highly
competitive.
References
1. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and
trends of the new age. ACM Computing Surveys 40(2), 5:1–5:60 (2008)
2. Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for
recognition. In: Proceedings of Computer Vision and Pattern Recognition, pp. 1–8
(2008)
3. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via
hashing. In: Proceedings of the International Conference on Very Large Data Bases,
pp. 518–529 (1999)
4. Kulis, B., Jain, P., Grauman, K.: Fast similarity search for learned metrics. IEEE
Transactions on Pattern Analysis and Machine Intelligence 31(12), 2143–2157
(2009)
5. Salakhutdinov, R., Hinton, G.: Semantic hashing. International Journal of Approx-
imate Reasoning 50(7), 969–978 (2009)
6. Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings. In:
Proceedings of Advanced Neural Information Processing Systems, pp. 1042–1050
(2009)
7. Norouzi, M., Fleet, D.J.: Minimal loss hashing for compact binary codes. In: Pro-
ceedings of the International Conference on Machine Learning, pp. 353–360 (2011)
8. Yu, C.N.J., Joachims, T.: Learning structural svms with latent variables. In: Pro-
ceedings of the International Conference on Machine Learning, pp. 1169–1176
(2009)
674 Y. Huang et al.
9. Mu, Y., Shen, J., Yan, S.: Weakly-supervised hashing in kernel space. In: Proceed-
ings of Computer Vision and Pattern Recognition, pp. 3344–3351 (2010)
10. Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search.
In: Proceedings of the International Conference on Research and Development in
Information Retrieval, pp. 18–25 (2010)
11. Strecha, C., Bronstein, A., Bronstein, M., Fua, P.: Ldahash: Improved matching
with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine
Intelligence 34(1), 66–78 (2012)
12. Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parameter-
sensitive hashing. In: Proceedings of the International Conference on Computer
Vision, pp. 750–757 (2003)
13. Baluja, S., Covell, M.: Learning to hash: Forgiving hash functions and applications.
Data Mining and Knowledge Discovery 17(3), 402–430 (2008)
14. Li, X., Lin, G., Shen, C., van den Hengel, A., Dick, A.R.: Learning hash func-
tions using column generation. In: Proceedings of the International Conference on
Machine Learning, pp. 142–150 (2013)
15. Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with ker-
nels. In: Proceedings of Computer Vision and Pattern Recognition, pp. 2074–2081
(2012)
16. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of Advanced
Neural Information Processing Systems, pp. 1753–1760 (2008)
17. Liu, W., Wang, J., Kumar, S., Chang, S.F.: Hashing with graphs. In: Proceedings
of the International Conference on Machine Learning, pp. 1–8 (2011)
18. Gong, Y., Lazebnik, S.: Iterative quantization: A procrustean approach to learning
binary codes. In: Proceedings of Computer Vision and Pattern Recognition, pp.
817–824 (2011)
19. Wang, J., Kumar, S., Chang, S.F.: Sequential projection learning for hashing with
compact codes. In: Proceedings of the International Conference on Machine Learn-
ing, pp. 1127–1134 (2010)
20. Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for large-scale
search. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(12),
2393–2406 (2012)
21. Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A.: lp-norm multiple kernel learning.
Journal of Machine Learning Research 12, 953–997 (2011)
22. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011). Software
available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm
23. Hunter, D.R., Lange, K.: A tutorial on mm algorithms. The American Statistician
58(1), 30–37 (2004)
24. Hunter, D.R., Lange, K.: Quantile regression via an mm algorithm. Journal of
Computational and Graphical Statistics 9(1), 60–77 (2000)
25. Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In:
Helmbold, D.P., Williamson, B. (eds.) COLT/EuroCOLT 2001. LNCS (LNAI),
vol. 2111, pp. 416–426. Springer, Heidelberg (2001)
26. List, N., Simon, H.U.: Svm-optimization and steepest-descent line search. In: Pro-
ceedings of the Conference on Computational Learning Theory (2009)
27. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning.
The MIT Press (2012)
28. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998)
HierCost: Improving Large Scale Hierarchical
Classification with Cost Sensitive Learning
1 Introduction
Categorizing entities according to a hierarchy of general to specific classes is a
common practice in many disciplines. It can be seen as an important aspect
of various fields such as bioinformatics, music genre classification, image clas-
sification and more importantly document classification [18]. Often the data is
curated manually, but with exploding sizes of databases, it is becoming increas-
ingly important to develop automated methods for hierarchical classification of
entities.
Several classification methods have been developed over the past several years
to address the problem of Hierarchical Classification (HC). One straightforward
approach is to simply use multi-class or binary classifiers to model the relevant
classes and disregard the hierarchical information. This methodology has been
called flat classification scheme in HC literature [18]. While flat classification can
be competitive, an important research directions is to improve the classification
performance by incorporating the hierarchical structure of the classes in the
learning algorithm. Another simple data decomposition approach trains local
classifiers for each of the classes defined according to the hierarchy, such that the
trained model can be used in a top-down fashion to take the most relevant path
in testing. This top-down approach trains each classifier on a smaller dataset and
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 675–690, 2015.
DOI: 10.1007/978-3-319-23528-8 42
676 A. Charuvaka and H. Rangwala
learning algorithm for wn . For the training example (xi , li ) we set yin = 1 iff
li = n and yin = −1 otherwise. γ (a, b) denotes the graph distance between
classes a, b ∈ N in the hierarchy, which is defined as the number of edges in the
undirected path between nodes a and b. We use cni to denote the cost of example
i in training of the model for class n. To simplify the notation, in some places,
we drop the super-script explicitly indicating the class, and use yi , ci and w
in place of yin , cni and wn where the class is implicitly understood to be n. L is
used to denote a generic loss function. In the current work, logistic loss function
is used, which is defined as L (y, f (x)) = log (1 + exp (−yf (x))).
1 N
min wn − wπ(n) 2 + C L yin , wnT xi (1)
w1 ,...,w|N | 2 2
n∈N i=1 n∈T
where, π (n) represents the parent of the class n according to the provided hier-
archy. The loss function L has been modeled as logistic loss or hinge loss. Note
that the loss is defined only on the terminal nodes T , and the non-terminal
node N − T , are introduced only as a means to facilitate regularization. Since
678 A. Charuvaka and H. Rangwala
the weights associated with different classes are coupled in the optimization
problem, Gopal et al. [9] used a distributed implementation of block coordinate
descent where each block of variables corresponds to wn for a particular class
n. The model weights are learned similarly to standard LR or SVM for the leaf
nodes n ∈ T , with the exception that the weights are shrunk towards parents
instead of towards the zero vector by the regularizer. For an internal non-leaf
node, the weights updates are averages of the other nodes which are connected
to it according to the hierarchy, i.e., the parents and children in the hierarchy.
The kind of regularization discussed above can be compared to the formula-
tions proposed in transfer and multi-task learning (MTL) literature [7], where
externally provided task relationships can be utilized to constrain the jointly
learned model weights to be similar to each other. In the case of HC, the task
relationship are explicitly provided as hierarchical relationships. However, one
significant difference between the application of this regularization between HC
and MTL is that the sets of examples in MTL for different tasks are, in gen-
eral, disjoint. Whereas, in the case of HC, the examples which are classified
as positive for one class are negative for all other classes except those which
belong to the ancestors of that class. Therefore, even though these models impose
similarity between siblings indirectly through the parent, when their respective
models are trained, the negative and positive examples are flipped. Hence, the
opposing forces for examples and regularization are acting simultaneously during
the learning of these models. However, due to the regularization strength being
imposed by the hierarchy, the net effect is that the importance of misclassifying
the examples coming for nearby classes is down-weighted. This insight can be
directly incorporated into the learning algorithm by defining the loss of nearby
negative examples for a class, where ”near” is defined with respect to the hierar-
chy, to be less severe than the examples which are farther. This yields a simple
cost sensitive classification method where the misclassification cost is directly
proportional to the distance between the nodes of the classes, which is the key
contribution of our work. With respect to prediction, there are only two classes
for each trained model, but the misclassification costs of negative examples are
dependent on the nodes from which they originate.
In this framework for HC, we essentially decouple the learning for each node
of the hierarchy and train the model for each one independently. Thus, rendering
scalability to this method. Instead of jointly formulating the learning of model
parameters for all the classes, we turn the argument around from that of regular-
izing the model weights towards those of the neighboring models, to the rescaling
the loss of example depending on the class relationships. A similar argument was
made in the case of multi-task transfer learning by Saha et al. [16], where, in
place of joint regularization of model weights, as is typically done in multi-task
learning [8], they augment the target tasks with examples from source tasks.
However, the losses for the two sets of examples are scaled differently.
HierCost: Cost Sensitive Hierarchical Classification 679
4 Methods
As shown in some previous works [1,9], the performance of flat classification has
been found to be very competitive, especially for large scale problems. Although,
the top-down classification method is efficient in training, it fares poorly with
respect to classification performance due to error propagation. Hence, in this
work, we extend the flat classification methodology to deal with HC. We use
the one-vs-all approach for training, where for each class n, to which examples
are assigned, we learn a classification model with weight wn . Note, that it is
unnecessary to train the models for non-terminal classes, as they only serve
as virtual labels in the hierarchy. Once the models for each terminal class are
trained, we perform prediction for input example x as per (2)
The essential idea is to formulate the learning algorithm such that the mis-
predictions on negative examples coming from nearby classes are treated as
being less severe. We encode this assumption through cost sensitive classification.
Standard regularized binary classification models, such as SVMs and Logistic
Regression, minimize an objective function consisting of loss and regularization
terms as shown in (3).
N
min L (yi , f (xi | w)) +ρ R (w) (3)
w
i=1
regularizer
loss
where w denotes the learned model weights. The loss function L, which is gen-
erally a convex approximation of zero-one loss, measures how well the model fits
the training examples. Here, each example is considered to be equally important.
As per the arguments made previously, we modify the importance of correctly
classifying examples according to the hierarchy using example based misclassifi-
cation costs. For models such as logistic regression, incorporating example based
costs into the learning algorithm is simply a matter of scaling the loss by a con-
stant positive value. Assuming that the classifier is being learned for class n, we
can write the cost sensitive objective function as shown in (4).
N
min ci L yi , wT xi + ρR (w) (4)
w
i=1
Here, ci denotes the cost associated with misclassification of the ith example.
Although, this scaling works for the smooth loss function of Logistic Regression,
it is not as straightforward in the case of non-smooth loss functions such as hinge
loss [12]. Therefore, using the formulation given in (4), for each of the models,
we can formulate the objective function for class n as a cost sensitive logistic
regression problem where the cost of the example xi for the binary classifier of
class n depends on how far the actual label li ∈ T is from n, according to the
680 A. Charuvaka and H. Rangwala
hierarchy. Additionally, to deal with the issue of rare categories, we can also
increase the cost of the positive examples for data-sparse classes thus mitigating
the effects of highly skewed datasets. Since our primary motivation is to argue
in favor of using hierarchical cost sensitive learning instead of more expensive
regularization models, we only concentrate on logistic loss, which is easier to
handle, from optimization perspective, than non-smooth losses such as SVM’s
hinge loss. The central issue, now, is that of defining the appropriate costs for
the positive and negative examples based on the distance of the examples from
the true class according to the hierarchy and the number of examples available
for training the classifiers. In the following section we discuss the selection of
costs for negative and positive examples.
Hierarchical Cost. Hierarchical costs impose the requirement that the mis-
classification of negative examples that are farther away from the training class
according to the hierarchy should be penalized more severely. Encoding this
assumption, we define the following instantiations of hierarchical costs. We
assume the class under consideration is denoted by n.
Tree Distance (TrD): In (5), we define the cost of negative examples as the
undirected graphical distance, γ (n, li ), between the class n and li , the class label
of example xi . We call this cost Tree Distance (TrD). We define γi ≡ γ (n, li )
and γmax = maxj∈T γj . Since dissimilarity increases with increasing γi , the cost
is a monotonically increasing function of γi .
γmax li = n
ci = (5)
γi li = n
αmax + 1 li = n
ci = (6)
αmax − αj + 1 li =
n
Exponentiated Tree Distance (ExTrD): Finally, in some cases, especially for deep
hierarchies, the tree distances can be large, and therefore, in order to shrink the
values of cost into a smaller range, we define ExTrD in (7), where k > 1, can be
tuned according to the hierarchy. Through tuning we found that on our dataset,
HierCost: Cost Sensitive Hierarchical Classification 681
k γmax n
li =
ci = (7)
k γi li = n
In all these cases, we set the cost of the positive class to the maximum cost of
any example.
Imbalance Cost. In certain cases, especially for large scale hierarchical text
classification, some classes are extremely small with respect to the number of pos-
itive examples available for training. In these cases, the learned decision bound-
ary might favor the larger classes. Therefore, to deal with this imbalance in the
class distributions, we increase the cost of misclassifying rare classes. This has
the effect of mitigating the influence of skew in the data distributions of abun-
dant and rare classes. We call the cost function incorporating this as Imbalance
Cost (IMB), which is given in (8). We noticed that using cost such as inverse
of class size diminishes the performance. Therefore, we use a squashing function
inspired by logistic function f (x) = L/ [1 + exp −k (x − x0 )], which would not
severely disadvantage very large classes.
ci = 1 + L/ 1 + exp |Ni − N0 |+ (8)
where |a|+ = max (a, 0) and Ni is the number of examples belonging to class
denoted by li . The value of ci lies in the range (1, L/2 + 1). We can use a tunable
parameter N0 , which can be intuitively interpreted as the number of examples
required to build a “good” model, above which increasing the cost does not have
a significant effect or might adversely affect the classification performance. In
our experiments, we used N0 = 10 and L = 20.
In order to combine the Hierarchical Costs with the Imbalance Costs, we
simply multiply the contributions of both the costs. We also experimented with
several other hierarchical cost variants, which are not discussed here due to space
constraints.
4.2 Optimization
Since we are dealing with large scale classification problems, we need an efficient
optimization method which relies only on the first order information to solve the
learning problem given in (9).
N
T 2
min f (w) = ci log 1 + exp −yi w xi + ρ w2 (9)
w
i=1
Since the cost values ci are predefined positive scalars, we can adapt any method
used to solve the standard regularized Logistic Regression (LR). We use acceler-
ated gradient descent due to its efficiency and simplicity. The ordinary gradient
682 A. Charuvaka and H. Rangwala
descent method has a convergence rate of O (1/k), where k is the number of iter-
ations. Accelerated gradient method improves the convergence rate to O 1/k 2
by additionally using the gradient information from the previous iteration [14].
The complete algorithm to solve the cost-sensitive binary logistic regression is
provided in Line 1. We describe the notations and expressions used in the algo-
rithm below.
N is the number of examples; X ∈ RN ×D denotes the data matrix;
N
y ∈ {±1} is the binary label vector for all examples; ρ ∈ R+ is the regu-
larization parameter; c = (c1 , c2 , . . . , cN ) ∈ RN
+ denotes the cost vector, where
ci is the cost for example i ; w ∈ RD denotes the weight vector learned by the
classifier; f (w) denotes the objective function value, given in (10)
fˆλ (u, w), described in (12), is the quadratic approximation of f (u) at w using
approximation constant or step size λ. The appropriate step size in each iteration
is found using line search.
T 2
fˆλ (u, w) = f (w) + (u − w) ∇f (w) + 1/2λ u − w2 (12)
The second issue that we must deal with is the definition of cost based on
hierarchical distances and class sizes. With respect to the training of a class n,
an example xi might be associated with multiple labels l1 , l2 . . . , lK . In this case
the tree distance γi is not uniquely defined. Hence, we must aggregate the values
of γ (n, l1 ) , . . . , γ (n, lK ). One strategy is to use an average of the values, but we
found that the taking the minimum works a little better. Similarly we can use
a minimum of of the number of common ancestors to all target labels for NCA
costs.
Finally, since an example is associated with multiple class labels, the class
size Ni of the examples is also not uniquely defined, in this case as well, we
use the the effective size as the minimum size out of all the labels associated
with xi for our IMB cost. It also makes intuitive sense, because we are trying
to upweight the rare classes, and the rarest class should be given precedence in
terms of the cost definition.
5 Experimental Evaluations
5.1 Datasets
The details of the datasets used for our experimental evaluations are provided
in Table 1. CLEF [6] is a dataset comprising of medical images annotated with
hierarchically organized Image Retrieval in Medical Applications (IRMA) codes.
The task is to predict the IRMA codes from image features. Images are described
with 80 features extracted using a technique called local distribution of edges.
684 A. Charuvaka and H. Rangwala
T denotes is the set of class labels, Pt and Rt are the precision and recall values
for class t ∈ T . P and R are the overall precision and recall values for the all the
classes taken together. Micro-F1 gives equal weight to all the examples therefore
it favors the classes with more number of examples. In the case of single label
classification, Micro-F1 is equivalent to accuracy. Macro-F1 gives equal weight
3
http://www.wipo.int/classifications/ipc/en/
4
http://lshtc.iit.demokritos.gr/
HierCost: Cost Sensitive Hierarchical Classification 685
to all the classes irrespective of their size. Hence, the performance on the smaller
categories is also taken into consideration.
Set based measures do not consider the distance of misclassification with
respect to the true label of the example, but in general, it is reasonable to
assume in most cases that misclassifications that are closer to the actual class
are less severe than misclassifications that are farther from the true class with
respect to the hierarchy. Hierarchical measures, therefore, take the distances
between the actual and predicted class into consideration. The hierarchical mea-
sures, described in eqs. (15) to (17), are Hierarchical Precision (hP ), Hierarchi-
cal Recall (hR), and their harmonic mean, Hierarchical F1 (hF1 ) respectively.
These are hierarchical extensions of standard precision and recall scores. Tree-
induced Error (T E) [19], given in (18), measures the average hierarchical distance
between the actual and predicted labels.
N
N
hP = A (li ) ∩ A ˆli / A ˆli (15)
i=1 i=1
N
N
hR = A (li ) ∩ A ˆli / |A (li )| (16)
i=1 i=1
hF1 = 2 · hP · hR/ (hP + hR) (17)
1 ˆ
N
TE = γ li , l i (18)
N i=1
where, ˆli and li are the predicted label and true labels of example i, respectively.
γ (a, b) the graph distance between a and b according to the hierarchy. A (l)
denotes the set that includes the node l and all its ancestors except the root
node. For T E lower values are better, whereas for all other measures higher
values are better.
For multi-label classification, where each li is a set
of micro-labels, we redefine
ˆ ˆ −1
graph distance and ancestors as: γml (li , li ) = li a∈l̂i minb∈li γ (a, b) and
Aml (l) = ∪a∈l A (a).
For all the experiments, the regularization parameter is tuned using a validation
set. The model is trained for a range of values 10k with appropriate values for
k selected depending on the dataset. Using the best parameter selected on vali-
dation set, we retrained the models on the entire training set and measured the
performance on a held out test set. The source code implementing the methods
discussed in this paper is available on our website 5 . The experiments were per-
formed on computers with Dell C8220 processors with dual Intel Xeon E5-2670
8 core CPUs and 64 GB memory.
5
http://cs.gmu.edu/∼mlbio/HierCost/
686 A. Charuvaka and H. Rangwala
5.5 Results
The evaluations on Dmoz 2010 and Dmoz 2012 datasets are blind and the pre-
dictions have to be uploaded to LSHTC website in order to obtain the scores.
For Dmoz 2012, Tree Errors are not available and for Dmoz 2010, the hF1 are
not available. For HRLR, we do not have access to the predictions, hence, we
could only report the values for Micro-F1 and Macro-F1 scores from [9]. Sta-
tistical significance tests compare the results of HierCost with LR. These tests
could not be performed on LSHTC datasets due to non-availability of the true
labels on test sets. As seen in Table 4, HierCost improves upon the baseline LR
results as well as the results reported in [9], in most cases, especially the Macro-
F1 scores. The results of HierCost are better on most measures. TD performs
worst on average on set-based measures. In fact, only on Dmoz 2012 dataset,
TD is competitive, on the rest, the results are much worse than the flat LR
classifier and its hierarchical extensions. On hierarchical measure, however, TD
outperformed flat classifiers on some datasets.
In Table 5, we report the run-times comparisons of TD, LR and HierCost.
We trained the models in parallel for different classes and computed the sum of
run-times for each training instance. In theory, the run-times of LR and Hier-
Cost should be equivalent, because they solve similar optimization problems.
However, minor variations in the run-times were observed because of the varia-
tions in optimal regularization penalties, which influences the convergence of the
optimization algorithm. The runtimes of flat methods were significantly worse
than TD, which is efficient in terms of training, but at considerable loss in classi-
fication performance. Although, we do not measure the training times of HRLR,
based on the experience from a similar problem [3], the recursive model take
between 3-10 iterations for convergence. In each iteration, the models for all the
terminal labels need to be trained hence each iteration is about as expensive as
688 A. Charuvaka and H. Rangwala
TD-LR LR HierCost
a single run of LR. In addition, the distributed recursive models require commu-
nication between the training machines which incurs an additional overhead.
6 Conclusions
In this paper, we have argued that the methods that extend flat classifica-
tion using hierarchical regularization, can be viewed in a complementary way
as weighting the losses on the negative examples depending on dissimilarity
between the positive and negative classes. The approach proposed in this paper,
incorporates this insight directly into the loss function by scaling the loss func-
tion according to the dissimilarity between the classes with respect to the hier-
archy, thus obviating the need for recursive regularization and iterative model
training. At the same time, this approach also makes parallelization trivial. Our
method also mitigates the adverse effects of imbalance in the training data by up-
weighting the loss for examples from smaller classes, thus, significantly improv-
ing their classification results. Our experimental results show that the proposed
method is able to efficiently incorporate hierarchical information by transform-
ing the hierarchical classification problem into an example based cost sensitive
classification problem. In future work, we would like to evaluate the benefits of
cost sensitive classification using large margin classifiers such as support vector
machines.
Acknowledgement. This work was supported by NSF IIS-1447489 and NSF career
award to Huzefa Rangwala (IIS-1252318). The experiments were run on ARGO
(http://orc.gmu.edu).
References
1. Babbar, R., Partalas, I., Gaussier, E., Amini, M.R.: On flat versus hierarchical clas-
sification in large-scale taxonomies. In: Advances in Neural Information Processing
Systems, pp. 1824–1832 (2013)
2. Cai, L., Hofmann, T.: Hierarchical document categorization with support vector
machines. In: ACM International Conf. on Information & Knowledge Management,
pp. 78–87 (2004)
690 A. Charuvaka and H. Rangwala
3. Charuvaka, A., Rangwala, H.: Approximate block coordinate descent for large scale
hierarchical classification. In: ACM SIGAPP Symposium on Applied Computing
(2015)
4. Chen, J., Warren, D.: Cost-sensitive learning for large-scale hierarchical classifi-
cation. In: ACM International Conf. on Information & Knowledge Management,
pp. 1351–1360 (2013)
5. Dekel, O., Keshet, J., Singer, Y.: Large margin hierarchical classification. In: Inter-
national Conf. on Machine Learning, p. 27 (2004)
6. Dimitrovski, I., Kocev, D., Loskovska, S., Džeroski, S.: Hierarchical annotation of
medical images. Pattern Recognition 44(10), 2436–2449 (2011)
7. Evgeniou, T., Micchelli, C., Pontil, M.: Learning multiple tasks with kernel meth-
ods. Journal of Machine Learning Research 6(1), 615–637 (2005)
8. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: ACM SIGKDD Inter-
national Conf. on Knowledge Discovery and Data Mining, pp. 109–117 (2004)
9. Gopal, S., Yang, Y.: Recursive regularization for large-scale classification with
hierarchical and graphical dependencies. In: ACM SIGKDD International Conf.
on Knowledge Discovery and Data Mining, pp. 257–265 (2013)
10. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for
text categorization research. The Journal of Machine Learning Research 5, 361–397
(2004)
11. Liu, T.Y., Yang, Y., Wan, H., Zeng, H.J., Chen, Z., Ma, W.Y.: Support vector
machines classification with a very large-scale taxonomy. ACM SIGKDD Explo-
rations Newsletter 7(1), 36–43 (2005)
12. Masnadi-Shirazi, H., Vasconcelos, N.: Risk minimization, probability elicitation,
and cost-sensitive svms. In: International Conf. on Machine Learning, pp. 759–766
(2010)
13. McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classifi-
cation by shrinkage in a hierarchy of classes. In: International Conf. on Machine
Learning, vol. 98, pp. 359–367 (1998)
14. Nesterov, Y.: Introductory lectures on convex optimization, vol. 87. Springer
Science & Business Media (2004)
15. Partalas, I., Kosmopoulos, A., Baskiotis, N., Artieres, T., Paliouras, G.,
Gaussier, E., Androutsopoulos, I., Amini, M.R., Galinari, P.: Lshtc: A benchmark
for large-scale text classification. arXiv preprint arXiv:1503.08581 (2015)
16. Saha, B., Gupta, S., Phung, D., Venkatesh, S.: Multiple task transfer learning with
small sample sizes. Knowledge and Information Systems, 1–28 (2015)
17. Shahbaba, B., Neal, R.M., et al.: Improving classification when a class hierarchy
is available using a hierarchy-based prior. Bayesian Analysis 2(1), 221–237 (2007)
18. Silla Jr., C.N., Freitas, A.A.: A survey of hierarchical classification across different
application domains. Data Mining and Knowledge Discovery 22(1–2), 31–72 (2011)
19. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for
classification tasks. Information Processing & Management 45(4), 427–437 (2009)
20. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. Journal of Machine Learning
Research 6, 1453–1484 (2005)
21. Yang, Y.: A study of thresholding strategies for text categorization. In: ACM SIGIR
Conf. on Research and Development in Information Retrieval, pp. 137–145 (2001)
22. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM
SIGIR Conf. on Research and Development in Information Retrieval, pp. 42–49
(1999)
Large Scale Optimization with Proximal
Stochastic Newton-Type Gradient Descent
c Springer International Publishing Switzerland 2015
A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 691–704, 2015.
DOI: 10.1007/978-3-319-23528-8 43
692 Z. Shi and R. Liu
where αk is the step size at the k-th iteration. Under standard assumptions the
sub-optimality achieved on iteration k of the ProxFG method with a constant
step size is given by
1
f (xk ) − f (x∗ ) = O( ).
k
When f is strongly-convex, the error satisfies [11]
L − μg k
f (xk ) − f (x∗ ) = O( ),
L + μh
where L is the Lipschitz constant of f (x), μg , and μh are the convexity param-
eters of g(x) and h(x) respectively. These notations mentioned here will be
detailed in Section 1.1. This result in a linear convergence rate, which is also
known as a geometric or exponential rate because the error is cut by a fixed
fraction on each iteration.
Unfortunately, the ProxFG methods can be unappealing when n is large
or huge because its iteration cost scales linearly in n. When the number of
components n is very large, then each iteration of (2) will be very expensive
since it requires computing the gradients for all the n component functions gi ,
and also their average.
To overcome this problem, researchers proposed the proximal stochastic gra-
dient descent methods (ProxSGD), whose main appealing is that they have an
iteration cost which is independent of n, making them suited for modern prob-
lems where n may be very large. The basic ProxSGD method for optimizing (1),
uses iterations of the form
xk = proxαk h xk−1 − αk ∇gik (xk−1 ) , (3)
where at each iteration an index ik is sampled uniformly from the set {1, ..., n}.
The randomly chosen gradient ∇gik (xk−1 ) yields an unbiased estimate of the
true gradient ∇g(xk−1 ) and one can show under standard assumptions that, for
a suitably chosen decreasing step-size sequence {αk }, the ProxSGD iterations
have an expected sub-optimality for convex objectives of [1]
1
E[f (xk )] − f (x∗ ) = O( √ )
k
and an expected sub-optimality for strongly-convex objectives of
1
E[f (xk )] − f (x∗ ) = O( ).
k
In these rates, the expectations are taken with respect to the selection of the ik
variables.
Besides these first order method, there is another group of methods, called
proximal Newton-type methods, which converge much faster, but need more
memory and computation to obtain the second order information about the
objective function. These methods are always limited to small-to-medium scale
Large Scale Optimization with Proximal Stochastic Newton-Type 693
problems that require a high degree of precision. For optimizing (1), proximal
Newton-type methods [2] that incorporate second order information use itera-
tions of the form xk+1 ← xk + Δxk , here Δxk is obtained by
1
Δxk = arg minp ∇g(xk )T d + dT Hk d + h(xk + d), (4)
d∈R 2
where Hk denotes an approximation to ∇2 g(xk ). According to the strategies
for choosing Hk , we obtain different method, such as proximal Newton method
(ProxN) when we choose Hk to be ∇2 g(xk ); proximal quasi-Newton method
(ProxQN) when we build an approximation to ∇2 g(xk ) using changes mea-
sured in ∇g according to a quasi-Newton strategy [2]. Indeed if we compared (4)
with (2), it can be seen ProxN is the ProxFG with scaled proximal mappings.
Based on the related background introduced above, now we can describe our
approaches and findings. The primary contribution of this work is the proposal
and analysis of a new algorithm that we call the proximal stochastic Newton-type
gradient (PROXTONE, pronounced /prok stone/) method, a stochastic variant
of the ProxN method. The PROXTONE method has the low iteration cost as
that of ProxSGD methods, but achieves the convergence rates like the ProxFG
method stated above. The PROXTONE iterations take the form xk+1 ← xk +
tk Δxk , where Δxk is obtained by
1
Δxk ← arg min dT (∇k + Hk xk ) + dT Hk d + h(xk + d), (5)
d 2
n n
here ∇k = n1 i=1 ∇ik , Hk = n1 i=1 Hki , and at each iteration a random index
j
j and corresponding Hk+1 is selected, then we set
i ∇gi (xk+1 ) − Hk+1
i
xk+1 if i = j,
∇k+1 = i
∇k+1 otherwise.
i
and Hk+1 ← Hki (i = j).
That is, like the ProxFG and ProxN methods, the steps incorporates a gradi-
ent with respect to each function; but, like the ProxSGD method, each iteration
only computes the gradient with respect to a single example and the cost of
the iterations is independent of n. Despite the low cost of the PROXTONE
iterations, we show in this paper that the PROXTONE iterations have a linear
convergence rate for strongly-convex objectives, like the ProxFG method. That
is, by having access to j and by keeping a memory of the approximation for
the Hessian matrix computed for the objective funtion, this iteration achieves a
faster convergence rate than is possible for standard ProxSGD methods.
Besides PROXTONE, there are a large variety of approaches available to
accelerate the convergence of ProxSGD methods, and a full review of this
immense literature would be outside the scope of this work. Several recent
work considered various special cases of (1), and developed algorithms that
enjoy the linear convergence rate, such as ProxSDCA [8], MISO [3], SAG [7],
ProxSVRG [11], SFO [10], and ProxN [2]. All these methods converge with an
694 Z. Shi and R. Liu
exponential rate in the value of the objective function, except that the ProxN
achieves superlinear rates of convergence for the solution, however it is a batch
mode method. Shalev-Shwartz and Zhang’s ProxSDCA [8,9] considered the case
where the component functions have the form gi (x) = φi (aTi x) and the Fenchel
conjugate functions of φi and h can be computed efficiently. Schimidt et al.’s
SAG [7] and Jascha et al.’s SFO [10] considered the case where h(x) ≡ 0.
Different from above related methods, our PROXTONE is a extension of
the SFO and ProxN to a proximal stochastic Newton-type method for solving
the more general nonsmooth ( compared to ProxSDCA, SAG and SFO) class
of problems defined in (1). PROXTONE makes connections between two com-
pletely different approaches. It achieves a linear convergence rate not only in the
value of the objective function, but also for the solution. We now outline the rest
of the study. Section 2 presents the main algorithm and gives an equivalent form
in order for the ease of analysis. Section 3 states the assumptions underlying
our analysis and gives the main results; we first give a linear convergence rate
in function value (weak convergence) that applies for any problem, and then
give a strong linear convergence rate for the solution, however with some addi-
tional conditions. We report some experimental results in Section 4 and provide
concluding remarks in Section 5.
and
n
1
∇k + Hk xk = [∇gi (xθi,k ) + (xk − xθi,k−1 )T Hθii,k ]. (11)
n i=1
3: Sample j from {1, 2, .., n}, and update the quadratic models or surrogate functions:
1
gjk+1 (x) = gj (xk+1 ) + (x − xk+1 )T ∇gj (xk+1 ) + (x − xk+1 )T Hk+1
i
(x − xk+1 ), (13)
2
while
n leaving all other gik+1 (x) unchanged: gik+1 (x) ← gik (x) (i = j); and Gk+1 (x) =
1
n
g
i=1 i
k+1
(x).
4: until stopping conditions are satisfied.
Output: xk .
gik (x)
1
=gi (xθi,k ) + (x − xθi,k )T ∇gi (xθi,k ) + (x − xθi,k )T Hθii,k (x − xθi,k ), (14)
2
here θi,k is a random variable which have the following conditional probability
distribution in each iteration:
1 1
P(θi,k = k|j) = and P(θi,k = θi,k−1 |j) = 1 − , (15)
n n
Large Scale Optimization with Proximal Stochastic Newton-Type 697
and Hθii,k is any positive definite matrix, which possibly depends on xθi,k . Then
at each iteration the search direction is found by solving the subproblem (12).
One of the crucial ideas of this algorithm is that the component function to be
used for updating the search direction at each iteration is chosen randomly. This
allows the function to be selected very quickly. After the component function
gj (x) selected and updated by (13), while leaving all other gik+1 (x) unchanged.
Arguably, the most important feature of this method is that the regularized
quadratic model (12) incorporates second order information in the form of a
positive definite matrix Hki . This is key because, at each iteration, the user has
complete freedom over the choice of Hki . A few suggestions for the choice of
Hki include: the simplest option is Hki = I that no second order information is
employed; Hki = ∇2 gi (xk ) provides the most accurate second order information,
but it is (potentially) more computationally expensive to work with.
3 Convergence Analysis
In this section we provide convergence theory for the PROXTONE algorithm.
Under the standard assumptions, we now state our convergence result.
Theorem 1. Suppose ∇gi (x) is Lipschitz continuous with constant Li > 0 for
i = 1, ..., n, and Li I mI Hki M I for all i = 1, ..., n, k ≥ 1. h(x) is strongly
convex with μh ≥ 0. Let Lmax = {L1 , ..., Ln }, then the PROXTONE iterations
satisfy for k ≥ 1:
M + Lmax 1 M + Lmax 1
E[f (xk )] − f ∗ ≤ [ + (1 − )]k x∗ − x0 2 . (16)
2 n 2μh + m n
The ideas of the proof is closed related to that of MISO by Mairal [3] and
for completeness we give a simple version in the appendix.
We have the following remarks regarding the above result:
At this moment, we see that the expected quality of the output of PROX-
TONE is good. However, in practice we are not going to run this method many
times on the same problem. What is the probability that our single run can give
698 Z. Shi and R. Liu
Corollary 1. Suppose the assumptions in Theorem 1 hold. Then for any > 0
and δ ∈ (0, 1), we have
Prob f (xk ) − f (x ) ≤ ≥ 1 − δ,
Based on Theorem 1 and its proof, we give a deeper and stronger result that
the PROXTONE achieves a linear convergence rate for the solution.
Theorem 2. Suppose ∇gi (x) and ∇2 gi are Lipschitz continuous with constant
n is strongly convex with μh ≥
Li > 0 and Ki > 0 respectively for i = 1, ..., n, h(x)
0. Let Lmax = {L1 , ..., Ln } and Kmax = (1/n) i=1 Li . If Hθii,k = ∇2 gi (xθi,k )
and Li I mI Hki M I, then PROXTONE converges exponentially to x in
expectation:
E[xk+1 − x ]
Kmax + 2Lmax M + Lmax 2Lmax 1 M + Lmax 1
≤( + )[ + (1 − )]k−1 x∗ − x0 2 .
m 2μh + m m n 2μh + m n
Corollary 2. Suppose the assumptions in Theorem 2 hold. Then for any > 0
and δ ∈ (0, 1), we have
Prob xk+1 − x ≥ ≥ 1 − δ,
4 Numerical Experiments
The technique proposed in this paper has wide applications, it can be used to
do least-squares regression, the Lasso, the elastic net, and the logistic regression.
Furthermore the principle of PROXTONE can also be applies to do nonconvex
optimization problems, such as training of deep convolutional network and so on.
In this section we present the results of some numerical experiments to illus-
trate the properties of the PROXTONE method. We focus on the sparse regular-
ized logistic regression problem for binary classification: given a set of training
examples (a1 , b1 ), . . . , (an , bn ) where ai ∈ Rp and bi ∈ {+1, −1}, we find the
optimal predictor x ∈ Rp by solving
n
1
minp log 1 + exp(−bi aTi x) + λ1 x22 + λ2 x1 ,
x∈R n i=1
where λ1 and λ2 are two regularization parameters. We set
gi (x) = log(1 + exp(−bi aTi x) + λ1 x22 , h(x) = λ2 x1 , (17)
and
λ1 = 1E − 4, λ2 = 1E − 4.
In this situation, the subproblem (12) become a lasso problem, which can be
effectively and accurately solved by the proximal algorithms [6].
(a) (b)
We used some publicly available data sets. The protein data set was obtained
from the KDD Cup 20041 ; the covertype data sets were obtained from the LIB-
SVM Data2 .
1
http://osmot.cs.cornell.edu/kddcup
2
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets
700 Z. Shi and R. Liu
5 Conclusions
This paper introduces a proximal stochastic method called PROXTONE for
minimizing regularized finite sums. For nonsmooth and strongly convex prob-
lems, we show that PROXTONE not only enjoys the same linear rates as those
of MISO, SAG, ProxSVRG and ProxSDCA, but also showed that the solution
of this method converges in exponential rate too. There are some directions
that the current study can be extended. In this paper, we have focused on the
theory and the convex experiments of PROXTONE; it would be meaningful to
also make clear the implementation details and do the numerical evaluation to
nonconvex problems [10]. Second, combine with randomized block coordinate
method [4] for minimizing regularized convex functions with a huge number of
varialbes/coordinates. Moreover, due to the trends and needs of big data, we
are designing distributed/parallel PROXTONE for real life applications. In a
broader context, we believe that the current paper could serve as a basis for
examining the method on the proximal stochastic methods that employ second
order information.
Appendix
In this Appendix, we give the proofs of the two propositions.
A Proof of Theorem 1
Since in each iteration of the PROXTONE, we have (14) and (15), that yields
1 1
E[x∗ − xθi,k 2 ] = E[x∗ − xk 2 ] + (1 − )E[x∗ − xθi,k−1 2 ]. (18)
n n
Since 0 Hθii,k M I and ∇2 gik (x) = Hθii,k , by Theorem 2.1.6 of [5] and
the assumption, ∇gik (x) and ∇gi (x) are Lipschitz continuous with constant M
and Li respectively, and further ∇gik (x) − ∇gi (x) is Lipschitz continuous with
constant M + Li for i = 1, . . . , n. This together with (7) yieds
M + Li
|[gik (x) − gi (x)] − [gik (y) − gi (y)] − ∇[gik (y) − gi (y)]T (x − y)| ≤ x − y2 .
2
Large Scale Optimization with Proximal Stochastic Newton-Type 701
Applying the above inequality with y = xθi,k , and using the fact that
∇[gik (xθi,k )] = ∇[gi (xθi,k )] and gik (xθi,k ) = gi (xθi,k ), we have
M + Li
|gik (x) − gi (x)| ≤ x − xθi,k 2 .
2
Summing over i = 1, . . . , n yields
n
k 1 M + Li
[G (x) + h(x)] − [g(x) + h(x)] ≤ x − xθi,k 2 . (19)
n i=1 2
Then by the Lipschitz continuity of ∇gi (x) and the assumption Li I mI Hki ,
we have
gi (x)
Li
≤ gi (xθi,k ) + ∇gi (xθi,k )T (x − xθi,k )| + x − xθi,k 2
2
1
≤ gi (xθi,k ) + (x − xθi,k )T ∇gi (xθi,k ) + (x − xθi,k )T Hθii,k (x − xθi,k ) = gik (x),
2
and thus, by summing over i yields g(x) ≤ Gk (x), and further by the optimality
of xk+1 , we have
f (xk+1 ) ≤ Gk (xk+1 ) + h(xk+1 ) ≤ Gk (x) + h(x)
n
1 M + Li
≤ f (x) + x − xθi,k 2 (20)
n i=1 2
Since mI Hθi,k and ∇2 gik (x) = Hθi,k , by Theorem 2.1.11 of [5], gik (x) is
m-strongly convex. Since Gk (x) is the average of gik (x), thus Gk (x) + h(x) is
(m + μh )-strongly convex, we have
m + μh m + μh
f (xk+1 ) + x − xk+1 2 ≤ Gk (xk+1 ) + h(xk+1 ) + x − xk+1 2
2 2
≤ Gk (x) + h(x)
= f (x) + [Gk (x) + h(x) − f (x)]
n
1 M + Li
≤ f (x) + x − xθi,k 2 .
n i=1 2
We have
μh k+1
x − x∗ 2 ≤ E[f (xk+1 )] − f ∗
2
n
1 M + Lmax m + μh
≤ E[ x − xθi,k 2 ] − E[ x − xk+1 2 ].
n i=1 2 2
702 Z. Shi and R. Liu
thus
n
M + Lmax 1 ∗
xk+1 − x∗ 2 ≤ E[ x − xθi,k 2 ]. (21)
2μh + m n i=1
then we have
n n
1 ∗ 1 1 1 ∗
E[ x − xθi,k 2 ] = xk − x∗ 2 + (1 − )E[ x − xθi,k−1 2 ]
n i=1 n n n i=1
n
1 k 1 1 ∗
≤ x − x∗ 2 + (1 − )E[ x − xθi,k−1 2 ]
n n n i=1
n
1 M + Lmax 1 1 ∗
≤[ + (1 − )]E[ x − xθi,k−1 2 ]
n 2μh + m n n i=1
n
1 M + Lmax 1 1 ∗
≤[ + (1 − )]k E[ x − xθi,0 2 ]
n 2μh + m n n i=1
1 M + Lmax 1
≤[ + (1 − )]k x∗ − x0 2 .
n 2μh + m n
M +Lmax 1 M +Lmax
Thus we have E[f (xk+1 )] − f ∗ ≤ 2 [ n 2μh +m + (1 − n1 )]k x∗ − x0 2 .
B Proof of Theorem 2
We first examine the relations between the search directions of ProxN and
PROXTONE.
By (4), (5) and Fermat’s rule, ΔxkP roxN and Δxk are also the solutions to
and
Since the ProxN method converges q-quadratically (cf. Theorem 3.3 of [2]),
xk+1 − x
≤ xk + ΔxkP roxN − x + Δxk − ΔxkP roxN
Kmax k
≤ x − x 2 + Δxk − ΔxkP roxN . (25)
m
Thus from (24) and (25), we have almost surely that
xk+1 − x
n
Kmax k Lmax θi,k−1
≤ x − x 2 + x − xk 2
m 2mn i=1
n n
Kmax k Lmax Lmax
≤ x − x 2 + 2xθi,k−1 − x∗ 2 + 2x∗ − xk 2 .
m mn i=1 mn i=1
which yieds
Kmax + 2Lmax M + Lmax 2Lmax 1 M + Lmax
xk+1 − x ≤ ( + )[
m 2μh + m m n 2μh + m
1 k ∗
+ (1 − )] x − x .
0 2
n
704 Z. Shi and R. Liu
References
1. Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for con-
vex optimization: a survey. Optimization for Machine Learning 2010, 1–38 (2011)
2. Lee, J., Sun, Y., Saunders, M.: Proximal newton-type methods for convex opti-
mization. In: Advances in Neural Information Processing Systems, pp. 836–844
(2012)
3. Mairal, J.: Optimization with first-order surrogate functions. arXiv preprint
arXiv:1305.3120 (2013)
4. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization
problems. SIAM Journal on Optimization 22(2), 341–362 (2012)
5. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course.
Kluwer, Boston (2004)
6. Parikh, N., Boyd, S.: Proximal algorithms. Foundations and Trends in Optimiza-
tion 1(3), 123–231 (2013)
7. Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic
average gradient. arXiv preprint arXiv:1309.2388 (2013)
8. Shalev-Shwartz, S., Zhang, T.: Proximal stochastic dual coordinate ascent. arXiv
preprint arXiv:1211.2717 (2012)
9. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for reg-
ularized loss. The Journal of Machine Learning Research 14(1), 567–599 (2013)
10. Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by uni-
fying stochastic gradient and quasi-newton methods. In: Proceedings of the 31st
International Conference on Machine Learning (ICML 2014), pp. 604–612 (2014)
11. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive vari-
ance reduction. arXiv preprint arXiv:1403.4699 (2014)
Erratum to: Bayesian Active Clustering
with Pairwise Constraints
Erratum to:
Chapter 15 in: A. Appice et al. (Eds.)
Machine Learning and Knowledge Discovery
in Databases
DOI: 10.1007/978-3-319-23528-8 15
(i) In the original version, Figure 2 was wrong. The correct figure is shown on
the next page.
Fertility Parkinsons
1 0.9
0.9 0.8
F−Measure
F−Measure
0.8 0.7
0.7
0.6
0.6
0.5
0 20 40 60 0 20 40 60
#Query #Query
Crabs Sonar
1 0.8
0.9 0.75
0.7
F−Measure
0.8
F−Measure
0.7 0.65
0.6 0.6
0.5 0.55
0.4 0.5
0 20 40 60 0 20 40 60 80 100
#Query #Query
Balance Transfusion
0.9 0.8
0.8 0.75
F−Measure
F−Measure
0.7 0.7
0.6 0.65
0.5 0.6
0.4 0.55
0 20 40 60 80 100 0 20 40 60 80 100
#Query #Query
Letters−IJ Digits−389
0.9 1
0.8 0.9
F−Measure
F−Measure
0.8
0.7
0.7
0.6
0.6
0.5
0 20 40 60 80 100 0 20 40 60 80 100
#Query #Query
Info+BC Uncertain+BC NPU+ITML NPU+MPCKmeans MinMax+ITML MinMax+MPCKmeans
Erratum to:
Chapter 39 in: A. Appice et al. (Eds.)
Machine Learning and Knowledge Discovery
in Databases
DOI: 10.1007/978-3-319-23528-8 39
(i) In the original version, Corresponding author & E-mail ids were missing,
they should read as follows:
Acknowledgments. This work was supported in part by the Alberta Innovates Tech-
nology Futures (AITF) and Natural Sciences and Engineering Research Council of
Canada (NSERC).
Erratum to: Predicting Unseen Labels
Using Label Hierarchies in Large-Scale
Multi-label Learning
Erratum to:
Chapter 7 in: A. Appice et al. (Eds.)
Machine Learning and Knowledge Discovery
in Databases
DOI: 10.1007/978-3-319-23528-8 7
Acknowledgements. The authors would like to thank the anonymous reviewers for
their feedback. This work has been supported by the German Institute for Educational
Research (DIPF) under the Knowledge Discovery in Scientific Literature (KDSL) pro-
gram, and the German Research Foundation as part of the Research Training Group
“Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under
grant No. GRK 1994/1.”