0% found this document useful (0 votes)
57 views

Hcac ML PDF

This document presents HCAC-ML, a semi-supervised hierarchical clustering algorithm that uses cluster-level constraints and distance metric learning. It combines HCAC, a state-of-the-art hierarchical semi-supervised clustering algorithm, with a variation of ITML (Information-Theoretic Metric Learning) to learn a Mahalanobis distance metric from the cluster-level constraints provided by HCAC. The algorithm is evaluated on 26 datasets and is shown to outperform other semi-supervised clustering algorithms, especially when fewer constraints are provided by the user.

Uploaded by

Shih Margarita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Hcac ML PDF

This document presents HCAC-ML, a semi-supervised hierarchical clustering algorithm that uses cluster-level constraints and distance metric learning. It combines HCAC, a state-of-the-art hierarchical semi-supervised clustering algorithm, with a variation of ITML (Information-Theoretic Metric Learning) to learn a Mahalanobis distance metric from the cluster-level constraints provided by HCAC. The algorithm is evaluated on 26 datasets and is shown to outperform other semi-supervised clustering algorithms, especially when fewer constraints are provided by the user.

Uploaded by

Shih Margarita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Integrating distance metric learning and cluster-level

constraints in semi-supervised clustering


Bruno Magalhães Nogueira Yuri Karan Benevides Tomas Ricardo Marcondes Marcacini
Faculty of Computing Faculty of Computing Federal University of Mato Grosso do Sul
Federal University of Mato Grosso do Sul Federal University Av. Capitão Olinto Mancini, 1662
Av. Costa e Silva, sn of Mato Grosso do Sul Três Lagoas, MS, Brazil
Campo Grande, MS, Brazil Av. Costa e Silva, sn Email: ricardo.marcacini@ufms.br
Email: bruno@facom.ufms.br Campo Grande, MS, Brazil
Email: iruynarak@gmail.com

Abstract—Semi-supervised clustering has been widely explored rithms modify the objective function of clustering algorithms
in the last years. In this paper, we present HCAC-ML (Hierarchi- in order to respect the provided constraints. On the other hand,
cal Confidence-based Active Clustering with Metric Learning), distance-based algorithms employ the constraints in order to
an innovative approach for this task which employs distance
metric learning through cluster-level constraints. HCAC-ML is learn a new distance metric to group instances. Constraint-
based on the HCAC algorithm, an state-of-the-art algorithm based methods, in general, have a lower computational cost.
for hierarchical semi-supervised clustering that uses an active However, distance-based methods are more indicated to deal
learning approach for inserting cluster-level constraints. These with contexts containing clusters with different shapes and
constraints are presented to a variation of ITML (Information- densities [10].
theoretic Metric Learning) algorithm to learn a Mahalanobis-like
distance function. We compared HCAC-ML with other semi- Existing distance-based clustering algorithms employ
supervised clustering algorithms in 26 different datasets. Results instance-level constraints [11], [12], [10], [13]. These algo-
indicate that HCAC-ML outperforms other algorithms in most rithms present good results in clustering quality. However,
of the scenarios, but specially when the number of constraints is they require relatively large amounts of user intervention
small. This makes HCAC-ML useful in practical applications. to obtain good results. Given this context, one interesting
alternative not explored in the literature is the usage of
I. I NTRODUCTION
cluster-level constraints and distance-metric learning in a same
Semi-supervised clustering has emerged as an interesting algorithm. To the best of our knowledge, there is no such
alternative in the last years. These algorithms improve the algorithm in the literature. Since cluster-level constraints can
clustering quality through external knowledge conveyed in the carry more information than instance-level constraints, less
form of constraints. These constraints are used to guide the user interventions may be needed to achieve good clustering
clustering process and can be directly derived from original performance. Moreover, employing metric learning approaches
data (using partially labeled data) or provided by an user, help to overcome problems as clusters of different shapes and
trying to adapt clustering results to his/her expectations [1]. to fit different kinds of data.
Constraints employed in semi-supervised clustering algo- In this sense, in this work we present HCAC-ML (Hi-
rithms may refer to instances, clusters or both. Instance-level erarchical Confidence-based Active Clustering with Metric
constraints may indicate whether two instances must or must Learning), a semi-supervised hierarchical clustering algorithm
not belong to the same cluster [2], as in pairwise must-link and that uses cluster-level constraints and distance-learning. The
cannot link constraints, or when a given instance is nearer to core of the algorithm is based on the HCAC algorithm [5],
a second than to a third one [3]. Cluster-level constraints, on which outperforms other semi-supervised algorithms in vari-
the other side, introduce information about groups of objects, ous scenarios. HCAC uses cluster-level constraints posed by
indicating, for example, when two clusters must or must not be the user along an agglomerative clustering process. Reported
merged [4], [5], split [6] or even removed [7]. When referring results indicate that HCAC achieve good results even when
to both instances and clusters, constraints may indicate, for the number of constraints is small, which is specially inter-
example, whether an instance must or must not belong to a esting in practical applications. Unlike other semi-supervised
given cluster [8]. While instance-level constraints are easier algorithms, HCAC explores hierarchical clustering, which has
for a user to provide, cluster-level constraints may be more large practical importance. The data structure provided by
effective, since they refer to a group of instances and, then, hierarchical clustering algorithms provides a visualization of
can convey more information [5]. the data in different levels of abstraction [14], which facilitates
Semi-supervised clustering algorithms may also be classi- the comprehension and the navigation over the data collection.
fied according to the way they deal with the constraints [9]: Moreover, hierarchical clustering algorithms do not require
constraint-based and distance-based. Constraint-based algo- prior knowledge about the number of clusters, which is an
important drawback of partitional clustering algorithms. that this would result in clusters with higher cohesion. For
In HCAC-ML, we adapt the ITML (Information-theoretic each broken constraint, a penalty is added in the objective
Metric Learning) algorithm [15] in order to learn a Maha- function. In MPC-KMeans, a distance metric is learned for
lanobis distance [16] using the cluster-level constraints ob- each cluster, obtaining more cohesive clusters and being more
tained by HCAC. Each user intervention generates one cluster- adaptive to different shapes. This algorithm follows the EM
level constraint which is further derived in several instance- approach. On the E-step, each object is clustered such that the
level constraints to serve as input of the ITML algorithm. Thus, sum of the distance of this object to the centroid of his cluster
HCAC-ML requires less user interventions to achieve good and the penalty for the possible violation of constraints for
results in clustering performance. Our experimental evaluation that attribution is minimized. Next, on the M-step, a distance
shows that HCAC-ML can efficiently use the information learning process for each cluster is performed, in order to learn
provided by the user and outperforms other algorithms with the distance that better fits for that dataset.
few user interventions in different scenarios. An algorithm-independent approach is provided in the In-
This paper is organized as follows. In the next section, we formation Theoretic Metric Learning (ITML) algorithm [15].
present some related work on semi-supervised clustering and ITML learns the covariance matrix A used in the Mahalanobis
distance metric learning. Then, in Section III we present the distance parameterization guided by a set of pairwise must-
HCAC-ML algorithm and describe its three basic steps. In link and cannot-link constraints and two thresholds u and l.
sequence, in Section IV we present or experimental method- According to the ITML approach, pairs of objects involved
ology and report the results. Finally, in Section V we present in must-link constraints must have pairwise distance below l,
our conclusions and point some future work. while objects involved in cannot-link constraints must have
pairwise distance above u. Due to this procedure, ITML does
II. R ELATED W ORK not rely on clustering results to adjust the covariance matrix,
Feature learning and distance metric learning are extensively unlike other distance-based algorithms.
explored in the last years. Feature learning aim at obtaining All of these related works achieve good results in clustering
a new set of features with smaller dimension that preserve quality. Most of them, however, are algorithm-driven and are
some characteristics in the data [17], [18]. Distance metric harder to adapt to employ cluster-level constraints. Due to
learning, on the other hand, focus on learning specific distance its adaptability and algorithm independence, we adapted the
metrics for an application, given that the concept of similarity ITML algorithm in order to work with cluster-level constraints
is subjective and may not be captured by conventional distance provided by the HCAC algorithm. The junction of HCAC and
metrics [19]. ITML generated the HCAC-ML algorithm, which is explained
In this work, we focus on distance metric learning ap- in details in the next section.
proaches, which are more appropriate for the context of semi-
supervised clustering. Existing algorithms for clustering ap- III. HCAC-ML: H IERARCHICAL C ONFIDENCE - BASED
plications learn distance metrics based on string-edit distance ACTIVE C LUSTERING WITH M ETRIC L EARNING
[20], Kullback–Leibler divergence [11], Euclidean distance In this work, we propose the Hierarchical Confidence-
[4] and Mahalanobis distance [21], [22]. This last one is based Active Clustering with Metric Learning (HCAC-ML)
the simplest and most common approach and consists in algorithm, an algorithm based on the HCAC algorithm [5] that
considering the problem of obtaining a similarity function as a performs metric learning through an adaptation of the ITML
Mahalanobis distance on the form dpx, yq “ ||x ´ y||2A , where algorithm [15]. The HCAC-ML algorithm can be described
A is a parameter (covariance) matrix. The values of the A in three basic steps, as presented in Figure 1: (i) Hierarchi-
matrix are modified along the iterations in order to optimize cal confidence-based clustering; (ii) Metric learning through
the adequacy of the distance function to the constraints. Due ITML; and (iii) Unsupervised clustering. Each of these steps
to its simplicity and versatility, in this work we focus on are detailed in the next sections.
learning Mahalanobis-like distance metrics in detriment of
other approaches. A. Hierarchical Confidence-based Clustering
In [22], the Mahalanobis distance is learned through must- In this first step, we cluster the dataset following the
link and cannot-link constraints which are used to indicate HCAC algorithm [5]. The choice for HCAC is due to its
whether two elements are near or not. The authors address this good performance in different scenarios and outperforms other
problem as a convex optimization problem, in order to obtain hierarchical clustering algorithms when the number of user
results without problems of local optimum. In [21], pairwise interventions is small. This shows that HCAC active learning
constraints are also used and a gradient descent approach is approach and the posed cluster-level constraints are efficient.
used to optimize the Mahalanobis weight matrix according to This step aims at obtaining cluster-level constraints to be used
the given constraints. in the metric learning procedure in the next step (ITML).
The MPC-KMeans algorithm [10] also learns a Mahalanobis The pseudocode of the HCAC procedure can be observed
distance function. This algorithm employs pairwise constraints in Algorithm 1. HCAC performs an agglomerative clustering
to learn a distance metric in a KMeans [23]. The constraints procedure and asks for user intervention to pose cluster-level
are relaxed, allowing the violation of some constraints in cases constraints when a low confidence cluster merge is detected.
Fig. 1. Steps of the HCAC-ML algorithm.

Algorithm 1: Hierarchical Confidence-based Active Clus- The confidence of a merge is related to the distance between
tering procedure. the elements from the proposed merge and other elements near
Input : X “ tx1 , x2 , ..., xn u: dataset of size n them. In essence, if a pair of elements are close but far from
δ: confidence threshold, other elements of the dataset, the merging confidence is high.
µ: number of pair of clusters to be shown to the user : On the other hand, if this pair of elements is also very near
maximum number of user interventions to other elements in the dataset, it might be advisable to ask
distp¨, ¨q: distance function for the user intervention to validate this merge. Experimental
Output: C: set of clusters generated by the algorithm evaluations show that this concept is useful in detecting cluster
Ψ: set of constraints posed by the user borders [5].
1 begin Let us consider a distance function distp¨, ¨q between data
2 for @xi P X do instances in a dataset which is used to produce the distance
3 ci Ð xi matrix DM . Along the clustering iterations, as instances
4 end and clusters are merged, the distances between the formed
5 for @ci P C do clusters and the remaining elements are updated in the DM
6 for @cj P C, cj ‰ ci do matrix according to the UPGMA approach [24]. The natural
7 DMi,j “ distpci , cj q merge (unsupervised merge) in each step of an agglomerative
8 end hierarchical clustering process involves the nearest pair of
9 end elements ci and cj with a distance DMi,j . The confidence θ
10 int Ð 0 new Ð n ` 1 of this merge is calculated by the difference between DMi,j
11 while |DM | ‰ 1 do and DMr,s , where DMr,s “ min DMk,l , k ‰ l, pk, lq ‰
12 minDist “ DMi,j : DMi,j “ min DMx,y , cx ‰ pi, jq, k P ti, ju ‘ l P ti, ju. In practical terms, low confidence
cy merges are those where confidence α is below a threshold δ
13 secM inDist “ DMr,s : DMr,s “ which is defined a priori. To determine the value of δ, we
min DMk,l , cl P tci , cj u, ck R tci , cj u perform a calibration procedure by running an unsupervised
14 θ Ð secM inDist ´ minDist clustering process and storing the confidence value of each
15 if θ ď δ and int ď  then cluster merge in an ordered list Conf . The value of δ can be
16 sortedN eighbors “ found by picking Conf , where  is the desired number of
nearestN eighborspci , cj , µq interventions.
17 pcnew , ψint q “ Once a low confidence is detected, the user intervention is
userChoisepsortedN eighbors, ci , cj q required to pose a cluster-level constraint. In this intervention,
18 int += 1 the user is asked to point the best cluster merge in that given
19 end step. A pool of µ pairs of clusters, where µ is defined a
20 else priori, is presented to the user. This pool is assembled in the
21 cnew “ mergepci , cj q. nearestN eighbors function and contains the nearest pair of
22 end clusters pci , cj q and the set containing µ ´ 1 pairs pcx , cy q cor-
23 updateM atrixpDM, cnew q respondent to the best unsupervised cluster merges involving
24 for @p P DM, p ‰ new do ci or cj (i.e., pcx , cy q ‰ pca , cb q and x P ti, ju ‘ y P ti, ju).
25 DMnew,p “ averagepDMp,i , DMp,j q Each of these pair of elements is presented to the user using
26 end some cluster summarization technique, as wordclouds in text
27 end clustering or parallel coordinates in other applications, in order
28 end to make it easier for the user to comprehend these clusters.
The number of user interventions must not exceed a
predefined value . Once this threshold  is achieved, the
clustering process resumes in an unsupervised way. Each of a dissimilarity constraint between A and B, we pose a dis-
the t “ 1, ...,  user interventions posed in the userChoice similarity constraint between every pair pax , by q, ax P A and
function generates a constraint ψt “ pCSt , CDt q, where by P B.
CSt “ pcl , cm , DMi,j q contains the pair pcl , cm q selected to
be merged in detriment of the pair of clusters in the set CDt
and the distance between the nearest pair of clusters pci , cj q. a1
The set CDt is composed by the pairs of clusters in the set
sortedN eighbors whose distance is smaller than DMl,m .
a2
An example of this procedure is illustrated in Figure 2. In 1.5
this figure, we assume that we have two underlying clusters b2
A
represented by the two sets of points. Each of these points
corresponds to a subcluster. Assuming that this is a low
b1 b3
confidence merge and we show the set of pairs of clusters
tpA, Bq, pB, Dq, pA, Cq, pA, Equ for the user to choose the B
next cluster merge, this user should select the pair pA, Cq,
since it appears to be the best option (the nearest pair of CD(A,B,1.5) → { ID(a1,b1,1.5), ID(a1,b2,1.5), ID(a1,b3,1.5),
ID(a2,b1,1.5), ID(a2,b2,1.5), ID(a2,b3,1.5)}
subclusters belonging to the same underlying cluster).
Then, it is intuitive to assume that in the new distance
Fig. 3. Generating instance-level constraints from the cluster-level constraints
function to be learned, the distance between the pair pA, Cq provided by HCAC.
should be smaller than the distance of the non-selected pairs.
Thus, the distance between the nearest pair of clusters is Finally, we assemble the sets IS and ID, such that IS “
taken as the u (upper bound) parameter for this similarity tIS1 Y IS2 Y ... Y ISµ u and ID “ tID1 Y ID2 Y ... Y IDµ u.
constraint. Similarly, we assume that the distance between the The IS and ID sets are used as input in the metric learning
non-selected pairs pA, Bq and pB, Dq should be greater than step, presented in the next section.
the distance between pA, Cq, which is taken as the l (lower
bound) parameter for this dissimilarity constraint. B. Metric Learning
Once the HCAC procedure finishes, the instance-level con-
straints generated by the  cluster-level constraints posed by
H E M the user are used to learn a Mahalanobis distance through an
D
ITML approach [15].
K N
I
F A The pseudocode of the modified ITML procedure adapted
B
for HCAC-ML is presented in Algorithm 2. It is possible to
C observe that this procedure relies on a set IS of instance sim-
L
G
J ilarity constraints and ID instance dissimilarity constraints; a
set U and L of upper and lower distance bounds; an slack
Cluster Distances Cluster Constraints
parameter γ; and an input covariance matrix A0 . The sets IS,
d(A,B) = 1.0
User
→ Cluster Similarity
ID, U and L are generated in the previous step, derived from
Intervention
d(B,D) = 1.3 CS = {(A,C,1.0)} the HCAC cluster-level constraints.
d(A,C) = 1.5 → Cluster Dissimilarity
d(A,E) = 2.3 CD = {(A,B,1.5), (B,D,1.5)}
Instead of using an upper bound and an lower bound values,
as in the original ITML algorithm, in HCAC-ML we have
Fig. 2. Generating similarity and dissimilarity constraints with the constraints sets of values. These values are obtained for each constraint
provided by HCAC. and reflect the hierarchical nature of the clustering algorithm.
Since for cluster merges in the lower levels of the cluster
Once we have sets of cluster similarity (CS) and cluster hierarchy the distances are smaller, the upper and lower bounds
dissimilarity (CD) constraints among pairs of clusters, we of constraints posed in these levels are also smaller. The
have to break each of this constraints in sets of instance- usage of different values for these boundaries leads to a
level constraints. Each CSt set generates a set ISt containing faster convergence of ITML, since it may have more than one
instance-level similarity constraints. Analogously, each CLt constraint for a same pair of instances.
set generates an instance-level dissimilarity set IDt . Both ISt The matrix A0 is the identity matrix, which leads the
and IDt are used as input for the ITML algorithm in the next Mahalanobis distance to be equal the Euclidean distance. The
step of HCAC-ML. parameter γ controls the intensity of the adjustment of the
The procedure of transforming cluster-level constraints in covariace matrix A. This parameter regularizes the importance
instance-level constraints is illustrated in Figure 3. In this relation between satisfying the constraints and the distance
figure, we assume that A “ ta1 , a2 u and B “ tb1 , b2 , b3 u minimization between A and A0 and varies in the interval
are the clusters composed by sets of elements. As we have r0, 1s.
Algorithm 2: Modified Information-Theoretic Metric IV. E XPERIMENTAL E VALUATION
Learning procedure. In this paper, we compared HCAC-ML with other state-
Input : X: distance matrix of size d x n, where d is of-the-art semi-supervised hierarchical clustering algorithms:
the number of attributes and n is the number of HCAC [5]; HCAC-LC [25] (Hierarchical Confidence-based
examples Active Clustering with Limited Constraints, a variation of
IS: Set of instance similarity constraints HCAC that only poses constraints when clusters size is greater
ID: Set of instance dissimilarity constraints than 2); Constrained Complete Link (CCL) [4]; and a hierar-
U, L: set of upper and lower bounds of distance for each chical clustering algorithm that employs pairwise constraints
constraint must-link and cannot-link [2], [26]. Also, we compared these
A0 : input covariance matrix algorithms with an unsupervised clustering algorithm (UP-
γ : slack parameter GMA [24]), which is used as a baseline.
c: constraint indexing function
A. Evaluation Methodology
Output: A: learned covariance matrix for the
Mahalanobis distance For the comparison of these algorithms, we used 26 datasets.
1 A Ð A0 From these datasets, 13 were real-world numerical datasets
2 λij Ð 0, @i, j from the UCI repository1 and from the MULAN repository2 .
3 if pi, jq P IS then A description of these datasets can be observed in Table I.
4 ξcpi,jq Ð Ui,j The remaining 13 datasets were artificially generated bi-
5 else dimensional datasets3 , varying the number of clusters in each
6 ξcpi,jq Ð Li,j dataset from 2 to 30. All datasets are perfectly balanced, with
7 end 30 examples in each cluster. Each cluster is formed by the
8 for @ISpi, jq P IS and IDpi, jq P ID do combination of two normal distributions (one for the x-axis
9 Get ISpi, jq P IS or IDpi, jq P ID and other for the y-axis), separated by a constant distance
10 p Ð pxi ´ xj qT Apxi ´ xj q and, therefore, are well shaped.
11 if pi, jq P ID then
12 δ Ð Li,j TABLE I
D ESCRIPTION OF THE REAL - WORLD NUMERICAL DATASETS USED IN THE
13 else EXPERIMENTS .
14 δ Ð ´Li, j
15 end Dataset # Examples # Classes
γ Breast Cancer Wisconsin 683 2
16 α Ð minpλij , 2δ p p1 ´ ξcpi,jq qq Ecoli 336 8
δα Emotions 593 27
17 β Ð 1´δαp
γξcpi,jq
Glass 214 6
18 ξcpi,jq Ð γ`δαξ cpi,jq
Haberman 306 2
Image Segmentation 210 7
19 γij Ð γij ´ α Ionosphere 351 2
20 A Ð A ` βApxi ´ xj qpxi ´ xj qT A Iris 150 3
21 end Pima 768 2
Vertebral Column 310 3
Vowel 990 10
Wine 178 3
Zoo 101 7
At each iteration of the ITML procedure, one must-link
or cannot-link constraint is used to adjust the A matrix. We have used objective measures to simulate the human
Unlikely the original ITML algorithm which performs multiple interaction in the semi-supervised algorithms. These measures
iterations over the constraints, HCAC-ML performs a single model user’s behaviour through the labels provided with the
iteration. This phenomenon is possible due to the relatively datasets to automatically answer the queries posed by the
large number of pairwise instance-level constraints derived algorithms. In HCAC, HCAC-LC and HCAC-ML we used
from the cluster-level constraints. Performing multiple itera- the entropy measure [27] in order to select the best cluster
tions should improve the convergence but also implies in an merge. Entropy is an easily calculated and naive measure
additional computational cost. that reflects the “purity” of a cluster. We assumed that the
user should choose the cluster merge that generates the a
C. Unsupervised Clustering cluster with the smallest possible entropy. For the algorithm
Once obtained the covariance matrix is adjusted, HCAC-ML using pairwise constraints, we inserted pairwise constraint to
uses this new covariance matrix to calculate a new distance randomly chosen pairs of instances. As suggested by [26], if
matrix DM using the Mahalanobis distance. The matrix DM the elements belong to the same class, a must-link constraint
is used as input for the unsupervised clustering algorithm 1 http://archive.ics.uci.edu/ml/datasets.html
Average-Link (UPGMA) [24]. The result of the Average-Link 2 http://mulan.sourceforge.net/datasets.html

algorithm is the output of HCAC-ML. 3 Datasets available at http://sites.labic.icmc.usp.br/bmnogueira/artificial.html


was added and the distance between this pair was set to zero. B. Results and Discussion
Otherwise, a cannot-link constraint was added and the distance
In order to illustrate the algorithms performance in terms
was set to infinity. Finally, in the CCL algorithm, it was
of FScore values, in Figure 4 we present a compilation of the
established that the roots of the next proposed merge have to
final FScore for each method in each configuration. For the
be merged if they present an entropy equals or lower than 0.2.
sake of space limitation, we report the FScore results of 6
This value was achieved through preliminary tuning procedure
datasets.
varying the entropy in the interval [0.2, 0.8].
In these results, it is possible to observe that HCAC-ML
In HCAC-ML we have tested different values for the slack
can achieve good results in various scenarios, specially when
γ parameter, varying from 0.1 to 1.0 and increasing 0.1
the number of constraints is small (between 1% and 20%
at each iteration. We have tried different numbers of user
of user intervention). Also, the performance of HCAC-ML
intervention in the clustering process, varying the number
does not present a big variation when as the number of
of desired interventions  in 1%, 5%, 10%, 20%, 30%, ...,
interventions increase. This is due to the fact that, when
100% of the number of merges in the agglomerative clustering
posing a larger number of cluser-level constraints, the derived
process (which is equal to the number of instances in the
pairwise constraints are repeated and redundant. In practical
dataset minus one). HCAC-based algorithms are allowed to
applications, these are interesting features, since we want to
ask for user intervention in first  low confidence merges.
ask the user as few questions as possible.
For the CCL algorithm, a distance thresold t was calibrated
for each dataset so that posing queries in merges involving HCAC-ML also presents good performance in datasets con-
distance between clusters larger than t does not exceed  taining more underlying clusters (classes). In these datasets,
queries. Finally, in the algorithm using pairwise constraints,  more cluster borders are present and more information is
random pair of objects were sorted and the adequate constraint needed to correctly separate them. In such scenarios, metric
was imposed. learning is useful for clustering tasks since it distorts the
In the case of HCAC and HCAC-LC algorithms, we have feature space making easier to separate clusters.
also tested two different numbers of pairs of elements in the It is also possible to notice that when the number of
pool: 5 and 10. In HCAC-ML, we tested with only 5 pair in the constraints is larger the performance of all algorithms is very
pool, since preliminary evaluations show that this algorithm is similar. Thus, in a hypothetical case that one has ”infinite“
not very sensitive to the pool size. user availability, he can chose for the clustering method with
In the evaluation, we have used a variation of stratified 10- smaller computational cost.
fold cross validation, dividing the dataset in 10 subsets. We The methods results are summarized with the Wilcoxon
repeated the clustering procedure 10 times, leaving on fold out statistical test and are presented in Table II. The symbol
of the dataset at each iteration. Each resulting clustering was IJindicates that HCAC-ML statistically outperforms the com-
evaluated through the FScore measure [28], [29] which is very pared method; Ÿindicates that outperforms the compared
adequate for hierarchical clustering. The FScore for a class Ki method with no statistical significance; Źindicates that HCAC-
is the maximum value of FScore obtained at any cluster Cj of ML is outperformed by the compared algorithm with no sta-
the hierarchy, which can be calculated according to Equation tistical significance; and İindicates that HCAC-ML is outper-
1: formed by the compared algorithm with statistical significance.
In the parenthesis, we have the number of datasets that HCAC-
2 ˚ RpKi , Cj q ˚ P pKi , Cj q ML performs better and worse than the compared algorithm.
F pKi , Cj q “ . (1) By analyzing these results, it is possible to observe that
RpKi , Cj q ` P pKi , Cj q
HCAC-ML outperforms all other methods in the interval
where RpKi , Cj q is the recall for the class Ki in the cluster between 1% and 20% of user intervention in most of the
Cj , defined as nij / size of Ki (nij is the number of elements datasets. As discussed before, this is an important feature for
in Cj that belongs to Ki ) and P pKi , Cj q is the precision, practical applications. As expected, HCAC-ML outperforms
defined as nij / size of Cj . The FScore value for a clustering instance-level based algorithms (Pairwise and CCL) and our
is calculated by the weighted average of the FScore for each unsupervised baseline algorithm in most of the comparisons.
class, as shown on Equation 2. This shows that the information acquired in the HCAC appli-
cation step is efficient and appropriately inserted.
k
ÿ ni HCAC-ML is outperformed by HCAC and HCAC-LC in
F Score “ F pKi q (2) scenarios where the number of constraints inserted is large
i“1
n
and/or the number of pairs in the pool of candidates is
The final FScore value for a given dataset is the average of larger (10 pairs of clusters). These scenarios are not very
the FScore values for each of the 10 folds. The non-parametric relevant since asking the user almost 100% of the cluster
Wilcoxon [30] statistical test was used to detect statistical merges is infeasible. Moreover, presenting more pairs for the
significance in the differences of the algorithms performance user intuitively leads to achieving better results. However, the
considering an α of 0.05. The test was applied to compare larger is the number of options, the more cognitive effort is
HCAC-ML against one of the other algorithms. demanded from the user.
Fig. 4. F-Score results. In the perimeter we have amount of user intervention (in %). In the radius we have the F-score

TABLE II
R ESULTS OF THE STATISTICAL COMPARISON OF HCAC-ML AGAINST OTHER METHODS .

% HCAC LC 5 HCAC LC 10 HCAC 5 HCAC 10 PAIRWISE CCL AVERAGE


1 IJ(20/3) IJ(20/3) IJ(21/3) IJ(21/3) IJ(19/4) IJ(22/3) IJ(23/3)
5 IJ(18/5) Ÿ(18/6) IJ(20/4) IJ(20/5) IJ(20/4) IJ(21/4) IJ(23/3)
10 IJ(19/6) Ÿ(18/7) IJ(20/4) IJ(17/4) IJ(18/5) IJ(20/6) IJ(22/3)
20 Ÿ(13/6) Ÿ(14/7) Ÿ(16/6) Ÿ(15/5) Ÿ(18/7) IJ(19/6) IJ(22/4)
30 Ÿ(11/7) Ź(9/10) Ÿ(13/7) Ÿ(14/8) Ÿ(18/6) Ÿ(17/7) IJ(22/2)
40 Ÿ(9/8) Ź(8/14) Ÿ(12/8) Ÿ(11/9) Ÿ(18/6) Ÿ(14/8) IJ(22/2)
50 Ź(7/13) İ(6/16) Ź(11/13) Ÿ(11/10) Ÿ(17/8) Ÿ(12/7) IJ(22/3)
60 Ź(7/15) İ(4/16) Ź(11/12) Ź(9/13) Ź(14/9) Ź(9/11) IJ(23/2)
70 İ(4/16) İ(5/18) Ź(9/12) Ź(8/13) Ź(15/9) Ź(7/12) IJ(23/3)
80 İ(5/17) İ(3/19) Ź(9/14) İ(6/17) Ÿ(15/11) Ź(7/16) IJ(23/3)
90 İ(3/21) İ(2/22) İ(7/15) İ(5/18) Ÿ(13/9) Ź(5/16) IJ(24/2)
100 İ(2/22) İ(2/22) İ(6/18) İ(5/18) Ÿ(12/10) İ(6/18) IJ(23/2)

V. C ONCLUSION plications where the user availability is limited and expensive.

In this paper, we present Hierarchical Confidence-based HCAC-ML has the drawback of being applicable only
Active Clustering with Metric Learning (HCAC-ML), a hierar- to scenarios where it is possible to employ a Mahalanobis
chical clustering algorithm that learns a Mahalanobis distance distance. In this sense, it cannot be directly applied in con-
using cluster-level constraints. HCAC-ML is based on the texts where Mahalanobis distance is not adequate, such as
HCAC algorithm, which uses cluster-level constraints posed in text clustering applications and other sparse contexts. In
using an active learning approach based on the confidence of these scenarios, an extra preprocessing step is necessary.
a cluster merge. Each cluster-level constraint acquired with For example, in textual datasets, one must normalize data
HCAC is derived in a set of instance-level pairwise similarity using the data versors [31]. Also, HCAC-ML presents the
and dissimilarity constraints. These pairwise constraints are additional computational cost of requiring an extra (unsuper-
presented to a variation of the ITML algorithm, which learns vised) clustering step, performed after learning the distance
the covariance matrix for a new Mahalanobis distance. metric, when compared to other semi-supervised clustering
Experimental results using 26 datasets show that HCAC-ML algorithms. However, this is an essential characteristic of
outperforms other state-of-the-art semi-supervised hierarchical distance metric learning algorithms. Given the good results
clustering algorithms. In particular, HCAC-ML have a strong achieved by HCAC-ML and considering that this additional
performance when the number of constraints provided by the step does not modify the complexity magnitude of the method,
user is small. This makes HCAC-ML useful in practical ap- this additional cost may be considered not relevant.
In future work, we intend to report the performance varia- [12] S. Basu, M. Bilenko, and R. J. Mooney, “A probabilistic framework
tion of the method when performing multiple iterations of the for semi-supervised clustering,” in KDD ’04: Proceedings of the 10th
ACM SIGKDD International Conference on Knowledge Discovery and
ITML step instead of using a single-pass approach. Moreover, Data Mining. New York, NY, USA: ACM, 2004, pp. 59–68. [Online].
we intend to apply HCAC-ML in larger datasets, including Available: http://doi.acm.org/10.1145/1014052.1014062
textual datasets. Finally, we intend to compare the HCAC-ML [13] V.-V. Vu, N. Labroche, and B. Bouchon-Meunier, “Improving
constrained clustering with active query selection,” Pattern Recognition,
performance with other distance metric learning algorithms. vol. 45, no. 4, pp. 1749 – 1758, 2012. [Online]. Available:
Given that existing algorithms perform partitional clustering, http://www.sciencedirect.com/science/article/pii/S0031320311004407
we must prune the dendrogram to be able to compare the [14] R. Gil-Garca and A. Pons-Porrata, “Hierarchical star clustering algo-
rithm for dynamic document collections,” in CIARP ’08: Proceedings
performance of HCAC-ML with these methods. of the 13th Iberoamerican congress on Pattern Recognition. Berlin,
Heidelberg: Springer-Verlag, 2008, pp. 187–194.
ACKNOWLEDGMENT [15] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,
“Information-theoretic metric learning,” in Proceedings of the 24th
The authors would like to acknowledge the financial support International Conference on Machine Learning, ser. ICML ’07. New
of the Coordination for the Improvement of Higher Education York, NY, USA: ACM, 2007, pp. 209–216. [Online]. Available:
http://doi.acm.org/10.1145/1273496.1273523
Personnel (CAPES), the Brazilian Council for Scientific and [16] P. C. Mahalanobis, “On the generalized distance in statistics,” Proc. of
Technological Development (CNPq) and the Foundation for the National Institute of Sciences (India), vol. 2, pp. 49–55, 1936.
the Support and Development of Education, Science and [17] Z. Zhang, M. Zhao, and T. W. S. Chow, “Binary- and multi-class
group sparse canonical correlation analysis for feature extraction and
Technology of Mato Grosso do Sul (FUNDECT - Siafem classification,” IEEE Transactions on Knowledge and Data Engineering,
25907, Process 147/2016). vol. 25, no. 10, pp. 2192–2205, Oct 2013.
[18] Z. Zhang, S. Yan, and M. Zhao, “Pairwise sparsity preserving embedding
R EFERENCES for unsupervised subspace learning and classification,” IEEE Transac-
tions on Image Processing, vol. 22, no. 12, pp. 4640–4651, Dec 2013.
[1] S. Dasgupta and V. Ng, “Which clustering do you want? inducing [19] B. Kulis, “Metric learning: A survey,” Foundations and Trends in
your ideal clustering with minimal feedback,” Journal of Artificial Machine Learning, vol. 5, no. 4, pp. 287–364, 2012.
Intelligence Research, vol. 39, pp. 581–632, 2010. [Online]. Available: [20] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using
http://dl.acm.org/citation.cfm?id=1946417.1946430 learnable string similarity measures,” in KDD ’03: Proceedings of the
[2] K. Wagstaff and C. Cardie, “Clustering with instance-level constraints,” 9th ACM SIGKDD International Conference on Knowledge Discovery
in ICML ’00: Proceedings of the 17th International Conference on and Data Mining. New York, NY, USA: ACM, 2003, pp. 39–48.
Machine Learning. San Francisco, CA, USA: Morgan Kaufmann [Online]. Available: http://doi.acm.org/10.1145/956750.956759
Publishers Inc., 2000, pp. 1103–1110. [21] H.-j. Kim and S.-g. Lee, “An effective document clustering method
[3] N. Kumar, K. Kummamuru, and D. Paranjpe, “Semi-supervised clus- using user-adaptable distance metrics,” in SAC ’02: Proceedings
tering with metric learning using relative comparisons,” in ICDM ’05: of the 9th ACM Symposium on Applied Computing. New
Proceedings of the 5th IEEE International Conference on Data Mining. York, NY, USA: ACM, 2002, pp. 16–20. [Online]. Available:
Washington, DC, USA: IEEE, 2005, pp. 693–696. http://doi.acm.org/10.1145/508791.508796
[4] D. Klein, S. D. Kamvar, and C. D. Manning, “From instance-level con- [22] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance
straints to space-level constraints: Making the most of prior knowledge metric learning, with application to clustering with side-information,”
in data clustering,” in ICML ’02: Proceedings of the 19th International in Advances in Neural Information Processing Systems 15, vol. 15.
Conference on Machine Learning. San Francisco, CA, USA: Morgan Cambridge, MA: MIT Press, 2003, pp. 505–512.
Kaufmann Publishers, 2002, pp. 307–314. [23] J. B. MacQueen, “Some methods for classification and analysis of mul-
[5] B. M. Nogueira, A. M. Jorge, and S. O. Rezende, “HCAC: Semi- tivariate observations,” in Proceedings of the V Berkeley Symposium on
supervised hierarchical clustering using confidence-based active learn- Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman,
ing,” in DS ’12: Proceedings of the 15th International Conference Eds., vol. 1. University of California Press, 1967, pp. 281–297.
on Discovery Science, ser. Lecture Notes in Computer Science, J.-G. [24] R. R. Sokal and C. D. Michener, “A statistical method for evaluating sys-
Ganascia, P. Lenca, and J.-M. Petit, Eds. Springer Berlin Heidelberg, tematic relationships,” University of Kansas Scientific Bulletin, vol. 28,
2012, vol. 7569, pp. 139–153. pp. 1409–1438, 1958.
[6] M.-F. Balcan and A. Blum, “Clustering with interactive feedback,” in [25] B. M. Nogueira, “Hierarchical semi-supervised confidence-based active
ALT ’08: Proceedings of the 19th international conference on Algorith- clustering and its application to the extraction of topic hierarchies
mic Learning Theory. Berlin, Heidelberg: Springer-Verlag, 2008, pp. from document collections,” Ph.D. dissertation, Instituto de Ciłncias
316–328. Matemáticas e de Computacão, 2013.
[7] Y. Huang and T. M. Mitchell, “Text clustering with extended user [26] I. Davidson and S. S. Ravi, “Using instance-level constraints in agglom-
feedback,” in SIGIR ’06: Proceedings of the 29th ACM Conference on erative hierarchical clustering: theoretical and empirical results,” Data
Research and Development in Information Retrieval. New York, NY, Mining and Knowledge Discovery, vol. 18, no. 2, pp. 257–282, 2009.
USA: ACM, 2006, pp. 413–420. [27] C. Shannon, “A mathematical theory of communication,” Bell System
[8] A. Dubey, I. Bhattacharya, and S. Godbole, “A cluster-level semi- Technical Journal, vol. 4, pp. 379–423, 1948.
supervision model for interactive clustering,” in ECML PKDD’10: [28] B. Larsen and C. Aone, “Fast and effective text mining using
Proceedings of the 2010 European Conference on Machine linear-time document clustering,” in KDD ’99: Proceedings of the 5th
Learning and Knowledge Discovery in Databases: Part I. Berlin, ACM SIGKDD International Conference on Knowledge Discovery and
Heidelberg: Springer-Verlag, 2010, pp. 409–424. [Online]. Available: Data Mining. New York, NY, USA: ACM, 1999, pp. 16–22. [Online].
http://dl.acm.org/citation.cfm?id=1888258.1888292 Available: http://doi.acm.org/10.1145/312129.312186
[9] R. Huang and W. Lam, “An active learning framework for semi- [29] R. M. Aliguliyev, “Performance evaluation of density-
supervised document clustering with language modeling,” Data and based clustering methods,” Information Sciences, vol. 179,
Knowledge Engineering, vol. 68, no. 1, pp. 49–67, 2009. no. 20, pp. 3583 – 3602, 2009. [Online]. Available:
[10] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and http://www.sciencedirect.com/science/article/pii/S0020025509002564
metric learning in semi-supervised clustering,” in ICML ’04: Proceed- [30] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics
ings of the 21st International Conference on Machine learning. New Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available:
York, NY, USA: ACM, 2004, pp. 81–88. http://dx.doi.org/10.2307/3001968
[11] D. Cohn, R. Caruana, and A. Mccallum, “Semi-supervised clustering [31] C. D. Manning, P. Raghavan, and H. Schütze, “Language models
with user feedback - technical report tr2003-1892,” Cornell University, for information retrieval,” in An Introduction to Information Retrieval.
Tech. Rep., 2003. Cambridge University Press, 2008, ch. 12.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy