A Hash-Based Co-Clustering Algorithm For Categorical Data
A Hash-Based Co-Clustering Algorithm For Categorical Data
A Hash-Based Co-Clustering Algorithm For Categorical Data
a r t i c l e i n f o a b s t r a c t
Article history: Cluster analysis, or clustering, refers to the analysis of the structural organization of a data set. This anal-
Received 23 February 2015 ysis is performed by grouping together objects of the data that are more similar among themselves than
Revised 16 July 2016
to objects of different groups. The sampled data may be described by numerical features or by a symbolic
Accepted 17 July 2016
representation, known as categorical features. These features often require a transformation into numer-
Available online 20 July 2016
ical data in order to be properly handled by clustering algorithms. The transformation usually assigns a
Keywords: weight for each feature calculated by a measure of importance (i.e., frequency, mutual information). A
Co-clustering problem with the weight assignment is that the values are calculated with respect to the whole set of
Categorical data objects and features. This may pose as a problem when a subset of the features have a higher degree of
Data mining importance to a subset of objects but a lower degree with another subset. One way to deal with such
Text mining problem is to measure the importance of each subset of features only with respect to a subset of ob-
Biclustering
jects. This is known as co-clustering that, similarly to clustering, is the task of finding a subset of objects
and features that presents a higher similarity among themselves than to other subsets of objects and
features. As one might notice, this task has a higher complexity than the traditional clustering and, if
not properly dealt with, may present an scalability issue. In this paper we propose a novel co-clustering
technique, called HBLCoClust, with the objective of extracting a set of co-clusters from a categorical data
set, without the guarantees of an enumerative algorithm, but with the compromise of scalability. This is
done by using a probabilistic clustering algorithm, named Locality Sensitive Hashing, together with the
enumerative algorithm named InClose. The experimental results are competitive when applied to labeled
categorical data sets and text corpora. Additionally, it is shown that the extracted co-clusters can be of
practical use to expert systems such as Recommender Systems and Topic Extraction.
© 2016 Elsevier Ltd. All rights reserved.
1. Introduction (Feng, Chen, & Ye, 2007), frequent patterns of gene expres-
sions (de França & Von Zuben, 2010) and many more.
The abundance of data being collected nowadays demands a set In order to accomplish such task, the objects of the data set are
of tools to automatically extract useful information from them. If described by a set of features measured during the data collection.
the data are partially labeled, a possible information to be extract Each feature can be represented by a numerical quantity, such as
is in the form of a mathematical model that can deduce the label height of a person, amount of gas measured on a car tank, or de-
from a set of measured variables, this characterizes the supervised scriptive characteristic or category, such as the gender of a person
learning. On the other hand, if the data are unlabeled, the informa- or the research topics they are interested.
tion can be extracted by modeling a group structure of the objects The objects described by numerical features can be conve-
that may describe the generating process of the data or may give a niently represented as numerical vector and the objects can be
summarization of the information contained on it. This is referred naturally compared to each other by using distance metrics. It
as unsupervised learning and it is commonly studied by means of makes sense to say that one person has twice the height of an-
clustering algorithms. other.
Data clustering can refer to the task of dividing a data set On the other hand, categorical features lacks these properties
into subset of objects that are more similar to each other than and should be transformed into numerical features in order to be
to the remaining elements of the set. There is a wide range of compared among themselves. For example, describing one person
applications such as segmenting a surveyed population (Morgan as male and another as female does not imply that one is more
& Sonquist, 1963) for market purposes, image quantization than the other.
http://dx.doi.org/10.1016/j.eswa.2016.07.024
0957-4174/© 2016 Elsevier Ltd. All rights reserved.
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 25
Some similarity metrics were proposed to quantify the differ- can be easily explained, thus improving the interpretability of the
ence between objects of categorical features (Boriah, Chandola, & model.
Kumar, 2008), mostly based on the matching features between two In many situations these relaxations can benefit the cluster
objects. One example is the Jaccard metric that, given the sets of analysis. When a cluster analysis is performed, it is expected that
features for two objects, it calculates the ratio between the cardi- the natural grouping of the data is related to the intended labeling
nality of the intersection between the two sets and the cardinality of the objective of the study. But, this expectation may not hold
of their union. true. For example, when trying to classify a set of animals, the
One problem that must be dealt with when using categorical clustering algorithm may correctly identify that lions, deers and
data is the high dimensionality. Since most clustering algorithms horses belong to the same class, but may incorrectly classify sea
requires a vectorial representation of the objects, the categorical lions, octopus and tuna as belonging to the same class if their com-
features are usually represented as a binary vector, with every po- mon traits are prevalent. The co-clustering of this data set would
sition representing whether the object has a given feature or not. still find these groups but, additionally, would find other groups re-
For example, if one measured feature is whether a person is male lating sea lions with mammals, tuna with other fishes and octopus
or female, it would be represented by a 2-dimensional vector. with invertebrates.
But, some categorical features may span into tens, hundreds Another possibility regards the topics extraction of a textual
or even thousands of vector dimensions. When describing a song data set. In this task it is sought to infer the set of words that
by its genre, each object would be represented as a vector with describes the topic of each document, based on the analysis of the
more than 1500 dimensions. This high dimension exponentially in- whole data set. The usual techniques applied to such task requires
creases the search space and it can cause a loss of precision on the a prior knowledge of how many different topics there are in the
similarity metrics. This is called the Curse of Dimensionality (Har- corpus and then tries to find the features responsible for the gen-
Peled, Indyk, & Motwani, 2012). erative process of each cluster of documents. Again, the restriction
This problem can be dealt with by reducing the dimensional- of a fixed generative process for each document (i.e., the generative
ity of the objects while preserving the similarity relationship be- process of the only cluster it belongs to) may fail to acknowledge
tween them. Probabilistic Dimension Reduction (Har-Peled et al., a second or third topic inside the text. With the co-clustering al-
2012) is the family of algorithms that exploits the probability that gorithm, the document can be grouped with different sets of doc-
two similar objects will be considered to be equal when a subset uments regarding different topics and, as this procedure also high-
of features is randomly sampled and used for comparison. lights the features used to create each cluster, it makes the topic
One of these algorithms, called Minhashing (Broder, 1997; of each cluster explicit.
Zamora, Mendoza, & Allende, 2016), approximates the Jaccard In- Finally, in Recommender Systems a clustering algorithm would
dex between two objects. This algorithm relies on the fact that the group together users with similar tastes. But then again, only the
probability of the first non-zero position of any random permu- prevalent common taste of each set of users will be taken into ac-
tation of the feature vector is the same for two objects is equal to count. If, for example, a given user rates positively many comedy
the Jaccard Index between them. The algorithm generates a smaller movies and just a few action movies, they would be likely grouped
dense representation of the data with this information. together with other users that share a taste for comedy. On the
The reduction of dimensionality has two drawbacks when ap- other hand, the co-clustering algorithm could also assign them to
plied prior the clustering procedure: i) the compact representation a group of users that like action movie. Besides, the taste of each
may hide some seemingly unimportant features that could be used user could be described by the combined set of features of every
to describe a smaller group and, ii) the new set of features will lose group they belong to.
its interpretability, since there is no clear relationship between the But, this flexibility comes with a price, the number of groups
original set and the reduced set. that can be found is usually large, given all the possible combina-
A more direct approach is to perform the data clustering in a tions of subset of objects and features. Some of these groups may
two-way manner by finding the clusters of objects conditioned to be irrelevant to the subject of analysis, rendering a burden to the
a subset of features. This defines the family of algorithms known post-analysis procedure.
as co-clustering. Some co-clustering algorithms coped with this problem by rein-
Data Co-Clustering (Dhillon, Mallela, & Modha, 2003; de França, troducing some of the constraints of the classical clustering, such
2012; Gao & Akoglu, 2014; Labiod & Nadif, 2011), also known as as the search for a pre-specified number of clusters and assign-
biclustering, tries to find subsets of objects and features that max- ing each object to only one group (Dhillon et al., 2003; Labiod &
imizes the similarity of the selected objects when considering the Nadif, 2011). These algorithms retain only the explicit description
chosen features. It exploits the fact that a given object may belong of which features were used as part of the clustering process.
to different categories when viewed by different aspects of its de- Despite theses difficulties, there are some co-clustering algo-
scription. For example, a given news text may report a story about rithms capable of dealing with such flexibility, the most recent be-
the economies of a football team. This document will have terms ing the HBLCoClust (de França, 2012) and CoClusLSH (Gao & Akoglu,
that are related to sports and other terms that relates to economy. 2014). They both have in common the use of a probabilistic dimen-
So, this object can be assigned to the group of sports related docu- sion reduction technique, named Locality Sensitive Hashing (LSH),
ments and the group of economy related document, depending of used to find promising regions with high probability of containing
the selected set of words. co-clusters.
This technique allows for many relaxations of the constraints In the original HBLCoClust algorithm, after the search for
imposed by traditional clustering1 . For example, each object may the promising regions, a graph partitioning technique, called
belong to more than one group, given a different subset of features. METIS (Karypis & Kumar, 2012), was used in order to generate
Additionally, one feature may be used to define different groups, meaningful results, but constraining the algorithm to a pre-defined
associated with different subset of features. number of clusters.2
Moreover, since each group is explicitly defined by a subset
of features, the reason for grouping together a subset of objects
2
It is also worth mentioning that the newest versions of METIS did not compile
1
Note: these relaxations are not present in every co-clustering algorithm. correctly in some systems, thus making the HBLCoClust inaccessible to many users.
26 F.O. de França / Expert Systems With Applications 64 (2016) 24–35
Fig. 1. Example of different quality measures: (a) constant bicluster, (b) constant rows, (c) constant columns, (d) coherent values (rows or columns should exhibit high
correlation).
This paper proposes a reformulation of HBLCoClust algorithm data. These data sets are composed by a set of documents de-
that does not depend of external algorithms while maintaining the scribed by the tokenized terms. A co-cluster of this data set would
scalability and automatically determining the number of groups to represent a subset of documents that exclusively share a subset of
return. The general idea is to pre-process the set of features, elim- these terms. Hopefully, this subset of terms describes the topic of
inating those possessed by the minority of the objects, then the the set of documents.
LSH algorithm is applied to the pre-processed data set in order to Formally, given the set O of objects and the set F of observed
identify promising regions to be explored. These regions are then features, a Co-Cluster C of a subset O ⊂ O, a subset F ⊂ F and a
explored by an enumerative algorithm, called InClose, returning ev- relation R ⊂ O × F can be described as:
ery co-cluster contained inside the given regions. Finally, the co- C (O , F ) = O × F | o ∈ O , f ∈ F ∧ (o, f ) ∈ R , (1)
clusters sharing a percentage of their elements are merged in order
to reduce the number of groups to be analyzed. where O and F can also be denoted as the objects and features
The experimental results will evaluate the groups generated by clusters, respectively.
HBLCoClust through the consistency of the grouped objects accord- The relation R contains all the tuples (o, f) inside the data set.
ing to their labels and the information conveyed by the subset of Notice, though, that the constraint that every pair of object and
features of each group when considering each subset of objects, feature of the co-cluster must exist in R may lead to a large set of
and scalability with respect to the number of elements in the data very small groups.
set. So, in order to minimize the number of groups while maximiz-
Additionally, some practical implications of the interpretability ing the size of such groups, the previous equation can be reformu-
of the groups will be illustrated in order to show the usefulness of lated as:
Data Co-Clustering to different applications. C ( O , F ) = O × F | |R | ≥ ρ · |O × F | , (2)
In Section 2 the co-clustering definition will be described
where |.| is the cardinality of a set and
in more details as well as some of its practical applications.
Section 3 explains the proposed algorithm with all of its key as- R = (o, f ) ∈ R | o ∈ O , f ∈ F . (3)
pects. Next, in Section 4, a complete set of experiments will be Notice that there is no restriction whether an object or a fea-
performed in order to assess how well the proposed algorithm per- ture must belong to only one group. There are other possible vari-
forms on real-world scenarios against other similar co-clustering ations of this problem such as the k, l−Co-Clustering, in which it
algorithms. Finally, Section 5 will give final comments on this work seeks to partition the data into k objects clusters and l features
along with some future perspectives. clusters by maximizing the number of non-null elements from the
submatrices induced by each combination of k and l. Specifically,
2. Co-clustering in this work, the focus will be on the formulation given in Eq. 2.
where o1 and o2 are the two objects being compared. The Jaccard • Every object in O has every feature in F ,
Index varies from 0 to 1, with the later meaning that the two ob- • For every o ∈ O, o ∈ O , there is one f ∈ F such that (o, f ) ∈ R,
jects are equal. • For every f ∈ F, f ∈ F , there is one o ∈ O such that (o , f) ∈ R.
Given a predefined order for the set of features, the Minhash
algorithm generates a random permutation π of this set and take In other words, this technique seeks the maximal subset of ob-
the first feature describing each object as its representative feature. jects O and the subset of features F that satisfies Eq. (1).
The probability that two objects have the same representative fea- In Andrews (2009) the enumerative algorithm InClose was pro-
ture is given by: posed and further improved in Andrews (2011). This algorithm
guarantees the exact set of formal concepts (i.e., co-clusters) with-
|o1 ∩ o2 |
P (oπ1 = oπ2 ) = , (5) out generating the same solution more than once (i.e., avoiding re-
|o1 ∪ o2 | dundant search).
For the sake of brevity and clarity, the InClose algorithm will
that is equal to the Jaccard Index.
be explained following the notations used in the previous sections
So, the expected value of the Jaccard Index is estimated by av-
(i.e., formal concepts will be called co-clusters from now on) and
eraging the number of collisions of two objects over k independent
with a small adaptation that will be explained later in this section.
random permutations. Since generating random permutations can
The general idea of the algorithm is to recursively enumerate
be computationally expensive, the permutation is approximated by
the Co-Clusters by following the lexicographical order of the fea-
an universal hash function such as the one proposed in Carter and
ture set F, expanding the current co-cluster until it becomes max-
Wegman (1979):
imal and branching the algorithm to explore new candidate solu-
h(x ) = a · x + b mod P, (6) tions.
Initially, the algorithm starts with the initial co-cluster (O , F )
where a and b are randomly chosen numbers from an uniform dis-
where O = O, the entire set of objects and F = { f }, a subset con-
tribution, and P is a large prime number. The variable x is the value
taining the first feature of F when following the lexicographical or-
to be hashed, i.e., a number associated with the original index of
der.
the ordered set of features. Notice that, this prime number should
By following the lexicographical order of the features in F, the
be at least as large as the number of features. This hash function
algorithm sequentially creates a new set F = F ∪ f for every fea-
will map each original index to an index in the range [0, P].
ture f ∈ F that is positioned after f. For each F it generates a new
The Minhash of an object j for permutation i is simply the fea-
set O that satisfies Eq. (1).
ture index x that minimizes the hash function hi (x):
After this step, the subset F is merged with every generated
mhi (O j ) = arg min hi (x ) | ∀x ∈ O j . (7) set F whenever the corresponding O = O , in other words, F will
x contain every feature that maintains the integrity of the set O ,
Given a data set of n objects, each object containing an aver- maximizing the co-cluster.
age of m̄ observed features, out of m possible features, and k hash The remaining pairs (O , F ) will be passed to a recursive call
functions. The complexity of this algorithm is O(n · m̄ · k ), with of the algorithm together with f = F ∩ F as long as (O , F ) is
m̄ << m on sparse data sets. cannonical with respect to f .
This idea was extended in Har-Peled et al. (2012) as an scalable A Co-Cluster (O , F ) is considered cannonical with respect to a
algorithm for the Nearest-Neighbor problem named Locality Sen- feature f if and only if there is no feature f < f that can be inserted
sitive Hashing (LSH). This algorithm creates p hash signatures for into F such as Eq. (1) is still satisfied. This verification is required
each object by grouping together a sequence of k Minhashes. Each to avoid redundancy.
signature represents a bucket and the probability that two objects The InClose algorithm is summarized in Algorithm 1. Notice that
o1 and o2 will collide into the same bucket is given by: two other parameters are included in this algorithm: minObjs and
minFeats, that are used to discard co-clusters smaller than these
Pcol l ision (o1 , o2 ) = 1 − (1 − J k ) p ) (8) thresholds.
where J is their Jaccard Index, k is the number of minhashes used
to build one hash signature and p is the number of hash signa- 3.3. Hash-Based linear co-Clustering algorithm
tures.
The values of k and p can be adjusted to maximize the proba- The Hash-based Linear Co-Clustering algorithm was initially
bility of grouping together the objects with a desired similarity of proposed in de França (2012) as an scalable algorithm to find k Co-
J. Clusters from a categorical data set. This algorithm can be divided
It is interesting to notice that each bucket groups together a in three simple steps:
subset of objects and is described as a subset of k common fea-
1. Find a co-cluster set C composed of a set of tuples (O , F ) by
tures. So, the LSH can be used to find a variable number of co-
applying the LSH algorithm. The set O is composed of the ob-
clusters with k features. We will show in the next sections that
jects inside a bucket and the set F is the set of k features used
these co-clusters can be used to limit the search space prior to the
as a hash signature.
application of an enumerative algorithm.
2. Maximize the co-clusters in C by inserting features in F while
satisfying Eq. (1) and then inserting objects in O obeying the
3.2. Enumerative algorithm for co-clustering same constraint.
3. Induce a graph from the set C where the vertices are the ob-
In mathematics, there is a very similar field of study called jects and the edges connect two objects that belongs to the
Formal Concept Analysis (Andrews, 2011) (FCA). In this field, the same co-cluster. After that, perform the community detection
data set is described as a Formal Context composed of a triple algorithm METIS (Karypis & Kumar, 2012) to find a set of k co-
C = (O, F , R ) where O stands for the set of objects, F is the set of clusters.
features and R⊆O × F is the binary relationship of the objects with
respect to the set of features. If a given object o ∈ O contains the The third step is needed because the second step generates a
feature f ∈ F, then (o, f) ∈ R. In FCA, the objective is to find pairs high number of small co-clusters when dealing with a sparse data
(O , F ), called formal concept of context C such as: set. But, this steps introduced theoretical and technical drawbacks.
28 F.O. de França / Expert Systems With Applications 64 (2016) 24–35
Algorithm 1: InClose f)∈R|o ∈ O ∧f ∈ F }| ≤ τ sparse · |O | · |F |, for an specified 0
≤ τ sparse ≤ 1 [optional].
input : (global) data set of objects, features and relations
6. Merge related co-clusters: merge any two co-clusters that
(O, F , R ),minimum cardinality for the cluster of shares the same subset of features.
objects minObjs and the clusterof features minFeats.
input : (local) initial subset of objects O and features F , In the first step, the algorithm optionally removes any feature
and currentfeature f . that appears in less than a percentage of the objects O. This step
output: set of Co-Clusters C avoids that, during the next stage, the LSH algorithm samples one
/* Insert new features in lexicographical order. */ of these rare features to compose the hash signature. This may cre-
ate a bucket containing a single object, thus generating an uninter-
Insertions ← ( f , O ) | f ∈ F ∧ f < f ∧ |O | ≥ minObjs
esting region.
with O = O ∩ {o ∈ O | (o , f ) ∈ R} ; The second step is similar to the original algorithm. It first gen-
erates p · k random hash functions and then creates p buckets for
/* Generate the maximalco-cluster with the fixed set every object by grouping k keys generated by the hash functions
O . */
(see Section 3.1). Additionally, it creates another set of p · k ran-
F ← F ∪ f | ( f , O ) ∈ Insertions ∧ O = O ; dom hash functions and then creates p buckets for every feature
by grouping the keys generated from the hash functions applied to
/* List of candidates to be recursively expanded. */
the objects set. This step generates a candidate set of regions to be
Candidates ← (O , F ∪ { f }, f ) | (O , f ) ∈ explored.
For every generated region, a new subset of the original data set
Insertions ∧ Cannonical(O , F ) ;
is created with the subset of objects that contains a relationship
if |F | ≥ minFeats then with every feature of the specified region and, similarly, a subset of
/* The operator ++ means concatenation. */ features that are related to every object of the region. This creates
a sparse subset that contains at least one co-cluster (as defined by
return (O , F ) + + InClose(O , F , f ) | (O , F , f ) ∈
(O , F )).
Candidates ; These regions are then used by the enumerative algorithm In-
Close to find the complete set of co-clusters. Since this algorithm
else only enumerates co-clusters that satisfies Eq. (1), the next step op-
return InClose(O , F , f ) | (O , F , f ) ∈ Candidates ; tionally insert objects and features that do not violate Eq. (2). The
order of insertion is defined by the number of missing values each
element introduces to the co-cluster.
Finally, since every co-cluster was generated from a constrained
For the theoretical drawbacks, the complexity of the first two region of the original data set and further expanded by the intro-
steps is O(n) with n representing the cardinality of the set of re- duction of sparseness, some co-clusters may share the same set of
lations R. The third step introduces a complexity proportional to features but with a different set of objects. For this purpose, those
O(m3 ) Kernighan and Lin (1970), with m representing the num- co-clusters sharing the same set of features are merged together.
ber of vertices of the induced graph or, in this case, the cardinality The HBLCoClust algorithm requires a total of six parameters,
of the set of objects O. Also, this algorithms introduce a number some of them optional: the minimum relative frequency of fea-
of different parameters that should be adjusted, together with the tures (τ , optional), the number of buckets (p) and number of hash
inherent parameters of HBLCoClust. Besides, it constrains the co- functions per bucket (k), the minimum cardinality for the objects
clusters to a k−co-clustering algorithm, with a pre-specified k. and features subsets (minObjs and minFeats), required by InClose,
About the technical drawbacks, there were some issues for the and the maximum allowed percentage of missing values (τ sparse ,
compilation process under some Operational Systems that pre- optional). The sensitivity of these parameters will be shown in the
vented the use of the HBLCoClust and the reproduction of the orig- next Section.
inal experiments was compromised. The pseudo-algorithm for HBLCoClust is depicted in
In order to eliminate the dependency of an external implemen- Algorithm 2 and followed by some auxiliary routines in
tation and the constraint of defining the number of clusters, we Algorithms 3 and 4.
now propose a new version of HBLCoClust by following these six One important thing to notice regarding the proposed algorithm
steps: is that the first three steps have a linear complexity with respect to
the cardinality of the relations set R, the fourth step is exponential
1. Remove rare features: remove any feature f ∈ F such as |{o|(o, proportional to the number of clusters contained inside the Regions
f) ∈ R}| ≤ τ · |O|, for an specified 0 ≤ τ ≤ 1 [optional]. created in step 3. The last two steps are quadratic with respect to
2. Find the promising regions with LSH: apply the LSH algorithm the average co-cluster cardinality. We will show in the next Section
to find a set of 2 · p candidates tuples (O , F ) with k features empirical evidence that the whole algorithm scales linearly with
by bucketing the objects with hash signatures from the features respect to the cardinality of R for the tested data sets.
set and, also, bucketing a set of features with hash signatures
from the objects set.
3. Create subsets of the original data set: expand each candidate 3.4. Literature review
(O , F ) creating a set of promising regions (O , F ) such as O =
{o ∈ O | (o, f ) ∈ R ∧ f ∈ F } and F = { f ∈ F | (o, f ) ∈ R ∧ o ∈ O }. In the literature, the most similar algorithm to HBLCoClust is the
4. Enumerate inside the promising regions: apply the InClose al- one proposed in Gao and Akoglu (2014) and called CoClusLSH. This
gorithm to every promising region thus creating the co-clusters algorithm iteratively applies the LSH algorithm using the set of ob-
set C = {(O , F )}. The duplicated regions can be avoided by jects and features alternately, and merging the groups defined by
storing the already explored regions on a hash table. the buckets using an entropic metric. The authors tested CoClusLSH
5. Expand the co-clusters by allowing sparseness: sequentially on a diverse set of real world data, presenting competitive numeri-
insert objects and features into every co-cluster such as |{(o, cal results measured by purity, mutual information and number of
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 29
Table 3 Table 5
Statistics of obtained Co-Clusters set for the Categorical Statistics of obtained Co-Clusters set for the Textual data
data sets. sets.
HBLCoClust 27.07 1.00 0.80 83.07 HBLCoClust 219.20 1.00 0.60 2807.23
CoClusLSH 37.00 1.00 1.00 52.00 SpecCo 3.00 1.00 −− −−
InClose 67.00 0.81 0.75 114.00 Multi5 # Objs. Feats. Size
SpecCo 7.00 1.00 −− −− HBLCoClust 1054.97 1.00 0.18 1051.23
Soybean S # Objs. Feats. Size SpecCo 5.00 1.00 −− −−
HBLCoClust 13.80 1.00 57.78 83.13 Multi10 # Objs. Feats. Size
CoClusLSH 10.00 1.00 1.00 225.00 HBLCoClust 2819.33 1.00 0.16 661.00
InClose 225.00 1.00 70.83 121.00 SpecCo 10.00 1.00 −− −−
SpecCo 4.00 1.00 −− −−
Soybean L # Objs. Feats. Size
HBLCoClust 42.40 1.00 0.74 205.73 Table 6
CoClusLSH 20.00 1.00 1.00 1089.00 Obtained results for the Textual data sets.
InClose 6470.00 0.98 0.93 109.00
Classic3 Purity NMI PMI
SpecCo 19.00 1.00 −− −−
House Votes # Objs. Feats. Size HBLCoClust 0.86 0.14 0.20
HBLCoClust 23.30 1.00 0.85 1173.80 SpecCo 0.86 0.73 −−
CoClusLSH 18.00 1.00 1.00 734.00
Multi5 Purity NMI PMI
InClose 124371 0.95 1.00 225.00
HBLCoClust 0.91 0.18 0.37
SpecCo 2.00 1.00 −− −−
SpecCo 0.59 0.53 −−
Multi10 Purity NMI PMI
Table 4 HBLCoClust 0.82 0.14 0.33
Obtained results for the categorical data SpecCo 0.57 0.55 −−
sets.
Table 7
Terms extracted from features clusters of Multi5 data set.
sport.baseball reds, houston, standings, cincinnati, colorado, mets, scores, marlins, including, milwaukee, oakland, pirates, city, expos, los, west, indians,
minnesota, ocf, rangers, joseph, white, angels, texas, giants, toronto, pittsburgh, phillies, cardinals, cubs, atlanta, mariners, orioles, mlb,
lost, braves, louis, detroit, teams, athletics, streak, hernandez, san, boston, cleveland, dodgers, sox, seattle, astros, blue, diego, jays,
jtchern, rockies, twins, brewers, tigers, red, francisco, philadelphia, kansas, yesterday, royals, california, padres, berkeley, league, chicago,
florida, angeles, april, montreal, yankees, baltimore, york
motorcycles handlebars, motorcycle, speed, countersteering, foward, handle, faq, awful, turns, debating, ummm, uiuc, happens, turning, pushing, fgc,
convert, unb, explain, duke, unbvm, acpub, slack, zkcl, infante, cbr, eric, cso, csd, methinks, push
politics.mideast later, muslims, ohanus, vol, turkish, involved, roads, argic, hand, muslim, armenian, document, russian, armenians, including, army, sahak,
proceeded, serdar, soul, killed, among, children, published, blood, appressian, mountain, often, exists, turks, armenia, general, soviet,
serve, escape, genocide, melkonian, ways, extermination, passes, closed
sci.space six, rigel, aurora, wings, mary, dfrf, alaska, dryden, spin, military, facility, speak, unknown, digex, prb, flight, pilot, fly, kotfr, air, nsmca, pat,
edwards, mig, wire, fighter, shafer
comp.graphics plot, recommend, pascal, hidden, routines, object, basic, cost, address, bob, cad, mac, info, frame, short, offer, animation, price, sites,
across, package, low, building, directory, removal, documentation, robert, built, recommendations, libraries, various, tasks, shading, fast,
files, code, objects, tools, handle, demo, library, contact, book
Table 8
Profile generated for one of the user on Movielens data set.
Movies Jaws, Back to the Future, Twelve Monkeys, Dumb & Dumber, ...
Like disaster, infidelity, horse, gunfight, USA, automobile, hospital, bathroom, jealousy, racism, elevator, fight, beer, male-nudity, helicopter, impalement,
good-versus-evil, outer-space, murder, washington-d.c., fire, shot-to-death, los-angeles-california, independent-film, small-town, train, drunkenness,
one-man-army, baby, teenage-boy, lifting-someone-into-the-air, redemption, f-word, photograph, tough-guy, gangster, main-character-dies
Dislike second-part, beaten-to-death, haunted-by-the-past, Washington,DistrictofColumbia,USA, cell-phone, vengeance, bulletproof-vest, obsession, book,
die-hard-scenario
The PMI metric is a very popular metric for assessing the qual-
ity of Topic Modeling algorithms. These algorithms try to search
for a set of words of a corpus that correctly describes the topic of
each document. A value of PMI around 0.5 is considered good for
the newsgroup data sets, as we can see in Anandkumar, Valluvan
et al. (2013), while the most well-known algorithm for Topic mod-
eling, Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2001), obtains
a value around 0.2.
To illustrate the quality of the obtained subsets of features, in
Table 7 we report the terms belonging to the set of features of 5
different clusters, one from each topic of Multi5 data set, chosen at
random. It is easy to see that most of these terms are meaningful
words revolving around the corresponding topic.
the algorithm. Notice that the y-axis of all the plots are fixed be- Finally, in Fig. 5, we can see that the pre-processing stage, con-
tween [0.6, 1.0] in order to highlight the significance of the varia- trolled by τ , affects only the number of clusters found, the higher
tions, while the right axis varies according to the results, in order the cutoff, less features will remain in the data set and less clus-
to verify the tendency of generating more or less clusters. ters can be found. The sparse parameter, on the other hand, does
In Fig. 3 we can see that the parameters minObjs and minFeats affect the Purity and the number of clusters. When this parame-
do not seem to affect the Purity of the clusters found by the al- ter is set to 1, it will only find dense clusters and, thus, the purity
gorithm. On the other hand, the more restricted the number of tends to be maximized, but it is less likely to find few larger clus-
objects and features, a smaller number of clusters are found. This ters. When the parameter is set to 0, the whole data set can be
means that the algorithm will maintain the quality of the clusters assigned to the same cluster, thus minimizing the Purity and re-
regardless of how many of them are found. turning a single cluster.
In Fig. 4, we can see the sensitivity analysis for the parameters
pertaining to the LSH algorithm. We can still observe that the Pu- 5. Conclusion
rity is not much affected by these parameters as well. The number
of clusters seems to be more affected by the number of buckets. In this paper a new algorithm for the co-clustering of categor-
More buckets mean that we will create more possibilities for find- ical data was presented. This algorithm consists of six sequential
ing different groups, thus leading to a higher number of clusters at steps and scales linearly with the number of non-empty entries of
the final stage. the data set. This is achieved by using a probabilistic algorithm to
34 F.O. de França / Expert Systems With Applications 64 (2016) 24–35
approximate the partial similarity between objects, thus creating Blei, D. M., Ng, A. Y., & Jordan, M. I. (2001). Latent dirichlet allocation. In Advances
seed co-clusters. These seed co-clusters are used to define regions in neural information processing systems (pp. 601–608).
Boriah, S., Chandola, V., & Kumar, V. (2008). Similarity measures for categorical
of the data set to be searched by an enumerative algorithm called data: A comparative evaluation. Red, 30(2), 3.
InClose, finally merging the Co-Clusters found inside these different Broder, A. Z. (1997). On the resemblance and containment of documents. In Com-
regions. pression and complexity of sequences 1997. proceedings (pp. 21–29). IEEE.
Carter, J., & Wegman, M. N. (1979). Universal classes of hash functions. Journal of
The experiments performed on categorical and textual data Computer and System Sciences, 18(2), 143–154.
showed that this algorithm is at least on par with other co- de Castro, P. A. D., de França, F. O., Ferreira, H. M., Coelho, G. P., & Von
clustering algorithms from the literature, and in many times signif- Zuben, F. J. (2010). Query expansion using an immune-inspired biclustering al-
gorithm. Natural Computing, 1–24.
icantly better. Most noticeably, the algorithm was capable of select-
Cheng, Y., & Church, G. M. (20 0 0). Biclustering of expression data. In Proceedings
ing subsets of features that maximized the point-wise mutual in- of the 8th international conference on intelligent systems for molecular biology
formation, leading to a meaningful set of features describing each (pp. 93–103).
Coelho, G. P., de França, F. O., & Von Zuben, F. J. (2009). Multi-objective biclustering:
cluster.
When non-dominated solutions are not enough. Journal of Mathematical Mod-
Some examples were provided illustrating the practical impli- elling and Algorithms.
cations of having an explicit subset of features. In one of these de Castro, P. A. D., de França, F. O., Ferreira, H. M., & Von Zuben, F. J. (2007). Apply-
examples, the features of the text data clusters were used to de- ing Biclustering to Perform Collaborative Filtering. In Proceedings of the 7th inter-
national conference on intelligent systems design and applications (pp. 421–426).
scribe the topic of a group of documents. In another example, the Rio de Janeiro, Brazil
descriptive features were used to generate a profile of what a given Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-cluster-
user likes or not in a movie. ing. In Proceedings of the ninth acm sigkdd international conference on knowledge
discovery and data mining (pp. 89–98). ACM.
Despite the good results, the proposed algorithm has a ten- Feng, H.-M., Chen, C.-Y., & Ye, F. (2007). Evolutionary fuzzy particle swarm optimiza-
dency of dividing the data sets into a larger set of clusters, when tion vector quantization learning scheme in image compression. Expert Systems
compared to other algorithms, though this may imply a more nat- with Applications, 32(1), 213–222.
de França, F. O., Coelho, G. P., & Von Zuben, F. J. (2008). bicaco: An ant colony in-
ural segmentation of the data. Also, the number of parameters may spired biclustering algorithm. In Ant colony optimization and swarm intelligence
require an additional effort to adjust, in some situations. (pp. 401–402). Springer.
For future research, we intend to propose heuristic algorithms de França, F. O., Coelho, G. P., & Von Zuben, F. J. (2013). Predicting missing val-
ues with biclustering: A coherence-based approach. Pattern Recognition, 46(5),
to automatically suggest the values of some of the required param-
1255–1266.
eters by using entropic metrics and knowledge extracted from ba- de França, F. O. (2012). Scalable overlapping co-clustering of word-document data.
sic statistics of the data set. This will be performed together with In Machine learning and applications (icmla), 2012 11th international conference
on: Vol. 1 (pp. 464–467). IEEE.
a more detailed sensitivity analysis.
de França, F. O., & Von Zuben, F. J. (2010). Finding a high coverage set of 5-biclusters
Additionally, we will explore in more details the idea of us- with swarm intelligence. In Evolutionary computation (cec), 2010 ieee congress on
ing co-clustering algorithms for Topic Modeling and Recommender (pp. 1–8). IEEE.
Systems, as well as some other possible applications. de França, F. O., & Von Zuben, F. J. (2011). Extracting additive and multiplicative
coherent biclusters with swarm intelligence. In Evolutionary computation (cec),
2011 ieee congress on (pp. 632–638). IEEE.
Funk, S. (2006). Netflix update: Try this at home.
References Gao, T., & Akoglu, L. (2014). Fast information-theoretic agglomerative co-clustering.
In H. Wang, & M. Sharaf (Eds.), Databases theory and applications. In Lecture notes
Anandkumar, A., Valluvan, R., et al. (2013). Learning loopy graphical models with in computer science: Vol. 8506 (pp. 147–159). Springer International Publishing.
latent variables: Efficient methods and guarantees. The Annals of Statistics, 41(2), Har-Peled, S., Indyk, P., & Motwani, R. (2012). Approximate nearest neighbor:
401–435. Towards removing the curse of dimensionality.. Theory OF Computing, 8(1),
Andrews, S. (2009). In-close, a fast algorithm for computing formal concepts. Inter- 321–350.
national conference on conceptual structures. Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American
Andrews, S. (2011). In-close2, a high performance formal concept miner. In Concep- Statistical Association (JASA), 67(337), 123–129.
tual structures for discovering knowledge (pp. 50–62). Springer.
Bache, K., & Lichman, M. (2013). UCI machine learning repository. URL http://archive.
ics.uci.edu/ml
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 35
Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic frame- Mirkin, B. (1996). Mathematical classification and clustering. Nonconvex optimization
work for performing collaborative filtering. In Proceedings of the 22nd annual and its applications. Springer.
international acm sigir conference on research and development in information re- Mitra, S., & Banka, H. (2006). Multi-objective evolutionary biclustering of gene ex-
trieval (pp. 230–237). ACM. pression data. Pattern Recognition, 39, 2464–2477.
Karypis, G., & Kumar, V. (2012). Metis-serial graph partitioning and fill-reducing matrix Morgan, J. N., & Sonquist, J. A. (1963). Problems in the analysis of survey data, and
ordering. a proposal. Journal of the American statistical association, 58(302), 415–434.
Kernighan, B. W., & Lin, S. (1970). An efficient heuristic procedure for partitioning Newman, M. E. (2006). Modularity and community structure in networks. Proceed-
graphs. Bell System Technical Journal, 49(2), 291–307. ings of the National Academy of Sciences, 103(23), 8577–8582.
Labiod, L., & Nadif, M. (2011). Co-clustering for binary and categorical data with Symeonidis, P., Nanopoulos, A., & Manolopoulos, Y. (2008). Providing justifications
maximum modularity.. In Icdm (pp. 1140–1145). in recommender systems. Systems, Man and Cybernetics, Part A: Systems and Hu-
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the twelfth mans, IEEE Transactions on, 38(6), 1–1272.
international conference on machine learning (pp. 331–339). Zamora, J., Mendoza, M., & Allende, H. (2016). Hashing-based clustering in high di-
Lewis, D. D. (1998). Naive (bayes) at forty: The independence assumption in infor- mensional data. Expert Systems with Applications, 62, 202–211.
mation retrieval. In Machine learning: Ecml-98 (pp. 4–15). Springer.