Unit 4 - Data Warehousing and Mining
Unit 4 - Data Warehousing and Mining
Mining – Unit 4
Topics to be discussed
• Similarity and Distance Measures
• Hierarchical Algorithms
• Partitioned Algorithms
• Clustering Large Databases
• Clustering with Categorical Attributes
Cluster Analysis
• Cluster analysis or simply clustering is the process of partitioning a set of data
objects (or observations) into subsets. Each subset is a cluster, such that objects
in a cluster are similar to one another, yet dissimilar to objects in other clusters.
The set of clusters resulting from a cluster analysis can be referred to as a
clustering.
• Cluster is a collection of data objects that are similar to one another within the
cluster and dissimilar to objects in other clusters, a cluster of data objects can be
treated as an implicit class. In this sense, clustering is sometimes called
automatic classification.
• Again, a critical difference here is that clustering can automatically find the
groupings.
Similarity or Dissimilarity(Distance)
measures
Requirements for clustering
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shapes
• Domain knowledge the parameters
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• Clustering high dimensionality data
• Constraint based clustering
• Interpretability and usability
Comparison aspects of clustering methods
• Partitioning criteria
• Separation of clusters
• Similarity measure
• Clustering space
Types of algorithms for Clustering
Partitioning algorithms
• “n” objects, “k” partitions and k<=n
• Exclusive cluster separation: each object must fall into one cluster
• Distance-based partitioning: an initial partition is created; iteratively
relocation techniques are attempted
• The general criterion of a good partitioning is that objects in the same
cluster are “close” or related to each other, whereas objects in
different clusters are “far apart” or very different.
Partitioning algorithm: K-means
• An objective function is used to assess the partitioning quality so that
objects within a cluster are similar to one another but dissimilar to objects
in other clusters. This is, the objective function aims for high intracluster
similarity and low intercluster similarity.
• A centroid-based partitioning technique uses the centroid of a cluster, Ci , to
represent that cluster. Conceptually, the centroid of a cluster is its center
point. The centroid can be defined in various ways such as by the mean or
medoid of the objects (or points) assigned to the cluster.
• The difference between an object p Ci and ci, the representative of the
cluster, is measured by dist(p, ci), where dist(x,y) is the Euclidean distance
between two points x and y.
• The quality of cluster Ci can be measured by the within cluster variation, which is
the sum of squared error between all objects in Ci and the centroid ci, defined as
• where E is the sum of the squared error for all objects in the data set; p is the
point in space representing a given object; and ci is the centroid of cluster Ci (both
p and ci are multidimensional).
• The distance of each object from its centroid is squared, and all these squared
distances are summed.
K-means clustering algorithm
• The k-means algorithm defines the centroid of a cluster as the mean value of the
points within the cluster.
• First, it randomly selects k of the objects in D, each of which initially represents a
cluster mean or center. For each of the remaining objects, an object is assigned to
the cluster to which it is the most similar, based on the Euclidean distance between
the object and the cluster mean. The k-means algorithm then iteratively improves
the within-cluster variation.
• For each cluster, it computes the new mean using the objects assigned to the cluster
in the previous iteration. All the objects are then reassigned using the updated
means as the new cluster centers. The iterations continue until the assignment is
stable, that is, the clusters formed in the current round are the same as those
formed in the previous round.
The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative
relocation.
K-medoids
• The statistical parameters of higher-level cells can easily be computed from the parameters of
the lower-level cells.
• These parameters include the following: the attribute-independent parameter, count; and the
attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max
(maximum), and the type of distribution that the attribute value in the cell follows such as
normal, uniform, exponential, or none (if the distribution is unknown).
• Here, the attribute is a selected measure for analysis such as price for house objects.
• When the data are loaded into the database, the parameters count, mean, stdev, min, and max
of the bottom-level cells are calculated directly from the data.
• The value of distribution may either be assigned by the user if the distribution type is known
beforehand or obtained by hypothesis tests such as the 2 test.
• “How is this statistical information useful for query answering?” The statistical
parameters can be used in a top-down, grid-based manner as follows. First, a layer within
the hierarchical structure is determined from which the query-answering process is to
start.
• This layer typically contains a small number of cells. For each cell in the current layer, we
compute the confidence interval (or estimated probability range) reflecting the cell’s
relevancy to the given query. The irrelevant cells are removed from further
consideration.
• Processing of the next lower level examines only the remaining relevant cells. This
process is repeated until the bottom layer is reached.
• At this time, if the query specification is met, the regions of relevant cells that satisfy the
query are returned. Otherwise, the data that fall into the relevant cells are retrieved and
further processed until they meet the query’s requirements.
Density based Method: CLIQUE
• CLIQUE (CLustering In QUEst) is a simple grid-based method for finding
densitybased clusters in subspaces. CLIQUE partitions each dimension
into nonoverlapping intervals, thereby partitioning the entire embedding
space of the data objects into cells.
• It uses a density threshold to identify dense cells and sparse ones. A cell is
dense if the number of objects mapped to it exceeds the density
threshold.
• CLIQUE performs clustering in two steps. In the first step, CLIQUE
partitions the d-dimensional data space into nonoverlapping rectangular
units, identifying the dense units among these. CLIQUE finds dense cells
in all of the subspaces.
CLIQUE – Phase 1
• To do so, CLIQUE partitions every dimension into intervals, and
identifies intervals containing atleast ‘l’ points, where ‘l’ is the density
threshold. CLIQUE then iteratively joins two k-dimensional dense
cells, c1 and c2, in subspaces (Di1,Di2….Dik) and (Dj1…Djk), provided c1
and c2 share equal intervals in those dimensions
• The join operation generates a new (k+1) dimensional candidate cell c
in space(Di1, …Dik-1, Dik, Dij).
• CLIQUE checks whether the number of points in c passes the density
threshold. The iteration terminates when no candidates can be
generated or no candidate cells are dense.
CLIQUE – Phase 2
• In the second step, CLIQUE uses the dense cells in each subspace to
assemble clusters, which can be of arbitrary shape. The idea is to apply
the Minimum Description Length (MDL) principle to use the maximal
regions to cover connected dense cells, where a maximal region is a
hyperrectangle where every cell falling into this region is dense, and the
region cannot be extended further in any dimension in the subspace.
• It starts with an arbitrary dense cell, finds a maximal region covering
the cell, and then works on the remaining dense cells that have not yet
been covered. The greedy method terminates when all dense cells are
covered.
Evaluation of clustering
• Clustering evaluation assesses the feasibility of clustering analysis, on
a dataset, and the quality of the results generated by a clustering
method.
• The major tasks of cluster evaluation are:
• Assessing clustering tendency. In this task, for a given data set, we
assess whether a nonrandom structure exists in the data. Blindly
applying a clustering method on a data set will return clusters;
however, the clusters mined may be misleading. Clustering analysis on
a data set is meaningful only when there is a nonrandom structure in
the data.
Cluster evaluation
• Determining the number of clusters in a data set. A few algorithms, such as k-
means, require the number of clusters in a data set as the parameter.
Moreover, the number of clusters can be regarded as an interesting and
important summary statistic of a data set. Therefore, it is desirable to
estimate this number even before a clustering algorithm is used to derive
detailed clusters.
• Measuring clustering quality. After applying a clustering method on a data
set, we want to assess how good the resulting clusters are. A number of
measures can be used. Some methods measure how well the clusters fit the
data set, while others measure how well the clusters match the ground truth,
if such truth is available. There are also measures that score clusterings and
thus can compare two sets of clustering results on the same data set.
End of Unit 4