0% found this document useful (0 votes)
60 views

Unit 4 - Data Warehousing and Mining

The document discusses clustering algorithms and methods. It begins by defining clustering and its goals. It then describes several partitioning algorithms like k-means and k-medoids that assign data points to clusters iteratively to optimize intra-cluster similarity. Next, it covers hierarchical clustering methods like agglomerative and divisive that create cluster hierarchies or dendrograms. It also discusses distance measures and algorithms like BIRCH that can cluster large datasets efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Unit 4 - Data Warehousing and Mining

The document discusses clustering algorithms and methods. It begins by defining clustering and its goals. It then describes several partitioning algorithms like k-means and k-medoids that assign data points to clusters iteratively to optimize intra-cluster similarity. Next, it covers hierarchical clustering methods like agglomerative and divisive that create cluster hierarchies or dendrograms. It also discusses distance measures and algorithms like BIRCH that can cluster large datasets efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Warehousing and

Mining – Unit 4
Topics to be discussed
• Similarity and Distance Measures
• Hierarchical Algorithms
• Partitioned Algorithms
• Clustering Large Databases
• Clustering with Categorical Attributes
Cluster Analysis
• Cluster analysis or simply clustering is the process of partitioning a set of data
objects (or observations) into subsets. Each subset is a cluster, such that objects
in a cluster are similar to one another, yet dissimilar to objects in other clusters.
The set of clusters resulting from a cluster analysis can be referred to as a
clustering.
• Cluster is a collection of data objects that are similar to one another within the
cluster and dissimilar to objects in other clusters, a cluster of data objects can be
treated as an implicit class. In this sense, clustering is sometimes called
automatic classification.
• Again, a critical difference here is that clustering can automatically find the
groupings.
Similarity or Dissimilarity(Distance)
measures
Requirements for clustering

• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shapes
• Domain knowledge the parameters
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• Clustering high dimensionality data
• Constraint based clustering
• Interpretability and usability
Comparison aspects of clustering methods

• Partitioning criteria
• Separation of clusters
• Similarity measure
• Clustering space
Types of algorithms for Clustering
Partitioning algorithms
• “n” objects, “k” partitions and k<=n
• Exclusive cluster separation: each object must fall into one cluster
• Distance-based partitioning: an initial partition is created; iteratively
relocation techniques are attempted
• The general criterion of a good partitioning is that objects in the same
cluster are “close” or related to each other, whereas objects in
different clusters are “far apart” or very different.
Partitioning algorithm: K-means
• An objective function is used to assess the partitioning quality so that
objects within a cluster are similar to one another but dissimilar to objects
in other clusters. This is, the objective function aims for high intracluster
similarity and low intercluster similarity.
• A centroid-based partitioning technique uses the centroid of a cluster, Ci , to
represent that cluster. Conceptually, the centroid of a cluster is its center
point. The centroid can be defined in various ways such as by the mean or
medoid of the objects (or points) assigned to the cluster.
• The difference between an object p  Ci and ci, the representative of the
cluster, is measured by dist(p, ci), where dist(x,y) is the Euclidean distance
between two points x and y.
• The quality of cluster Ci can be measured by the within cluster variation, which is
the sum of squared error between all objects in Ci and the centroid ci, defined as

• where E is the sum of the squared error for all objects in the data set; p is the
point in space representing a given object; and ci is the centroid of cluster Ci (both
p and ci are multidimensional).
• The distance of each object from its centroid is squared, and all these squared
distances are summed.
K-means clustering algorithm
• The k-means algorithm defines the centroid of a cluster as the mean value of the
points within the cluster.
• First, it randomly selects k of the objects in D, each of which initially represents a
cluster mean or center. For each of the remaining objects, an object is assigned to
the cluster to which it is the most similar, based on the Euclidean distance between
the object and the cluster mean. The k-means algorithm then iteratively improves
the within-cluster variation.
• For each cluster, it computes the new mean using the objects assigned to the cluster
in the previous iteration. All the objects are then reassigned using the updated
means as the new cluster centers. The iterations continue until the assignment is
stable, that is, the clusters formed in the current round are the same as those
formed in the previous round.
The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative
relocation.
K-medoids

• The limitation of k-means to deal with outliers is dealt in this algorithm


• The distance between the nearest objects are normalized and the outliers are
brought into the clusters. The dynamic assignment of the medoids to calculate
the distance happens in the algorithm called Partioning Around Medoids(PAM)
Hierarchical methods
• A hierarchical clustering method works by grouping data objects into
a hierarchy or “tree” of clusters.
• Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
• The two approaches to hierarchical clustering:
• Agglomerative clustering
• Divisive clustering
Agglomerative clustering
• An agglomerative hierarchical clustering method uses a bottom-up
strategy. It typically starts by letting each object form its own cluster
and iteratively merges clusters into larger and larger clusters, until all
the objects are in a single cluster or certain termination conditions are
satisfied. The single cluster becomes the hierarchy’s root.
• For the merging step, it finds the two clusters that are closest to each
other (according to some similarity measure), and combines the two
to form one cluster. Because two clusters are merged per iteration,
where each cluster contains at least one object, an agglomerative
method requires at most n iterations.
Divisive hierarchical clustering method
• A divisive hierarchical clustering method employs a top-down
strategy. It starts by placing all objects in one cluster, which is the
hierarchy’s root.
• It then divides the root cluster into several smaller subclusters, and
recursively partitions those clusters into smaller ones.
• The partitioning process continues until each cluster at the lowest
level is coherent enough—either containing only one object, or the
objects within a cluster are sufficiently similar to each other.
Agglomerative Nesting(AGNES) &Divisive
Analysis(DIANA)
Dendrogram for hierarchical clustering
Distance measures in algorithmic methods
• Four widely used measures for distance between clusters are as follows, where
(p-p’) is the distance between two objects or points, p and p’; mi is the mean for
cluster, Ci ; and ni is the number of objects in Ci . They are also known as linkage
measures.
Hierarchical clustering using single and complete linkages
• When an algorithm uses the maximum distance, dmax(Ci ,Cj), to measure the distance
between clusters, it is sometimes called a farthest-neighbor clustering algorithm.
• If the clustering process is terminated when the maximum distance between nearest
clusters exceeds a user-defined threshold, it is called a complete-linkage algorithm.
• By viewing data points as nodes of a graph, with edges linking nodes, we can think of each
cluster as a complete subgraph, that is, with edges connecting all the nodes in the clusters.
• The distance between two clusters is determined by the most distant nodes in the two
clusters.
• Farthest-neighbor algorithms tend to minimize the increase in diameter of the clusters at
each iteration. If the true clusters are rather compact and approximately equal size, the
method will produce high-quality clusters. Otherwise, the clusters produced can be
meaningless.
BIRCH: Multiphase Hierarchical Clustering
using Clustering Featuring
• Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed
for clustering a large amount of numeric data by integrating hierarchical
clustering (at the initial microclustering stage) and other clustering methods such
as iterative partitioning (at the later macroclustering stage).
• It overcomes the two difficulties in agglomerative clustering methods: (1)
scalability and (2) the inability to undo what was done in the previous step.
• BIRCH uses the notions of clustering feature to summarize a cluster, and
clustering feature tree (CF-tree) to represent a cluster hierarchy.
• These structures help the clustering method achieve good speed and scalability in
large or even streaming databases, and also make it effective for incremental and
dynamic clustering of incoming objects.
Clustering Feature
• Consider a cluster of n d-dimensional data objects or points. The clustering feature (CF) of the
cluster is a 3-D vector summarizing information about clusters of objects. It is defined as
CF={n, LS, SS}
• LS is the linear sum of n datapoints
• SS is the sum of squares of the datapoints
• A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical
clustering.
• By definition, a nonleaf node in a tree has descendants or “children.” The nonleaf nodes store
sums of the CFs of their children, and thus summarize clustering information about their
children. A CF-tree has two parameters: branching factor, B, and threshold, T. The branching
factor specifies the maximum number of children per nonleaf node. The threshold parameter
specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. These
two parameters implicitly control the resulting tree’s size.
• Given a limited amount of main memory, an important consideration in
BIRCH is to minimize the time required for input/output (I/O). BIRCH
applies a multiphase clustering technique: A single scan of the data set
yields a basic, good clustering, and one or more additional scans can
optionally be used to further improve the quality. The primary phases are
• Phase 1: BIRCH scans the database to build an initial in-memory CF-tree,
which can be viewed as a multilevel compression of the data that tries to
preserve the data’s inherent clustering structure.
• Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf
nodes of the CF-tree, which removes sparse clusters as outliers and groups
dense clusters into larger ones.
• For Phase 1, the CF-tree is built dynamically as objects are inserted. Thus,
the method is incremental.
• An object is inserted into the closest leaf entry (subcluster). If the diameter
of the subcluster stored in the leaf node after insertion is larger than the
threshold value, then the leaf node and possibly other nodes are split.
• After the insertion of the new object, information about the object is passed
toward the root of the tree. The size of the CF-tree can be changed by
modifying the threshold.
• If the size of the memory that is needed for storing the CF-tree is larger than
the size of the main memory, then a larger threshold value can be specified
and the CF-tree is rebuilt.
Clustering Feature Tree
• The rebuild process is performed by building a new tree fromthe leaf
nodes of the old tree. Thus, the process of rebuilding the tree is done
without the necessity of rereading all the objects or points. This is
similar to the insertion and node split in the construction of BC-trees.
Therefore, for building the tree, data has to be read just once.
• Some heuristics and methods have been introduced to deal with
outliers and improve the quality of CF-trees by additional scans of the
data. Once the CF-tree is built, any clustering algorithm, such as a
typical partitioning algorithm, can be used with the CF-tree in Phase
2.
Chameleon: Multiphase hierarchical
clustering using Dynamic modeling
• Chameleon is a hierarchical clustering algorithm that uses dynamic
modeling to determine the similarity between pairs of clusters. In
Chameleon, cluster similarity is assessed based on (1) how well
connected objects are within a cluster and (2) the proximity of
clusters. That is, two clusters are merged if their interconnectivity is
high and they are close together. Thus, Chameleon does not depend
on a static, user-supplied model and can automatically adapt to the
internal characteristics of the clusters being merged. The merge
process facilitates the discovery of natural and homogeneous clusters
and applies to all data types as long as a similarity function can be
specified.
Chameleon
Probabilistic hierarchical
clustering
Partitioning methods
• Given a set of n objects, a partitioning method constructs k partitions of
the data, where each partition represents a cluster and k n. That is, it
divides the data into k groups such that each group must contain at least
one object.
• In other words, partitioning methods conduct one-level partitioning on
data sets. The basic partitioning methods typically adopt exclusive cluster
separation. That is, each object must belong to exactly one group. This
requirement may be relaxed, for example, in fuzzy partitioning
techniques.
• Most partitioning methods are distance-based. It uses the iterative
relocation technique to place the objects into any of the clusters.
• The best technique is to follow is to include the “nearest” object.
• Traditional partitioning methods include subspace clustering.
• Achieving global optimality in partitioning-based clustering is often computationally
prohibitive, potentially requiring an exhaustive enumeration of all the possible
partitions.
• Instead, most applications adopt popular heuristic methods, such as greedy
approaches like the k-means and the k-medoids algorithms, which progressively
improve the clustering quality and approach a local optimum.
• These heuristic clustering methods work well for finding spherical-shaped clusters in
small- to medium-size databases.
• To find clusters with complex shapes and for very large data sets, partitioning-based
methods need to be extended.
Clustering for Large
Databases – Density based
Clustering
Clustering for Large Databases – Density
based Clustering
• “How can we find dense regions in density-based clustering?”
• The density of an object o can be measured by the number of objects
close to o. DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) finds core objects, that is, objects that have dense
neighborhoods. It connects core objects and their neighborhoods to
form dense regions as clusters.
Clustering for Large Databases – Density
based Clustering
• How does DBSCAN quantify the neighborhood of an object?”
• A user-specified parameter > 0 is used to specify the radius of a
neighborhood we consider for every object.
• The -neighborhood of an object o is the space within a radius
centered at o.
• An object is a core object if the -neighborhood of the object contains
at least MinPts objects. Core objects are the pillars of dense regions.
Clustering for Large Databases – Density
based Clustering
• Given a set, D, of objects, we can identify all core objects with respect to
the given parameters, and MinPts. The clustering task is therein reduced
to using core objects and their neighborhoods to form dense regions,
where the dense regions are clusters.
• For a core object q and an object p, we say that p is directly density-
reachable from q (with respect to and MinPts) if p is within the -
neighborhood of q.
• Clearly, an object p is directly density-reachable from another object q if
and only if q is a core object and p is in the -neighborhood of q. Using the
directly density-reachable relation, a core object can “bring” all objects
from its -neighborhood into a dense region.
Clustering for Large Databases – Density
based Clustering
• “How can we assemble a large dense region using small dense regions
centered by core objects?”
• In DBSCAN, p is density-reachable from q (with respect to and MinPts in
D) if there is a chain of objects p1, … ,pn, such that p1 = q, pn = p, and
pi+1 is directly density-reachable from pi with respect to and MinPts, for
1 <=n, pi  D.
• Note that density-reachability is not an equivalence relation because it is
not symmetric. If both o1 and o2 are core objects and o1 is density-
reachable from o2, then o2 is density-reachable from o1. However, if o2 is
a core object but o1 is not, then o1 may be density-reachable from o2,
but not vice versa.
• We can use the closure of density-connectedness to find connected
dense regions as clusters. Each closed set is a density-based cluster. A
subset C D is a cluster if
• (1) for any two objects o1,o2  C, o1 and o2 are density-connected;
and
• (2) there does not exist an object o  C and another object o’  (D-C)
such that o and o’ are density-connected.
• “How does DBSCAN find clusters?” Initially, all objects in a given data set D are
marked as “unvisited.” DBSCAN randomly selects an unvisited object p, marks p as
“visited,” and checks whether the -neighborhood of p contains at least MinPts
objects.
• If not, p is marked as a noise point. Otherwise, a new cluster C is created for p, and
all the objects in the neighborhood of p are added to a candidate set, N.
• DBSCAN iteratively adds to C those objects in N that do not belong to any cluster. In
this process, for an object p’ in N that carries the label “unvisited,” DBSCAN marks it
as “visited” and checks its neighborhood.
• If the neighborhood of p’ has at least MinPts objects, those objects in the
neighborhood of p’ are added to N.
• DBSCAN continues adding objects to C until C can no longer be expanded, that is, N
is empty. At this time, cluster C is completed, and thus is output.
• To find the next cluster, DBSCAN randomly selects an unvisited object
from the remaining ones. The clustering process continues until all
objects are visited.
• If a spatial index is used, the computational complexity of DBSCAN is
O(nlogn),where n is the number of database objects. Otherwise, the
complexity is O(n2).
• With appropriate settings of the user-defined parameters, and
MinPts, the algorithm is effective in finding arbitrary-shaped clusters.
Grid Based Methods
• The grid-based clustering approach uses a multiresolution grid data
structure. It quantizes the object space into a finite number of cells
that form a grid structure on which all of the operations for clustering
are performed.
• The main advantage of the approach is its fast processing time, which
is typically independent of the number of data objects, yet dependent
on only the number of cells in each dimension in the quantized space.
• Two methods: STING and CLIQUE
Grid Based Methods - STING
• STING is a grid-based multiresolution clustering technique in which the embedding
spatial area of the input objects is divided into rectangular cells.
• The space can be divided in a hierarchical and recursive way.
• Several levels of such rectangular cells correspond to different levels of resolution and
form a hierarchical structure.
• Each cell at a high level is partitioned to form a number of cells at the next lower
level.
• Statistical information regarding the attributes in each grid cell, such as the mean,
maximum, and minimum values, is precomputed and stored as statistical parameters.
• These statistical parameters are useful for query processing and for other data
analysis tasks.
Grid Based Method: STING

• The statistical parameters of higher-level cells can easily be computed from the parameters of
the lower-level cells.
• These parameters include the following: the attribute-independent parameter, count; and the
attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max
(maximum), and the type of distribution that the attribute value in the cell follows such as
normal, uniform, exponential, or none (if the distribution is unknown).
• Here, the attribute is a selected measure for analysis such as price for house objects.
• When the data are loaded into the database, the parameters count, mean, stdev, min, and max
of the bottom-level cells are calculated directly from the data.
• The value of distribution may either be assigned by the user if the distribution type is known
beforehand or obtained by hypothesis tests such as the 2 test.
• “How is this statistical information useful for query answering?” The statistical
parameters can be used in a top-down, grid-based manner as follows. First, a layer within
the hierarchical structure is determined from which the query-answering process is to
start.
• This layer typically contains a small number of cells. For each cell in the current layer, we
compute the confidence interval (or estimated probability range) reflecting the cell’s
relevancy to the given query. The irrelevant cells are removed from further
consideration.
• Processing of the next lower level examines only the remaining relevant cells. This
process is repeated until the bottom layer is reached.
• At this time, if the query specification is met, the regions of relevant cells that satisfy the
query are returned. Otherwise, the data that fall into the relevant cells are retrieved and
further processed until they meet the query’s requirements.
Density based Method: CLIQUE
• CLIQUE (CLustering In QUEst) is a simple grid-based method for finding
densitybased clusters in subspaces. CLIQUE partitions each dimension
into nonoverlapping intervals, thereby partitioning the entire embedding
space of the data objects into cells.
• It uses a density threshold to identify dense cells and sparse ones. A cell is
dense if the number of objects mapped to it exceeds the density
threshold.
• CLIQUE performs clustering in two steps. In the first step, CLIQUE
partitions the d-dimensional data space into nonoverlapping rectangular
units, identifying the dense units among these. CLIQUE finds dense cells
in all of the subspaces.
CLIQUE – Phase 1
• To do so, CLIQUE partitions every dimension into intervals, and
identifies intervals containing atleast ‘l’ points, where ‘l’ is the density
threshold. CLIQUE then iteratively joins two k-dimensional dense
cells, c1 and c2, in subspaces (Di1,Di2….Dik) and (Dj1…Djk), provided c1
and c2 share equal intervals in those dimensions
• The join operation generates a new (k+1) dimensional candidate cell c
in space(Di1, …Dik-1, Dik, Dij).
• CLIQUE checks whether the number of points in c passes the density
threshold. The iteration terminates when no candidates can be
generated or no candidate cells are dense.
CLIQUE – Phase 2
• In the second step, CLIQUE uses the dense cells in each subspace to
assemble clusters, which can be of arbitrary shape. The idea is to apply
the Minimum Description Length (MDL) principle to use the maximal
regions to cover connected dense cells, where a maximal region is a
hyperrectangle where every cell falling into this region is dense, and the
region cannot be extended further in any dimension in the subspace.
• It starts with an arbitrary dense cell, finds a maximal region covering
the cell, and then works on the remaining dense cells that have not yet
been covered. The greedy method terminates when all dense cells are
covered.
Evaluation of clustering
• Clustering evaluation assesses the feasibility of clustering analysis, on
a dataset, and the quality of the results generated by a clustering
method.
• The major tasks of cluster evaluation are:
• Assessing clustering tendency. In this task, for a given data set, we
assess whether a nonrandom structure exists in the data. Blindly
applying a clustering method on a data set will return clusters;
however, the clusters mined may be misleading. Clustering analysis on
a data set is meaningful only when there is a nonrandom structure in
the data.
Cluster evaluation
• Determining the number of clusters in a data set. A few algorithms, such as k-
means, require the number of clusters in a data set as the parameter.
Moreover, the number of clusters can be regarded as an interesting and
important summary statistic of a data set. Therefore, it is desirable to
estimate this number even before a clustering algorithm is used to derive
detailed clusters.
• Measuring clustering quality. After applying a clustering method on a data
set, we want to assess how good the resulting clusters are. A number of
measures can be used. Some methods measure how well the clusters fit the
data set, while others measure how well the clusters match the ground truth,
if such truth is available. There are also measures that score clusterings and
thus can compare two sets of clustering results on the same data set.
End of Unit 4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy