0% found this document useful (0 votes)
2 views

Dmaclat4 Merged

The document provides an overview of clustering methods, including definitions, requirements, and comparisons between algorithms like K-Means and K-Medoids. It details hierarchical clustering techniques, such as agglomerative and divisive methods, and discusses the integration of clustering algorithms for improved efficiency. Additionally, it outlines various clustering approaches, including partitioning, hierarchical, density-based, and grid-based methods, along with their advantages and limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Dmaclat4 Merged

The document provides an overview of clustering methods, including definitions, requirements, and comparisons between algorithms like K-Means and K-Medoids. It details hierarchical clustering techniques, such as agglomerative and divisive methods, and discusses the integration of clustering algorithms for improved efficiency. Additionally, it outlines various clustering approaches, including partitioning, hierarchical, density-based, and grid-based methods, along with their advantages and limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Unit 4: Cluster Analysis

2 Marks / 4 Marks Questions


1.​ What is clustering? (refer below questions)
2.​ What are the requirements of clustering? (refer below questions)
3.​ State the categories of clustering methods?(refer below questions)
4.​ Difference between K-Means and K-Medoids Algorithms.

K-Means K-Medoids

Uses centroids (mean of points) as the Uses medoids (actual data points) as the center
center of clusters of clusters

Sensitive to outliers as mean is influenced by More robust to outliers since it uses actual
extreme values points (medoids)

Efficient for large datasets due to simpler Slower due to the need to calculate distances
computations between all pairs of points

Computational complexity: O(nk) Computational complexity: O(n²k)

Works well for spherical clusters Works better for non-spherical and complex
clusters

Only works with numerical data (mean Can work with non-numeric data as well
calculation)

Requires fewer computational resources Requires more computational resources for


distance calculations

Not ideal when clusters are not globular Better for cases where clusters are
non-spherical or contain noise

5.​ What do you mean by Hierarchical Clustering?(refer below questions)


6.​ What do you mean by Agglomerative Clustering?(refer below questions)
7.​ What do you mean by Divisive Clustering?
●​ A top-down hierarchical clustering method.
●​ Starts with all data points in one cluster.
●​ Recursively splits clusters into smaller ones.
●​ Splitting is based on a distance measure (e.g., median or mean).
●​ Continues until each cluster contains a single data point.
●​ Unlike Agglomerative, which merges clusters, Divisive splits them.
●​ Produces a dendrogram showing the split hierarchy.
●​ Example: Start with the dataset {10, 20, 30, 40, 50, 60}, split by median value (e.g.,
35) into two clusters.
●​ Advantage: Does not require predefining the number of clusters; produces a
hierarchy of clusters for flexible analysis.
●​ Disadvantage: Computationally expensive, as it needs to check and split at each
level recursively.

8.​ Both k-means and k-medoids algorithms can perform effective


clustering.
(a) Illustrate the strength and weakness of k-means in comparison
with the k-medoids algorithm.
(b) Illustrate the strength and weakness of these schemes in
comparison with a hierarchical clustering scheme (such as AGNES).
(a) Strength and Weakness of K-Means vs. K-Medoids:
K-Means K-Medoids

Strengths: Strengths:
Fast and efficient for large datasets More robust to outliers since it uses actual
(O(nk)) points (medoids)
Works well for spherical clusters Can work with non-numeric and categorical
Less computational cost compared to data
K-Medoids Can better handle non-spherical and complex
shapes

Weaknesses: Weaknesses:
Sensitive to outliers (mean is Slower due to the need to calculate distances
influenced by extreme values) between all pairs of points
Assumes spherical clusters, which is Computational complexity is higher (O(n²k))
limiting
(b) Strength and Weakness of K-Means/K-Medoids vs. Hierarchical Clustering (e.g.,
AGNES):
K-Means/K-Medoids Hierarchical Clustering (e.g., AGNES)

Strengths: Strengths:
More efficient and faster for large Does not require the number of clusters to
datasets be predefined
K-Means: Faster (O(nk)), K-Medoids: Provides a hierarchy (dendrogram) that
More robust to outliers shows cluster relationships

Weaknesses: Weaknesses:
Sensitive to initial centroids (K-Means) Computationally expensive (O(n²))
Assumes spherical clusters (K-Means) Cannot handle large datasets efficiently
K-Medoids slower with larger datasets Can be affected by noisy data

9.​ Give an example of how specific clustering methods may be integrated,


for example, where one clustering algorithm is used as a preprocessing
step for another.
❖​ BIRCH integrates two clustering methods in a two‐phase process:
➢​ Phase 1 (Hierarchical preprocessing): Build an in-memory CF-tree by
inserting each data point into the closest leaf entry, splitting nodes when
the diameter threshold T is exceeded.
➢​ Phase 2 (Partitioning refinement): Apply a partitioning algorithm (e.g.
K-Means) to the CF-tree’s leaf entries (each entry summarizing a
subcluster).
❖​ Benefit: CF-tree compression drastically reduces the number of objects K-Means
must process, improving speed and memory usage.
❖​ Example: Given a large numeric dataset, BIRCH first creates a CF-tree (with
Branching Factor B and threshold T) that summarizes thousands of points into a
few hundred CF entries; then K-Means clusters these CF entries to produce the
final clusters.
❖​ Other integration (mentioned in PDF): CHAMELEON uses an initial hierarchical
clustering to produce subclusters, then refines them using a dynamic modeling
(graph-partitioning) approach, combining strengths of both methods.
12 Marks Questions
1.​ What do you mean by Clustering? Explain the requirements used in
Clustering.
●​ Cluster: A group of similar data objects that are distinct from objects in other
groups.
●​ Cluster Analysis: Identifies similarities in data and groups similar objects together.
●​ Unsupervised Learning: No predefined classes; learning is based on patterns in data
rather than labeled examples.
●​ Applications:
○​ Used for understanding data distribution.
○​ Serves as a preprocessing step for other algorithms.
Example of Cluster Analysis:
●​ Scenario: A retail company segments customers based on annual income and
spending behavior.
●​ Dataset: Includes customer income and spending scores.
●​ Clustering Method: K-Means (K=3)
●​ Process:
○​ Assign initial cluster centers.
○​ Group customers based on proximity to centroids.
○​ Recalculate centroids and repeat until stable clusters form.
●​ Result:
○​ Cluster 1: High-income, low-spenders.
○​ Cluster 2: Low-income, high-spenders.
○​ Cluster 3: Middle-income, moderate-spenders.
Requirements for Clustering (Based on PDF content):
●​ Partitioning Requirement: Divides dataset into clusters where intra-cluster
similarity is high and inter-cluster similarity is low.
●​ Hierarchical Requirement: Organizes clusters into a tree or nested structure
(dendrogram).
●​ Density-Based Requirement: Identifies clusters of arbitrary shape based on
regions of higher density.
●​ Grid-Based Requirement: Divides data space into a finite number of cells that
form a grid structure.
●​ Evaluation Requirement: Measures clustering quality using methods like Sum of
Squared Error (SSE), Silhouette Coefficient, etc.​
2.​ Briefly describe and give examples of each of the following approaches
to clustering: partitioning methods, hierarchical methods, density-based
methods and grid-based methods.
1. Partitioning Methods:
●​ Partitioning methods divide the dataset into a predefined number of clusters.
●​ K-Means Algorithm: Divides the data into k clusters by minimizing the variance
within each cluster.
●​ Example: A retail company segments customers based on income and spending
behavior using K-Means (K=3).
●​ K-Medoids (PAM): Similar to K-Means but uses actual data points (medoids) instead
of centroids.
●​ Example: Segmenting customers into high-income, low-income, and middle-income
groups.
●​ CLARA (Clustering Large Applications): Works on samples from large datasets,
applies PAM, and chooses the best clustering configuration.
2. Hierarchical Methods:
●​ Hierarchical methods build a tree-like structure (dendrogram) of clusters.
●​ Types:
●​ Agglomerative (Bottom-Up): Starts with individual data points as clusters and
merges the closest clusters.
●​ Example: Merging clusters based on proximity until all points form one cluster.
●​ Divisive (Top-Down): Starts with all points in one cluster and recursively splits it
into smaller clusters.
●​ Example: Splitting a cluster into smaller ones based on median values.
●​ Advantages: No need to predefine the number of clusters.
●​ Disadvantages: Computationally expensive for large datasets.
●​ Diagram Reference:
3. Density-Based Methods:
●​ Clustering based on density (local cluster criterion), such as density-connected
points.
●​ DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups
points that are closely packed while marking outliers as noise.
●​ Diagram Reference:

●​ Example: Clustering days with similar weather conditions (e.g., temperature,


humidity).
●​ OPTICS: Similar to DBSCAN but produces an augmented reachability plot for
better visual understanding of density.
●​ Advantages: Can discover clusters of arbitrary shape and handle noise.
4. Grid-Based Methods:
●​ Methods that use a multi-resolution grid data structure to partition the data space.
●​ STING (Statistical Information Grid): Divides data space into hierarchical
rectangular grids and uses statistical information for clustering.​
Example: Clustering based on temperature and humidity data from weather
stations.
●​ Diagram Reference:
●​ CLIQUE (Clustering In QUEst): A density-based grid clustering method that finds
clusters in subspaces of high-dimensional data
●​ Example: Clustering users based on their behavior patterns in a multi-dimensional
space.
●​ Advantages: Efficient for large datasets; works well in high-dimensional spaces.
●​ Disadvantages: Fixed grid structure; may not detect clusters if grid cells are too
large or too small.
3.​ Explain in detail about Hierarchical Clustering.
Hierarchical clustering organizes data into a tree of nested clusters (a dendrogram),
useful when the number of clusters is not known in advance ​.
Types​
• Agglomerative (bottom-up): start with each point as its own cluster, iteratively merge
the two closest clusters until one remains .​
• Divisive (top-down): start with all points in one cluster, recursively split clusters (e.g. by
median) until each contains a single point .
Agglomerative Steps ​
1.​ Initialize N clusters (one per data point)
2.​ Find the pair of clusters with minimum distance and merge (N→N–1)
3.​ Recompute distances between the new cluster and all others
4.​ Repeat step 2–3 until one cluster remains
5.​ Cut the dendrogram at desired level to form final clusters
Divisive Steps ​
1.​ Begin with all points in one cluster
2.​ Choose a split criterion (e.g. median) to divide into two subclusters
3.​ Recursively apply splitting to subclusters until singleton clusters
Linkage (distance) methods decide how to measure inter-cluster distance​
• Single‐link: minimum distance between any two members​
• Complete‐link: maximum distance between any two members​
• Average‐link: average pairwise distance​
• Centroid: distance between cluster centroids​
• Medoid: distance between cluster medoids (most centrally located points)
Dendrogram ​
• Tree-like plot: leaves = data points; internal nodes = merges​
• Y-axis shows merge distance; cutting at height H yields clusters​
Applications ​
• Market segmentation: group customers by purchasing behavior​
• Document clustering: group texts/articles by topic​
• Healthcare: group patients by symptoms or genetic profiles
Advantages / Drawbacks ​
• No need to predefine k; produces a dendrogram; supports non-spherical clusters; works
on small datasets​
• Computationally expensive (O(n²)); not suited to very large datasets; sensitive to noise
and outliers
4.​ Explain in detail about the partitional Clustering method.
Partitioning method: divide a database D of n objects into k clusters to minimize the sum
of squared distances (centroid or medoid) ​
• Given k, find a partition that optimizes this criterion​
• Exact (global) optimum by enumeration is infeasible ⇒ use heuristic methods
1. K-Means Clustering
●​ Algorithm Steps
1.​ Choose k and select k instances at random as initial centroids
2.​ Assign each object to the nearest centroid (e.g. Euclidean distance)
3.​ Recompute each centroid as the mean of its assigned points
4.​ Repeat steps 2–3 until no change in assignments
●​ Metric: Sum of Squared Error (SSE); aim to minimize SSE
●​ Advantages: simple; easy to implement
●​ Disadvantages: sensitive to initialization; high cost for large n
●​ Choosing k: “elbow” method—plot SSE vs k and look for the inflection point. ​
2. K-Medoids (PAM) ​
●​ Algorithm Steps
1.​ Initialize: select k actual data points as medoids
2.​ Assignment: assign each non-medoid to the nearest medoid (e.g. Manhattan distance)
3.​ Update: for each medoid–non medoid pair, swap if it reduces total dissimilarity (sum
of distances)
4.​ Repeat until no swap reduces cost
●​ Robustness: uses actual data points ⇒ less sensitive to outliers than K-Means
●​ Disadvantages:​
• Not ideal for non-spherical clusters and costly.​
• Random initialization can yield different results
3. CLARA (Clustering Large Applications) - apply PAM to large datasets via sampling
Algorithm Steps
a.​ Draw a random sample of size s from the full dataset
b.​ Apply PAM on the sample to find k medoids
c.​ Compute total cost of these medoids on the entire dataset
d.​ Repeat steps 1–3 for several samples; select medoids with lowest overall cost
Drawbacks:​
• If samples miss important regions or outliers, clusters may be suboptimal​
• Still computationally expensive due to multiple PAM runs
5.​ Explain in detail about Clustering methods with an example.
Partitioning a database D of n objects into k clusters such that the sum of squared errors
(SSE) within clusters is minimized​
– Each cluster is represented by a centroid (mean) in K-Means
• K-Means Algorithm Steps ​
1.​ Select k (number of clusters)
2.​ Initialize: choose k data points at random as initial centroids
3.​ Assignment: assign each object to the nearest centroid (e.g., Euclidean distance)
4.​ Update: recompute each centroid as the mean of all points assigned to it
5.​ Repeat steps 3–4 until no object changes its assigned cluster
• Metric​

SSE (Sum of Squared Error): ​


– The algorithm seeks to minimize SSE
Convergence and Choice of k​
– Converges when assignments no longer change​
– “Elbow method”: plot SSE vs k and pick k at the elbow point
• Advantages and Disadvantages ​
– Advantages: simple; easy to implement​
– Disadvantages: sensitive to initialization; may converge to local minimum; expensive for
large datasets
6.​ Describe each of the following clustering algorithms in terms of the
following criteria:
(1) shapes of clusters that can be determined;
(2) input parameters that must be specified; and
(3) limitations: k-means, k-medoids, DBSCAN.
K-Means​
• Shapes of clusters: Finds compact, roughly spherical (convex) clusters by minimizing
within-cluster variance ​
• Input parameters - k (number of clusters) & Choice of distance metric (e.g. Euclidean)​
• Limitations​
Sensitive to initial centroids (different runs may yield different clusters)​
Cannot handle non-spherical or varying-density clusters​
Sensitive to outliers (they distort centroids)​
High computational cost for large n ​
K-Medoids (PAM)​
• Shapes of clusters: Finds compact clusters around actual data points (medoids); still
favors roughly spherical/compact shapes ​
• Input parameters - k (number of medoids) & Choice of distance metric (e.g. Manhattan or
Euclidean)​
• Limitations​
Not suitable for arbitrary-shaped clusters (uses compactness criterion)​
Random medoid initialization can lead to different results​
More computationally expensive than k-means (swap evaluations)​
Memory/storage intensive; does not scale well to very large datasets ​
DBSCAN​
• Shapes of clusters - Discovers clusters of arbitrary shape (density-connected regions)
and identifies noise ​
• Input parameters - ε (Epsilon): radius for neighborhood & MinPts: minimum number of
points required to form a dense region (core point) ​
• Limitations​
Sensitive to choice of ε and MinPts.​
Difficulty handling data with widely varying density​
Performance degrades in high-dimensional spaces (“curse of dimensionality”) ​
Problem Solving / Case Studies
1.​ Suppose that the data mining task is to cluster points (with (x, y)
representing location) into three clusters, where the points are A1(20,
100), A2(4, 10), B1(6, 9), B2(8, 10), B3(6, 4), C1(1, 2), C2(4, 9),
C3(3,5). The distance function is Euclidean distance. Suppose initially we
assign A1, B1, and C1 as the center of each cluster, respectively. Use
the k-means algorithm to show only
i.​ The three cluster centers after the first round of
execution, and
ii.​ The final three clusters.
2.​ Use an example to show why the k-means algorithm may not find the
global optimum, that is, optimizing the within-cluster variation.
Given Data Points:​
(2,3), (3,3), (6,5), (8,8), (3,2), (5,7)
Number of clusters (k): 2
Step 1: Initial random centroids chosen: (2,3) and (8,8)
Step 2: First Iteration - Assign clusters based on distance to centroids:
a.​ Cluster 1: (2,3), (3,3), (3,2)
b.​ Cluster 2: (6,5), (8,8), (5,7)
3.​ Step 3: Calculate new centroids:
a.​ New Centroid 1: (2.6, 2.6)
b.​ New Centroid 2: (6.3, 6.6)
Step 4: Reassign points based on updated centroids:
c.​ Cluster 1: (2,3), (3,3), (3,2)
d.​ Cluster 2: (6,5), (8,8), (5,7)
Step 5: Convergence achieved (no change in cluster membership).
Important Observation:
●​ Though convergence is achieved, K-Means may have converged to a local
minimum, not the global minimum.
●​ This happens because the initial centroids were randomly chosen, leading the
algorithm to a solution that might not represent the best (globally optimal) cluster
configuration.
●​ Global optimum could be different if a better set of initial centroids were chosen.
Conclusion:
K-Means is sensitive to the initial selection of centroids.
It can get stuck in a local minimum, giving a suboptimal clustering result even if the
algorithm converges.
Running K-Means multiple times with different random initializations can help in
approaching a better solution.​
4.​ Explain the BIRCH clustering algorithm in detail. Given the dataset:
(1,2), (2,2), (1.5,1.8), (8,9), (8.5,9.5), (9,9), (25,80). Build a CF
(Clustering Feature) tree with a threshold of 1.5 (assume each CF node
can hold max 2 entries). Show how the tree is updated after each
insertion.
Clustering Feature (CF):​
A CF is a compact representation of a cluster.​
It is a triplet:
●​ N = Number of points
●​ LS = Linear Sum (sum of all the points)
●​ SS = Square Sum (sum of squares of all the points)
CF Tree:
●​ A height-balanced tree storing the clustering features.
●​ Each non-leaf node contains entries pointing to child nodes.
●​ Each leaf node holds CF entries.
●​ Each node has a max number of entries it can hold (here max=2).
●​ New points are inserted by traversing from the root down to the leaves,
merging with the closest CF entry if possible (based on threshold distance).
Threshold (T):
●​ Max radius allowed for a CF entry.
●​ If adding a point would cause the cluster's radius to exceed T = 1.5, we
create a new CF entry instead of merging.
5.​ Compare and evaluate clustering techniques: DBSCAN, BIRCH, and
CLIQUE. Use a small dataset and explain how each would form clusters,
handle noise, and scale with dimensionality. Include diagrams where
appropriate.

Aspect DBSCAN BIRCH CLIQUE

Cluster Shape Arbitrary shapes Spherical clusters Arbitrary shapes in


subspaces

Noise Detects and labels Sensitive to noise Sparse regions treated as


Handling noise noise

Input ε (Epsilon), MinPts Branching Factor (B), Grid size, Density


Parameters Diameter Threshold (T) threshold

Scalability Moderate scalability Highly scalable (linear Highly scalable (good for
time) high dimensions)

Dimensionality Poor in high dimensions Works mainly on Excellent for


Handling low-dimensional numeric high-dimensional
data subspaces

Cluster Density expansion CF-tree insertion based Merging dense units in


Formation from core points on threshold grids
Method

Diagram DBSCAN Example - BIRCH CF-Tree - Page 97 CLIQUE Grids - Page 176
Reference Page 130
Theory / Concept Questions
1.​ Differentiate between agglomerative and divisive hierarchical clustering.
Illustrate each method with a simple example.

Agglomerative Clustering Divisive Clustering

Bottom-up approach Top-down approach

Each object starts in its own cluster All objects start in one cluster

Merges clusters iteratively Splits clusters iteratively

Based on nearest neighbor linkage Based on best separation criterion

Dendrogram built by merging clusters Dendrogram built by splitting clusters

Stops when all objects form one cluster Stops when each object is its own cluster

Commonly used method Less commonly used

Does not require pre-specifying the number of Does not require pre-specifying the number
clusters of clusters

Uses single-link, complete-link, average-link, Uses median or distance-based splitting


centroid methods

Time complexity is O(n²) More computationally expensive

Example: Merge (P2,P4), (P1,C1) based on Example: Split {10,20,30,40,50,60} by


distance median value

Forms one big cluster at the end Ends with singleton clusters

Sensitive to noise and outliers Can separate outliers early during splits

Dendrogram grows upward Dendrogram grows downward

Cluster formation visualized by horizontal cuts Cluster formation visualized by vertical


on dendrogram splits
2.​ Explain the DBSCAN clustering algorithm. Given the following set of 2D
points, perform one iteration of DBSCAN with ε = 2 and MinPts = 2.
Show the core points and clusters formed. Points: (1,2), (2,2), (2,3),
(8,7), (8,8), (25,80)
DBSCAN = Density-Based Spatial Clustering of Applications with Noise.
Groups together dense regions; separates low-density regions as outliers.
Needs two parameters: ε (epsilon - neighborhood radius) and MinPts (minimum number of
points in ε).
Points classified into:
●​ Core Points (≥ MinPts within ε)
●​ Border Points (near core points but < MinPts themselves)
●​ Noise Points (neither core nor border)
3.​ Write a short note on BIRCH and STING clustering techniques.
Highlight their key features and differences.
BIRCH Clustering Technique:
●​ BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) clusters
large datasets by first generating a compact summary called a CF (Clustering
Feature) tree​.
●​ Phases:
○​ Phase 1: Scans database to build an initial in-memory CF-tree.
○​ Phase 2: Applies any clustering algorithm (e.g., K-Means) on CF tree leaves​.
●​ Clustering Feature (CF):
○​ CF = (N, LS, SS)
■​ N = Number of data points
■​ LS = Linear sum of data points
■​ SS = Square sum of data points​.
●​ CF-Tree:
○​ A height-balanced tree storing clustering features.
○​ Parameters: Branching factor (B) and Threshold (T)​.
●​ Advantages:
○​ Scales linearly with the number of data points.
○​ Finds good clustering with a single scan and improves with few additional
scans.
●​ Disadvantages:
○​ Handles only numeric data.
○​ Sensitive to insertion order.
○​ Works better for spherical shaped clusters​.
STING Clustering Technique:
●​ STING (Statistical Information Grid) is a grid-based clustering technique that
divides the data space into hierarchical rectangular cells​.
●​ Process:
○​ Divides data into multi-level grids (top-down).
○​ Statistical parameters (mean, variance, distribution type) are computed for
each cell​.
○​ Uses confidence intervals to filter cells as you move down the hierarchy.
●​ Advantages:
○​ Fast and scalable due to precomputed statistics.
○​ Efficient for large spatial datasets.
○​ Easy to parallelize and incrementally update.
○​ O(K) time complexity, where K = number of bottom-level cells​.
●​ Disadvantages:
○​ Cannot detect diagonal cluster boundaries (only horizontal/vertical).
○​ Fixed grid resolution can cause loss of accuracy​.
●​ Example Use Case:
○​ Weather data clustering (Temperature vs Humidity)​.
Key Differences between BIRCH and STING:

BIRCH STING

Builds a CF-tree from data points Divides the data space into hierarchical grids

Good for numeric and relatively Good for spatial and high-dimensional data
low-dimensional data

Hierarchical and distance-based method Grid-based, statistical method

Needs explicit thresholds (e.g., diameter Works with precomputed statistical measures
threshold)

Handles incremental data Query-independent clustering

Sensitive to input order Fixed resolution can impact accuracy

Effective for spherical clusters Effective for density-based clusters

Two-phase clustering process (CF tree + Pure grid-based clustering


external method)
4.​ Compare partitioning methods and hierarchical methods in clustering.
Provide examples where each method is most suitable.

Partitioning Methods Hierarchical Methods

Divide data into k clusters Build a hierarchy of clusters

Require predefined number of clusters (k) No need to predefine number of clusters

Optimize an objective function (e.g., SSE) Merge or split clusters based on distance
measures

Examples: K-Means, K-Medoids Examples: Agglomerative, Divisive

Works well on large datasets More suitable for small datasets

Fast and efficient algorithms Computationally expensive (O(n²))

Sensitive to initial centroids (for K-Means) Sensitive to noise and outliers

Produces flat cluster structure Produces nested cluster structure (dendrogram)

Poor for clusters of arbitrary shape Can capture complex cluster shapes

Easy to implement and scale Harder to implement for large data

Final clusters are independent Final clusters are dependent (nested)

Suitable for applications like market Suitable for applications like phylogenetic tree
segmentation construction

Difficult to handle overlapping clusters Can handle overlapping and nested clusters better

Cannot capture cluster history Captures merging/splitting history

Reassignment of points possible during Once merged or split, cannot undo decisions
iterations

Example where suitable: Customer Example where suitable: Creating taxonomy of


segmentation in retail (Partitioning) animal species (Hierarchical)
5.​ Describe the working of the CLIQUE algorithm. What makes it suitable
for high-dimensional data clustering?
Working of the CLIQUE Algorithm :
●​ Nature: CLIQUE is both grid-based and density-based clustering.
●​ Partitioning the Space:
○​ Divides each dimension into equal-length intervals.
○​ Forms a multi-dimensional grid of rectangular, non-overlapping units.
●​ Density Calculation: A unit (grid cell) is marked dense if the fraction of total points
inside exceeds a threshold.
●​ Cluster Formation: A cluster is a maximal set of connected dense units within a
subspace.
●​ Subspace Clustering: Finds clusters in low-dimensional subspaces, not necessarily
the full space.
●​ Major Steps :
○​ Partition data space into grid cells.
○​ Identify dense units (based on a density threshold).
○​ Merge adjacent dense units to form clusters.
○​ Identify clusters across various subspaces using the Apriori principle.
○​ Generate a minimal description for each cluster (covering maximal connected
dense regions).
What Makes CLIQUE Suitable for High-Dimensional Data :
Efficient subspace clustering: Automatically identifies relevant subspaces where clusters
exist, avoiding the full-dimensional space.
Scalability: Scales linearly with the size of input data.
Insensitive to input order: Does not depend on how data is ordered.
Flexible to data distribution: Does not presume any fixed data distribution (e.g., normal
or uniform).
Grid-Based Advantages: Reduces computation by working on summarized grid information
instead of raw points.
Strength: Particularly good for high-dimensional datasets where only a few dimensions are
relevant for clusters.​

6.​ Various Partitional Clustering Techniques.(Refer above questions)


7.​ Various Cluster Evaluation Measures / Techniques. (Refer above
questions)
8.​ What is model-based clustering? Explain its advantages and give an
example of how it differs from density-based approaches.
Model-Based Clustering
●​ Assumes data are generated by a probabilistic model (generative approach) rather
than purely by density or distances ​.
●​ Represents each cluster by a statistical distribution (e.g. Gaussian) with parameters
learned from the data.
How It Works
1.​ Specify a mixture model: assume data come from a mixture of M distributions
(e.g., Gaussians).
2.​ Estimate parameters (μ, σ², weights): using Expectation–Maximization to
maximize the likelihood of observing the data under the model.
3.​ Assign points to clusters: either by maximum posterior probability (hard) or by
posterior probabilities (soft).
Advantages ​
●​ Clear objective: optimizes a well-defined likelihood function.
●​ Soft clustering: yields membership probabilities, capturing overlap.
●​ Handles missing/partial data: probabilistic framework can accommodate
incomplete observations.
●​ Flexible cluster shapes: can model non-spherical clusters by choosing appropriate
distributions.
●​ Scalable: similar efficiency to algorithmic hierarchical methods (O(n) per
iteration).
Example Contrast with Density-Based (DBSCAN)
●​ Model-Based (Gaussian Mixture): fits Gaussian components to data, learns
means/variances; clusters arise from parametric model.
●​ Density-Based (DBSCAN): finds clusters as density-connected regions using ε and
MinPts; no underlying probability model; assigns points as core, border, or noise ​.
●​ Key Difference: model-based uses statistical assumptions and likelihood
maximization; density-based relies on local point density and connectivity.
Unit 5: Outliers and Statistical Approaches in Data Mining

2 Marks / 4 Marks Questions


1.​ What do you mean by Outlier Detection?
●​ Outlier detection is identifying data objects that deviate significantly from
the general behavior of the data.
●​ Outliers may be generated by different mechanisms compared to normal
objects.
●​ Outliers are not random noise; they often carry important or interesting
information.
●​ Applications include fraud detection, network intrusion detection,
customer segmentation, and medical diagnostics​

2.​ Give an application example where global outliers, contextual outliers and
collective outliers are all interesting. What are the attributes, and what
are the contextual and behavioral attributes? How is the relationship
among objects modeled in collective outlier detection?
Scenario: Intrusion Detection in a Computer Network.
●​ Global Outlier: A single computer sending a very unusual data packet.
●​ Contextual Outlier: A normal packet becomes suspicious based on the time
or network load (e.g., sending large files during maintenance hours).
●​ Collective Outlier: A group of computers suddenly sending denial-of-service
(DoS) packets simultaneously​.
Attributes:
●​ Contextual Attributes: Time of access, network location.
●​ Behavioral Attributes: Packet size, type of activity.
Modeling Relationships in Collective Outlier Detection:
●​ Background knowledge like distance measures or similarity metrics among
multiple data objects is essential to detect collective outliers​.
3.​ Give an application example of where the border between "normal
objects" and outliers is often unclear, so that the degree to which an
object is an outlier has to be well estimated.
Scenario: Temperature Measurement.
●​ A temperature of 80°F (26.67°C) could be normal during summer but an
outlier in winter​.
●​ The distinction depends heavily on context (season, location).
●​ The border between normal and outlier behavior is often a gray area,
making exact thresholding difficult.​

4.​ In outlier detection by semi-supervised learning, what is the advantage


of using objects without labels in the training data set?
Semi-supervised outlier detection is designed for scenarios where most of the data
is unlabeled.
A small labeled set (normal data) is used together with unlabeled data to train a
model for normal behavior​.
Objects without labels help:
●​ Expand training data without manual labeling.
●​ Improve model generalization by learning patterns from normal nearby
objects.
●​ Detect outliers as those deviating from the learned model of normal data​.
12 Marks Questions
1.​ Discuss about Outlier Detection.
●​ Outliers are data objects that deviate considerably from other data points.
●​ They are generated by a different mechanism compared to normal data
points.
●​ Outlier detection is crucial for fraud detection, network intrusion, medical
diagnostics, and fault detection.
●​ Outliers are categorized into three types:
○​ Global Outliers: Objects significantly different from the entire
dataset.
○​ Contextual Outliers: Objects that are outliers in a specific context
(time, location).
○​ Collective Outliers: A group of objects that together form an outlier.
●​ Statistical methods assume a distribution (e.g., normal distribution) and
identify points that significantly deviate.
●​ Distance-based methods identify outliers based on distance from their
neighbors.
●​ Density-based methods detect outliers in low-density regions compared to
neighbors.
●​ Deviation-based methods find outliers as deviations from expected
behavior.
●​ In high-dimensional data, detecting outliers becomes more complex because
distances lose meaning (curse of dimensionality).
●​ Supervised methods use labeled data to classify outliers.
●​ Unsupervised methods find outliers without prior labels based on clustering
or density analysis.
●​ Outlier detection faces challenges like evolving patterns (concept drift),
noise, and lack of ground truth.
●​ Outlier detection algorithms must balance between false positives and false
negatives.
●​ Outlier scores are assigned to rank how strongly a data point deviates.
●​ Example Applications:
○​ Credit card fraud detection: Transactions deviating from usual
spending.
○​ Network intrusion detection: Sudden spikes in data traffic patterns.
2.​ Why is outlier mining important? Briefly describe the different
approaches behind statistical-based outlier detection, distance-based
outlier detection, and deviation-based outlier detection.
●​ Outlier detection is critical in many real-world applications like fraud detection,
intrusion detection, fault detection, and medical diagnostics.
●​ Outliers often represent interesting and critical information — rare events that
differ from the norm.
●​ Outliers could indicate errors, novel events, or important phenomena.
●​ Early detection helps in security systems (e.g., network intrusions) and financial
systems (e.g., fraudulent transactions).
●​ Identifying outliers leads to better decision-making in healthcare, banking, and
industrial systems.
●​ Helps improve model robustness by removing misleading or noisy data points.
●​ In scientific research, outliers can lead to new discoveries rather than errors.
●​ Important for quality control in manufacturing (detect defective items).
●​ Helps in customer segmentation by identifying unusual customer behaviors.
●​ Essential for credit risk modeling and insurance claim analysis.
●​ Detects anomalous behaviors in IoT, cybersecurity, and sensor networks.
●​ Important for early warning systems in weather, earthquakes, and health.
●​ Helps uncover hidden patterns that standard methods might miss.
●​ Outliers can expose fraudulent or criminal activity early.
●​ Conclusion: Mining outliers ensures data integrity, security, and discovery of critical
insights.
Statistical-Based Outlier Detection​
●​ Assumes the data follows a known statistical distribution (e.g., Gaussian/Normal).
●​ Outliers are defined as objects that deviate significantly from the expected
statistical distribution.
●​ Common techniques:
○​ Z-score Method: Data points with a Z-score beyond ±3 are outliers​.
○​ Grubb’s Test: Detects univariate outliers based on t-distribution​.
○​ Mahalanobis Distance: For multivariate outlier detection​.
●​ Mixture models (like Gaussian Mixture Model) are used to model multiple
distributions​.
●​ Advantages: Clear mathematical foundation; suitable when distribution is known.
●​ Limitation: Assumes correct distribution model; sensitive to assumption violations.
Distance-Based Outlier Detection​
●​ Outliers are identified based on their distance to neighboring points.
●​ Key Idea: Normal points are close to many neighbors; outliers are far away​.
●​ Methods:​
○​ k-Nearest Neighbors (k-NN): Points with large distance to the k-th
neighbor are outliers​.
○​ DBSCAN: Points that do not belong to any dense cluster are flagged as
outliers​.
●​ Pros: Non-parametric; no assumption about data distribution.
●​ Cons: Choice of parameters like "k" and "ε" (radius) is critical; computationally
intensive for large datasets.
Deviation-Based Outlier Detection​
●​ Detects outliers by identifying deviations from expected behavior instead of
clustering or distance.
●​ Focuses on finding patterns that deviate from the majority without assuming
statistical distribution​.
●​ Examples:
○​ Sequential pattern mining to detect deviations in time-series data.
○​ Model normal behavior first, then detect deviations.
●​ Useful when patterns are complex and not easily captured by distance or statistical
methods.
●​ Handles situations where outliers are relative to sequences or behavioral changes.
Problem Solving / Case Studies
1.​ Using a small dataset, demonstrate how statistical methods like Z-score
can be used to detect outliers. Dataset: 12, 13, 11, 12, 13, 14, 90.
Z= (x-mean)/standard deviation
2.​ You are working on a financial fraud detection system. Explain how
unsupervised outlier detection methods can be applied.
●​ No need for labeled fraud examples.
●​ Transactions not fitting any cluster are flagged.
●​ Distant transactions are treated as suspicious.
●​ Low-density transactions are marked as outliers.
●​ Normal spending patterns are learned automatically.
●​ Detects new fraud types without retraining.
Dataset: $100, $120, $110, $105, $98, $10,000, $115. Use the Z-score
method to identify potential outliers
Theory / Concept Questions
1.​ What are outliers? Explain the challenges in detecting outliers in
high-dimensional data with examples.
●​ Outliers are data objects that deviate significantly from the normal pattern​.
●​ They appear as if generated by a different mechanism​.
●​ Outliers are different from noise; noise is random error or variance​.
●​ Example 1: A customer normally spending ₹10,000 suddenly spends ₹10,00,000​.
●​ Example 2: A student scoring 5 or 99 while others score between 40-80​.
●​ Example 3: A temperature sensor recording 100°C when normal range is 20°C-30°C​.
●​ Applications: Credit card fraud detection, customer segmentation, medical
analysis​.
Challenges in Detecting Outliers in High-Dimensional Data​​
●​ High Dimensionality: In large feature datasets (like finance/medical), defining
outliers becomes complex​.
●​ Concept Drift: In real-time systems (e.g., stock market), normal patterns change,
so past outliers might become normal​.
●​ No Ground Truth: It's unclear if unusual values are real anomalies or meaningful
patterns​.
●​ Imbalanced Data: Outliers are rare; models may ignore them as noise​.
●​ Masked Outliers: Outliers may blend inside normal clusters, making detection more
difficult​.
●​ Modeling Normal Objects Properly: Difficult to enumerate all normal behaviors​.
●​ Border is a Gray Area: The difference between normal and outlier is often
ambiguous​.
●​ Application-Specific Variations: Choice of distance measures depends on specific
application needs​.
●​ Handling Noise: Random errors can hide real outliers​.
●​ Understandability: Need to justify why an object is an outlier​.
●​ Specifying Degree of Outlierness: Quantifying how unlikely the outlier is becomes
necessary​.
●​ Dimensionality Curse:Distances lose meaning in high-dimensional space​.
●​ Feature Irrelevance: Irrelevant attributes can confuse models​.
●​ Visualization Challenges: Difficult to visualize and understand high-dimensional
outliers​.
●​ Resource Intensity: Outlier detection in high dimensions is computationally
expensive​.​
2.​ Describe the differences between supervised and unsupervised outlier
detection methods. Give one example scenario for each.
Supervised Outlier Detection Unsupervised Outlier Detection

1. Requires labeled data (normal and outliers). 1. Does not require labeled data.

2. Treated as a classification problem. 2. Treated as a clustering or distance-based


problem.

3. Trained using past known outliers. 3. Detects patterns without prior knowledge of
outliers.

4. Decision Trees & Random Forests used. 4. Clustering methods like DBSCAN and
K-Means are used.

5. SVM (Support Vector Machines) separate 5. Points not fitting any cluster detected as
normal and outlier points. outliers.

6. Effective when good labels are available. 6. Useful when labels are unavailable.

7. High accuracy if labeled data is good. 7. Accuracy depends on the quality of


clustering.

8. Needs expert labeling for training. 8. No need for human labeling.

9. Example Techniques: Decision Trees, SVM​. 9. Example Techniques: DBSCAN, K-Means​.

10. Can classify future instances once 10. Detects anomalies by measuring deviation
trained. from clusters.

11. Sensitive to training data quality. 11. Sensitive to distance measure and cluster
density.

12. Good for applications like medical anomaly 12. Good for credit card fraud detection
detection​. without labeled fraud​.

13. Requires constant retraining if new 13. Automatically adapts as new patterns form.
outliers appear.

14. Less effective when labels are missing or 14. Suitable for dynamic, evolving datasets.
incomplete.

15. Example Scenario: Bank uses past fraud 15. Example Scenario: Detect fraudulent credit
transactions to train a decision tree​. card transactions by finding transactions that
don't fit into spending clusters​.
3.​ Statistical Data Mining Approaches.
●​ Regression Analysis: Predicts a continuous outcome and models relationships
between dependent and independent variables​.
●​ Generalized Linear Models (GLM): Extends linear regression to allow response
variables that have error distribution models other than a normal distribution​.
●​ Analysis of Variance (ANOVA): Tests whether there are significant differences
between the means of three or more groups​.
●​ Mixed Effect Models: Models where data have both fixed effects (main factors)
and random effects (natural groupings like doctors, schools)​.
●​ Discriminant Analysis: A technique used for classifying data into predefined
categories and for dimensionality reduction​.
●​ Factor Analysis: Reduces a large number of variables into fewer factors that
explain most of the variability in the data​.
●​ Survival Analysis: Deals with modeling time until an event occurs; handles censored
data like unfinished observations​.
●​ Grubb’s Test: Detects univariate outliers in a normally distributed dataset by
comparing a test statistic to a critical value​.
●​ Detection of Multivariate Outliers: Uses Mahalanobis Distance (MDist) to flag
objects significantly deviating from the mean in multi-dimensional data​.
●​ Mixture of Parametric Distributions: Models data with multiple distributions like
Gaussian Mixtures to detect low-probability points​.
●​ Histogram-Based Outlier Detection (Non-Parametric): Defines outliers based on
rare, low-frequency bins in histograms​.
●​ Scientific Data Mining: Combines scientific computing techniques with statistical
data mining for domains like healthcare and social sciences​.
General Steps in Statistical Data Mining:
●​ Collect data
●​ Preprocess (clean, normalize)
●​ Apply statistical modeling (e.g., regression, clustering)
●​ Interpret results​.
Handling High Dimensional Data: Statistical methods like Principal Component
Analysis (PCA) and Factor Analysis help manage high dimensions​.
Application Examples:
●​ Financial Risk Modeling
●​ Medical Diagnosis
●​ Customer Behavior Analysis
●​ Reliability Testing​.
Applications
1.​ Explain how data mining is applied in intrusion detection systems. Give an
example of detecting anomalies in login patterns.
●​ Intrusion Detection Systems (IDS) monitor network or system activities for
malicious activities.
●​ Data mining enhances IDS through pattern discovery techniques.
●​ Association rule mining is used to find relationships between system attributes
indicative of intrusions.
●​ Classification techniques like decision trees are used to detect known attacks.
●​ Clustering detects new or unknown attacks by finding unusual patterns.
●​ Outlier detection identifies anomalies that deviate significantly from normal
behavior.
●​ Data mining helps in building accurate user behavior profiles.
●​ Real-time intrusion detection is improved using stream data analysis.
●​ Distributed data mining enables analysis across multiple network locations.
●​ Mining frequent patterns helps identify common intrusion behaviors.
●​ Mining rare patterns assists in detecting previously unknown attacks.
●​ IDS use supervised learning for signature-based detection.
●​ IDS use unsupervised learning for anomaly-based detection.
●​ Ensemble methods combine multiple models to improve detection accuracy.
●​ Example: Repeated failed login attempts, unusual time of access, or login from an
unfamiliar device are detected as anomalies by mining login data.​

2.​ Discuss the role of data mining in recommender systems. How does it
personalize suggestions based on user behavior?
●​ Recommender systems predict user preferences using data mining.
●​ Association rule mining discovers products/items frequently bought together.
●​ Clustering groups similar users based on preferences.
●​ Classification models predict if a user will like an item.
●​ Collaborative filtering recommends items based on user similarity.
●​ Content-based filtering recommends items similar to those the user liked before.
●​ Hybrid methods combine collaborative and content-based approaches.
●​ Data mining captures hidden patterns in user interactions.
●​ Mining clickstream data helps personalize web recommendations.
●​ Mining purchase history improves product recommendations.
●​ User demographic data is mined to predict preferences.
●​ Mining user feedback (ratings, reviews) helps refine recommendations.
●​ Trend analysis finds evolving user interests over time.
●​ Anomaly detection identifies unusual behaviors for fine-tuning.
●​ Systems personalize suggestions by matching user behaviors with discovered
patterns.

3.​ Design a data mining solution for a recommender system. Explain


collaborative vs content-based filtering. Build a simple user-item matrix.
Demonstrate a recommendation using collaborative filtering. Explain how
outlier handling improves recommendation accuracy.
●​ Collect user data: ratings, purchase history, browsing history.
●​ Build a user-item interaction matrix (ratings or binary purchase).
●​ Apply collaborative filtering: recommend items liked by similar users.
●​ Apply content-based filtering: recommend similar items based on item features.
●​ Collaborative filtering uses user-user or item-item similarity.
●​ Content-based filtering uses item attributes to match user profiles.
●​ Hybrid approach merges collaborative and content-based results.
●​ Simple User-Item Matrix Example:

●​ Using collaborative filtering, recommend Item C to User 1 (similar users liked it).
●​ Similarity measures like cosine similarity are used in collaborative filtering.
●​ Content features like genre, price, brand are used in content-based filtering.
●​ Outlier detection identifies fake ratings or abnormal behavior.
●​ Removing outliers (fake users, bots) improves recommendation reliability.
●​ Outliers skew user similarity computation; detecting them preserves accuracy.
●​ Final recommendation is improved by filtering anomalies before prediction.
4.​ Explain briefly how data is analyzed by data mining in the Finance
Sector.
●​ Data mining is used for credit scoring and risk assessment.
●​ Customer segmentation is performed based on transaction behavior.
●​ Fraud detection is achieved by anomaly detection techniques.
●​ Stock market prediction uses time series analysis.
●​ Loan default prediction is modeled using classification algorithms.
●​ Clustering helps to identify similar investment patterns.
●​ Rule mining finds association between different financial products.
●​ Text mining is used to analyze financial news sentiment.
●​ Data mining automates large-scale financial data processing.
●​ Regression models forecast revenue and financial metrics.
●​ Decision trees identify key factors affecting investment returns.
●​ Neural networks model complex financial behavior.
●​ Portfolio optimization is performed using predictive analytics.
●​ Anti-money laundering uses clustering and outlier detection.
●​ Bankruptcy prediction models are built using data mining techniques.​

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy