Clustering
Clustering
Clustering
UNIT-1
Clustering
TOPICS TO BE COVERED
Clustering
Reinforcement Learning
Decision Tree Learning
Bayesian Networks
Support Vector Machine
Genetic Algorithm
Issues in Machine Learning
Data Science Vs Machine Learning
Clustering: An Unsupervised Learning
UNSUPERVISED LEARNING:
1. CLUSTERING
2. ASSOCIATION RULE
Clustering: An Unsupervised Learning
● The task of grouping a set of objects in such a way that objects in the
same group(called a cluster) are more similar to each other than to
those in other clusters.
● It makes unlabeled data more understandable and manipulative.
● Machine learns the attributes and trends by itself without any provided
input-output mapping.
● The clustering algorithms extract patterns and inferences from the type
of data objects, and then make discrete classes of clustering them
suitably.
Clustering: Applications in Biology
Clustering: Applications in Biology
Clustering: Other Applications
❖ Google News Clustering.
● Clustering Quality
➔ Inter-Clusters distance=>Maximized
➔ Intra-Clusters distance=>Minimized
Partitioning Algorithms
● Partitioning Method: Construct a partition of a database D of m objects into a set of k clusters
● Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
➔ Global Optimal: Exhaustively enumerate all partitions
➔ Heuristic Method: k-means (MacQueen, 1967)
Hierarchical Clustering
➔ Hierarchical clustering of animal into vertebrate and invertebrate.
➔ Produce a nested sequence of clusters.
➔ One approach: Recursive application of a partitional clustering algorithm.
Model Based Clustering
➔ A model is hypothesized
➔ E.g. Assume data is generated by a mixture of underlying probability distributions
➔ Fit the data to model
Density Based Clustering
➔ Based on density connected points
➔ Locates regions of high density separated by regions of low density
➔ E.g., DBSCAN
Graph Theoretic Clustering
➔ Weights of edges between items (nodes) based on similarity.
➔ E.g., look for minimum cut in a graph
(Dis)similarity Measures
➔ Distance Metric (Scale-Dependent)
➔ Minkowski Family of distance measures
➔ Cosine Distance
(Dis)similarity Measures
➔ Correlation coefficients (scale-invariant)
➔ Mahalanobis distance
➔ Pearson Correlation
Quality of Clustering
Internal Evaluation: Assign the best score to the algorithm that produces clusters with high similarity
within a cluster and low similarity between clusters, e.g.,
Davies-Bouldin Index
External Evaluation: Evaluated based on data such as known class labels and external benchmarks,
e.g. Rand Index, Jaccard Index, f-measure
Issues in Machine Learning
aspects.
Difference between ML and DS
Data science is a field that studies data and how to extract meaning from it, whereas machine learning is a
field devoted to understanding and building methods that utilize data to improve performance or inform
predictions.
Difference between ML and DS