ML Unit-4 Final 2024-25
ML Unit-4 Final 2024-25
Introduction to clustering:
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group than those in
other groups. In simple words, the aim is to segregate groups with similar traits and assign them
into clusters.
Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset.
It does it by finding some similar patterns in the unlabelled dataset such as shape, size,
colour, behaviour, etc., and divides them as per the presence and absence of those similar
patterns
It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabelled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-
ID. ML system can use this id to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.
Example:
Here, the clustering technique has partitioned the entire data points into two clusters. The data
points within a cluster are similar to each other but different from other clusters. For example, when
we visit any shopping mall, we can observe that the things with similar usage are grouped together.
Such as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can
easily find out the things. The clustering technique also works in the same way
Why Clustering?
Clustering allows us to find hidden relationship between the data points in the dataset.
Examples:
1. In marketing, customers are segmented according to similarities to carry out targeted marketing.
2. Given a collection of text, we need to organize them, according to the content similarities to create
a topic hierarchy
3. Detecting distinct kinds of pattern in image data (Image processing). It’s effective in biology
research for identifying the underlying patterns.
And there are many examples which makes clustering so important.
1
Types of Clustering Methods:
Partitioning or Flat algorithm: This algorithm try to divide the dataset of interest into predefined
number of groups/ clusters. All the groups/ clusters are independent of each other. For Example: K-
means, Fuzzy C- means.
K-Means Clustering algorithm : In this type, the dataset is divided into a set of k groups, where K is
used to define the number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another cluster
centroid.in below figure k value is three so that the given data set divided into three clusters.
Fuzzy C-means clustering algorithm: Fuzzy clustering is a type of portioning method and same as
k-means algorithm but the difference is in which a data points
may belong to more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. It is sometimes also known as the
Fuzzy k-means algorithm.in below figure the given data set divided into two clusters as the C value is
two but some data points belong to both the clusters.
Hierarchical Clustering :
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level. The
most common examples of this method is the Agglomerative Hierarchical algorithm, Divisive
Method.
2
Divisive Method:
Top down approach
Begin with the whole set and proceed to divide it into successively smaller clusters
Example:
Density-Based Clustering
The density-based clustering method connects the highly dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities
and high dimensions. Popular examples of density models are DBSCAN and OPTICS.
• Discovers clusters of arbitrary shapes
• Handle noise
Distribution Model-Based Clustering:In the distribution model-based clustering method, the data
is divided based on the probability of how a dataset belongs to a particular distribution. The
grouping is done by assuming some distributions commonly Gaussian distribution. The example of
this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture
Models (GMM).
These clustering models are based on the notion of how probable is it that all data points in the
cluster belong to the same distribution
3
K Means Clustering:
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points, which are near to the
particular k-center, create a cluster.
here below diagram explains the working of the K-means Clustering Algorithm:
4
Let's understand the above steps by considering the example:
Initialisation: here consider k value is two.so, firstly; you need to randomly initialise two points
called the cluster centroids. Here, you need to make sure that your cluster centroids depicted by an
orange and blue cross as shown in the image are less than the training data points depicted by navy
blue dots. K-means clustering algorithm is an iterative algorithm and it follows next two steps
iteratively. Once you are done with the initialization, let’s move on to the next step.
Cluster Assignment: In this step, it will go through all the navy blue data points to compute the
distance between the data points and the cluster centroid initialised in the previous step. Now,
depending upon the minimum distance from the orange cluster centroid or blue cluster centroid, it
will group itself into that particular group. So, data points are divided into two groups, one
represented by orange colour and the other one in blue colour as shown in the graph. Since these
cluster formations are not the optimised clusters, so let’s move ahead and see how to get final
clusters.
Move Centroid: Now, you will take the above two cluster centroids and iteratively reposition them
for optimization. You will take all blue dots, compute their average and move current cluster
centroid to this new location. Similarly, you will move orange cluster centroid to the average of
orange data points. Therefore, the new cluster centroids will look as shown in the graph. Moving
forward, let us see how can we optimize clusters, which will give us better insight.
Optimization: You need to repeat above two steps iteratively till the cluster centroids stop changing their
positions and become static. Once the clusters become static, then k-means clustering algorithm is said
to be converged.
5
K Means Numerical Example with Illustration:
Given data
6
Advantages of k-means algorithm:
1. Ease of implementation
2. Speed
3. Availability
4. If variables are huge, then K-Means most of the times computationally faster than hierarchical
clustering, if we keep k smalls.
5. K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are
globular.
Bisecting K-means
– Has less trouble with initialization because it performs several trial bisections and takes
the one with the lowest SSE, and because there are only two centroids at each step
8
The data consists of two pairs of clusters, where the clusters in each (top-bottom) pair are
closer to each other than to the clusters in the other pair. Figure shows that if we start with
two initial centroids per pair of clusters, then even when both centroids are in a single
cluster, the centroids will redistribute themselves so that the "true" clusters are found.
9
Ensemble Methods:
Ensemble methods is a machine learning technique that combines several base models in order to
produce one optimal predictive model.
Ensemble learning is a method that is used to enhance the performance of Machine
Learning model by combining several learners. When compared to a single model, this
type of learning builds models with improved efficiency and accuracy.
The main principle behind the ensemble model is that a group of weak learners come
together to form a strong learner, thus increasing the accuracy of the model.
Ensemble methods can decrease variance using bagging approach, bias using a boosting
approach, or improve predictions using stacking approach.
Ensemble training methods can either be homogenous or heterogeneous in nature. Most
ensemble learning methods are homogeneous, meaning that they use a single type of
base learning model/algorithm. In contrast, heterogeneous ensembles make use of
different learning algorithms, diversifying, and varying the learners to ensure that
accuracy is as high as possible.
Working:
Ensemble model can be built by using several base classifier or learning models let’s say
model1,model2,…….model N on given and then train all these models on given same training data set
and take all its predictions and combine all the predictions by using voting classifier to build final
prediction for our final model.
Here voting classifier is a Voting Classifier is a machine-learning model that trains on an ensemble of
numerous models and predicts an output (class) based on their highest probability of chosen class
as the output.
10
Parallel ensemble techniques Sequential ensemble techniques
Example: Bagging algorithms Example: Boosting algorithms
– Random forest – AdaBoost
– Bagging meta-estimator – GBM
– XGBM
In parallel ensemble techniques, base learners are generated in a parallel format. Parallel methods
utilize the parallel generation of base learners to encourage independence between the base
learners. The independence of base learners significantly reduces the error by averaging the
predictions of the individual learners.
Example: Bagging.
BAGGING:
Bagging is one of the Ensemble construction technique and a classification and regression
method aims to reduce the variance of estimates by averaging multiple estimates together.
Bagging creates subsets from the main dataset that the learners are trained on.
Which is also known as Bootstrap Aggregation. Why because the bagging combines the
both the bootstrapping and aggregation methods to form one ensemble model.
Example: random forest algorithm.
Bootstrapping: The bootstrap method refers to creating small multiple subsets of data from an
entire dataset. These subsets of data are randomly sampled and replaced. The replacement of the
sample is known as resampling or row sampling so there is a chance that the same data points are
selected. Each section of the subsets will have equal probability. It will change the mean and
standard deviation of the dataset, making the model more robust. The base learners and classifiers
in the ensemble method will be mapped onto these subsets.
Aggregation:
Aggregation for finding the average of all individual models predictions to get a final model
prediction.
Working process:
Now lets see working process of bagging .it can be work in three steps.
11
1. Boot strap data: which means here first we create subsets of training data set with random
samples by using boot strapping method.
2. Building multiple classifiers or models: after boot strap data build a model or classifier for each
subset of bootstrapped samples and then predict all the models.
3. Aggregation or model fit: in this step we are combine all predictions of the base models by using
aggregation method or voting classifier to get final prediction for our final model.
Advantages:
Bagging is a completely data-specific algorithm. The bagging technique reduces model over-
fitting.
It also performs well on high-dimensional data. Moreover, the missing values in the dataset
do not affect the performance of the algorithm.
Disadvantages:
That being said, one limitation that it has is giving its final prediction based on the mean
predictions from the subset samples, rather than outputting the precise values for the
classification or regression model.
Bagging is not helpful in case of bias or under-fitting in the data.
Bagging ignores the value with the highest and the lowest result which may have a wide
difference and provides an average result.
12
The below diagram explains the working of the Random Forest algorithm:
13
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data
point occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision.
Dataset:
14
Row sampling and Feature sampling:
15
it reduces overfitting problem in decision trees, also reduces the variance, and therefore improves
the accuracy.
2. Random Forest can be used to solve both classification as well as regression problems.
3. Random Forest works well with both categorical and continuous variables.
4. Random Forest can automatically handle missing values.
5. No feature scaling required: No feature scaling (standardization and normalization) required in case of
Random Forest as it uses rule-based approach instead of distance calculation.
6. Handles non-linear parameters efficiently: Nonlinear parameters do not affect the performance of a
Random Forest unlike curve-based algorithms. So, if there is high non-linearity between the independent
variables, Random Forest may outperform as compared to other curve-based algorithms.
7. Random Forest can automatically handle missing values.
8. Random Forest is usually robust to outliers and can handle them automatically. 9. Random Forest
algorithm is very stable. Even if a new data point is introduced in the dataset, the overall algorithm is not
affected much since the new data may impact one tree, but it is very hard for it to impact all the trees.
10. Random Forest is comparatively less impacted by noise.
BOOSTING:
• Boosting is a sequential ensemble model.
• Boosting is family of algorithms that converts weak learners into strong learners.
• Boosting is an ensemble technique that learns from previous predictor mistakes to make
better predictions in the future.
• The technique combines several weak base learners to form one strong learner, thus
significantly improving the predictability of models. Boosting works by arranging weak
learners in a sequence, such that weak learners learn from the next learner in the sequence
to create better predictive models
• Boosting is a sequential process in which more weightage is given to misclassified
instances after every iteration
• E .g. : AdaBoost, GradientBoost, LightGBM, Xgbm algorithm
16
How Boosting Algorithm Works?
1. The base learner takes all the distributions and assign equal weight or attention to each
observation.
2. If there is any prediction error caused by first base learning algorithm, then we pay higher
attention to observations having prediction error. Then, we apply the next base learning algorithm.
3. Repeat step 2 till the limit of the base learning algorithms is reached.
4. Finally, it combines the outputs from weak learner and creates a strong learner, which eventually
improves the prediction power of the model.
Advantages of Boosting
It is one of the most successful techniques in solving the two-class classification problems.
Boosting technique takes care of the weightage of the higher accuracy sample and lower
accuracy sample and then gives the combined results.
Net error is evaluated in each learning steps. It works good with interactions.
Boosting technique helps when we are dealing with bias or under fitting in the data set.
Multiple boosting techniques are available. For example: AdaBoost, LPBoost, XGBoost,
Gradient Boost, Brown Boost.
It is good at handling the missing data.
Disadvantages of Boosting
Boosting is hard to implement in real-time due to the increased complexity of the algorithm.
The high flexibility of these techniques results in a multiple numbers of parameters than
have a direct effect on the behaviour of the model.
Boosting technique often ignores overfitting or variance issues in the data set.
It increases the complexity of the classification.
Time and computation can be a bit expensive.
17
• Credit risks
• Recommender system for Netflix
• Malware
• Wildlife conservations and so on.
Dataset:
19
20
21
22
23
Bayesian learning algorithm:
Bayesian learning is a powerful method for building models that predict future outcomes by
updating beliefs based on new evidence.
Bayes' Theorem is a fundamental concept in probability theory and statistics, used to update the
probability of a hypothesis based on new evidence. It describes the relationship between conditional
probabilities and is widely applied in machine learning, data analysis, and Bayesian inference.
Formula:
P(A|B) : Probability of event A occurring given that event B has occurred.(posterior probability)
The results can be sensitive to the choice of prior, especially when the data is limited
In some models like Naive Bayes, the assumption of feature independence may not give
good accuracy
Naive Bayes:
Naive Bayes is a probabilistic algorithm based on Bayes' Theorem. It assumes that all
features are independent of each other when predicting a class.
Supervised Learning algorithm.
Primarily used for classification tasks but can also be used for regression tasks.
Independence Assumption:
2. Calculate Likelihoods :
25
For each feature, compute the probability of it occurring given the class.
For each class , calculate the posterior probability given the features .
Formula:
4. Normalization of Values
Choose the class with the highest posterior probability as the prediction.
26
Step 1 : Calculate Prior Probabilities
Since P(Features) is same for all classes, we can ignore its computation during
classification.
27
(b)Calculate Posterior Probabilitys For Play = No :
Since P(Features) is same for all classes, we can ignore its computation during
classification.
Using Naive Bayes, the prediction for : Outlook = Sunny, Temperature = Cool,
Humidity = High, Wind = Strong : NO
28