UNIT-3 Machine Learning
UNIT-3 Machine Learning
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and
Regression Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine
Classifiers – Basic Statistics – Gaussian Mixture Models – Nearest Neighbor Methods –
Unsupervised Learning – K means Algorithms
• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the tests are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex:
The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree learning algorithm used in
machine learning and data mining for classification tasks. It was developed by Ross Quinlan in
the 1980s and is the predecessor of more advanced decision tree algorithms like C4.5 and
CART.
How the ID3 Algorithm Works?
ID3 builds a decision tree by selecting attributes that maximize information gain (or minimize
entropy). The process follows these steps:
1. Calculate Entropy
Entropy measures the impurity or disorder in a dataset. It is calculated using the formula:
Disadvantages of ID3
Overfits on noisy or small datasets
Cannot handle continuous numerical values directly (must be discretized)
Prefers attributes with many values (can be biased toward high-cardinality attributes)
Example of ID3
• Yes = 3 times
• No = 2 times
Since the entropy after splitting is 0, we stop here. The final tree is:
Weather
/ | \
Sunny Overcast Rainy
Yes Yes No
The C4.5 algorithm is an improved version of the ID3 decision tree algorithm developed by
Ross Quinlan. It overcomes some limitations of ID3 and is widely used in classification
problems.
How C4.5 Works
1. Start with the entire dataset and calculate the entropy.
2. Select the best attribute to split the data using the Gain Ratio (instead of just Information
Gain).
3. Create branches based on attribute values.
4. Handle missing values and continuous data (ID3 cannot handle these well).
5. Recursively repeat the process for each subset until all data is classified.
6. Use pruning to remove unnecessary branches and prevent overfitting.
• ID3 uses Information Gain, which can favor attributes with many values.
• C4.5 solves this issue by introducing Gain Ratio, which normalizes Information Gain.
• Formula for Gain Ratio:
Imagine a dataset where we decide whether to play outside based on weather conditions.
Example:
We see that:
Since the missing row is split between "Hot" and "Mild", entropy and information gain are
calculated by weighting the contributions accordingly.
Unlike ID3, which requires categorical attributes, C4.5 can split numerical data dynamically
by finding the best threshold.
How It Works:
Step 2: Compute Information Gain for each threshold and pick the best one.
• Suppose 20°C gives the highest Gain Ratio, C4.5 splits the data:
o ≤ 20°C → "No"
o > 20°C → "Yes"
The CART algorithm (Classification and Regression Trees) is a decision tree learning
technique used for classification and regression tasks. It was introduced by Breiman et al.
(1984) and is widely used in machine learning for predictive modelling.
CART constructs binary decision trees by recursively splitting the dataset into two subsets based
on feature values. The algorithm selects the best split at each step using Gini impurity (for
classification) or mean squared error (for regression).
• For classification problems, the split is chosen based on Gini Index (default in CART).
• For regression problems, the split is chosen based on Mean Squared Error (MSE).
Splitting Criteria:
pi is the probability of each class. A lower Gini Index means purer nodes.
4. Pruning (Optional):
The Gini Index measures the impurity of a node. The formula for Gini Index is:
Example Dataset
ID Feature: Age Label: Play Tennis (Yes=1, No=0)
1 25 Yes (1)
2 30 Yes (1)
3 35 No (0)
4 40 No (0)
5 45 No (0)
Weighted Gini
Since the split at Age = 30 results in Gini = 0, this is the best split.
For regression, CART splits the data based on Mean Squared Error (MSE):
Example Dataset
ID Feature: Age Target: Salary (in $1000s)
1 25 50
2 30 55
3 35 60
4 40 70
5 45 80
• Left Node (Age ≤ 35): { (25, 50), (30, 55), (35, 60) }
• Right Node (Age > 35): { (40, 70), (45, 80) }
Age ≤ 35?
/ \
55 75
ENSEMBLE LEARNING
Ensemble learning is a technique in machine learning where multiple models (often called weak
learners or base models) are combined to create a stronger, more accurate model. The main idea
is that multiple models working together can reduce errors and improve predictions compared to
a single model.
1. Boosting
• Models are trained sequentially, where each new model corrects the mistakes of the previous
ones.
• Helps reduce bias and improve weak models.
Example: AdaBoost, Gradient Boosting (XGBoost, LightGBM, CatBoost).
Steps in Boosting:
• Assigns more weight to misclassified samples and improves the next weak learner.
AdaBoost (Adaptive Boosting) Algorithm
AdaBoost (Adaptive Boosting) is an ensemble learning method that combines multiple weak
learners (usually decision stumps) to create a strong classifier. It adjusts the weights of
misclassified samples to focus more on difficult cases in each iteration.
1. Initialize Weights:
o Assign equal weights to all training samples.
3. Calculate Error:
o The error is measured as the total weight of misclassified samples.
Step 1: Dataset
Initially, all samples are given equal importance. Since we have 4 samples, their initial weight is:
Sample Weight wi
1 0.25
2 0.25
3 0.25
4 0.25
• If x<2.5, predict +1
• If x≥2.5, predict -1
The error of the weak classifier is the sum of weights of misclassified samples:
After several iterations, we get multiple weak classifiers h1,h2,h3,... with different weights αt.
• Multiple models (usually the same algorithm) are trained on random subsets of data.
• Predictions are averaged (for regression) or voted (for classification).
• Helps reduce variance and prevents overfitting.
Example: Random Forest (combines multiple decision trees).
Steps in Bagging:
• Instead of a single Decision Tree, Random Forest builds multiple trees and combines
their predictions.
• The random forest algorithm is a machine learning technique that uses multiple decision
trees to make predictions. It can be used for classification and regression tasks
1. Create multiple datasets → Randomly pick data with replacement (some data may be
repeated).
2. Train multiple decision trees → Each tree learns from a different dataset.
3. Make predictions → Each tree makes its own prediction.
4. Combine the results →
o For classification → Take the majority vote (most common prediction).
o For regression → Take the average of all predictions.
5. More trees = better accuracy & less overfitting.
6. Every tree in the forest makes its own predictions without relying on others.
7. Each tree is built using random samples and features to reduce mistakes.
8. Sufficient data ensures the trees are different and learn unique patterns and variety.
9. Combining the predictions from different trees leads to a more accurate final result.
Advantages of Random Forest
• Random Forest provides very accurate predictions even with large datasets.
• Random Forest can handle missing data well without compromising with accuracy.
• It doesn’t require normalization or standardization on dataset.
• When we combine multiple decision trees it reduces the risk of overfitting of the
model.
Combining multiple classifiers can improve machine learning model performance by leveraging
the strengths of different algorithms. There are various ways to combine classifiers:
• Stacking: Train several models, then we use another model (meta model) to combine
their predictions for better result.
• Bagging (Bootstrap Aggregating)
Trains multiple instances of the same classifier on different subsets of data. Reduces
variance and prevents overfitting. Example: Random Forest (uses bagging with decision
trees).
• Boosting
Sequentially trains classifiers, where each new model focuses on the mistakes of the
previous ones. Reduces bias and increases accuracy. Examples:
The Mixture of Experts (MoE) is an ensemble learning technique that divides a complex
problem into subproblems and assigns specialized models (called experts) to solve each
subproblem. A gating network learns to combine the outputs of these experts to make a final
prediction.
It is widely used in deep learning and large-scale AI models, such as Google’s Switch
Transormers, which use MoE to efficiently allocate computational resources.
Input: This is the problem or data you want to handle.
Experts: These are smaller models, each trained to be really good at a specific part of the overall
problem. Think of them like the different specialists on your team.
Gating network: This is like a manager who decides which expert is best suited for each part of
the problem. It looks at the input and figures out who should work on what.
Output: This is the final answer or solution that the model produces after the experts have done
their work.
Advantages of MoE
Scalability – MoE can handle large-scale problems by distributing tasks across specialized
models.
Improved Accuracy – Experts specialize in different areas, leading to better generalization.
Parallel Computation – Experts can run independently, making MoE efficient for distributed
computing.
Reduced Overfitting – Specialization prevents overfitting to general patterns.
Disadvantages of MoE
BASIC STATISTICS
Mean:
Median:
Mode:
Variance, in statistics, is a measure of how spread out or dispersed data points are from their
average (mean), calculated by averaging the squared differences from the mean.
Covariance:
Covariance is a measure of relationship between two variables that is scale dependent, i.e. how
much will a variable change when another variable changes.
Standard Deviation: The square root of the variance is known as the standard deviation.
Interquartile Range: The range between the first and third quartiles, measuring data spread
around the median.
• Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In
this curve, there is too much concentration of items near the central value.
• Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this
curve, there is equal distribution of items around the central value.
• Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called
platykurtic. In this curve, there is less concentration of items around the central value.
Mahalanobis Distance: The Mahalanobis distance is a statistical measurement that determines
how far a point is from a distribution. It's used in many fields, including computer science,
chemometrics, and cluster analysis.
It is a powerful technique that considers the correlations between variables in a dataset, making it
a valuable tool in various applications such as outlier detection, clustering, and classification.
D² = (x-μ)ᵀΣ⁻¹(x-μ)
Where D² is the squared Mahalanobis Distance, x is the point in question, μ is the mean vector of
the distribution, Σ is the covariance matrix of the distribution, and ᵀ denotes the transpose of a
matrix.
The Gaussian / Normal Distribution: Normal distribution, also known as the Gaussian
distribution, is a continuous probability distribution that is symmetric about the mean, depicting
that data near the mean are more frequent in occurrence than data far from the mean.
Model Training
Expectation-Maximization:
• During the E step, the model calculates the probability of each data point belonging to
each Gaussian component.
• The M step then adjusts the model’s parameters based on these probabilities.
1. Mixture of Gaussians
o Instead of assuming all points belong to just one cluster (like in k-means), GMM
assumes data is a mix of several Gaussian distributions.
o Each distribution represents one hidden group (e.g., different flavors of candy).
2. Soft Clustering (Probabilities Instead of Hard Labels)
o Instead of saying, “This point is in Cluster A,” GMM says, “This point is 70%
likely to be in Cluster A and 30% likely to be in Cluster B.”
3. Expectation-Maximization (EM) Algorithm
o Since we don’t know which Gaussian a point belongs to, we start with a guess.
o We then refine this guess using the E-step (Expectation) and M-step
(Maximization) until the clusters make sense.
Example: Imagine a Class of Students
Let’s say we measure the heights of students in a school. If we plot the heights, we might see
three peaks in the data.
GMM assumes that each peak represents a Gaussian distribution, and the overall height
distribution is just a mix of these three groups.
If we give a new student’s height, GMM can tell us the probability that the student belongs to
each group.
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby. The
K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method employed to
tackle classification and regression problems.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification it performs
an action on the dataset.
As an example, consider the following table of data points containing two features:
The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its closest
neighbours.
• The red diamonds represent Category 1 and the blue squares represent Category
2.
• The new data point checks its closest neighbours (circled points).
• Since the majority of its closest neighbours are blue squares (Category 2) KNN
predicts the new data point belongs to Category 2.
• K represents the number of nearest neighbors that needs to be considered while making
prediction.
• To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.
• The k data points with the smallest distances to the target point are the nearest neighbors.
• In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors becomes
the predicted class for the target data point.
• In the regression problem, the class label is calculated by taking average of the target values of
K nearest neighbors. The calculated average value becomes the predicted output for the target
data point.
Example:
Given Query:
X= (Maths=6, CS=8) → Find the class?
Maths CS Result
4 3 Fail
6 7 Pass
6 8 Pass
5 5 Fail
8 8 Pass
Given K=3.
Step 3:
As per the result, K=3 and we need to consider the 3 smallest values (smallest distances) from the new
data point to the actual data points.
Thus, we assign the new data point into the Pass category.
Advantages:
• Easy to implement: The KNN algorithm is easy to implement because its
complexity is relatively low as compared to other machine learning algorithms.
• No training required: KNN stores all data in memory and doesn’t require any
training so when new data points are added it automatically adjusts and uses the
new data for future predictions.
• Few Hyperparameters: The only parameters which are required in the training of a
KNN algorithm are the value of k and the choice of the distance metric which we
would like to choose from our evaluation metric.
• Flexible: It works for Classification problem like is this email spam or not? and
also work for Regression task like predicting house prices based on nearby similar
houses.
Disadvantages:
• Doesn’t scale well: KNN is considered as a “lazy” algorithm as it is very slow
especially with large datasets
• Curse of Dimensionality: When the number of features increases KNN struggles to
classify data accurately a problem known as curse of dimensionality.
• Prone to Overfitting: As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well.
K-dimensional tree
A k-d tree is a special kind of binary search tree that helps organize points in multiple
dimensions (like 2D or 3D space).
Imagine you have a list of locations on a map (like stores or houses), and you want to quickly
find the one closest to you. Instead of checking every single location one by one, a k-d tree
organizes them in a way that makes searching much faster.
1. It starts by dividing the space based on one coordinate (like splitting a map along a vertical line).
2. Then, it keeps dividing the smaller sections using other coordinates (like splitting horizontally
next).
3. This process continues, making it easier to search for nearby points.
The purpose of a k-d tree is to efficiently organize and search points in multiple dimensions
(2D, 3D, or higher).
1. Fast Nearest Neighbor Search
o Example: Finding the closest gas station or restaurant to your location.
2. Range Search
o Example: Finding all delivery addresses within a certain distance from a warehouse.
3. Efficient Spatial Partitioning
o Example: Used in 3D graphics and gaming to speed up rendering by organizing objects in
space.
4. Machine Learning (KNN Algorithm)
o Helps speed up the k-Nearest Neighbors (KNN) classifier by reducing search time.
5. Robotics & Pathfinding
o Used in motion planning for robots to navigate around obstacles efficiently.
• Faster searches than checking every point one by one (especially in large datasets).
• Organizes multi-dimensional data in a structured way.
UNSUPERVISED LEARNING
Unsupervised learning is a type of machine learning that works with data that has no labels or
categories. The main goal is to find patterns and relationships in the data without any
guidance.In this approach, the machine analyzes unorganized information and groups it based
on similarities, patterns, or differences. Unlike supervised learning, there is no teacher or
training involved. The machine must uncover hidden structures in the data on its own.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images,
containing both dogs and cats. The model has never seen an image of a dog or cat before, and it
has no pre-existing labels or categories for these animals. Your task is to use unsupervised
learning to identify the dogs and cats in a new, unseen image. suppose it is given an image
having both dogs and cats which it has never seen. Thus, the machine has no idea about the
features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘. But it can categorize them
according to their similarities, patterns, and differences, i.e., we can easily categorize the above
picture into two parts. The first may contain all pics having dogs in them and the second part
may contain all pics having cats in them. Here you didn’t learn anything before, which means no
training data or examples.
It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabeled data.
The unsupervised learning algorithm can be further categorized into two types of problems:
• Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the objects
of another group. Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
• Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
K MEANS ALGORITHM
• K-means clustering is a popular unsupervised machine learning algorithm used for
partitioning a dataset into a pre-defined number of clusters. The goal is to group similar
data points together and discover underlying patterns or structures within the data.
• The first property of clusters states that the points within a cluster should be similar to
each other. So, our aim here is to minimize the distance between the points within a
cluster.
• There is an algorithm that tries to minimize the distance of the points in a cluster with
their centroid – the k-means clustering technique.
• K-means is a centroid-based algorithm or a distance-based algorithm, where we calculate
the distances to assign a point to a cluster. In K-Means, each cluster is associated with a
centroid.
• The main objective of the K-Means algorithm is to minimize the sum of distances
between the points and their respective cluster centroid.
• Optimization plays a crucial role in the k-means clustering algorithm. The goal of the
optimization process is to find the best set of centroids that minimizes the sum of squared
distances between each data point and its closest centroid.
• Initialization: Start by randomly selecting K points from the dataset. These points will
act as the initial cluster centroids.
• Assignment: For each data point in the dataset, calculate the distance between that point
and each of the K centroids. Assign the data point to the cluster whose centroid is closest
to it. This step effectively forms K clusters.
• Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
• Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids
no longer change significantly or when a specified number of iterations is reached.
• Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster.
Mathematical Representation
The objective of K-Means is to minimize the sum of squared differences between each point and
its assigned cluster centroid:
The main objective of k-means clustering is to partition your data into a specific number (k) of
groups, where data points within each group are similar and dissimilar to points in other groups.
It achieves this by minimizing the distance between data points and their assigned cluster’s
center, called the centroid.
• Grouping similar data points: K-means aims to identify patterns in your data by
grouping data points that share similar characteristics together. This allows you to
discover underlying structures within the data.
• Minimizing within-cluster distance: The algorithm strives to make sure data points
within a cluster are as close as possible to each other, as measured by a distance metric
(usually Euclidean distance). This ensures tight-knit clusters with high cohesiveness.
• Maximizing between-cluster distance: Conversely, k-means also tries to maximize the
separation between clusters. Ideally, data points from different clusters should be far
apart, making the clusters distinct from each other.
Advantages of K-means
1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large datasets
with high dimensionality.
3. Scalability: K-means can handle large datasets with many data points and can be
easily scaled to handle even larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.
Disadvantages of K-Means
Example:
(Note: Keep point 1 and 2 as centroids and label them as K1 & K2)
Step 1:
Decide the centroid. So let's consider that point ① & ② are the centroids of the cluster K1 & K2.
K1 = (185, 72)
K2 = (170, 56)
Step 6:
Total clusters are K = 2.
K1 = {1, 4}
K2 = {2, 3}
1 185 72 K1
2 170 56 K2
3 168 60 K2
4 179 68 K1