ML UNIT 3 QA (2)
ML UNIT 3 QA (2)
1. Explain the Decision Tree algorithm. What are its advantages and disadvantages
Decision Tree : A decision tree is a type of supervised learning algorithm that is commonly used
in machine learning to model and predict outcomes based on input data.
It is a tree-like structure where each internal node tests on attribute, each branch
corresponds to attribute value and each leaf node represents the final decision or
prediction.
The decision tree algorithm is used to solve both regression and classification
problems.
A decision tree in machine learning is a versatile, interpretable algorithm used for
predictive modelling.
It structures decisions based on input data, making it suitable for both classification and
regression tasks.
Root Node: Represents the original choice or feature from which the tree branches, is
the highest node.
Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by
the values of particular attributes.
Attribute Selection Measures : A technique which is used t select the best attribute for the
root node and for sub-nodes is called as Attribute selection measure or ASM.
There are 3 techniques for ASM
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝𝑖 log2 𝑝𝑖
𝑖=1
(ii) Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it
was randomly classified according to the distribution of classes in the dataset.
If p is the probability of an instance being classified into a particular class then
𝐶
𝐺𝑖𝑛𝑖 (𝑆) = 1 − ∑ 𝑝𝑖 2
𝑖=1
(iii) Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is
split on an attribute.
Suppose S is a set of instances, A is an attribute and 𝑺𝒇𝒊 is the subset of S for which
attribute A has value i, and the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set.
𝒏
|𝑺𝒇𝒊 |
𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏(𝑺, 𝑨) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) − ∑ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒇𝒊 )
|𝑺|
𝒊=𝟏
2 P. RAVI KISHORE| M.Sc., M.Tech. , B.Ed., (Ph d)
Advantages of Decision Tree :
1. Easy to understand and interpret, making them accessible to non-experts.
2. Handle both numerical and categorical data without requiring extensive preprocessing.
3. Provides insights into feature importance for decision-making.
4. Handle missing values and outliers without significant impact.
5. Applicable to both classification and regression tasks.
Decision Tree : A decision tree is a type of supervised learning algorithm that is commonly used
in machine learning used for predictive modelling.
Build a decision tree by iteratively selecting the best attribute to split the data based on
information gain.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝𝑖 log2 𝑝𝑖
𝑖=1
Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is split
on an attribute.
Suppose S is a set of instances, A is an attribute and 𝑺𝒇𝒊 is the subset of S for which
attribute A has value i, and the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set.
𝒏
|𝑺𝒇𝒊 |
𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏(𝑺, 𝑨) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) − ∑ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒇𝒊 )
|𝑺|
𝒊=𝟏
ID3 algorithm :
If all examples have the same label:
– return a leaf with that label
Else if there are no features left to test:
– return a leaf with the most common label
• Else: – choose the feature F that maximises the information gain of S to be the next
node using Equation
𝒏
|𝑺𝒇𝒊 |
𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏(𝑺, 𝑨) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) − ∑ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒇𝒊 )
|𝑺|
𝒊=𝟏
add a branch from the node for each possible value f in Fˆ – for each branch:
∗ calculate 𝑺𝒇𝒊 by removing F from the set of features
∗ recursively call the algorithm with 𝑺𝒇𝒊 to compute the gain relative to the current
set of example
Flow Chart :
The CART (Classification and Regression Trees) Algorithm : CART is a type of supervised
learning algorithm and is a decision tree methodology used for predictive modeling tasks. It can
perform both classification (predicting discrete labels) and regression (predicting continuous
values).
CART was first produced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone
in 1984.
Tree Structure
Root Node: The top node of the tree, representing the entire dataset.
Internal Nodes: Represent decision points. Each internal node splits the data into two or more
subsets.
Leaf Nodes (Terminal Nodes): Represent the outcome or prediction. For classification trees,
each leaf node represents a class label.
For regression trees, each leaf node represents a continuous value.
CART Algorithm :
1. Splitting Criteria : CART uses a greedy approach to split the data at each node. It evaluates
all possible splits and selects the one that best reduces the impurity of the resulting subsets.
(i) Gini Index (for Classification) : Measures the likelihood of an incorrect classification of a
new instance if it was randomly classified according to the distribution of classes in the dataset.
If p is the probability of an instance being classified into a particular class then
𝐶
𝐺𝑖𝑛𝑖 (𝑆) = 1 − ∑ 𝑝𝑖 2
𝑖=1
The lower the Gini impurity, the more pure the subset is. For regression tasks
or (ii) Mean Squared Error (for regression) residual reduction. The lower the residual
reduction, the better the fit of the model to the data.
𝑛
1 2
⏞𝑖 )
𝑀𝑆𝐸 = ∑ (𝑌𝑖 − 𝑌
𝑛
𝑖=1
Where 𝑌𝑖 is the actual value of the 𝑖𝑡ℎ observation .and ⏞𝑖 is the predicted value
𝑌
of the 𝑖𝑡ℎ observation.
2. Recursive Partitioning: After the initial split, each subset of data is recursively split again,
creating branches until reaching terminal nodes or meeting stopping criteria (e.g., maximum
depth or minimum samples per leaf).
3. Leaf Nodes and Prediction: Once no further splits are required, the leaves of the tree
represent the final predictions. For classification, the leaves represent class labels, while for
regression, they contain mean or median values of the target variable.
4. Pruning Overfitting: The process of removing sections of the tree to reduce complexity and
improve generalization. To prevent overfitting of the data, pruning is a technique used to
remove the nodes
Pruning techniques :
(i) Cost complexity pruning : Calculating the cost of each node and removing nodes that
have a negative cost.
(ii) Information gain pruning : Calculating the information gain of each node and removing
nodes that have a low information gain.
4. Evaluation Metrics
(i) Classification Accuracy: The proportion of correctly classified instances.
(ii) Confusion Matrix: A table used to evaluate the performance of a classification model.
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
(iii) R-squared: Used for regression tasks to measure the proportion of variance explained by
the model.
Advantages
▶ It is a simple and intuitive algorithm that is easy to understand and interpret.
▶ It can handle both numerical and categorical data.
▶ It can handle missing values by imputing them with surrogate splits.
▶ It can handle multi-class classification problems by using an extension called the multi-
class CART
Disadvantages
▶ It tends to overfit the data, especially if the tree is allowed to grow too deep.
▶ It is a greedy algorithm that may not find the optimal tree.
▶ It may be biased towards predictors with many categories or high cardinality.
▶ It may produce unstable results if the data is sensitive to small changes
Example of CART in Classification : Consider binary classification problem ENJOY game YES or
NO hass 14 training examples in below table
Decision :
Feature Gini index
Outlook 0. 342
Temperature 0.439
Humidity 0.367
Wind 0.428
calculated gini index values of outlook feature has cost is the lowest.
So put outlook decision at the top of the tree (Root Node). The sub dataset in the overcast
leaf has only yes decisions. This means that overcast leaf is over.
Step 3 : Construction of Decision Tree : Assign a leaf node to each subset that contains
instances that belong to the same class.
Regression Tree : A Regression Tree is a type of decision tree that is used for predicting
continuous response variables.
It is a simple and fast algorithm commonly used for predicting categorical, discrete, or
nonlinear sample data in computer science.
Calculate standard deviation reduction values for all features. Select the best feature
which has the highest score
.Clustering is an unsupervised learning problem to find clusters of points in our data set that
have common characteristics
Step 4: Convergence Repeat the E-step and M-step until convergence, typically
measured by a small change in the log-likelihood or in the parameter values between
iterations.
EM Algorithm :
1. Voting : Some classification systems will only produce an output where all the classifiers
agree, or more than half of them agree, whereas others simply take the most common output,
which is what we usually mean by majority voting
2. Bagging : Bagging is a supervised learning technique that can be used for both regression
and classification tasks. Here is an overview of the steps including Bagging classifier algorithm:
Bootstrap Sampling : Divides the original training data int ‘N’ subsets and randomly selects a
subset with replacement in some rows from other subsets. This step ensures that the base
models are trained on diverse subsets of the data and there is no class imbalance.
4. Bosting : Sequentially builds a group of decision trees and corrects the residual errors made
by previous trees, enhancing its predictive accuracy.
It trains each new weak learner to fit the residuals of the previous ensemble’s predictions
thus making it less sensitive to individual data points or outliers in the data.
2. Bagging : Bagging is a supervised learning technique that can be used for both
regression and classification tasks. Here is an overview of the steps including Bagging
classifier algorithm:
Bootstrap Sampling : Divides the original training data int ‘N’ subsets and randomly
selects a subset with replacement in some rows from other subsets. This step ensures
that the base models are trained on diverse subsets of the data and there is no class
imbalance.
4. Bosting : Sequentially builds a group of decision trees and corrects the residual
errors made by previous trees, enhancing its predictive accuracy.
It trains each new weak learner to fit the residuals of the previous ensemble’s
predictions thus making it less sensitive to individual data points or outliers in the data.
Boosting is an ensemble technique that combines multiple weak learners to create a strong
learner
AdaBoost algorithm :
Step1 – Initialize the weights
For a dataset with N training data points instances, initialize N WiWi weights for each data point
𝟏
with 𝒘𝒊 =
𝑵
Step2 – Train weak classifiers
After applying the weak classifier model to the training data we will update the weight assigned
to the points using the accuracy of the model. The formula for updating the weights will
be wi=wiexp(−αkyiMk(xi))
. 𝑾𝒊 = 𝑾𝒊 𝒆𝒙𝒑(−𝜶𝒌 𝒚𝒊 𝑴𝒌 (𝒙)) = 𝑾𝒊 𝒆−𝜶𝒌 𝒚𝒊 𝑴𝒌 (𝒙)
Here yi is the true output and Xi is the corresponding input vector
Step5 – Normalize the Instance weight
We will normalize the instance weight so that they can be summed up to 1 using the
formula Wi=Wi/sum(W)
𝑾𝒊
𝑾𝒊 =
𝑺𝒖𝒎(𝑾)
Step6 – Repeat steps 2-5 for K iterations
We will train K classifiers and will calculate model importance and update the instance weights
using the above formula
The final model M(X) will be an ensemble model which is obtained by combining these
weak models weighted by their model weights
Evelyn Fix and Joseph Hodges developed this algorithm in 1951, which was subsequently
expanded by Thomas Cover.
Euclidean Distance :; the cartesian distance between the two points which are in the
plane/hyperplane
𝒏
𝒅(𝒙𝒊 , 𝒚𝒊 ) = √∑(𝒙𝒊 − 𝒚𝒊 )𝟐
𝒊=𝟏
𝒅(𝒙, 𝒚) = ∑ |𝒙𝒊 − 𝒚𝒊 |
𝒊=𝟏
Minkowski Distance : Used as Euclidean, as well as the Manhattan
𝒏 𝟏/𝒑
𝒅(𝒙, 𝒚) = [∑(𝒙𝒊 − 𝒚𝒊 )𝒑 ]
𝒊=𝟏
If 𝑝 = 2 then it is Euclidean Distance
If 𝑝 = 1 then it is Manhattan Distance
Despite its simplicity, nearest neighbors has been successful in a large number of
classification and regression problems, including handwritten digits and satellite image
scenes.
Being a non-parametric method, it is often successful in classification situations where
the decision boundary is very irregular.
In the regression problem, the class label is calculated by taking average of the target
values of K nearest neighbors. The calculated average value becomes the predicted
output for the target data point.
CART is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The term "k-means" was first used by James MacQueen in 1967, [2]
K- Meanss algorithm is also referred to as the Lloyd–Forgy algorithm
K-means clustering is a method of vector quantization, originally from signal processing,
It starts by randomly assigning the clusters centroid in the space. Then each data point assign to
one of the cluster based on its distance from centroid of the cluster. After assigning each point
to one of the cluster, new cluster centroids are assigned. This process runs iteratively until it
finds good cluster. In the analysis, assume that number of cluster is given in advanced
Algorithm :
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
𝒎𝒊𝒏
𝒅𝒊 = 𝒅(𝒙𝒊 , 𝝁𝒋 )
𝒋
𝑑(𝑥𝑖 , 𝑦𝑖 ) = √∑ (𝑥𝑖 − 𝑦𝑖 )2
𝑖=1
𝒅(𝒙, 𝒚) = ∑ |𝒙𝒊 − 𝒚𝒊 |
𝒊=𝟏
Minkowski Distance : Used as Euclidean, as well as the Manhattan
𝒏 𝟏/𝒑
𝒅(𝒙, 𝒚) = [∑(𝒙𝒊 − 𝒚𝒊 )𝒑 ]
𝒊=𝟏
If 𝑝 = 2 then it is Euclidean Distance
If 𝑝 = 1 then it is Manhattan Distance
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Advantages :
Easy implementation with computationally fast and efficient with large number of
variables
Works well with distinct boundary data sets.
Disadvantages :
Difficulty in predicting the exact k-value for unknown data set.
Initial seeds have a strong influence on the final resulting cluster.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝𝑖 log2 𝑝𝑖
𝑖=1
𝐺𝑖𝑛𝑖 (𝑆) = 1 − ∑ 𝑝𝑖 2
𝑖=1
7. What is Boosting?
Boosting is a machine learning technique that improves the accuracy of predictive data analysis
by training multiple models sequentially. The goal is to create a strong learner from a collection
of weak learners, which are models that perform only slightly better than random guessing.
Boosting algorithms work by iteratively adjusting the weights of training instances, giving more
importance to misclassified instances. The final prediction is the weighted average of all the
predictions from the weak