0% found this document useful (0 votes)
15 views

ML Unit 3

The document provides an overview of decision trees in machine learning, detailing their structure, terminologies, and the process of constructing them. It explains the CART (Classification and Regression Trees) algorithm, which is used for both classification and regression tasks, including its splitting criteria and pruning techniques. Additionally, it discusses the advantages and limitations of CART, emphasizing its interpretability and the challenges of overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

ML Unit 3

The document provides an overview of decision trees in machine learning, detailing their structure, terminologies, and the process of constructing them. It explains the CART (Classification and Regression Trees) algorithm, which is used for both classification and regression tasks, including its splitting criteria and pruning techniques. Additionally, it discusses the advantages and limitations of CART, emphasizing its interpretability and the challenges of overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

R22 – JNTUH-Machine Learning UNIT -III Notes

Decision Tree in Machine Learning


A decision tree is a type of supervised learning algorithm that is
commonly used in machine learning to model and predict
outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds
to attribute value and each leaf node represents the final
decision or prediction. The decision tree algorithm falls under
the category of supervised learning. They can be used to solve
both regression and classification problems.
Decision Tree Terminologies
There are specialized terms associated with decision trees that
denote various components and facets of the tree structure and
decision-making procedure. :
 Root Node: A decision tree’s root node, which
represents the original choice or feature from which
the tree branches, is the highest node.
 Internal Nodes (Decision Nodes): Nodes in the tree
whose choices are determined by the values of
particular attributes. There are branches on these
nodes that go to other nodes.
 Leaf Nodes (Terminal Nodes): The branches’ termini,
when choices or forecasts are decided upon. There
are no more branches on leaf nodes.
 Branches (Edges): Links between nodes that show
how decisions are made in response to particular
circumstances.
 Splitting: The process of dividing a node into two or
more sub- nodes based on a decision criterion. It
involves selecting a feature and a threshold to create
subsets of data.
 Parent Node: A node that is split into child nodes. The
original node from which a split originates.
 Child Node: Nodes created as a result of a split from a
parent node.
 Decision Criterion: The rule or condition used to
determine how the data should be split at a decision
node. It involves comparing feature values against a
Page 1 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

threshold.
 Pruning: The process of removing branches or nodes
from a decision tree to improve its generalisation
and prevent overfitting.

Page 2 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Understanding these terminologies is crucial for interpreting and


working with decision trees in machine learning applications.
How Decision Tree is formed?
The process of forming a decision tree involves recursively
partitioning the data based on the values of different
attributes. The algorithm selects the best attribute to split the
data at each internal node, based on certain criteria such as
information gain or Gini impurity. This splitting process
continues until a stopping criterion is met, such as reaching a
maximum depth or having a minimum number of instances in a
leaf node. Why Decision Tree?
Decision trees are widely used in machine learning for a number
of reasons:
 Decision trees are so versatile in simulating intricate
decision- making processes, because of their
interpretability and versatility.
 Their portrayal of complex choice scenarios that take
into account a variety of causes and outcomes is made
possible by their hierarchical structure.
 They provide comprehensible insights into the
decision logic, decision trees are especially helpful for
tasks involving categorisation and regression.
 They are proficient with both numerical and
categorical data, and they can easily adapt to a
variety of datasets thanks to their autonomous feature
selection capability.
 Decision trees also provide simple visualization, which
helps to comprehend and elucidate the underlying
decision processes in a model.
Decision Tree Approach
Decision tree uses the tree representation to solve the problem in
which each leaf node corresponds to a class label and attributes
are represented on the internal node of the tree. We can
represent any boolean function on discrete attributes using the
decision tree.

Page 3 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Constructing Decision Trees


1. Decision Node 2. Leaf Node
Here D1 to D6 are Data Sets as per above classification

Learning with Trees – Decision Trees – Constructing Decision


Trees
Step1 : given data set, choose target attribute
Step 2: calculate information gain with entropy
𝑃 𝑃
Step 3: Information Gain (IG) = −((
𝑁+𝑃
log (𝑃+𝑁) +
2
𝑁
( ))
𝑁

2 𝑃+𝑁
log
𝑃+𝑁
𝑃𝑖+𝑁
For each attribute , find entropy E(A) where E(A) = ∑ 𝑖

𝑃+𝑁
I(𝑃𝑖𝑁𝑖)
Entropy = IGX Probability
Page 4 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Step 4 : calculate Gain = IG – E(A)

Page 5 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

CART(Classification And Regression Tree) for


Decision Tree
CART is a predictive algorithm used in Machine learning and it
explains how the target variable’s values can be predicted
based on other matters. It is a decision tree where each fork is
split into a predictor variable and each node has a prediction
for the target variable at the end. The term CART serves as a
generic term for the following categories of decision trees:
 Classification Trees: The tree is used to determine which
“class” the target variable is most likely to fall into
when it is continuous.
 Regression trees: These are used to predict a
continuous variable’s value.
In the decision tree, nodes are split into sub-nodes based on a
threshold value of an attribute. The root node is taken as the
training set and is split into two by considering the best attribute
and threshold value. Further, the subsets are also split using the
same logic. This continues till the last pure sub-set is found in
the tree or the maximum number of leaves possible in that
growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree
algorithm that is used for both classification and regression tasks.
It is a supervised learning algorithm that learns from labelled
data to predict unseen data.
 Tree structure: CART builds a tree-like structure
consisting of nodes and branches. The nodes represent
different decision points, and the branches represent
the possible outcomes of those decisions. The leaf
nodes in the tree contain a predicted class label or
value for the target variable.
 Splitting criteria: CART uses a greedy approach to split the
data at each node. It evaluates all possible splits and
selects the one that best reduces the impurity of the
resulting subsets. For classification tasks, CART uses

Page 6 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Gini impurity as the splitting criterion. The lower the


Gini impurity, the more pure the subset is. For
regression tasks, CART uses residual reduction as the

Page 7 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

splitting criterion. The lower the residual reduction, the


better the fit of the model to the data.
 Pruning: To prevent overfitting of the data, pruning is a
technique used to remove the nodes that contribute
little to the model accuracy. Cost complexity pruning
and information gain pruning are two popular pruning
techniques. Cost complexity pruning involves
calculating the cost of each node and removing nodes
that have a negative cost. Information gain pruning
involves calculating the information gain of each node
and removing nodes that have a low information gain.
How does CART algorithm works?
The CART algorithm works via the following process:
 The best-split point of each input is obtained.
 Based on the best-split points of each input in Step 1,
the new “best” split point is identified.
 Split the chosen input according to the “best” split point.
 Continue splitting until a stopping rule is satisfied or
no further desirable splitting is available.

CART algorithm uses Gini Impurity to split the dataset into a


decision tree .It does that by searching for the best
homogeneity for the sub nodes, with the help of the Gini
index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It
Page 8 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

stores the sum of squared probabilities of each class. It


computes the degree of probability of a specific variable that
is wrongly being classified when

Page 9 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

chosen randomly and a variation of the Gini coefficient. It


works on categorical variables, provides outcomes either
“successful” or “failure” and hence conducts binary splitting
only.
The degree of the Gini index varies from 0 to 1,
 Where 0 depicts that all the elements are allied to
a certain class, or only one class exists there.
 Gini index close to 1 means a high level of impurity,
where each class contains a very small fraction of
elements, and
 A value of 1-1/n occurs when the elements are
uniformly distributed into n classes and each
class has an equal probability of 1/n. For
example, with two classes, the Gini impurity is 1
– 1/2 = 0.5.
Mathematically, we can write Gini Impurity as follows:
Gini=1−∑i=1n(pi)2Gini=1−∑i=1n(pi)2
where pipi is the probability of an object being classified to a
particular class.
In conclusion, Gini impurity is the probability of
misclassification, assuming independent selection of the
element and its class based on the class probabilities.
CART for Classification
A classification tree is an algorithm where the target variable is
categorical. The algorithm is then used to identify the “Class”
within which the target variable is most likely to fall.
Classification trees are used when the dataset needs to be split
into classes that belong to the response variable(like yes or
no)
For classification in decision tree learning algorithm that
creates a tree- like structure to predict class labels. The tree
consists of nodes, which represent different decision points,
and branches, which represent the possible result of those
decisions. Predicted class labels are present at each leaf
node of the tree.
How Does CART for Classification Work?
CART for classification works by recursively splitting the
training data into smaller and smaller subsets based on certain
Page 10 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

criteria. The goal is to split the data in a way that minimizes


the impurity within each subset.
Impurity is a measure of how mixed up the data is in a
particular subset. For classification tasks, CART uses Gini
impurity
 Gini Impurity- Gini impurity measures the
probability of misclassifying a random instance
from a subset labeled

Page 11 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

according to the majority class. Lower Gini impurity means


more purity of the subset.
 Splitting Criteria- The CART algorithm evaluates all
potential splits at every node and chooses the one that
best decreases the Gini impurity of the resultant subsets.
This process continues until a stopping criterion is
reached, like a maximum tree depth or a minimum
number of instances in a leaf node.
CART for Regression
A Regression tree is an algorithm where the target variable is
continuous and the tree is used to predict its value. Regression
trees are used when the response variable is continuous. For
example, if the response variable is the temperature of the day.
CART for regression is a decision tree learning method that
creates a tree-like structure to predict continuous target
variables. The tree consists of nodes that represent different
decision points and branches
that represent the possible outcomes of those decisions. Predicted
values for the target variable are stored in each leaf node of
the tree.
How Does CART works for Regression?
Regression CART works by splitting the training data
recursively into smaller subsets based on specific criteria.
The objective is to split the data in a way that minimizes the
residual reduction in each subset.
 Residual Reduction- Residual reduction is a measure of
how much the average squared difference between the
predicted values and the actual values for the target
variable is reduced by splitting the subset. The lower
the residual reduction, the better the model fits the
data.
 Splitting Criteria- CART evaluates every possible split at
each node and selects the one that results in the
greatest reduction of residual error in the resulting
subsets. This process is repeated until a stopping
criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
Pseudo-code of the CART algorithm
Page 12 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

d = 0, endtree = 0 Note(0) = 1, Node(1) = 0, Node(2) = 0 while


endtree < 1 if Node(2d-1) + Node(2d) + + Node(2d+1-2) = 2 -
2d+1 endtree = 1
else do i = 2d-1, 2d,...., 2d+1-2 if Node(i) > -1 Split tree else Node(2i+1)
=
-1 Node(2i+2) = -1 end if end do end if d = d + 1 end while

Page 13 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

CART model representation


CART models are formed by picking input variables and
evaluating split points on those variables until an appropriate
tree is produced.
Steps to create a Decision Tree using the CART algorithm:
 Greedy algorithm: In this The input space is divided
using the Greedy method which is known as a
recursive binary spitting. This is a numerical method
within which all of the values are aligned and
several other split points are tried and assessed
using a cost function.
 Stopping Criterion: As it works its way down the tree
with the training data, the recursive binary splitting
method described above must know when to stop
splitting. The most frequent halting method is to
utilize a minimum amount of training data allocated
to every leaf node. If the count is smaller than the
specified threshold, the split is rejected and also the
node is considered the last leaf node.
 Tree pruning: Decision tree’s complexity is defined as the
number of splits in the tree. Trees with fewer branches
are recommended as they are simple to grasp and less
prone to cluster the data. Working through each leaf
node in the tree and evaluating the effect of deleting it
using a hold-out test set is the quickest and simplest
pruning approach.
 Data preparation for the CART: No special data
preparation is required for the CART algorithm.
POPULAR CART-BASED ALGORITHMS:
 CART (Classification and Regression Trees): The
original algorithm that uses binary splits to build
decision trees.
 C4.5 and C5.0: Extensions of CART that allow for
multiway splits and handle categorical variables
more effectively.
 Random Forests: Ensemble methods that use multiple
decision trees (often CART) to improve predictive

Page 14 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

performance and reduce overfitting.


 Gradient Boosting Machines (GBM): Boosting algorithms
that also use decision trees (often CART) as base
learners, sequentially improving model performance.
Advantages of CART
 Results are simplistic.

Page 15 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

 Classification and regression trees are


Nonparametric and Nonlinear.
 Classification and regression trees implicitly perform
feature selection.
 Outliers have no meaningful effect on CART.
 It requires minimal supervision and produces
easy-to- understand models.
Limitations of CART
 Overfitting.
 High Variance.
 low bias.
 the tree structure may be unstable.
Applications of the CART algorithm
 For quick Data insights.
 In Blood Donors Classification.
 For environmental and ecological data.
 In the financial sectors.

What is Ensemble Learning with examples?


 Ensemble learning is a machine learning technique that
combines the predictions from multiple individual models
to obtain a better predictive performance than any
single model. The basic idea behind ensemble learning is
to leverage the wisdom of the crowd by aggregating
the predictions of multiple models, each of which may
have its own strengths and weaknesses. This can lead to
improved performance and generalization.
 Ensemble learning can be thought of as compensation
for poor learning algorithms that are computationally
more expensive than a single model. But they are
more efficient than a single non-ensemble model that
has passed through a lot of learning. In this article, we
will have a comprehensive overview of the
importance of ensemble learning and how it works,
different types of ensemble classifiers, advanced
ensemble learning techniques, and some algorithms
(such as random
Page 16 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

forest, xgboost) for better clarification of the common


ensemble classifiers and finally their uses in the technical
world.

Page 17 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

 Several individual base models (experts) are fitted to


learn from the same data and produce an aggregation of
output based on which a final decision is taken. These
base models can be machine learning algorithms such as
decision trees (mostly used), linear models, support
vector machines (SVM), neural networks, or any other
model that is capable of making predictions.
 Most commonly used ensembles include techniques
such as Bagging- used to generate Random Forest
algorithms and Boosting- to generate algorithms such
as Adaboost, Xgboost etc.
Ensemble Learning Techniques
 Gradient Boosting Machines (GBM): Gradient Boosting is a
popular ensemble learning technique that sequentially
builds a group of decision trees and corrects the residual
errors made by previous trees, enhancing its predictive
accuracy. It trains each new weak learner to fit the
residuals of the previous ensemble’s predictions thus
making it less sensitive to individual data points or
outliers in the data.
 Extreme Gradient Boosting (XGBoost): XGBoost features
tree pruning, regularization, and parallel processing,
which makes it a preferred choice for data scientists
seeking robust and accurate predictive models.
 CatBoost: It is designed to handle features
categorically that eliminates the need for extensive
pre-processing.CatBoost is known for its high
predictive accuracy, fast training, and automatic
handling of overfitting.
 Stacking: It combines the output of multiple base
models by training a combiner(an algorithm that takes
predictions of base models) and generate more accurate
prediction. Stacking allows for more flexibility in
combining diverse models, and the combiner can be
any machine learning algorithm.
 Random Subspace Method (Random Subspace Ensembles):
It is an ensemble learning approach that improves the
Page 10 of 28
R22 – JNTUH-Machine Learning UNIT -III Notes

predictive accuracy by training base models on


random subsets of input features. It mitigates
overfitting and improves the generalization by
introducing diversity in the model space.

Page 10 of 28
R22 – JNTUH-Machine Learning UNIT -III Notes

 Random Forest Variants: They introduce variations in tree


construction, feature selection, or model optimization to
enhance performance.
Selecting the right advanced ensemble technique depends on the
nature of the data, the specific problem trying to be solved, and
the computational resources available. It often requires
experimentation and changes to achieve the best results.
Algorithm based on Bagging and Boosting
Bagging Algorithm
Bagging is a supervised learning technique that can be used for
both regression and classification tasks. Here is an overview of
the steps including Bagging classifier algorithm:
 Bootstrap Sampling: Divides the original training data
into ‘N’ subsets and randomly selects a subset with
replacement in some rows from other subsets. This
step ensures that the base models are trained on
diverse subsets of the data and there is no class
imbalance.
 Base Model Training: For each bootstrapped sample,
train a base model independently on that subset of
data. These weak models are trained in parallel to
increase computational efficiency and reduce time
consumption.
 Prediction Aggregation: To make a prediction on testing
data combine the predictions of all base models. For
classification tasks, it can include majority voting or
weighted majority while for regression, it involves
averaging the predictions.
 Out-of-Bag (OOB) Evaluation: Some samples are excluded
from the training subset of particular base models
during the bootstrapping method. These “out-of-bag”
samples can be used to estimate the model’s
performance without the need for cross- validation.
 Final Prediction: After aggregating the predictions from
all the base models, Bagging produces a final
prediction for each instance.
Boosting Algorithm
Page 11 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Boosting is an ensemble technique that combines multiple weak


learners to create a strong learner. The ensemble of weak
models are trained in series such that each model that comes
next, tries to correct errors of the

Page 12 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

previous model until the entire training dataset is predicted


correctly. One of the most well-known boosting algorithms is
AdaBoost (Adaptive Boosting).
Here are few popular boosting algorithm frameworks:
 AdaBoost (Adaptive Boosting): AdaBoost assigns
different weights to data points, focusing on
challenging examples in each iteration. It combines
weighted weak classifiers to make predictions.
 Gradient Boosting: Gradient Boosting, including
algorithms like Gradient Boosting Machines (GBM),
XGBoost, and LightGBM, optimizes a loss function by
training a sequence of weak learners to minimize the
residuals between predictions and actual values,
producing strong predictive models.
Uses of Ensemble Learning
Ensemble learning is a versatile approach that can be applied
to a wide range of machine learning problems such as:-
 Classification and Regression: Ensemble techniques
make problems like classification and regression
versatile in various domains, including finance,
healthcare, marketing, and more.
 Anomaly Detection: Ensembles can be used to detect
anomalies in datasets by combining multiple anomaly
detection algorithms, thus making it more robust.
 Portfolio Optimization: Ensembles can be employed to
optimize investment portfolios by collecting predictions
from various models to make better investment
decisions.
 Customer Churn Prediction: In business and marketing
analytics, by combining the results of various models
capturing different aspects of customer behaviour,
ensembles can be used to predict customer churn.
 Medical Diagnostics: In healthcare, ensembles can be
used to make more accurate predictions of diseases
based on various medical data sources and diagnostic
models.
 Credit Scoring: Ensembles can be used to improve the

Page 13 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

accuracy of credit scoring models by combining the


outputs of various credit risk assessment models.
 Climate Prediction: Ensembles of climate models
help in making more accurate and reliable
predictions for weather

Page 14 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

forecasting, climate change projections, and related


environmental studies.
 Time Series Forecasting: Ensemble learning combines
multiple time series forecasting models to enhance
accuracy and reliability, adapting to changing
temporal patterns.
What is Statistics?
Statistics is the science of collecting, organizing, analyzing,
interpreting, and presenting data. It encompasses a wide
range of techniques for summarizing data, making
inferences, and drawing conclusions.
Statistical methods help quantify uncertainty and variability in
data, allowing researchers and analysts to make data-driven
decisions with confidence.
Types of Statistics
There are commonly two types of statistics, which are
discussed below:
• Descriptive Statistics: "Descriptive Statistics" helps us
simplify and organize big chunks of data. This makes large
amounts of data easier to understand.
• Inferential Statistics: "Inferential Statistics" is a little
different. It uses smaller data to draw conclusions about
a larger group. It helps us predict and draw conclusions
about a population.

• Descriptive Statistics: "Descriptive Statistics" helps us


simplify and organize big chunks of data. This makes
large amounts of data easier to understand.
• Mean is calculated by summing all values present in the
sample divided by total number of values present in the
sample. Mean(μ)=SumofValuesNumberofValues
Mean(μ)=NumberofValues SumofValues

Median is the middle of a sample when arranged from
lowest to highest or highest to lowest. in order to find
the median, the data must be sorted.
For odd number of data points:

Page 15 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

• Median=(n+12)thMedian=(2n+1)th
• For even number of data points:

Page 16 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

• Median=Average of(n2)th value and its


next value Median=Average of (2n)th value
and its next value
• Mode is the most frequently occurring value in the dataset.
• Measures of Dispersion
• Range: The difference between the maximum and minimum
values.
• Variance: The measure of how data points differ from the mean.
• Standard Deviation: The square root of the variance,
representing the average distance from the mean.

Measures of Dispersion

• Range: The difference between the maximum and minimum values.

• Variance: The average squared deviation from the mean, representing data spread.

• Standard Deviation: The square root of variance, indicating data spread relative to the mean.

• Interquartile Range: The range between the first and third quartiles, measuring data
spread around the median.

Measures of Shape

• Skewness: Indicates data asymmetry.

Page 17 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Measures of Shape

• Kurtosis: Measures the peakedness of the data distribution.

Inferential Statistics

Inferential statistics is used to make predictions by taking any group of data in which you are
interested. It can be defined as a random sample of data taken from a population to describe and
make inferences about the population. Any group of data that includes all the data you are
interested in is known as population. It basically allows you to make predictions by taking a small
sample instead of working on the whole population.

Population and Sample

• Population: The entire group being studied.

• Sample: A subset of the population used for analysis.

Estimation

• Point Estimation: Provides a single value estimate of a population parameter.

• Interval Estimation: Offers a range of values (confidence interval) within which


the parameter likely lies.

• Confidence Intervals: Indicate the reliability of an estimate.

Page 18 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Hypothesis Testing

• Null and Alternative Hypotheses: The null hypothesis assumes no effect or relationship,
while the alternative suggests otherwise.

• Type I and Type II Errors: Type I error is rejecting a true null hypothesis, while Type II is
failing to reject a false null hypothesis

• p-Values: Measure the probability of obtaining the observed results under the
null hypothesis.

• t-Tests and z-Tests: Compare means to assess statistical significance.

ANOVA (Analysis of Variance):

Compares means across multiple groups to determine if they differ significantly.

Chi-Square Tests:

Assess the association between categorical variables.

Correlation and Regression:

Understanding relationships between variables is critical in machine learning.

Correlation

• Pearson Correlation Coefficient: Measures linear relationship strength between


two variables.

• Spearman Rank Correlation: Assesses the strength and direction of the


monotonic relationship between variables.

Regression Analysis

• Simple Linear Regression: Models the relationship between two variables.

• Multiple Linear Regression: Extends to multiple predictors.

• Assumptions of Linear Regression: Linearity, independence, homoscedasticity, normality.

• Interpretation of Regression Coefficients: Explains predictor influence on the


response variable.

• Model Evaluation Metrics: R-squared, Adjusted R-squared, RMSE.

Page 19 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Bayesian Statistics

Bayesian statistics incorporate prior knowledge with current evidence to update beliefs.

Bayes' Theorem is a fundamental concept in probability theory that relates conditional probabilities.
It is named after the Reverend Thomas Bayes, who first introduced the theorem. Bayes' Theorem is
a mathematical formula that provides a way to update probabilities based on new evidence. The
formula is as follows:

P(A∣B)=P(B)P(B∣A)⋅P(A), where

• P(A∣B): The probability of event A given that event B has occurred (posterior probability).

• P(B∣A): The probability of event B given that event A has occurred (likelihood).

• P(A): The probability of event A occurring (prior probability).

• P(B): The probability of event B occurring.

Applications of Statistics in Machine Learning

Statistics is a key component of machine learning, with broad applicability in various fields.

• Feature engineering relies heavily on statistics to convert geometric features into


meaningful predictors for machine learning algorithms.

• In image processing tasks like object recognition and segmentation, statistics


accurately reflect the shape and structure of objects in images.

• Anomaly detection and quality control benefit from statistics by identifying deviations
from norms, aiding in the detection of defects in manufacturing processes.

• Environmental observation and geospatial mapping leverage statistical analysis to


monitor land cover patterns and ecological trends effectively.

Gaussian Mixture Models

Normal or Gaussian Distribution

In real life, many datasets can be modeled by Gaussian


Distribution (Univariate or Multivariate). So it is quite natural
and intuitive to assume that the clusters come from different
Gaussian Distributions. Or in other words, it tried to model the
dataset as a mixture of several Gaussian Distributions. This is
the core idea of this model.
In one dimension the probability density function of a Gaussian
Distribution is given by

Page 110 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

where and are respectively the mean and variance of the


distribution. For Multivariate ( let us say d-variate) Gaussian
Distribution, the probability density function is given by

Here µ is a d dimensional vector denoting the mean of the


distribution and ∑ is the d X d covariance matrix.

In probability theory and statistics, a normal distribution or Gaussian


distribution is a type of continuous probability distribution for a real-

valued random variable. The general form of its probability density


function is
A random variable with a Gaussian distribution is said to be normally
distributed, and is called a normal deviate.
Normal distributions are important in statistics and are often used in
the natural and social sciences to represent real-valued random variables
whose distributions are not known Their importance is partly due to the
central limit theorem. It states that, under some conditions, the average
of many samples (observations) of a random variable with finite mean
and variance is itself a random variable—whose distribution converges to
a normal distribution as the number of samples increases. Therefore,
physical quantities that are expected to be the sum of many independent
processes, such as measurement errors, often have distributions that are
nearly normal

Moreover, Gaussian distributions have some unique properties that are


valuable in analytic studies. For instance, any linear combination of a
fixed collection of independent normal deviates is a normal deviate.
Many results and methods, such as propagation of uncertainty and least
squares[5] parameter fitting, can be derived analytically in explicit form
when the relevant variables are normally distributed.

A normal distribution is sometimes informally called a bell curve.


However, many other distributions are bell-shaped (such as the Cauchy,
Student's t,
and logistic distributions).

The univariate probability distribution is generalized for vectors in the


multivariate normal distribution and for matrices in the matrix normal
Page 111 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

distribution.

Page 112 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

In real-world machine learning applications, it is common to


have many relevant features, but only a subset of them may
be observable. When dealing with variables that are
sometimes observable and sometimes not, it is indeed
possible to utilize the instances when that variable is visible or
observed in order to learn and make predictions for the
instances where it is not observable. This approach is often
referred to as handling missing data. By using the available
instances where the variable is observable, machine learning
algorithms can learn patterns and relationships from the
observed data. These learned patterns can then be used to
predict the values of the variable in instances where it is
missing or not observable.
The expectation-Maximization algorithm can be used to handle
situations where variables are partially observable. When certain
variables are observable, we can use those instances to learn
and estimate their values. Then, we can predict the values of
these variables in instances when it is not observable.
Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm is an iterative
optimization method that combines different unsupervised
machine

Page 113 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

learning algorithms to find maximum likelihood or maximum


posterior estimates of parameters in statistical models that
involve unobserved latent variables. The EM algorithm is
commonly used for latent variable models and can handle
missing data. It consists of an estimation step (E-

Page 114 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

step) and a maximization step (M-step), forming an iterative


process to improve model fit.
 In the E step, the algorithm computes the latent
variables i.e. expectation of the log-likelihood using
the current parameter estimates.
 In the M step, the algorithm determines the parameters
that maximize the expected log-likelihood obtained in the
E step, and corresponding model parameters are updated
based on the estimated latent variables.

Key Terms in Expectation-Maximization (EM) Algorithm


Some of the most commonly used key terms in the
Expectation- Maximization (EM) Algorithm are as follows:
 Latent Variables: Latent variables are unobserved
variables in statistical models that can only be inferred
indirectly through their effects on observable variables.
They cannot be directly measured but can be detected
by their impact on the observable variables.
 Likelihood: It is the probability of observing the given data
given the parameters of the model. In the EM
algorithm, the goal is to find the parameters that
maximize the likelihood.
 Log-Likelihood: It is the logarithm of the likelihood
function, which measures the goodness of fit between
the observed data and the model. EM algorithm seeks
to maximize the log- likelihood.
 Maximum Likelihood Estimation (MLE): MLE is a method
Page 20 of 28
R22 – JNTUH-Machine Learning UNIT -III Notes

to estimate the parameters of a statistical model by


finding the

Page 20 of 28
R22 – JNTUH-Machine Learning UNIT -III Notes

parameter values that maximize the likelihood function,


which measures how well the model explains the observed
data.
 Posterior Probability: In the context of Bayesian
inference, the EM algorithm can be extended to estimate
the maximum a posteriori (MAP) estimates, where the
posterior probability of the parameters is calculated
based on the prior distribution and the likelihood
function.
 Expectation (E) Step: The E-step of the EM algorithm
computes the expected value or posterior probability of
the latent variables given the observed data and
current parameter estimates. It involves calculating
the probabilities of each latent variable for each data
point.
 Maximization (M) Step: The M-step of the EM algorithm
updates the parameter estimates by maximizing the
expected log-likelihood obtained from the E-step. It
involves finding the parameter values that optimize
the likelihood function, typically through numerical
optimization methods.
 Convergence: Convergence refers to the condition when
the EM algorithm has reached a stable solution. It is
typically determined by checking if the change in the
log-likelihood or the parameter estimates falls below a
predefined threshold.
How Expectation-Maximization (EM) Algorithm Works:
The essence of the Expectation-Maximization algorithm is to use
the available observed data of the dataset to estimate the
missing data and then use that data to update the values of
the parameters. Let us understand the EM algorithm in detail.

Page 21 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

EM Algorithm Flowchart

1. Initialization:
 Initially, a set of initial values of the
parameters are considered. A set of incomplete
observed data is given to the system with the
assumption that the observed data comes
from a specific model.
2. E-Step (Expectation Step): In this step, we use the
observed data in order to estimate or guess the
values of the missing or incomplete data. It is
basically used to update the variables.
 Compute the posterior probability or
responsibility of each latent variable given
the observed data and current parameter
estimates.
 Estimate the missing or incomplete data values
using the current parameter estimates.
 Compute the log-likelihood of the observed
data based on the current parameter estimates
and estimated missing data.
Page 22 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

3. M-step (Maximization Step): In this step, we use the


complete data generated in the preceding
“Expectation” – step in order to

Page 23 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

update the values of the parameters. It is basically used


to update the hypothesis.
 Update the parameters of the model by
maximizing the expected complete data log-
likelihood obtained from the E-step.
 This typically involves solving optimization
problems to find the parameter values that
maximize the log- likelihood.
 The specific optimization technique used
depends on the nature of the problem and
the model being used.
4. Convergence: In this step, it is checked whether the
values are converging or not, if yes, then stop
otherwise repeat step-
2 and step-3 i.e. “Expectation” – step and “Maximization” –
step until the convergence occurs.
 Check for convergence by comparing the change
in log- likelihood or the parameter values
between iterations.
 If the change is below a predefined threshold,
stop and consider the algorithm converged.
 Otherwise, go back to the E-step and
repeat the process until convergence is
achieved.
Applications of the EM algorithm
 It can be used to fill in the missing data in a sample.
 It can be used as the basis of unsupervised learning of
clusters.
 It can be used for the purpose of estimating the
parameters of the Hidden Markov Model (HMM).
 It can be used for discovering the values of latent variables.
Advantages of EM algorithm
 It is always guaranteed that likelihood will increase
with each iteration.
 The E-step and M-step are often pretty easy for many
problems in terms of implementation.
 Solutions to the M-steps often exist in the closed form.

Page 24 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

Disadvantages of EM algorithm
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and
backward (numerical optimization requires only
forward probability).

Page 25 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

What is the K-Nearest Neighbors Algorithm?


KNN is one of the most basic yet essential classification
algorithms in machine learning. It belongs to the supervised
learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-
parametric, meaning it does not make any underlying
assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution
of the given data). We are given some prior data (also called
training data), which classifies coordinates into groups
identified by an attribute.
As an example, consider the following table of data points
containing two features:

Why do we need a KNN algorithm?


(K-NN) algorithm is a versatile and widely used machine learning
algorithm that is primarily used for its simplicity and ease of
implementation. It does not require any assumptions about
the underlying data distribution. It can also handle both
numerical and categorical data, making it a flexible choice for
various types of datasets in classification and regression tasks.
It is a non-parametric method that makes predictions based on
the similarity of data points in a given dataset. K-NN is less
Page 26 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

sensitive to outliers compared to other algorithms.

Page 27 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

The K-NN algorithm works by finding the K nearest neighbors to a


given data point based on a distance metric, such as Euclidean
distance. The class or value of the data point is then determined
by the majority vote or average of the K neighbors. This
approach allows the algorithm to adapt to different patterns and
make predictions based on the local structure of the data.
Distance Metrics Used in KNN Algorithm
As we know that the KNN algorithm helps us identify the
nearest points or the groups for a query point. But to
determine the closest groups or the nearest points for a query
point we need some metric. For this purpose, we use below
distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two
points which are in the plane/hyperplane. Euclidean distance can
also be visualized as the length of the straight line that joins
the two points which are into consideration. This metric helps
us calculate the net displacement done between the two
states of an object.
distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
Manhattan Distance
Manhattan Distance metric is generally used when we are
interested in the total distance traveled by the object instead of
the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-
dimensions.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan
distance, are special cases of the Minkowski distance.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above we can say that when p = 2 then it is
the same as the formula for the Euclidean distance and when
p = 1 then we obtain the formula for the Manhattan distance.
The above-discussed metrics are most common while dealing with
a Machine Learning problem but there are other distance
metrics as well like Hamming Distance which come in handy
while dealing with problems that require overlapping
Page 28 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

comparisons between two vectors whose contents can be


Boolean as well as string values.

Page 29 of 50
R22 – JNTUH-Machine Learning UNIT -III Notes

How to choose the value of k for KNN Algorithm?


The value of k is very crucial in the KNN algorithm to define the
number of neighbors in the algorithm. The value of k in the k-
nearest neighbors (k-NN) algorithm should be chosen based on
the input data. If the input data has more outliers or noise, a
higher value of k would be better. It is recommended to choose
an odd value for k to avoid ties in classification. Cross-
validation methods can help in selecting the best k value for the
given dataset.
Algorithm for K-NN
DistanceToNN=sort(distance from 1st example, distance from
kth example)
value i=1 to number of training
records: Dist=distance(test
example, ith example) if (Dist<any
example in DistanceToNN):
Remove the example from DistanceToNN and value.
Put new example in DistanceToNN and value in sorted
order. Return average of value
Fit using K-NN is more reasonable than 1-NN, K-NN affects
very less from noise if dataset is large.
In K-NN algorithm, We can see jump in prediction values due
to unit change in input. The reason for this due to change in
neighbors. To handles this situation, We can use weighting of
neighbors in algorithm. If the distance from neighbor is high,
we want less effect from that neighbor. If distance is low, that
neighbor should be more effective than others.

Page 210 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

Workings of KNN algorithm


Thе K-Nearest Neighbors (KNN) algorithm operates on the principle
of similarity, where it predicts the label or value of a new data point
by considering the labels or values of its K nearest neighbors in the
training dataset.

Step-by-Step explanation of how KNN works is discussed below:


Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that
needs to be considered while making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training
data points, Euclidean distance is used. Distance is
calculated between each of the data points in the
dataset and target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the
target point are the nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of K-
nearest neighbors are determined by performing
majority voting. The class with the most occurrences
among the neighbors becomes the predicted class for
the target data point.
 In the regression problem, the class label is calculated
by taking average of the target values of K nearest
neighbors. The calculated average value becomes the
predicted output for the target data point.
Page 211 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

Let X be the training dataset with n data points, where each data
point is represented by a d-dimensional feature vector XiXi and Y be
the

Page 212 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

corresponding labels or values for each data point in X. Given a


new data point x, the algorithm calculates the distance between
x and each data point XiXi in X using a distance metric, such as
Euclidean distance:distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
The algorithm selects the K data points from X that have the
shortest distances to x. For classification tasks, the algorithm
assigns the label y that is most frequent among the K nearest
neighbors to x. For regression tasks, the algorithm calculates
the average or weighted average of the values y of the K
nearest neighbors and assigns it as the predicted value for x.
Advantages of the KNN Algorithm
 Easy to implement as the complexity of the
algorithm is not that high.
 Adapts Easily – As per the working of the KNN
algorithm it stores all the data in memory storage
and hence whenever a new example or data point is
added then the algorithm adjusts itself as per that
new example and has its contribution to the future
predictions as well.
 Few Hyperparameters – The only parameters which
are required in the training of a KNN algorithm are
the value of k and the choice of the distance metric
which we would like to choose from our evaluation
metric.
Disadvantages of the KNN Algorithm
 Does not scale – As we have heard about this that the KNN
algorithm is also considered a Lazy Algorithm. The main
significance of this term is that this takes lots of computing
power as well as data storage. This makes this algorithm
both time-consuming and resource exhausting.
 Curse of Dimensionality – There is a term known as the
peaking phenomenon according to this the KNN
algorithm is affected by
the curse of dimensionality which implies the algorithm
faces a hard time classifying the data points properly
when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to
Page 213 of
50
R22 – JNTUH-Machine Learning UNIT -III Notes

the curse of dimensionality it is prone to the problem of


overfitting as well. Hence generally feature selection as
well as dimensionality reduction techniques are applied
to deal with this problem.

Page 214 of
50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy