0% found this document useful (0 votes)
21 views

Unit 3 PPT

Uploaded by

MR ROBOT Byte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Unit 3 PPT

Uploaded by

MR ROBOT Byte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

All the IPython Notebooks in **Machine larning** lecture series by **[Mr.Mounesh Gouda](https://www.linkedin.

com/in/mounesh-gouda-
858069246/)** are available @ **[GitHub](https://github.com/Mouneshgouda)** </small>

Mounesh Gouda ​

Unit:-3 Classification and Tree based methods

Probabilities in machine learning:

Probability Distributions:

Continuous Distributions: Probability distributions like Gaussian (Normal) distribution, exponential distribution, etc., are frequently used in
machine learning algorithms. For example, Gaussian distributions are commonly used in statistical models, and the parameters of these
distributions are often estimated from data.

Discrete Distributions: Probability mass functions for discrete distributions, such as the binomial distribution or multinomial distribution, are
used in situations where the outcome is discrete (e.g., counting successes in a fixed number of trials).

Bayesian Inference:

Bayesian Probability: Bayesian probability is a framework for updating beliefs about a hypothesis as evidence is collected. In machine
learning, Bayesian methods are used for model training and updating based on new information.

Bayesian Networks: These are graphical models that represent the probabilistic relationships among a set of variables. Bayesian networks
are commonly used in areas like medical diagnosis, fraud detection, and risk assessment.

Probabilistic Models:

Probabilistic Graphical Models (PGMs): PGMs, such as Bayesian networks and Markov networks, are used to represent and reason about
uncertainty in complex systems. They are especially useful in situations where variables are interdependent.

Hidden Markov Models (HMMs): HMMs are used for modeling time-series data with hidden states. They have applications in speech
recognition, bioinformatics, and finance.

Classification and Regression:

Logistic Regression: In binary classification problems, logistic regression models the probability of a given instance belonging to a particular
class.

Softmax Regression: In multi-class classification problems, softmax regression models the probabilities of an instance belonging to each
class.

Evaluation Metrics:

Probability Calibration: In binary classification, probability calibration ensures that predicted probabilities align well with actual outcomes.
Calibrated probabilities are important in applications like risk assessment and fraud detection.

Precision, Recall, F1-Score: These metrics are often used to evaluate the performance of classification models and are computed based on
probabilities.

Ensemble Methods:

Random Forest, Gradient Boosting: Ensemble methods often use probabilistic models at their core. They combine predictions from multiple
models to improve overall performance.

Naive Bayes is a classification technique based on Bayes' theorem, which assumes that all features that predict the target outcome are
Naive Bayes is a classification technique based on Bayes' theorem, which assumes that all features that predict the target outcome are
independent of each other. It calculates the probability of each category and then selects the category with the highest probability. It has been
used successfully for many purposes, but is especially useful for natural language processing (NLP) problems.

Bayes' theorem explains the probability of an event based on prior knowledge of the events that will be associated with the event.

What makes Naive Bayes a “Naive” algorithm?

Naive Bayes classifier assumes that the features we use to predict the target are independent and do not affect each other. While in real-life
data, features depend on each other in determining the target, but this is ignored by the Naive Bayes classifier.

Though the independence assumption is never correct in real-world data, but often works well in practice. so that it is called “Naive”.

Math behind Naive Bays Algorithm

Given a features vector X=(x1,x2,…,xn) and a class variable y, Bayes Theorem states that:

We’re interested in calculating the posterior probability P(y | X) from the likelihood P(X | y) and prior probabilities P(y),P(X).

Using the chain rule, the likelihood P(X ∣ y) can be decomposed as:

However, since the concept of freedom is not fair, the situation seems independent from each other.

Thus, by conditional independence, we have:

Since the denominator remains the same for all values, the result could be:
The Naive Bayes classifier combines these models with decision rules. The general rule is to choose the best idea; this is called the
maximum a posteriori decision rule, or MAP.

How Naive Bays really works:

How Naive Bays really works:

To make it clearer, let's explain with an example:

Suppose we have a group of emails and we want to classify them as spam or not spam.

Our database contains 15 non-spam emails and 10 spam emails. I did some analysis and recorded the frequency of each word as shown
below:

Note: Stop Words like “the”, “a”, “on”, “is”, “all” had been removed as they do not carry important meaning and are usually removed from
texts. The same thing applies to numbers and punctuations.

exploring some probabilities:

P(Dear|Not Spam) = 8/34 P(Visit|Not Spam) = 2/34 P(Dear|Spam) = 3/47 P(Visit|Spam) = 6/47

and so on.

now assume we have the message “Hello friend” and we want to know whether it is a spam or not.

so, using Bayes’ Theorem


ignoring the denominator

But, P(Hello friend | Not Spam) = 0, as this case (Hello friend) doesn’t exist in our dataset, i.e. we deal with single words, not the whole
sentence, and the same for P(Hello friend | Spam) will be zero as well, which in turn will make both probabilities of being a spam and not
spam both are zero, which has no meaning!!

But wait!! we said that the Naive Bayes assumes that the features we use to predict the target are independent .

so,

now let’s calculate the probability of being spam using the same procedure:

so, the message “Hello friend” is not a spam.

Pros and Cons for Naive Bayes

Pros:

Requires a small amount of training data. So the training takes less time. Handles continuous and discrete data, and it is not sensitive to
irrelevant features. Very simple, fast, and easy to implement. Can be used for both binary and multi-class classification problems. Highly
scalable as it scales linearly with the number of predictor features and data points. When the Naive Bayes conditional independence
assumption holds true, it will converge quicker than discriminative models like logistic regression.

Cons:

The assumption of independent predictors/features. Naive Bayes implicitly assumes that all the attributes are mutually independent which is
almost impossible to find in real-world data. If a categorical variable has a value that appears in the test dataset, and not observed in the
training dataset, then the model will assign it a zero probability and will not be able to make a prediction. This is what we called the “Zero
Frequency problem“, and can be solved using smoothing techniques.

Applications of Naive Bayes Algorithm

Applications of Naive Bayes Algorithm

Real-time Prediction.

Multi-class Prediction.
Text classification/ Spam Filtering/ Sentiment Analysis.

Recommendation Systems.

Support Vector Machine (SVM)

Support Vector Machines (SVM) :-is a powerful supervised machine learning algorithm used for classification and regression tasks. The
primary goal of SVM is to find a hyperplane in an N-dimensional space (N is the number of features) that distinctly classifies data points into
different classes.

Here's a brief explanation of SVM:

Hyperplane:

In a 2-dimensional space, a hyperplane is a line. In 3-dimensional space, it's a plane, and so on. In general, a hyperplane is an (N-1)-
dimensional subspace in an N-dimensional space. For a binary classification problem (two classes), the SVM aims to find the hyperplane that
best separates the two classes. Support Vectors:

Support vectors are the data points that are closest to the hyperplane and have the most significant influence on the positioning of the
hyperplane. These support vectors essentially "support" the optimal separation between classes. Margin:

The margin is the distance between the hyperplane and the nearest data point from each class (the support vectors). SVM aims to maximize
this margin because a larger margin generally leads to a more robust and generalized model. Kernel Trick:

SVMs can efficiently handle non-linear decision boundaries through the use of a kernel trick. The kernel function transforms the input features
into a higher-dimensional space, making it possible to find a hyperplane that separates the classes. Common kernel functions include linear,
polynomial, radial basis function (RBF), and sigmoid. C Parameter:

The regularization parameter C is crucial in SVM. It determines the trade-off between having a smooth decision boundary and classifying
training points correctly. A small C encourages a larger margin but may misclassify more points, while a large C classifies more points
correctly but may result in a smaller margin. Soft Margin SVM:

In real-world scenarios, data may not always be perfectly separable. In such cases, SVM can be adapted to allow for some misclassifications
by introducing a slack variable. Soft Margin SVM aims to find a balance between maximizing the margin and minimizing the misclassification
errors. Multi-Class Classification:

SVM can be extended for multi-class classification using techniques like one-vs-one or one-vs-all. In summary, SVM is a versatile algorithm
suitable for both linear and non-linear classification tasks. Its ability to handle high-dimensional spaces and the flexibility provided by different
kernel functions make it a popular choice in various applications, including image recognition, text classification, and bioinformatics
Margin

Minimum distance from the hyperplane of all the observations

Functional Margin:- Theoretical definition of Margin

Geometric Margin

Wider margin : Future points can be classified with certainty

Why maximize Margin ?

Points near decision surface : uncertain classification decisions (50% either way)

A classifier with a large margin makes no low certainty classification decisions

Gives classification safety margin w.r.t. slight errors in measurement


Why maximize Margin ?

SVM classifier : large margin around decision boundary

Fat separator between the classes

Fewer choices of where it can be put in comparison to hyperplane

Linear SVM Mathematically


Hard Margin

Finding the classifier

Solve an optimization problem


The Real world – not so clean !
Hard vs. Soft

Soft Margin SVM:-

always has a solution

More robust to outliers

Hard Margin:-

Requires no parameters

Why Dual Problem


Why Dual?

High Dimensional data

No. of features >> no. of samples

Eg. Image data, genetic data

p >> N

Np >> N*N

N*N number of inner products Leads to the ‘Kernel’ formulation

Decision Trees

A decision tree is a type of non-parametric supervised learning algorithm characterized by its hierarchical, tree-like structure. The tree
comprises various components, including a root node, branches, internal nodes, and leaf nodes.

The foundational concept of decision trees has greatly influenced classical machine learning algorithms such as Random Forests, Bagging,
and Boosted Decision Trees. The fundamental idea behind decision trees is to represent data using a tree structure. In this structure, each
internal node serves as a test on a specific attribute, essentially representing a condition. Each branch emanating from an internal node
signifies an outcome of the corresponding test, and each leaf node, also known as a terminal node, holds a class label.

This representation allows decision trees to make decisions based on a series of attribute tests, leading to a path from the root to a specific
leaf node. Decision trees are valuable in classification and regression tasks due to their interpretability, ease of understanding, and ability to
capture complex decision boundaries in the data. The tree structure is utilized to segment the feature space into regions, with each leaf node
corresponding to a particular class or regression value.

Before learning more about decision trees let’s get familiar with some of the terminologies.

Root Nodes: These nodes mark the starting point of a decision tree, initiating the division of the population based on various features.

Decision Nodes: After the root node, the resulting nodes from subsequent splits are referred to as decision nodes.

Leaf Nodes: Nodes where further division is not feasible are termed as leaf nodes or terminal nodes.

Branch/Sub-tree: Similar to a sub-graph in a graph, a subsection of a decision tree is known as a sub-tree or branch.

Pruning: This process involves selectively removing nodes to prevent overfitting and improve the generalization ability of the decision tree.
Pruning ensures a more balanced and accurate model by trimming unnecessary branches and nodes.
Why use decision trees?

Reflecting Human Thinking: Decision trees often mirror the logical reasoning inherent in human decision-making. This characteristic makes
them intuitive and straightforward for individuals to comprehend.

Transparent Structure: Decision trees adopt a tree-like structure, contributing to the ease of understanding the underlying logic. The clear and
visual representation simplifies the interpretation of decision-making processes within the algorithm.

Decision Tree Example

Let's explore the concept of decision trees through an example. Decision trees are employed for constructing classification or regression
models, structured in the form of a tree. This process involves breaking down a dataset into progressively smaller subsets while
simultaneously developing an associated decision tree. The outcome is a tree structure comprising decision nodes and leaf nodes.

A decision node (e.g., "Outlook") features two or more branches (e.g., "Sunny," "Overcast," and "Rainy"). Meanwhile, a leaf node (e.g.,
"Play") signifies a specific classification or decision. The highest decision node in the tree, corresponding to the most influential predictor, is
referred to as the root node. Notably, decision trees exhibit versatility by accommodating both categorical and numerical data in their
construction.

Entropy

Entropy is a measure of the impurity in a substance. It shows the randomness of things.

Entropy from information theory. The higher the entropy, the more content.

H(X) = — Σ (pi * log2 pi)

Where,

X = total sample
pi is the probability test of the category I

Gini Index

The Gini Index is a measure of impurity or purity used to create decision trees in the CART (Classification and Regression Trees) algorithm.

In contrast, items with a low Gini index should be prioritized for a high Gini index.

When creating binary bins only, the CART algorithm uses the Gini index to create binary bins.

The Gini index can be calculated using the following formula:

Information Gain

We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be
learned.

Information gain tells us how important a given attribute of the feature vectors is.

We will use it to decide the ordering of attributes in the nodes of a decision tree.

Information Gain = entropy ( parent) — [average entropy ( children)]

Decision Tree to Decision Rules

A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes one by one.

Pruning is a crucial process in decision tree construction aimed at optimizing the tree's structure. The primary goal is to strike a balance
between a tree that is too large, increasing the risk of overfitting, and one that is too small, potentially missing important features in the
dataset. Pruning helps reduce the size of the tree while maintaining or even improving accuracy. Two commonly used tree pruning
techniques are:

Cost Complexity Pruning: Cost Complexity Pruning involves the use of a hyperparameter known as the "cost-complexity parameter" (often
denoted as alpha). This parameter controls the trade-off between the complexity of the tree and its fit to the training data. By iteratively
pruning subtrees based on this parameter, the algorithm seeks to find the optimal balance, leading to a tree with improved generalization
performance on unseen data.

Reduced Error Pruning: Reduced Error Pruning, also known as post-pruning, is a technique where the tree is initially allowed to grow to its full
extent, capturing as much detail from the training data as possible. Subsequently, nodes are removed based on their contribution to reducing
errors on a validation dataset. This process continues until further pruning does not result in a significant improvement in performance.

These pruning methods are essential for preventing overfitting and enhancing the model's ability to generalize to new data. By carefully
removing unnecessary nodes, decision trees become more interpretable, computationally efficient, and better suited for real-world
applications.

Ensemble Learning?

Ensemble methods represent a robust approach to enhancing model performance by combining predictions from multiple models. Leveraging
this machine learning algorithm results in improved outcomes, showcasing the effectiveness of this

Types of Ensemble Learning

Bagging or Bootstrap Aggregation — Random Forest

Boosting — AdaBoost, XG Boost and Gradient Boost


Why do we use Ensemble Techniques?

These techniques help in reducing the variance (bagging), bias (boosting) and improve predictions.

Certainly, here's a rephrased version of the information:

Bagging: Bagging, as exemplified by Random Forest, effectively addresses overfitting concerns and exhibits reduced training time. While
there is occasional bias increase, variance is mitigated, and the use of independent parallel classifiers contributes to its robust performance.

Boosting: Gradient Boosting, a representative of boosting techniques, may encounter overfitting issues, which can be mitigated through
parameter tuning. Boosting is adept at reducing bias and is characterized as a set of sequential classifiers, emphasizing its sequential nature
in model building.

Boosting

Boosting algorithms is the family of algorithms that combine weak learners into a strong learner.
Idea behind boosting algorithms?

Boosting algorithms aim to learn weak classifiers that exhibit slight correlation with the true classification. These weak classifiers are then
combined to form a strong classifier that demonstrates high correlation with the true classification.

Illustrating this concept with the mail spam detection problem involves breaking down the task into several steps:

Identifying if the email contains the phrase 'how have earned the prize.

Checking for the presence of only an image in the email. Determining the sender of the email.

Assessing how caps lock was utilized in the email.

Examining the subject line of the email.

Each of these steps represents a weak classifier, individually insufficient for determining whether the email is spam. However, when
combined, these weak classifiers work synergistically, resulting in a robust and accurate spam detection system with a higher probability of
correctly identifying spam emails.

working of boosting algorithm?

Boosting algorithms operate through an iterative process, progressively learning weak classifiers and integrating them into a final strong
classifier. The inclusion of weak classifiers is typically weighted based on their accuracy. Following each iteration, the training data undergoes
reweighting, with misclassified instances gaining weight and correctly classified instances losing weight. In subsequent iterations, the focus of
the learner is primarily on instances that were previously misclassified. The distinctive characteristics of various boosting algorithms primarily
stem from the specific reweighting approaches applied to the training set.

Bagging( Bootstrap Aggregating)

As we discussed before bagging is an ensemble technique mainly used to reduce the variance of our predictions by combining the result of
multiple classifiers modelled on different sub-samples of the same data set.
Main Steps involved in bagging are :

Generating Multiple Datasets: Through sampling with replacement from the original dataset, new datasets are created.

Constructing Multiple Classifiers: Each of these smaller datasets undergoes the building of a classifier, typically using the same model across
all datasets.

Combining Classifiers: The predictions from each individual classifier are then aggregated to form an improved classifier, often characterized
by significantly reduced variance.

Bagging operates akin to a "Divide and Conquer" strategy, involving a collection of predictive models executed on diverse subsets derived
from the original dataset. These models are subsequently amalgamated to enhance accuracy and promote model stability.

Gradient Boosting

is an ensemble learning technique that works by combining the predictions of multiple weak learners, typically decision trees, to create a
strong predictive model. The key idea behind Gradient Boosting is to iteratively improve the model by fitting a weak learner to the residual
errors of the existing model.

How Gradient Boosting Works? (Iterative Corrections) Compute Target Column Average:

Start by determining the average of the target column to establish an initial prediction. Residual Calculation:

Calculate residuals by comparing actual and predicted observations, representing the differences between them. Model Training with
Residuals:

Train a model using the residuals as the target variable, aiming to capture patterns not considered in the initial prediction. Update Predictions:

Adjust the default prediction by incorporating the predictions obtained from the newly trained model. Optimize Loss Functions:

Enhance the model by optimizing the loss functions of the previous learner, contributing to a refined and more accurate predictive model.
XGBoost

short for eXtreme Gradient Boosting, is an optimized and efficient implementation of the gradient boosting algorithm. It has gained popularity
in machine learning competitions and various applications due to its speed and performance. XGBoost is designed to be highly scalable and
provides high predictive accuracy.

Here are the key components and aspects of XGBoost:

Gradient Boosting Framework:

XGBoost belongs to the family of ensemble learning methods, specifically the gradient boosting framework. It builds a series of weak learners
(typically decision trees) sequentially, with each subsequent learner correcting the errors of the previous ones. Objective Function:

XGBoost uses a customizable objective function that combines a loss function and a regularization term. The objective function is optimized
during the training process, ensuring a robust and accurate model. Regularization:

XGBoost incorporates L1 (Lasso) and L2 (Ridge) regularization terms into its objective function to prevent overfitting and improve the
generalization of the model. Tree Pruning:

XGBoost employs a technique called "pruning" during the construction of decision trees. Pruning helps avoid the growth of deep and overfit
trees, contributing to the overall efficiency and interpretability of the model. Parallel and Distributed Computing:

XGBoost is designed to take advantage of parallel and distributed computing capabilities. It can efficiently handle large datasets and expedite
the training process by utilizing multiple cores or distributed computing environments. Handling Missing Values:

XGBoost has a built-in mechanism to handle missing values in the dataset. It automatically learns the best imputation strategy during the
training process. Feature Importance:

XGBoost provides a feature importance score, allowing users to interpret the contribution of each feature to the model's predictions. This
information aids in feature selection and understanding the model's behavior. Cross-Validation:

XGBoost facilitates the use of cross-validation during the training process, enabling robust model evaluation and hyperparameter tuning.
XGBoost's efficiency, scalability, and effectiveness make it a popular choice for various machine learning tasks, including classification,
regression, and ranking problems.

Credites

https://www.kaggle.com/ https://medium.com/ https://www.wikipedia.org/


Loading [MathJax]/extensions/Safe.js

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy