0% found this document useful (0 votes)

4 views19 pages

Classification and Clustering Algorithm Notes

The document provides an overview of classification algorithms, a supervised learning technique used to categorize new observations based on training data. It details various types of classification algorithms, including Logistic Regression, Naive Bayes, K-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines, along with their evaluation methods. Additionally, it briefly mentions K-Means clustering as a separate technique for grouping data points into clusters.

Uploaded by

poojithachowdaryuppalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views19 pages

Classification and Clustering Algorithm Notes

Uploaded by

poojithachowdaryuppalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

The Classification algorithm is a Supervised Learning technique that is

used to identify the category of new observations on the basis of

training data. In Classification, a program learns from the given dataset
or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or
dog, etc. Classes can be called as targets/labels or categories. Unlike
regression, the output variable of Classification is a category, not a
value, such as "Green or Blue", "fruit or animal", etc. Since the
Classification algorithm is a Supervised learning technique, hence it
takes labelled input data, which means it contains input with the
corresponding output.

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Learners in Classification Problems

There are two types of learners.

 Lazy Learners

It first stores the training dataset before waiting for the test dataset to arrive.
When using a lazy learner, the classification is carried out using the training
dataset's most appropriate data. Less time is spent on training, but more
time is spent on predictions. Some of the examples are case-based
reasoning and the KNN algorithm.

 Eager Learners

Before obtaining a test dataset, eager learners build a classification model

using a training dataset. They spend more time studying and less time
predicting. Some of the examples are ANN, naive Bayes, and Decision trees.
Types of Classification Algorithms

1. Logistic Regression

It is a supervised learning classification technique that forecasts the

likelihood of a target variable. There will only be a choice between two
classes. Data can be coded as either one or yes, representing success, or as
0 or no, representing failure. The dependent variable can be predicted most
effectively using logistic regression. When the forecast is categorical, such as
true or false, yes or no, or a 0 or 1, you can use it. A logistic regression
technique can be used to determine whether or not an email is a spam.

2. Naive Byes

Naive Bayes determines whether a data point falls into a particular category.
It can be used to classify phrases or words in text analysis as either falling
within a predetermined classification or not. It assumes that predictors in a
dataset are independent. This means that it assumes the features are
unrelated to each other. For example, if given a banana, the classifier will
see that the fruit is of yellow color, oblong-shaped and long and tapered. All
of these features will contribute independently to the probability of it being a
banana and are not dependent on each other. Naive Bayes is based on
Bayes’ theorem, which is given as:

Figure 3 : Bayes’ Theorem

Where :

P(A | B) = how often happens given that B happens

P(A) = how likely A will happen

P(B) = how likely B will happen

P(B | A) = how often B happens given that A happens

Text Tag

“A great game” Sports

“The election is over” Not Sports

“What a great score” Sports

“A clean and unforgettable game” Sports

“The spelling bee winner was a surprise” Not Sports

3. K-Nearest Neighbors

It calculates the likelihood that a data point will join the groups based on
which group the data points closest to it are a part of. When using k-NN for
classification, you determine how to classify the data according to its nearest
neighbor.

The parameter k in kNN refers to the number of labeled points (neighbors)

considered for classification. The value of k indicates the number of these
points used to determine the result. Our task is to calculate the distance and
identify which categories are closest to our unknown entity.

Given a point whose class we do not know, we can try to understand which
points in our feature space are closest to it. These points are the k-nearest
neighbors. Since similar things occupy similar places in feature space, it’s
very likely that the point belongs to the same class as its neighbors. Based on
that, it’s possible to classify a new point as belonging to one class or another.

Some advanced methods for selecting k that are suitable for these cases.

1. Square root method

The optimal K value can be calculated as the square root of the total number
of samples in the training dataset. Use an error plot or accuracy plot to find
the most favorable K value. KNN performs well with multi-label classes, but in
case of the structure with many outliers, it can fail, and you’ll need to use
other methods.

2. Cross-validation method (Elbow method)

Begin with k=1, then perform cross-validation (5 to 10 fold – these figures are
common practice as they provide a good balance between the computational
efforts and statistical validity), and evaluate the accuracy. Keep repeating the
same steps until you get consistent results. As k goes up, the error usually
decreases, then stabilizes, and then grows again. The optimal k lies at the
beginning of the stable zone.

K-distance is the distance between data points and a given query point. To
calculate it, we have to pick a distance metric.
Some of the most popular metrics are explained below.

Euclidean distance

The Euclidean distance between two points is the length of the straight line
segment connecting them. This most common distance metric is applied to
real-valued vectors.

Manhattan distance

The Manhattan distance between two points is the sum of the absolute
differences between the x and y coordinates of each point. Used to measure
the minimum distance by summing the length of all the intervals needed to get
from one location to another in a city, it’s also known as the taxicab distance.
Minkowski distance

Minkowski distance generalizes the Euclidean and Manhattan distances. It

adds a parameter called “order” that allows different distance measures to be
calculated. Minkowski distance indicates a distance between two points in a
normed vector space

Hamming distance

Hamming distance is used to compare two binary vectors (also called data
strings or bitstrings). To calculate it, data first has to be translated into a binary
system.

4. Decision Tree

A decision tree is an example of supervised learning. Although it can solve

regression and classification problems, it excels in classification problems.
Similar to a flow chart, it divides data points into two similar groups at a
time, starting with the "tree trunk" and moving through the "branches" and
"leaves" until the categories are more closely related to one another.
Decision tree builds classification or regression models in the form of a tree
structure. It breaks down a dataset into smaller and smaller subsets while at
the same time an associated decision tree is incrementally developed. The
final result is a tree with decision nodes and leaf nodes. A decision node
(e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy).
Leaf node (e.g., Play) represents a classification or decision. The topmost
decision node in a tree which corresponds to the best predictor called root
node. Decision trees can handle both categorical and numerical data.
Decision tree includes all predictors with the dependence assumptions between predictors.

Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain
instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sam
the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entrop
one.

To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:
a) Entropy using the frequency table of one attribute:

b) Entropy using the frequency table of two attributes:

Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a
decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneou
branches).

Step 1: Calculate entropy of the target.

Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is
proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before th
The result is the Information Gain, or decrease in entropy.

Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its branch
repeat the same process on every branch.
Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
5. Random Forest Algorithm

The random forest algorithm is an extension of the Decision Tree algorithm

where you first create a number of decision trees using training data and
then fit your new data into one of the created ‘tree’ as a ‘random forest’. It
averages the data to connect it to the nearest tree data based on the data
scale. These models are great for improving the decision tree’s problem of
forcing data points unnecessarily within a category.

The following steps explain the working Random Forest Algorithm:

Step 1: Select random samples from a given data or training set.

Step 2: This algorithm will construct a decision tree for every training data.

Step 3: Voting will take place by averaging the decision tree.

Step 4: Finally, select the most voted prediction result as the final prediction
result.

This combination of multiple models is called Ensemble. Ensemble uses two

methods:

1. Bagging: Creating a different training subset from sample training

data with replacement is called Bagging. The final output is based
on majority voting.

2. Boosting: Combing weak learners into strong learners by creating

sequential models such that the final model has the highest
accuracy is called Boosting. Example: ADA BOOST, XG BOOST.

6. Support Vector Machine

Support Vector Machine is a popular supervised machine learning technique
for classification and regression problems. It goes beyond X/Y prediction by
using algorithms to classify and train the data according to polarity.

The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily
put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the

hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram
in which there are two different categories that are classified using a
decision boundary or hyperplane:

Evaluating a Classification model:

Once our model is completed, it is necessary to evaluate its performance;
either it is a Classification or Regression model. So for evaluating a
Classification model, we have the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output

is a probability value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be
near to 0.
o The value of log loss increases if the predicted value deviates from the
actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:
1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and

describes the performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which
has a total number of correct predictions and incorrect predictions. The
matrix looks like as below table:
Actual Positive Actual Negative
o

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics

Curve and AUC stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at
different thresholds.
o To visualize the performance of the multi-class classification model, we
use the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive
Rate) on Y-axis and FPR(False Positive Rate) on X-axis.

K-Means Clustering Algorithm

The K-means clustering algorithm computes centroids and repeats

until the optimal centroid is found. It is presumptively known how
many clusters there are. It is also known as the flat clustering
algorithm. The number of clusters found from data by the method is
denoted by the letter ‘K’ in K-means.

In this method, data points are assigned to clusters in such a way

that the sum of the squared distances between the data points and
the centroid is as small as possible. It is essential to note that
reduced diversity within clusters leads to more identical data points
within the same cluster.

The following stages will help us understand how the K-Means

clustering technique works-
 Step 1: First, we need to provide the number of clusters, K,
that need to be generated by this algorithm.
 Step 2: Next, choose K data points at random and assign each
to a cluster. Briefly, categorize the data based on the number
of data points.
 Step 3: The cluster centroids will now be computed.
 Step 4: Iterate the steps below until we find the ideal centroid,
which is the assigning of data points to clusters that do not
vary.
 4.1 The sum of squared distances between data points and
centroids would be calculated first.
 4.2 At this point, we need to allocate each data point to the
cluster that is closest to the others (centroid).
 4.3 Finally, compute the centroids for the clusters by averaging
all of the cluster’s data points.

Applications of K-Means clustering

 To get relevant insights from the data we’re dealing with.
 Distinct models will be created for different subgroups in a
cluster-then-predict approach.
 Market segmentation
 Document Clustering
 Image segmentation
 Image compression
 Customer segmentation
 Analysing the trend on dynamic data

Stated Choice Methods Analysis and Applications
100% (2)
Stated Choice Methods Analysis and Applications
418 pages
Telecom Customer Churn Project Report
50% (2)
Telecom Customer Churn Project Report
25 pages
Cse Vsem 503 B PR Unit 2 Notes
No ratings yet
Cse Vsem 503 B PR Unit 2 Notes
17 pages
ML Unit 2
No ratings yet
ML Unit 2
84 pages
Classification
No ratings yet
Classification
7 pages
Module Iii
No ratings yet
Module Iii
15 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
Sayan Das - Machine Learning
No ratings yet
Sayan Das - Machine Learning
4 pages
ML Unit 3 Part 3
No ratings yet
ML Unit 3 Part 3
33 pages
Slide 3
No ratings yet
Slide 3
23 pages
ML Notes
No ratings yet
ML Notes
50 pages
DM Unit Iii
No ratings yet
DM Unit Iii
87 pages
MLunit 2 Mynotes
No ratings yet
MLunit 2 Mynotes
15 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
Chapter
100% (1)
Chapter
101 pages
UNIT 3 - Final
No ratings yet
UNIT 3 - Final
37 pages
Unit - II
No ratings yet
Unit - II
37 pages
ML Unit 2 Final - III Yr
No ratings yet
ML Unit 2 Final - III Yr
72 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Algorithms New
No ratings yet
Algorithms New
8 pages
Machine Learning QNA
No ratings yet
Machine Learning QNA
1 page
Machine Learning
100% (6)
Machine Learning
115 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
Refer For KNNDecison Tree SVM
No ratings yet
Refer For KNNDecison Tree SVM
90 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
WK 07
No ratings yet
WK 07
8 pages
Big Data Notes
No ratings yet
Big Data Notes
33 pages
FPA Unit 2
No ratings yet
FPA Unit 2
20 pages
Chapter 2 Types of Machine Learning and Their Learning Strategies
No ratings yet
Chapter 2 Types of Machine Learning and Their Learning Strategies
45 pages
ML Unit 2
No ratings yet
ML Unit 2
46 pages
Introduction To Classification and Classification Algorithms
No ratings yet
Introduction To Classification and Classification Algorithms
9 pages
Unit Ii
No ratings yet
Unit Ii
102 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Unit - 3
No ratings yet
Unit - 3
83 pages
Day 2 - Session 2: - KNN - Decision Tree - Random Forest - Naïve Bayes Classification
No ratings yet
Day 2 - Session 2: - KNN - Decision Tree - Random Forest - Naïve Bayes Classification
50 pages
ML - Course - 15 - 17
No ratings yet
ML - Course - 15 - 17
31 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
ML Unit-Ii Notes
No ratings yet
ML Unit-Ii Notes
17 pages
CH 04 Classification Techniques
No ratings yet
CH 04 Classification Techniques
89 pages
Unit3 ML
No ratings yet
Unit3 ML
7 pages
Unit 4 Classification & Prediction
No ratings yet
Unit 4 Classification & Prediction
10 pages
Classification
No ratings yet
Classification
50 pages
DataMining-Handouts1 5
No ratings yet
DataMining-Handouts1 5
8 pages
Machine Learning Algorithms Laiki
No ratings yet
Machine Learning Algorithms Laiki
123 pages
Unit 5
No ratings yet
Unit 5
25 pages
Unit 3,4,5 ML (CS - AI)
No ratings yet
Unit 3,4,5 ML (CS - AI)
37 pages
Module 5 - Supervised Learning Algorithms
No ratings yet
Module 5 - Supervised Learning Algorithms
38 pages
ML4 ML Algorithms
No ratings yet
ML4 ML Algorithms
123 pages
Chapter 7 Supervised Learning
No ratings yet
Chapter 7 Supervised Learning
71 pages
Learning Types ML
No ratings yet
Learning Types ML
18 pages
ML Assignment 2 PDF
No ratings yet
ML Assignment 2 PDF
9 pages
Co-2 ML 2019
No ratings yet
Co-2 ML 2019
71 pages
DM - MP
No ratings yet
DM - MP
15 pages
ML Unit4
No ratings yet
ML Unit4
10 pages
Classification Chapter 5
No ratings yet
Classification Chapter 5
26 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ramin Shamshiri ABE6981 HW - 03
No ratings yet
Ramin Shamshiri ABE6981 HW - 03
13 pages
Sta 3010 Quizes
No ratings yet
Sta 3010 Quizes
10 pages
Determinants of Rural-Urban Migration in Konkan Region of Maharashtra
No ratings yet
Determinants of Rural-Urban Migration in Konkan Region of Maharashtra
7 pages
ML Disha
No ratings yet
ML Disha
46 pages
DAV - Viva QnA - Doubtly - in
No ratings yet
DAV - Viva QnA - Doubtly - in
12 pages
Cse3502-Information Security Management: Phishing Detection Using Data Mining Techniques
No ratings yet
Cse3502-Information Security Management: Phishing Detection Using Data Mining Techniques
25 pages
Body Disposal Pattern
No ratings yet
Body Disposal Pattern
9 pages
Wollo University: College of Natural Science
No ratings yet
Wollo University: College of Natural Science
29 pages
ISyE7406 Homework3
No ratings yet
ISyE7406 Homework3
20 pages
Statistical Machine Learning
No ratings yet
Statistical Machine Learning
28 pages
Data Mining Concepts and Techniques
50% (2)
Data Mining Concepts and Techniques
136 pages
Course Outline MTS 202 - Statistical Inference
No ratings yet
Course Outline MTS 202 - Statistical Inference
5 pages
Active Learning
No ratings yet
Active Learning
16 pages
Aiml Project Report
No ratings yet
Aiml Project Report
10 pages
Ebook Handbook of Regression Modeling in People Analytics 1St Edition Keith Mcnulty Online PDF All Chapter
100% (13)
Ebook Handbook of Regression Modeling in People Analytics 1St Edition Keith Mcnulty Online PDF All Chapter
69 pages
WSU Book Thesis Final Volume III
No ratings yet
WSU Book Thesis Final Volume III
393 pages
2 Customer Churning Analysis Using Machine Learning Algorithms
No ratings yet
2 Customer Churning Analysis Using Machine Learning Algorithms
10 pages
9STEPSBinomial Logistic Regression EDWINABU
No ratings yet
9STEPSBinomial Logistic Regression EDWINABU
10 pages
Types of Classification Algorithm
No ratings yet
Types of Classification Algorithm
27 pages
Group 3 Written Report
No ratings yet
Group 3 Written Report
16 pages
Electronic Medical Records and Diabetes Quality of Care: Results From A Sample of Family Medicine Practices
No ratings yet
Electronic Medical Records and Diabetes Quality of Care: Results From A Sample of Family Medicine Practices
7 pages
Do Young Generations Save For Retirement - Ensuring Financial Security of Gen Z and Gen Y
No ratings yet
Do Young Generations Save For Retirement - Ensuring Financial Security of Gen Z and Gen Y
25 pages
A Comparative Analysis of Logistic Regression and Random Forest For Individual Fairness in Machine Learning
No ratings yet
A Comparative Analysis of Logistic Regression and Random Forest For Individual Fairness in Machine Learning
5 pages
Nitin Jha (05114802819)
No ratings yet
Nitin Jha (05114802819)
21 pages
Bda Unit 5
No ratings yet
Bda Unit 5
14 pages
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
No ratings yet
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
16 pages
The Impact of Tobacco Promotion at The Point of Sale: A Systematic Review
No ratings yet
The Impact of Tobacco Promotion at The Point of Sale: A Systematic Review
11 pages
Does Bullying Cause Emotional Problems? A Prospective Study of Young Teenagers
No ratings yet
Does Bullying Cause Emotional Problems? A Prospective Study of Young Teenagers
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Classification and Clustering Algorithm Notes

Uploaded by

Classification and Clustering Algorithm Notes

Uploaded by

The Classification algorithm is a Supervised Learning technique that is

used to identify the category of new observations on the basis of

Learners in Classification Problems

Before obtaining a test dataset, eager learners build a classification model

It is a supervised learning classification technique that forecasts the

Figure 3 : Bayes’ Theorem

P(A | B) = how often happens given that B happens

P(A) = how likely A will happen

P(B) = how likely B will happen

P(B | A) = how often B happens given that A happens

“A great game” Sports

“The election is over” Not Sports

“What a great score” Sports

“A clean and unforgettable game” Sports

“The spelling bee winner was a surprise” Not Sports

The parameter k in kNN refers to the number of labeled points (neighbors)

1. Square root method

2. Cross-validation method (Elbow method)

Minkowski distance generalizes the Euclidean and Manhattan distances. It

A decision tree is an example of supervised learning. Although it can solve

b) Entropy using the frequency table of two attributes:

Step 1: Calculate entropy of the target.

The random forest algorithm is an extension of the Decision Tree algorithm

The following steps explain the working Random Forest Algorithm:

Step 1: Select random samples from a given data or training set.

Step 3: Voting will take place by averaging the decision tree.

This combination of multiple models is called Ensemble. Ensemble uses two

1. Bagging: Creating a different training subset from sample training

2. Boosting: Combing weak learners into strong learners by creating

6. Support Vector Machine

SVM chooses the extreme points/vectors that help in creating the

Evaluating a Classification model:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output

Where y= Actual output, p= predicted output.

o The confusion matrix provides us a matrix/table as output and

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

o ROC curve stands for Receiver Operating Characteristics

K-Means Clustering Algorithm

The K-means clustering algorithm computes centroids and repeats

In this method, data points are assigned to clusters in such a way

The following stages will help us understand how the K-Means

Applications of K-Means clustering

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.