0% found this document useful (0 votes)

23 views

Clustering in Machine Learning

The document discusses AUC (area under the curve) which is a metric used to evaluate binary classification models. AUC is calculated based on the ROC curve, which plots the true positive rate against the false positive rate at different classification thresholds. A higher AUC means the model is better at distinguishing between classes. An AUC of 0.5 means random classification, while 1.0 represents perfect classification.

Uploaded by

elin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Clustering in Machine Learning

Uploaded by

elin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Clustering in Machine Learning

Cluster analysis or clustering is a type of unsupervised learning method which means that the
machine that is being trained has to deal with unclassified and unlabelled information and has to act
on them with no guidance. The machine has to find the structure of the data it’s given by itself. With
clustering, this unlabelled dataset is divided into groups according to the similarities between data
points. The data points in the same cluster has to share more similarities within each other than they
do with data points in other clusters. The clusters are formed based on some similar patterns like
shape, size, colour or behaviour that is found in the unclassified dataset and divided according to
existence or nonexistence of those akin patterns. Clustering is especially useful for dealing with data
you have no knowledge of. Different clustering methods have a wide variety of application areas
such as medical imaging, sequence analysis, human genetic clustering, analysis of antimicrobial
activity, market research, social network analysis, search result grouping, anomaly detection, image
segmentation, crime analysis, data mining and much more. To give examples of some of the most
used and known clustering algorithms:

Centroid Based Clustering (also known as Partitioning Clustering):

In this method the data is divided into non-hierarchical groups. This methods most common
example is K-Means Clustering, in which the data set is divided into pre-determined k number of
groups according to the distance the data points have to each cluster centre which we call centroid.
The centroid is decided so that the gap between the data points in that cluster and the cluster centre
is less than compared to the centroid of other clusters. Since it is a NP-hard problem it’s iterative,
the solutions are approximated after several trials. K-Means Clustering is mostly used in areas such
as computer vision, astronomy and market segmentation.

Hierarchical Clustering (or Conectivity Based Clustering):

Similar to centroid-based clustering as the grouping is also done based on the proximity of data
points considering that the data points that are closer share more similarities than the ones that are
more distant to them. As a result of this grouping we get clusters that create a tree structure where
everything is organized from top to bottom. It’s perfect to use on more specific groups of data. Sone
examples for this method would be single linkage clustering (or agglomerative clustering), UPGMA
or WPGMA (unweighted or weighted pair group method with arithmetic mean), complete linkage
clustering. BIRCH (Balance Iterative Reducing and Clustering using Hierarchies) is another example
that works better than k-means clustering when it comes to large sets of data. Instead of the data
points summaries that hold the distribution information about the data are clustered. These
summaries can be used by other clustering algorithms as well so BIRCH is also used together with
them. Mean-Shift Clustering is used commonly for image handling and computer vision processing.
In this method the area with a high density of data points is the mode that all the data points are
shifted towards in an iterative manner.

Hierarchical Clustering is mostly used in areas such as data mining and statistics.

Distribution Based Clustering:

In this method the data points are divided into clusters according to the possibility that they fit in a
given cluster. There is normal or gaussian grouping. Since in gaussian the forthcoming data is fitted
into certain established number of distributions in a way that the placement of data is maximized, it
is the popular approach. Expectation-Maximization Clustering is an example of this method. It is
used in areas such as structural engineering, medical image reconstruction and in psychometrics.
Density Based Clustering:

In this method the clusters are formed by grouping together data points that are existing in a high
density together. Examples of this method include DBSCAN and OPTICS. DBSCAN (density-based
spatial clustering of applications with noise) is ideal for finding outliers in a set by seperating
territories according to low-density to distinguis outliers between high-density clusters. OPTICS
(ordering points to identify the clustering structure) method is similar to DBSCAN but better since it
doesn’t struggle like DBSCAN when dealing with clusters with a variety of density. Density based
clustering methods are commonly used for data mining.

Subspace Clustering:

This method is good for working on high dimensional data groups. It finds clusters in different
subpaces that might be overlapping. There are 2 types of subspace clustering depending on the
strategy it uses: top-down, where each clusters subset is evaluated after finding an initial cluster in
the full dimensions set or bottom-up, where clusters found in low dimensions are merged to handle
high dimensions. This method is used in movie recommendations, social networks and biological
data sets.

Fuzzy Clustering:

This one is a soft method where data points may belong to several clusters. Fuzzy C-Means is an
example of this kind of algorithm.

Affinity Propagation Clustering:

In this method data points communicate with each other to reveal their similarities and form the
clusters. As the data points pass messages to each other, exemplars which are sets of data that
represent the clusters are found. It is used in areas such as computer vision and computational
biology.

Sources:
https://www.geeksforgeeks.org/clustering-in-machine-learning/
https://www.geeksforgeeks.org/different-types-clustering-algorithm/
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
https://en.wikipedia.org/wiki/Hierarchical_clustering#Agglomerative_clustering_example
https://en.wikipedia.org/wiki/DBSCAN
https://en.wikipedia.org/wiki/OPTICS_algorithm
https://www.javatpoint.com/clustering-in-machine-learning
https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-
data-scientists-should-know/
https://en.wikipedia.org/wiki/BIRCH
https://en.wikipedia.org/wiki/Affinity_propagation#Applications
AUC
AUC is the Area Under the Curve, in this instance the ROC (Receiver Operating Characteristic) curve also called
the “error curve” sometimes. It’s used in binary classification problems. With a ROC curve we can visualize the
results after an algorithm operates on a certain set of samples and gives estimates of each object about which
class they belong to (0 or 1) and by calculating the AUC we can assess the quality of the classifiers
performance. ROC curve, a probability curve, plots Sensitivity (TPR) against Specificity (FPR) at different
threshold values. Choosing the threshold depends on the balance of false positives and false negatives. A
bigger AUC means that the models ability to distinguish between classes is also better. AUC=1 means classifier
is perfect at distinguishing positive and negative points, if AUC=0 classifies everything incorrectly, if 0.5< AUC <
1 then the chance of classifier doing it correctly is high and if AUC=0.5 it means classifier is not able to
distinguish correctly it either does it randomly for every point or predicts a constant class for every point. In
the example python code below:

We first import packages necessary to perform Logistic Regression (which is a type of classification algorithm
we use when the response variable is binary)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
the we need a dataset to fit the model on
#import dataset from CSV file on Github
url =
"https://raw.githubusercontent.com/Statology/Python-Guides/main/default.csv
"
data = pd.read_csv(url)

#define the predictor variables and the response variable

X = data[['student', 'balance', 'income']]
y = data['default']

#split the dataset into training (70%) and testing (30%) sets
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.3,random_state=0)

#instantiate the model

log_regression = LogisticRegression()

#fit the model using the training data

log_regression.fit(X_train,y_train)
Finally using metrics.roc_auc_score() function we calculate the area under the curve

#use model to predict probability that given y value is 1

y_pred_proba = log_regression.predict_proba(X_test)[::,1]

#calculate AUC of model

auc = metrics.roc_auc_score(y_test, y_pred_proba)

#print AUC score

print(auc)
Sources:

https://dasha.ai/en-us/blog/auc-roc

https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/?
https://www.statology.org/auc-in-python/

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
ML UNIT-III
No ratings yet
ML UNIT-III
18 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Clustering
No ratings yet
Clustering
8 pages
Clustering
No ratings yet
Clustering
11 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
Unit 5
No ratings yet
Unit 5
10 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Clustering
No ratings yet
Clustering
6 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
Classification
No ratings yet
Classification
32 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
Clustering
No ratings yet
Clustering
45 pages
Unit 5
No ratings yet
Unit 5
5 pages
ML_Unit-3
No ratings yet
ML_Unit-3
22 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Clustering
No ratings yet
Clustering
13 pages
ml8
No ratings yet
ml8
5 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
60 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
CBSYLLABUS BDA
No ratings yet
CBSYLLABUS BDA
5 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
clustering
No ratings yet
clustering
20 pages
Clustering new
No ratings yet
Clustering new
6 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 4 Introduction to Algorithm
No ratings yet
Unit 4 Introduction to Algorithm
10 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Unit 5
No ratings yet
Unit 5
27 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Clustering
No ratings yet
Clustering
10 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Data Science Project Training Report
No ratings yet
Data Science Project Training Report
19 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Asynchronous Task Cluster Analysis
No ratings yet
Asynchronous Task Cluster Analysis
2 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Module-5_Notes_13-12-2024.docx
No ratings yet
Module-5_Notes_13-12-2024.docx
45 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Unit 4
No ratings yet
Unit 4
21 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Cluster
No ratings yet
Cluster
20 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unit 4
No ratings yet
Unit 4
74 pages
1 s2.0 S0263224123006152 Main
No ratings yet
1 s2.0 S0263224123006152 Main
11 pages
Smart Glasses for the Visually Impaired With AI and ML Integrations (1)
No ratings yet
Smart Glasses for the Visually Impaired With AI and ML Integrations (1)
10 pages
Practical Bot Development: Designing and Building Bots with Node.js and Microsoft Bot Framework Szymon Rozga 2024 Scribd Download
100% (1)
Practical Bot Development: Designing and Building Bots with Node.js and Microsoft Bot Framework Szymon Rozga 2024 Scribd Download
55 pages
Ai Notes 1 5 Chapters
No ratings yet
Ai Notes 1 5 Chapters
58 pages
Blue Print - AI - Std10 - Preboard-1 - 24-25
No ratings yet
Blue Print - AI - Std10 - Preboard-1 - 24-25
2 pages
Term Paper For Gns 102 Group 2 Corrected
No ratings yet
Term Paper For Gns 102 Group 2 Corrected
11 pages
Module 1 Introduction to AI in Construction
No ratings yet
Module 1 Introduction to AI in Construction
13 pages
A Comprehensive Study On Integration of Big Data and AI in Financial Industry and Its Effect On Present and Future Opportunities
No ratings yet
A Comprehensive Study On Integration of Big Data and AI in Financial Industry and Its Effect On Present and Future Opportunities
11 pages
Paramount Maths Volume-1 InHindi PDF
100% (1)
Paramount Maths Volume-1 InHindi PDF
238 pages
1 s2.0 S036013152400085X Main
No ratings yet
1 s2.0 S036013152400085X Main
19 pages
IRIS - Flower
No ratings yet
IRIS - Flower
13 pages
Sl. No. Sap Id Name of The Student
No ratings yet
Sl. No. Sap Id Name of The Student
10 pages
Global Entrepreneur Report 2025 Desktop
No ratings yet
Global Entrepreneur Report 2025 Desktop
33 pages
Syai Sem3 - Ds Unit2
No ratings yet
Syai Sem3 - Ds Unit2
37 pages
(BUỔI 7) TEST Điền Khuyết Thông Tin
No ratings yet
(BUỔI 7) TEST Điền Khuyết Thông Tin
7 pages
AI_Based_Biometrics_Report
No ratings yet
AI_Based_Biometrics_Report
2 pages
1120 Update On The Global Regulatory and Standards Landscape
No ratings yet
1120 Update On The Global Regulatory and Standards Landscape
33 pages
Brain Tumer Extraction From Mri Image Using K-Means Clustring Tecnique
No ratings yet
Brain Tumer Extraction From Mri Image Using K-Means Clustring Tecnique
7 pages
AI and Copyright
No ratings yet
AI and Copyright
3 pages
(PDF Download) Technology in Action, Rental Edition, 17th Edition Alan Evans Fulll Chapter
100% (8)
(PDF Download) Technology in Action, Rental Edition, 17th Edition Alan Evans Fulll Chapter
49 pages
Iml Gtu Imp
No ratings yet
Iml Gtu Imp
1 page
STEMpedia AI Lab Brochure-1
No ratings yet
STEMpedia AI Lab Brochure-1
8 pages
Yojana Summary February 2024
No ratings yet
Yojana Summary February 2024
15 pages
Artificial Intelligence Usage For Teaching and Learning of Christian Religious Education in Tertiary Institutions in Abuja, Nigeria
No ratings yet
Artificial Intelligence Usage For Teaching and Learning of Christian Religious Education in Tertiary Institutions in Abuja, Nigeria
8 pages
ICT Grade 9 ST
No ratings yet
ICT Grade 9 ST
142 pages
Full Download (Ebook) Conversational AI: Chatbots that work by Andrew R. Freed ISBN 9781617298837, 1617298832 PDF DOCX
100% (6)
Full Download (Ebook) Conversational AI: Chatbots that work by Andrew R. Freed ISBN 9781617298837, 1617298832 PDF DOCX
81 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
Ward Patterson Final Report
No ratings yet
Ward Patterson Final Report
28 pages
BSC AI 1
100% (1)
BSC AI 1
203 pages
Fortigate 400f Series
No ratings yet
Fortigate 400f Series
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering in Machine Learning

Uploaded by

Clustering in Machine Learning

Uploaded by

Clustering in Machine Learning

Centroid Based Clustering (also known as Partitioning Clustering):

Hierarchical Clustering (or Conectivity Based Clustering):

Distribution Based Clustering:

Affinity Propagation Clustering:

#define the predictor variables and the response variable

#instantiate the model

#fit the model using the training data

#use model to predict probability that given y value is 1

#calculate AUC of model

#print AUC score

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.