0% found this document useful (0 votes)
23 views

Clustering in Machine Learning

The document discusses AUC (area under the curve) which is a metric used to evaluate binary classification models. AUC is calculated based on the ROC curve, which plots the true positive rate against the false positive rate at different classification thresholds. A higher AUC means the model is better at distinguishing between classes. An AUC of 0.5 means random classification, while 1.0 represents perfect classification.

Uploaded by

elin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Clustering in Machine Learning

The document discusses AUC (area under the curve) which is a metric used to evaluate binary classification models. AUC is calculated based on the ROC curve, which plots the true positive rate against the false positive rate at different classification thresholds. A higher AUC means the model is better at distinguishing between classes. An AUC of 0.5 means random classification, while 1.0 represents perfect classification.

Uploaded by

elin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Clustering in Machine Learning

Cluster analysis or clustering is a type of unsupervised learning method which means that the
machine that is being trained has to deal with unclassified and unlabelled information and has to act
on them with no guidance. The machine has to find the structure of the data it’s given by itself. With
clustering, this unlabelled dataset is divided into groups according to the similarities between data
points. The data points in the same cluster has to share more similarities within each other than they
do with data points in other clusters. The clusters are formed based on some similar patterns like
shape, size, colour or behaviour that is found in the unclassified dataset and divided according to
existence or nonexistence of those akin patterns. Clustering is especially useful for dealing with data
you have no knowledge of. Different clustering methods have a wide variety of application areas
such as medical imaging, sequence analysis, human genetic clustering, analysis of antimicrobial
activity, market research, social network analysis, search result grouping, anomaly detection, image
segmentation, crime analysis, data mining and much more. To give examples of some of the most
used and known clustering algorithms:

Centroid Based Clustering (also known as Partitioning Clustering):

In this method the data is divided into non-hierarchical groups. This methods most common
example is K-Means Clustering, in which the data set is divided into pre-determined k number of
groups according to the distance the data points have to each cluster centre which we call centroid.
The centroid is decided so that the gap between the data points in that cluster and the cluster centre
is less than compared to the centroid of other clusters. Since it is a NP-hard problem it’s iterative,
the solutions are approximated after several trials. K-Means Clustering is mostly used in areas such
as computer vision, astronomy and market segmentation.

Hierarchical Clustering (or Conectivity Based Clustering):

Similar to centroid-based clustering as the grouping is also done based on the proximity of data
points considering that the data points that are closer share more similarities than the ones that are
more distant to them. As a result of this grouping we get clusters that create a tree structure where
everything is organized from top to bottom. It’s perfect to use on more specific groups of data. Sone
examples for this method would be single linkage clustering (or agglomerative clustering), UPGMA
or WPGMA (unweighted or weighted pair group method with arithmetic mean), complete linkage
clustering. BIRCH (Balance Iterative Reducing and Clustering using Hierarchies) is another example
that works better than k-means clustering when it comes to large sets of data. Instead of the data
points summaries that hold the distribution information about the data are clustered. These
summaries can be used by other clustering algorithms as well so BIRCH is also used together with
them. Mean-Shift Clustering is used commonly for image handling and computer vision processing.
In this method the area with a high density of data points is the mode that all the data points are
shifted towards in an iterative manner.

Hierarchical Clustering is mostly used in areas such as data mining and statistics.

Distribution Based Clustering:

In this method the data points are divided into clusters according to the possibility that they fit in a
given cluster. There is normal or gaussian grouping. Since in gaussian the forthcoming data is fitted
into certain established number of distributions in a way that the placement of data is maximized, it
is the popular approach. Expectation-Maximization Clustering is an example of this method. It is
used in areas such as structural engineering, medical image reconstruction and in psychometrics.
Density Based Clustering:

In this method the clusters are formed by grouping together data points that are existing in a high
density together. Examples of this method include DBSCAN and OPTICS. DBSCAN (density-based
spatial clustering of applications with noise) is ideal for finding outliers in a set by seperating
territories according to low-density to distinguis outliers between high-density clusters. OPTICS
(ordering points to identify the clustering structure) method is similar to DBSCAN but better since it
doesn’t struggle like DBSCAN when dealing with clusters with a variety of density. Density based
clustering methods are commonly used for data mining.

Subspace Clustering:

This method is good for working on high dimensional data groups. It finds clusters in different
subpaces that might be overlapping. There are 2 types of subspace clustering depending on the
strategy it uses: top-down, where each clusters subset is evaluated after finding an initial cluster in
the full dimensions set or bottom-up, where clusters found in low dimensions are merged to handle
high dimensions. This method is used in movie recommendations, social networks and biological
data sets.

Fuzzy Clustering:

This one is a soft method where data points may belong to several clusters. Fuzzy C-Means is an
example of this kind of algorithm.

Affinity Propagation Clustering:

In this method data points communicate with each other to reveal their similarities and form the
clusters. As the data points pass messages to each other, exemplars which are sets of data that
represent the clusters are found. It is used in areas such as computer vision and computational
biology.

Sources:
https://www.geeksforgeeks.org/clustering-in-machine-learning/
https://www.geeksforgeeks.org/different-types-clustering-algorithm/
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
https://en.wikipedia.org/wiki/Hierarchical_clustering#Agglomerative_clustering_example
https://en.wikipedia.org/wiki/DBSCAN
https://en.wikipedia.org/wiki/OPTICS_algorithm
https://www.javatpoint.com/clustering-in-machine-learning
https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-
data-scientists-should-know/
https://en.wikipedia.org/wiki/BIRCH
https://en.wikipedia.org/wiki/Affinity_propagation#Applications
AUC
AUC is the Area Under the Curve, in this instance the ROC (Receiver Operating Characteristic) curve also called
the “error curve” sometimes. It’s used in binary classification problems. With a ROC curve we can visualize the
results after an algorithm operates on a certain set of samples and gives estimates of each object about which
class they belong to (0 or 1) and by calculating the AUC we can assess the quality of the classifiers
performance. ROC curve, a probability curve, plots Sensitivity (TPR) against Specificity (FPR) at different
threshold values. Choosing the threshold depends on the balance of false positives and false negatives. A
bigger AUC means that the models ability to distinguish between classes is also better. AUC=1 means classifier
is perfect at distinguishing positive and negative points, if AUC=0 classifies everything incorrectly, if 0.5< AUC <
1 then the chance of classifier doing it correctly is high and if AUC=0.5 it means classifier is not able to
distinguish correctly it either does it randomly for every point or predicts a constant class for every point. In
the example python code below:

We first import packages necessary to perform Logistic Regression (which is a type of classification algorithm
we use when the response variable is binary)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
the we need a dataset to fit the model on
#import dataset from CSV file on Github
url =
"https://raw.githubusercontent.com/Statology/Python-Guides/main/default.csv
"
data = pd.read_csv(url)

#define the predictor variables and the response variable


X = data[['student', 'balance', 'income']]
y = data['default']

#split the dataset into training (70%) and testing (30%) sets
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.3,random_state=0)

#instantiate the model


log_regression = LogisticRegression()

#fit the model using the training data


log_regression.fit(X_train,y_train)
Finally using metrics.roc_auc_score() function we calculate the area under the curve

#use model to predict probability that given y value is 1


y_pred_proba = log_regression.predict_proba(X_test)[::,1]

#calculate AUC of model


auc = metrics.roc_auc_score(y_test, y_pred_proba)

#print AUC score


print(auc)
Sources:

https://dasha.ai/en-us/blog/auc-roc

https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/?
https://www.statology.org/auc-in-python/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy