Distance Based Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Distance-based Models

Intuition for distance based models

•Set of (x,y) points


•Two classes
•Is the box red or blue?
How we can do it?
-use Bayes rule
-decision tree
-Fit a hyperplane
Or Since the nearest points are red
-use this as basis of your algorithm

1
Why distance measures?

Introduction
There are many ways to measure distance.
(Minkowski distance). If X = Rd , the Minkowski distance of order
p > 0 is defined as

Where is the pnorm (sometimes denoted


Lp norm) of the vector z. We will often refer to Disp simply as the
p-norm.
The 2-norm refers to the familiar Euclidean distance

2
Introduction

Introduction
The 1-norm denotes Manhattan distance, also called cityblock distance:

The 0-norm (or L0 norm) counts the number of non-zero elements in a vector.
The corresponding distance counts the number of positions in which vectors x
and y differ. This is not strictly a Minkowski distance; however, we can define
it as

under the understanding that x0 = 0 for x = 0 and 1 otherwise.


If x and y are binary strings, this is also called the Hamming distance.
We can see the Hamming distance as the number of bits that need to be
flipped to change x into y.

when having two vectors (username and password). If the L0 norm of the vectors is equal
to 0, then the login is successful. Otherwise, if the L0 norm is 1, it means that either the
username or password is incorrect, but not both. And lastly, if the L0 norm is 2, it means
that both username and password are incorrect.

3
Example of various distances
You cant go directly
The L1 norm is calculated by ||X||1 = |3| + |4| = 7

Using the same example, the L2 norm is calculated


by ||X||2 = √ (|3|2 + |4|2) = √9+16 = √25 = 5
As you can see in the graphic, L2 norm is the most direct
route.

L-infinity norm:
Gives the largest magnitude among each element of a vector.
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect.
So, for example, if your vector represents the cost of constructing a building, by
minimizing L-infinity norm we are reducing the cost of the most expensive building.

A is the testing point


For eucledian point B and C are
at same distance: cant capture
the fact that B differs from A on
only one attribute and 2 from C
But for different p things change

4
5
Introduction
Distance metric: Desired properties
Given an instance space X, a distance metric is a function
Dis : X×X → R such that for any x, y, z ∈ X:
1. distances between a point and itself are zero: Dis(x,x) = 0;
2. all other distances are larger than zero: if x != y then
Dis(x,y) > 0;
3. distances are symmetric: Dis(y,x) = Dis(x,y);
4. detours can not shorten the distance:
Dis(x,z) ≤ Dis(x,y)+Dis(y,z).
If the second condition is weakened to a non-strict inequality – i.e.,
Dis(x,y) may be zero even if x ≠ y – the function Dis is called a
pseudo-metric.

Suitable distance function for strings:


Edit Distance

From Biology: just delete A or add A in beginning


Number of insertions deletion needed to get from X to Y
It satisfies all distance properties

6
Non-Metric Distance Functions

Neighbours and Exemplars

In distance-based models, we formulate the model in terms of a


number of prototypical instances or exemplars, and define the
decision rules in terms of the nearest exemplars or neighbours.

For example, the basic linear classifier uses the two class means
μ⊕ and μϴ as exemplars, as a summary of all we need to know
about the training data in order to build the classifier.

A fundamental property of the mean of a set of vectors is that it


minimises the sum of squared Euclidean distances to those
vectors.

7
Neighbours and Exemplars

Theorem: The arithmetic mean minimises squared Euclidean


distance:
The arithmetic mean μ of a set of data points D in a Euclidean
space is the unique point that minimises the sum of squared
Euclidean distances to those data points.

Minimising the sum of squared Euclidean distances of a given set


of points is the same as minimising the average squared
Euclidean distance.

Neighbours and Exemplars

Centroid and Medoid

The centroid or geometric center is the arithmetic mean


position of all the points.

Medoids are representative objects of a data set or a cluster


with a data set whose average dissimilarity to all the objects in
the cluster is minimal.

Medoids are similar in concept to means or centroids, but


medoids are always restricted to be members of the data set.

8
Neighbours and Exemplars

In certain situations it makes sense to restrict an exemplar to be


one of the given data points. In that case, we speak of a medoid,
to distinguish it from a centroid which is an exemplar that
doesn’t have to occur in the data.

Finding a medoid requires us to calculate, for each data point,


the total distance to all other data points, in order to choose the
point that minimises it.
Regardless of the distance metric used, this is an O(n2) operation
for n points, so for medoids there is no computational reason to
prefer one distance metric over another.

A small data set of


10 points, with
circles indicating
centroids and
squares indicating
medoids (the
latter must be
data points), for
different distance
metrics.

9
Neighbours and Exemplars
Once we have determined the exemplars, the basic linear
classifier constructs the decision boundary as the perpendicular
bisector of the line segment connecting the two exemplars.

An alternative, distance-based way to classify instances without


direct reference to a decision boundary is by the following
decision rule:
if x is nearest to μ⊕ then classify it as positive, otherwise as
negative; or equivalently, classify an instance to the class of the
nearest exemplar.

If we use Euclidean distance as our closeness measure, simple


geometry tells us we get exactly the same decision boundary.

Neighbours and Exemplars

So the basic linear classifier can be interpreted from a distance-


based perspective as constructing exemplars that minimise
squared Euclidean distance within each class, and then applying
a nearest-exemplar decision rule.

This change of perspective opens up many new possibilities.


For example, we can investigate what the decision boundary
looks like if we use Manhattan distance for the decision rule.

The figure on the next slide illustrates this.

10
Neighbours and Exemplars

11
Neighbours and Exemplars

A Voronoi diagram is a partitioning of a plane into regions based on distance to points.

12
13
14
15
Neighbours and Exemplars

To summarise, the main ingredients of distance-based models are:

 distance metrics, which can be Euclidean, Manhattan, Minkowski or


Mahalanobis, among many others;
 exemplars: centroids that find a centre of mass according to a
chosen distance metric, or medoids that find the most centrally
located data point; and
distance-based decision rules, which take a vote among the k
nearest exemplars.

These ingredients are combined in various ways to obtain supervised


and unsupervised learning algorithms.

Nearest-neighbor Classification

The most commonly used distance-based classifier is straightforward:


it simply uses each training instance as an exemplar.

Consequently, ‘training’ this classifier requires nothing more than


memorising the training data.

This extremely simple classifier is known as the nearest neighbour


classifier.

Its decision regions are made up of the cells of a Voronoi tesselation,


with piecewise linear decision boundaries selected from the Voronoi
boundaries.

16
Nearest-neighbor Classification
Properties of the nearest-neighbor classifier:

Unless the training set contains identical instances from different


classes, it will be able to separate the classes perfectly on the training
set – as it memorizes all training examples!

By choosing the right exemplars we can more or less represent any


decision boundary, or at least an arbitrarily close piecewise linear
approximation.

It follows that the nearest-neighbor classifier has low bias, but also
high variance. This suggests a risk of overfitting if the training data is
limited, noisy or unrepresentative.

Nearest-neighbor Classification

From an algorithmic point of view, training the nearest-neighbor


classifier is very fast, taking only O(n) time for storing n
exemplars.

The downside is that classifying a single instance also takes O(n)


time, as the instance will need to be compared with every
exemplar to determine which one is the nearest.

It is possible to reduce classification time at the expense of


increased training time by storing the exemplars in a more
elaborate data structure, but this tends not to scale well to large
numbers of features.

17
Nearest-neighbor Classification
The high-dimensional instance spaces can be problematic for the
infamous curse of dimensionality. High-dimensional spaces tend to be
extremely sparse, which means that every point is far away from
virtually every other point, and hence pairwise distances tend to be
uninformative.

However, irrespective of the curse of dimensionality, it is not simply a


matter of counting the number of features, as there are several
reasons why the effective dimensionality of the instance space may be
much smaller than the number of features. For example, some of the
features may be irrelevant and wipe out the relevant features’ signal in
the distance calculations.

In such a case it would be a good idea, before building a distance-


based model, to reduce dimensionality by performing feature
selection.

Nearest-neighbor Classification
In any case, before applying nearest-neighbor classification it is a
good idea to plot a histogram of pairwise distances of a sample to
see if they are sufficiently varied.

The nearest-neighbor method can easily be applied to regression


problems with a real-valued target variable.

The k-nearest neighbour method in its simplest form takes a vote


between the k ≥ 1 nearest exemplars of the instance to be
classified, and predicts the majority class.
We can easily turn this into a probability estimator by returning
the normalised class counts as a probability distribution over
classes.

18
Knn Overview

Intuition for knn

•Set of (x,y) points


•Two classes
•Is the box red or blue?
How we can do it?
-use Bayes rule
-decision tree
-Fit a hyperplane
Or nearest points are red
-use this as basis of your algorithm

19
20
21
Nearest-neighbor Classification

If k-nearest neighbour is used for regression problems, the


obvious way to aggregate the predictions from the k neighbours
is by taking the mean value, which can again be distance-
weighted.

This would lend the model additional predictive power by


predicting values that aren’t observed among the stored
exemplars.

More generally, we can apply k-means to any learning problem


where we have an appropriate ‘aggregator’ for multiple target
values.

22
Predicting value for x =4

Curve is for target function


Dots represent data

Predicting for x=0; knn fails since its out of range of training samples ( Extrapolation)
Predicting 4 is interpolation and knn is good for it
More neigbours don’t gurantee more accuracy

23
Choosing the value of K

There is no easy recipe to decide what value of k is appropriate


for a given data set.

However, it is possible to sidestep this question to some extent


by applying distance weighting to the votes: that is, the closer an
exemplar is to the instance to be classified, the more its vote
counts.

One could say that distance weighting makes k-nearest


neighbour more of a global model, while without it (and for
small k) it is more like an aggregation of local models.

Nearest-neighbor Classification

The next figure illustrates k-nearest neighbour method on a


small data set of 20 exemplars from five different classes, for k =
3, 5, 7.

The class distribution is visualised by assigning each test point


the class of a uniformly sampled neighbour: so, in a region
where two of k = 3 neighbours are red and one is orange, the
shading is a mix of two-thirds red and one-third orange.

While for k = 3 the decision regions are still mostly discernible,


this is much less so for k = 5 and k = 7.

24
Nearest-neighbor Classification

25
Nearest-neighbor Classification

To take an extreme example: if k is equal to the number of


exemplars n, every test instance will have the same number of
neighbours and will receive the same probability vector.

We conclude that the refinement of k-nearest neighbour – the


number of different predictions it can make – initially increases
with increasing k, then decreases again.

Furthermore, we can say that the bias increases and the variance
decreases with increasing k.

26
27
Why is knn slow?

28
Making knn fast

29
Use median and apply splits

Classifying and disadvantage: missing


neighbors

30
Distance-based Clustering

In a distance-based context, unsupervised learning is usually taken


to refer to clustering.

The distance-based clustering methods can be exemplar-based


and hence predictive: they naturally generalise to unseen
instances.

There can be clustering methods that are not exemplar-based and


hence descriptive.

Distance-based Clustering

Predictive distance-based clustering methods use the same


ingredients as distance-based classifiers: a distance metric, a way
to construct exemplars and a distance-based decision rule.

In the absence of an explicit target variable, the assumption is


that the distance metric indirectly encodes the learning target,
so that we aim to find clusters that are compact with respect to
the distance metric.

This requires a notion of cluster compactness that can serve as


the optimization criterion.

31
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)

Hard vs. soft clustering


• Hard clustering: Each document belongs to exactly one cluster
– More common and easier to do
• Soft clustering: A document can belong to more than one
cluster.
– Makes more sense for applications like creating browsable
hierarchies
– You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
– You can only do that with a soft clustering approach.

32
Distance-based Clustering
K-means algorithm

The K-means problem is NP-complete, which means that there is


no efficient solution to find the global minimum and we need to
resort to a heuristic algorithm.

The best known algorithm is usually also called K-means,


although the name ‘Lloyd’s algorithm’ is also used.

The outline of the algorithm is given in the next slide. The


algorithm iterates between partitioning the data using the
nearest-centroid decision rule, and recalculating centroids from
a partition.

Andrew

33
Andrew

Andrew

34
Andrew

Andrew

35
Andrew

Andrew

36
Andrew

Andrew

37
k – mean algorithm

Cluster assignment
step
Mink ||x(i) – Uk||

Move centroid step


U2 = 1/3(x1 + x3 + x5)

38
K--means algorithm
Randomly initalize K cluster centroids

Repeat {

for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to

:= average (mean) of points assigned to cluster

Andrew

K--means for non--separated clusters

T-shirt sizing
Weight

Height

39
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:

40
K-means algorithm

Sec. 16.4

Termination conditions
• Several possibilities, e.g.,
– A fixed number of iterations.
– Doc partition unchanged.
– Centroid positions don’t change.

Does this mean that the docs in a


cluster are unchanged?

41
Sec. 16.4

Convergence
• Why should the K-means algorithm ever reach a
fixed point?
– A state in which clusters don’t change.
• K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
– EM is known to converge.
– Number of iterations could be large.
– But in practice usually isn’t

Sec. 16.4

Time Complexity
• Computing distance between two docs is
O(M) where M is the dimensionality of the
vectors.
• Reassigning clusters: O(KN) distance
computations, or O(KNM).
• Computing centroids: Each doc gets added
once to some centroid: O(NM).
• Assume these two steps are each done once
for I iterations: O(IKNM).

42
Sec. 16.4

Seed Choice
• Results can vary based on Example showing
random seed selection. sensitivity to seeds
• Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings. In the above, if you start
with B and E as centroids
– Select good seeds using a you converge to {A,B,C}
heuristic (e.g., doc least similar and {D,E,F}
to any existing mean) If you start with D and F
you converge to
– Try out multiple starting points {A,B,D,E} {C,F}
– Initialize with the results of
another method.

Random initialization
Should have K <M

Randomly pick K training


examples.

Set equal to these


examples.

Andrew

43
Local optima

Andrew

Random initialization
For i = 1 to 100 {

Randomly initialize K--means.


Run K--means. Get .
Compute cost function (distortion)

Pick clustering that gave lowest cost

44
What is the right value of K?

Andrew

Choosing the value of K


Elbow method:
Cost func3on

Cost func3on

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

(no. of clusters) (no. of clusters)

45
Choosing the value of K
Sometimes, you’re running K--means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for how
well it performs for that later purpose.

E.g. T--shirt sizing T--shirt sizing

Weight
Weight

Height Height

K-means Algorithm

46
K-means Algorithm

Example - Clustering MLM data

When we run K-means on MLM dataset data with K = 3, we


obtain the clusters {Associations, Trees, Rules}, {GMM, naive
Bayes}, and a larger cluster with the remaining data points.

When we run it with K = 4, we get that the large cluster splits


into two: {kNN, Linear Classifier, Linear Regression} and {K-
means, Logistic Regression, SVM}; but also that GMM gets
reallocated to the latter cluster, and naive Bayes ends up as a
singleton.

47
K=3
MLM Data K=4

Example - Clustering MLM data


It can be shown that one iteration of K-means can never increase
the within-cluster scatter, from which it follows that the
algorithm will reach a stationary point: a point where no further
improvement is possible. It is worth noting that even the
simplest data set will have many stationary points.

In general, while K-means converges to a stationary point in


finite time, no guarantees can be given about whether the
convergence point is in fact the global minimum, or if not, how
far we are from it.

In practice it is advisable to run the algorithm a number of times


and select the solution with the smallest within-cluster scatter.

48
Example - Stationary points in clustering

Consider the task of dividing the set of numbers {8,44,50,58,84}


into two clusters. There are four possible partitions that 2-means
can find: {8}, {44,50,58,84}; {8,44}, {50,58, 84}; {8,44, 50}, {58,
84}; and {8,44,50,58}, {84}.

It is easy to verify that each of these establishes a stationary


point for 2-means, and hence will be found with a suitable
initialisation.

Only the first clustering is optimal; i.e., it minimises the total


within-cluster scatter.

Distance-based Clustering

Clustering around medoids

It is straightforward to adapt the K-means algorithm to use a


different distance metric.
K-medoids algorithm additionally requires the exemplars to be
data points.

Calculating the medoid of a cluster requires examining all pairs


of points – whereas calculating the mean requires just a single
pass through the points – which can be prohibitive for large data
sets.

49
Hierarchical Clustering

Dendrograms are tree diagrams which are purely defined in


terms of a distance measure.

Because dendrograms use features only indirectly, as the basis


on which the distance measure is calculated, they partition the
given data rather than the entire instance space, and hence
represent a descriptive clustering rather than a predictive one.

50
Hierarchical clustering

Definition – Dendrogram

Given a data set D, a dendrogram is a binary tree with the


elements of D at its leaves.
An internal node of the tree represents the subset of elements in
the leaves of the subtree rooted at that node.
The level of a node is the distance between the two clusters
represented by the children of the node.
Leaves have level 0.

Dendrogram: Hierarchical Clustering

– Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.

102

51
• The decision of the no. of clusters that can best depict
different groups can be chosen by observing the dendrogram.
• The best choice of the no. of clusters is the no. of vertical lines
in the dendrogram cut by a horizontal line that can transverse
the maximum distance vertically without intersecting a
cluster.

Example - Hierarchical clustering of MLM data


A hierarchical clustering of the MLM data is given in the next slide.
The tree shows that the three logical methods at the top form a strong
cluster.

If we wanted three clusters, we get the logical cluster, a second small


cluster {GMM, naive Bayes}, and the remainder.

If we wanted four clusters, we would separate GMM and naive Bayes,


as the tree indicates this cluster is the least tight of the three.

If we wanted five clusters, we would construct {Linear Regression,


LinearClassifier} as a separate cluster.

This illustrates the key advantage of hierarchical clustering: it doesn’t


require fixing the number of clusters in advance.

52
Example - Hierarchical clustering of MLM data

A dendrogram constructed by hierarchical clustering from the data

53
Hierarchical clustering

Taking cluster means as exemplars assumes Euclidean distance,


and we may want to use one of the other distance metrics.

This has led to the introduction of the linkage function, which is


a general way to turn pairwise point distances into pairwise
cluster distances.

Definition - Linkage function


A linkage function L : 2X × 2X → R calculates the distance between
arbitrary subsets of the instance space, given a distance metric
Dis : X × X →R.

54
Hierarchical clustering

The most common linkage functions are as follows:

Single linkage defines the distance between two clusters


as the smallest pairwise distance between
elements from each cluster.
Complete linkage defines the distance between two clusters
as the largest point-wise distance.
Average linkage defines the cluster distance as the average
point-wise distance.
Centroid linkage defines the cluster distance as the point
distance between the cluster means.

Hierarchical clustering
These linkage functions can be defined mathematically as
follows:

55
Hierarchical Aggelometirc clustering
The general algorithm to build a dendrogram is given in the next slide.

The tree is built from the data points upwards and is hence a bottom–up or
agglomerative algorithm.

At each iteration the algorithm constructs a new partition of the data by


merging the two nearest clusters together.
In general, the HAC algorithm gives different results when different linkage
functions are used.
Hierarchical clustering using single linkage can essentially be done by
calculating and sorting all pairwise distances between data points, which
requires O(n2) time for n points.

The other linkage functions require at least O(n2logn). The un-optimised


algorithm has time complexity O(n3).

56
Sec. 17.2.1

Computational Complexity
• In the first iteration, all HAC methods need to compute
similarity of all pairs of N initial instances, which is
O(N2).
• In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
• In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
– Often O(N3) if done naively or O(N2 log N) if done more
cleverly

Hierarchical clustering

57
• Input distance matrix (L = 0 for all clusters)
• The nearest pair of cities is MI and TO, at
distance 138. These are merged into a single
cluster called "MI/TO". The level of the new
cluster is L(MI/TO) = 138 and the new sequence
number is m = 1
• Then we compute the distance from this new
compound object to all other objects.
• In single link clustering the rule is that the
distance from the compound object to another
object is equal to the shortest distance from any
member of the cluster to the outside object.
• So the distance from "MI/TO" to RM is chosen to
be 564, which is the distance from MI to RM, and
so on.
• After merging MI with TO we obtain the
following matrix:------------

• min d(i,j) = d(NA,RM) = 219 => merge NA and


RM into a new cluster called NA/RM L(NA/RM) =
219 m = 2

• min d(i,j) = d(BA,NA/RM) = 255


• => merge BA and NA/RM into a new
cluster called BA/NA/RM L(BA/NA/RM) =
255 now m = 3

• min d(i,j) = d(BA/NA/RM,FI) = 268 =>


merge BA/NA/RM and FI into a new
cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268; Now m = 4

• Finally, we merge the last two clusters at


level 295.

58

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy