Distance Based Models

Distance-based Models
Intuition for distance based models
•Set of (x,y) points

•Two classes
•Is the box red or blue?
How we can do it?
-use Bayes rule
-decision tree
-Fit a hyperplane
Or Since the nearest points are red
-use this as basis of your algorithm
1
Why distance measures?
Introduction
There are many ways to measure distance.
(Minkowski distance). If X = Rd , the Minkowski distance of order
p > 0 is defined as
Where is the pnorm (sometimes denoted

Lp norm) of the vector z. We will often refer to Disp simply as the
p-norm.
The 2-norm refers to the familiar Euclidean distance
2
Introduction
Introduction
The 1-norm denotes Manhattan distance, also called cityblock distance:
The 0-norm (or L0 norm) counts the number of non-zero elements in a vector.
The corresponding distance counts the number of positions in which vectors x
and y differ. This is not strictly a Minkowski distance; however, we can define
it as
under the understanding that x0 = 0 for x = 0 and 1 otherwise.

If x and y are binary strings, this is also called the Hamming distance.
We can see the Hamming distance as the number of bits that need to be
flipped to change x into y.
when having two vectors (username and password). If the L0 norm of the vectors is equal
to 0, then the login is successful. Otherwise, if the L0 norm is 1, it means that either the
username or password is incorrect, but not both. And lastly, if the L0 norm is 2, it means
that both username and password are incorrect.
3
Example of various distances
You cant go directly
The L1 norm is calculated by ||X||1 = |3| + |4| = 7
Using the same example, the L2 norm is calculated

by ||X||2 = √ (|3|2 + |4|2) = √9+16 = √25 = 5
As you can see in the graphic, L2 norm is the most direct
route.
L-infinity norm:
Gives the largest magnitude among each element of a vector.
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect.
So, for example, if your vector represents the cost of constructing a building, by
minimizing L-infinity norm we are reducing the cost of the most expensive building.
A is the testing point

For eucledian point B and C are
at same distance: cant capture
the fact that B differs from A on
only one attribute and 2 from C
But for different p things change
4
5
Introduction
Distance metric: Desired properties
Given an instance space X, a distance metric is a function
Dis : X×X → R such that for any x, y, z ∈ X:
1. distances between a point and itself are zero: Dis(x,x) = 0;
2. all other distances are larger than zero: if x != y then
Dis(x,y) > 0;
3. distances are symmetric: Dis(y,x) = Dis(x,y);
4. detours can not shorten the distance:
Dis(x,z) ≤ Dis(x,y)+Dis(y,z).
If the second condition is weakened to a non-strict inequality – i.e.,
Dis(x,y) may be zero even if x ≠ y – the function Dis is called a
pseudo-metric.
Suitable distance function for strings:

Edit Distance
From Biology: just delete A or add A in beginning

Number of insertions deletion needed to get from X to Y
It satisfies all distance properties
6
Non-Metric Distance Functions
Neighbours and Exemplars
In distance-based models, we formulate the model in terms of a

number of prototypical instances or exemplars, and define the
decision rules in terms of the nearest exemplars or neighbours.
For example, the basic linear classifier uses the two class means
μ⊕ and μϴ as exemplars, as a summary of all we need to know
about the training data in order to build the classifier.
A fundamental property of the mean of a set of vectors is that it

minimises the sum of squared Euclidean distances to those
vectors.
7
Theorem: The arithmetic mean minimises squared Euclidean

distance:
The arithmetic mean μ of a set of data points D in a Euclidean
space is the unique point that minimises the sum of squared
Euclidean distances to those data points.
Minimising the sum of squared Euclidean distances of a given set

of points is the same as minimising the average squared
Euclidean distance.
Centroid and Medoid
The centroid or geometric center is the arithmetic mean

position of all the points.
Medoids are representative objects of a data set or a cluster

with a data set whose average dissimilarity to all the objects in
the cluster is minimal.
Medoids are similar in concept to means or centroids, but

medoids are always restricted to be members of the data set.
8
In certain situations it makes sense to restrict an exemplar to be

one of the given data points. In that case, we speak of a medoid,
to distinguish it from a centroid which is an exemplar that
doesn’t have to occur in the data.
Finding a medoid requires us to calculate, for each data point,

the total distance to all other data points, in order to choose the
point that minimises it.
Regardless of the distance metric used, this is an O(n2) operation
for n points, so for medoids there is no computational reason to
prefer one distance metric over another.
A small data set of

10 points, with
circles indicating
centroids and
squares indicating
medoids (the
latter must be
data points), for
different distance
metrics.
9
Once we have determined the exemplars, the basic linear
classifier constructs the decision boundary as the perpendicular
bisector of the line segment connecting the two exemplars.
An alternative, distance-based way to classify instances without

direct reference to a decision boundary is by the following
decision rule:
if x is nearest to μ⊕ then classify it as positive, otherwise as
negative; or equivalently, classify an instance to the class of the
nearest exemplar.
If we use Euclidean distance as our closeness measure, simple

geometry tells us we get exactly the same decision boundary.
So the basic linear classifier can be interpreted from a distance-

based perspective as constructing exemplars that minimise
squared Euclidean distance within each class, and then applying
a nearest-exemplar decision rule.
This change of perspective opens up many new possibilities.

For example, we can investigate what the decision boundary
looks like if we use Manhattan distance for the decision rule.
The figure on the next slide illustrates this.
10
11
A Voronoi diagram is a partitioning of a plane into regions based on distance to points.
12
13
14
15
To summarise, the main ingredients of distance-based models are:
 distance metrics, which can be Euclidean, Manhattan, Minkowski or

Mahalanobis, among many others;
 exemplars: centroids that find a centre of mass according to a
chosen distance metric, or medoids that find the most centrally
located data point; and
distance-based decision rules, which take a vote among the k
nearest exemplars.
These ingredients are combined in various ways to obtain supervised

and unsupervised learning algorithms.
Nearest-neighbor Classification
The most commonly used distance-based classifier is straightforward:

it simply uses each training instance as an exemplar.
Consequently, ‘training’ this classifier requires nothing more than

memorising the training data.
This extremely simple classifier is known as the nearest neighbour

classifier.
Its decision regions are made up of the cells of a Voronoi tesselation,

with piecewise linear decision boundaries selected from the Voronoi
boundaries.
16
Properties of the nearest-neighbor classifier:
Unless the training set contains identical instances from different

classes, it will be able to separate the classes perfectly on the training
set – as it memorizes all training examples!
By choosing the right exemplars we can more or less represent any

decision boundary, or at least an arbitrarily close piecewise linear
approximation.
It follows that the nearest-neighbor classifier has low bias, but also
high variance. This suggests a risk of overfitting if the training data is
limited, noisy or unrepresentative.
From an algorithmic point of view, training the nearest-neighbor

classifier is very fast, taking only O(n) time for storing n
exemplars.
The downside is that classifying a single instance also takes O(n)

time, as the instance will need to be compared with every
exemplar to determine which one is the nearest.
It is possible to reduce classification time at the expense of

increased training time by storing the exemplars in a more
elaborate data structure, but this tends not to scale well to large
numbers of features.
17
The high-dimensional instance spaces can be problematic for the
infamous curse of dimensionality. High-dimensional spaces tend to be
extremely sparse, which means that every point is far away from
virtually every other point, and hence pairwise distances tend to be
uninformative.
However, irrespective of the curse of dimensionality, it is not simply a

matter of counting the number of features, as there are several
reasons why the effective dimensionality of the instance space may be
much smaller than the number of features. For example, some of the
features may be irrelevant and wipe out the relevant features’ signal in
the distance calculations.
In such a case it would be a good idea, before building a distance-

based model, to reduce dimensionality by performing feature
selection.
In any case, before applying nearest-neighbor classification it is a
good idea to plot a histogram of pairwise distances of a sample to
see if they are sufficiently varied.
The nearest-neighbor method can easily be applied to regression

problems with a real-valued target variable.
The k-nearest neighbour method in its simplest form takes a vote

between the k ≥ 1 nearest exemplars of the instance to be
classified, and predicts the majority class.
We can easily turn this into a probability estimator by returning
the normalised class counts as a probability distribution over
classes.
18
Knn Overview
Intuition for knn
•Set of (x,y) points

•Two classes
•Is the box red or blue?
How we can do it?
-use Bayes rule
-decision tree
-Fit a hyperplane
Or nearest points are red
-use this as basis of your algorithm
19
20
21
If k-nearest neighbour is used for regression problems, the

obvious way to aggregate the predictions from the k neighbours
is by taking the mean value, which can again be distance-
weighted.
This would lend the model additional predictive power by

predicting values that aren’t observed among the stored
exemplars.
More generally, we can apply k-means to any learning problem

where we have an appropriate ‘aggregator’ for multiple target
values.
22
Predicting value for x =4
Curve is for target function

Dots represent data
Predicting for x=0; knn fails since its out of range of training samples ( Extrapolation)
Predicting 4 is interpolation and knn is good for it
More neigbours don’t gurantee more accuracy
23
Choosing the value of K
There is no easy recipe to decide what value of k is appropriate

for a given data set.
However, it is possible to sidestep this question to some extent

by applying distance weighting to the votes: that is, the closer an
exemplar is to the instance to be classified, the more its vote
counts.
One could say that distance weighting makes k-nearest

neighbour more of a global model, while without it (and for
small k) it is more like an aggregation of local models.
The next figure illustrates k-nearest neighbour method on a

small data set of 20 exemplars from five different classes, for k =
3, 5, 7.
The class distribution is visualised by assigning each test point

the class of a uniformly sampled neighbour: so, in a region
where two of k = 3 neighbours are red and one is orange, the
shading is a mix of two-thirds red and one-third orange.
While for k = 3 the decision regions are still mostly discernible,

this is much less so for k = 5 and k = 7.
24
25
To take an extreme example: if k is equal to the number of

exemplars n, every test instance will have the same number of
neighbours and will receive the same probability vector.
We conclude that the refinement of k-nearest neighbour – the

number of different predictions it can make – initially increases
with increasing k, then decreases again.
Furthermore, we can say that the bias increases and the variance
decreases with increasing k.
26
27
Why is knn slow?
28
Making knn fast
29
Use median and apply splits
Classifying and disadvantage: missing

neighbors
30
Distance-based Clustering
In a distance-based context, unsupervised learning is usually taken

to refer to clustering.
The distance-based clustering methods can be exemplar-based

and hence predictive: they naturally generalise to unseen
instances.
There can be clustering methods that are not exemplar-based and

hence descriptive.
Predictive distance-based clustering methods use the same

ingredients as distance-based classifiers: a distance metric, a way
to construct exemplars and a distance-based decision rule.
In the absence of an explicit target variable, the assumption is

that the distance metric indirectly encodes the learning target,
so that we aim to find clusters that are compact with respect to
the distance metric.
This requires a notion of cluster compactness that can serve as

the optimization criterion.
31
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
Hard vs. soft clustering

• Hard clustering: Each document belongs to exactly one cluster
– More common and easier to do
• Soft clustering: A document can belong to more than one
cluster.
– Makes more sense for applications like creating browsable
hierarchies
– You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
– You can only do that with a soft clustering approach.
32
K-means algorithm
The K-means problem is NP-complete, which means that there is

no efficient solution to find the global minimum and we need to
resort to a heuristic algorithm.
The best known algorithm is usually also called K-means,

although the name ‘Lloyd’s algorithm’ is also used.
The outline of the algorithm is given in the next slide. The

algorithm iterates between partitioning the data using the
nearest-centroid decision rule, and recalculating centroids from
a partition.
Andrew
33
Andrew
Andrew
34
Andrew
Andrew
35
Andrew
Andrew
36
Andrew
Andrew
37
k – mean algorithm
Cluster assignment
step
Mink ||x(i) – Uk||
Move centroid step

U2 = 1/3(x1 + x3 + x5)
38
K--means algorithm
Randomly initalize K cluster centroids
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
Andrew
K--means for non--separated clusters
T-shirt sizing
Weight
Height
39
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
40
K-means algorithm
Sec. 16.4
Termination conditions
• Several possibilities, e.g.,
– A fixed number of iterations.
– Doc partition unchanged.
– Centroid positions don’t change.
Does this mean that the docs in a

cluster are unchanged?
41
Sec. 16.4
Convergence
• Why should the K-means algorithm ever reach a
fixed point?
– A state in which clusters don’t change.
• K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
– EM is known to converge.
– Number of iterations could be large.
– But in practice usually isn’t
Sec. 16.4
Time Complexity
• Computing distance between two docs is
O(M) where M is the dimensionality of the
vectors.
• Reassigning clusters: O(KN) distance
computations, or O(KNM).
• Computing centroids: Each doc gets added
once to some centroid: O(NM).
• Assume these two steps are each done once
for I iterations: O(IKNM).
42
Sec. 16.4
Seed Choice
• Results can vary based on Example showing
random seed selection. sensitivity to seeds
• Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings. In the above, if you start
with B and E as centroids
– Select good seeds using a you converge to {A,B,C}
heuristic (e.g., doc least similar and {D,E,F}
to any existing mean) If you start with D and F
you converge to
– Try out multiple starting points {A,B,D,E} {C,F}
– Initialize with the results of
another method.
Random initialization
Should have K <M
Randomly pick K training

examples.
Set equal to these

examples.
Andrew
43
Local optima
Andrew
Random initialization
For i = 1 to 100 {
Randomly initialize K--means.

Run K--means. Get .
Compute cost function (distortion)
Pick clustering that gave lowest cost
44
What is the right value of K?
Andrew

Elbow method:
Cost func3on
Cost func3on
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(no. of clusters) (no. of clusters)
45
Sometimes, you’re running K--means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for how
well it performs for that later purpose.
E.g. T--shirt sizing T--shirt sizing
Weight
Weight
Height Height
K-means Algorithm
46
K-means Algorithm
Example - Clustering MLM data
When we run K-means on MLM dataset data with K = 3, we

obtain the clusters {Associations, Trees, Rules}, {GMM, naive
Bayes}, and a larger cluster with the remaining data points.
When we run it with K = 4, we get that the large cluster splits

into two: {kNN, Linear Classifier, Linear Regression} and {K-
means, Logistic Regression, SVM}; but also that GMM gets
reallocated to the latter cluster, and naive Bayes ends up as a
singleton.
47
K=3
MLM Data K=4
Example - Clustering MLM data

It can be shown that one iteration of K-means can never increase
the within-cluster scatter, from which it follows that the
algorithm will reach a stationary point: a point where no further
improvement is possible. It is worth noting that even the
simplest data set will have many stationary points.
In general, while K-means converges to a stationary point in

finite time, no guarantees can be given about whether the
convergence point is in fact the global minimum, or if not, how
far we are from it.
In practice it is advisable to run the algorithm a number of times

and select the solution with the smallest within-cluster scatter.
48
Example - Stationary points in clustering
Consider the task of dividing the set of numbers {8,44,50,58,84}

into two clusters. There are four possible partitions that 2-means
can find: {8}, {44,50,58,84}; {8,44}, {50,58, 84}; {8,44, 50}, {58,
84}; and {8,44,50,58}, {84}.
It is easy to verify that each of these establishes a stationary

point for 2-means, and hence will be found with a suitable
initialisation.
Only the first clustering is optimal; i.e., it minimises the total

within-cluster scatter.
Clustering around medoids
It is straightforward to adapt the K-means algorithm to use a

different distance metric.
K-medoids algorithm additionally requires the exemplars to be
data points.
Calculating the medoid of a cluster requires examining all pairs

of points – whereas calculating the mean requires just a single
pass through the points – which can be prohibitive for large data
sets.
49
Hierarchical Clustering
Dendrograms are tree diagrams which are purely defined in

terms of a distance measure.
Because dendrograms use features only indirectly, as the basis

on which the distance measure is calculated, they partition the
given data rather than the entire instance space, and hence
represent a descriptive clustering rather than a predictive one.
50
Hierarchical clustering
Definition – Dendrogram
Given a data set D, a dendrogram is a binary tree with the

elements of D at its leaves.
An internal node of the tree represents the subset of elements in
the leaves of the subtree rooted at that node.
The level of a node is the distance between the two clusters
represented by the children of the node.
Leaves have level 0.
Dendrogram: Hierarchical Clustering
– Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
102
51
• The decision of the no. of clusters that can best depict
different groups can be chosen by observing the dendrogram.
• The best choice of the no. of clusters is the no. of vertical lines
in the dendrogram cut by a horizontal line that can transverse
the maximum distance vertically without intersecting a
cluster.
Example - Hierarchical clustering of MLM data

A hierarchical clustering of the MLM data is given in the next slide.
The tree shows that the three logical methods at the top form a strong
cluster.
If we wanted three clusters, we get the logical cluster, a second small

cluster {GMM, naive Bayes}, and the remainder.
If we wanted four clusters, we would separate GMM and naive Bayes,

as the tree indicates this cluster is the least tight of the three.
If we wanted five clusters, we would construct {Linear Regression,

LinearClassifier} as a separate cluster.
This illustrates the key advantage of hierarchical clustering: it doesn’t

require fixing the number of clusters in advance.
52
Example - Hierarchical clustering of MLM data
A dendrogram constructed by hierarchical clustering from the data
53
Taking cluster means as exemplars assumes Euclidean distance,

and we may want to use one of the other distance metrics.
This has led to the introduction of the linkage function, which is

a general way to turn pairwise point distances into pairwise
cluster distances.
Definition - Linkage function

A linkage function L : 2X × 2X → R calculates the distance between
arbitrary subsets of the instance space, given a distance metric
Dis : X × X →R.
54
The most common linkage functions are as follows:
Single linkage defines the distance between two clusters

as the smallest pairwise distance between
elements from each cluster.
Complete linkage defines the distance between two clusters
as the largest point-wise distance.
Average linkage defines the cluster distance as the average
point-wise distance.
Centroid linkage defines the cluster distance as the point
distance between the cluster means.
These linkage functions can be defined mathematically as
follows:
55
Hierarchical Aggelometirc clustering
The general algorithm to build a dendrogram is given in the next slide.
The tree is built from the data points upwards and is hence a bottom–up or
agglomerative algorithm.
At each iteration the algorithm constructs a new partition of the data by

merging the two nearest clusters together.
In general, the HAC algorithm gives different results when different linkage
functions are used.
Hierarchical clustering using single linkage can essentially be done by
calculating and sorting all pairwise distances between data points, which
requires O(n2) time for n points.
The other linkage functions require at least O(n2logn). The un-optimised

algorithm has time complexity O(n3).
56
Sec. 17.2.1
Computational Complexity
• In the first iteration, all HAC methods need to compute
similarity of all pairs of N initial instances, which is
O(N2).
• In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
• In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
– Often O(N3) if done naively or O(N2 log N) if done more
cleverly
57
• Input distance matrix (L = 0 for all clusters)
• The nearest pair of cities is MI and TO, at
distance 138. These are merged into a single
cluster called "MI/TO". The level of the new
cluster is L(MI/TO) = 138 and the new sequence
number is m = 1
• Then we compute the distance from this new
compound object to all other objects.
• In single link clustering the rule is that the
distance from the compound object to another
object is equal to the shortest distance from any
member of the cluster to the outside object.
• So the distance from "MI/TO" to RM is chosen to
be 564, which is the distance from MI to RM, and
so on.
• After merging MI with TO we obtain the
following matrix:------------
• min d(i,j) = d(NA,RM) = 219 => merge NA and

RM into a new cluster called NA/RM L(NA/RM) =
219 m = 2
• min d(i,j) = d(BA,NA/RM) = 255

• => merge BA and NA/RM into a new
cluster called BA/NA/RM L(BA/NA/RM) =
255 now m = 3
• min d(i,j) = d(BA/NA/RM,FI) = 268 =>

merge BA/NA/RM and FI into a new
cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268; Now m = 4
• Finally, we merge the last two clusters at

level 295.
58

Distance Based Models

Uploaded by

Copyright:

Available Formats

Distance Based Models

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distance Based Models

Uploaded by

Copyright:

Available Formats

Distance-based Models

Intuition for distance based models

•Set of (x,y) points

Where is the pnorm (sometimes denoted

under the understanding that x0 = 0 for x = 0 and 1 otherwise.

Using the same example, the L2 norm is calculated

A is the testing point

Suitable distance function for strings:

From Biology: just delete A or add A in beginning

Neighbours and Exemplars

In distance-based models, we formulate the model in terms of a

A fundamental property of the mean of a set of vectors is that it

Theorem: The arithmetic mean minimises squared Euclidean

Minimising the sum of squared Euclidean distances of a given set

Neighbours and Exemplars

Centroid and Medoid

The centroid or geometric center is the arithmetic mean

Medoids are representative objects of a data set or a cluster

Medoids are similar in concept to means or centroids, but

In certain situations it makes sense to restrict an exemplar to be

Finding a medoid requires us to calculate, for each data point,

A small data set of

An alternative, distance-based way to classify instances without

If we use Euclidean distance as our closeness measure, simple

Neighbours and Exemplars

So the basic linear classifier can be interpreted from a distance-

This change of perspective opens up many new possibilities.

The figure on the next slide illustrates this.

A Voronoi diagram is a partitioning of a plane into regions based on distance to points.

To summarise, the main ingredients of distance-based models are:

 distance metrics, which can be Euclidean, Manhattan, Minkowski or

These ingredients are combined in various ways to obtain supervised

The most commonly used distance-based classifier is straightforward:

Consequently, ‘training’ this classifier requires nothing more than

This extremely simple classifier is known as the nearest neighbour

Its decision regions are made up of the cells of a Voronoi tesselation,

Unless the training set contains identical instances from different

By choosing the right exemplars we can more or less represent any

From an algorithmic point of view, training the nearest-neighbor

The downside is that classifying a single instance also takes O(n)

It is possible to reduce classification time at the expense of

However, irrespective of the curse of dimensionality, it is not simply a

In such a case it would be a good idea, before building a distance-

The nearest-neighbor method can easily be applied to regression

The k-nearest neighbour method in its simplest form takes a vote

Intuition for knn

•Set of (x,y) points

If k-nearest neighbour is used for regression problems, the

This would lend the model additional predictive power by

More generally, we can apply k-means to any learning problem

Curve is for target function

There is no easy recipe to decide what value of k is appropriate

However, it is possible to sidestep this question to some extent

One could say that distance weighting makes k-nearest

The next figure illustrates k-nearest neighbour method on a

The class distribution is visualised by assigning each test point

While for k = 3 the decision regions are still mostly discernible,

To take an extreme example: if k is equal to the number of

We conclude that the refinement of k-nearest neighbour – the