Distance Based Models
Distance Based Models
Distance Based Models
1
Why distance measures?
Introduction
There are many ways to measure distance.
(Minkowski distance). If X = Rd , the Minkowski distance of order
p > 0 is defined as
2
Introduction
Introduction
The 1-norm denotes Manhattan distance, also called cityblock distance:
The 0-norm (or L0 norm) counts the number of non-zero elements in a vector.
The corresponding distance counts the number of positions in which vectors x
and y differ. This is not strictly a Minkowski distance; however, we can define
it as
when having two vectors (username and password). If the L0 norm of the vectors is equal
to 0, then the login is successful. Otherwise, if the L0 norm is 1, it means that either the
username or password is incorrect, but not both. And lastly, if the L0 norm is 2, it means
that both username and password are incorrect.
3
Example of various distances
You cant go directly
The L1 norm is calculated by ||X||1 = |3| + |4| = 7
L-infinity norm:
Gives the largest magnitude among each element of a vector.
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect.
So, for example, if your vector represents the cost of constructing a building, by
minimizing L-infinity norm we are reducing the cost of the most expensive building.
4
5
Introduction
Distance metric: Desired properties
Given an instance space X, a distance metric is a function
Dis : X×X → R such that for any x, y, z ∈ X:
1. distances between a point and itself are zero: Dis(x,x) = 0;
2. all other distances are larger than zero: if x != y then
Dis(x,y) > 0;
3. distances are symmetric: Dis(y,x) = Dis(x,y);
4. detours can not shorten the distance:
Dis(x,z) ≤ Dis(x,y)+Dis(y,z).
If the second condition is weakened to a non-strict inequality – i.e.,
Dis(x,y) may be zero even if x ≠ y – the function Dis is called a
pseudo-metric.
6
Non-Metric Distance Functions
For example, the basic linear classifier uses the two class means
μ⊕ and μϴ as exemplars, as a summary of all we need to know
about the training data in order to build the classifier.
7
Neighbours and Exemplars
8
Neighbours and Exemplars
9
Neighbours and Exemplars
Once we have determined the exemplars, the basic linear
classifier constructs the decision boundary as the perpendicular
bisector of the line segment connecting the two exemplars.
10
Neighbours and Exemplars
11
Neighbours and Exemplars
12
13
14
15
Neighbours and Exemplars
Nearest-neighbor Classification
16
Nearest-neighbor Classification
Properties of the nearest-neighbor classifier:
It follows that the nearest-neighbor classifier has low bias, but also
high variance. This suggests a risk of overfitting if the training data is
limited, noisy or unrepresentative.
Nearest-neighbor Classification
17
Nearest-neighbor Classification
The high-dimensional instance spaces can be problematic for the
infamous curse of dimensionality. High-dimensional spaces tend to be
extremely sparse, which means that every point is far away from
virtually every other point, and hence pairwise distances tend to be
uninformative.
Nearest-neighbor Classification
In any case, before applying nearest-neighbor classification it is a
good idea to plot a histogram of pairwise distances of a sample to
see if they are sufficiently varied.
18
Knn Overview
19
20
21
Nearest-neighbor Classification
22
Predicting value for x =4
Predicting for x=0; knn fails since its out of range of training samples ( Extrapolation)
Predicting 4 is interpolation and knn is good for it
More neigbours don’t gurantee more accuracy
23
Choosing the value of K
Nearest-neighbor Classification
24
Nearest-neighbor Classification
25
Nearest-neighbor Classification
Furthermore, we can say that the bias increases and the variance
decreases with increasing k.
26
27
Why is knn slow?
28
Making knn fast
29
Use median and apply splits
30
Distance-based Clustering
Distance-based Clustering
31
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
32
Distance-based Clustering
K-means algorithm
Andrew
33
Andrew
Andrew
34
Andrew
Andrew
35
Andrew
Andrew
36
Andrew
Andrew
37
k – mean algorithm
Cluster assignment
step
Mink ||x(i) – Uk||
38
K--means algorithm
Randomly initalize K cluster centroids
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
Andrew
T-shirt sizing
Weight
Height
39
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
40
K-means algorithm
Sec. 16.4
Termination conditions
• Several possibilities, e.g.,
– A fixed number of iterations.
– Doc partition unchanged.
– Centroid positions don’t change.
41
Sec. 16.4
Convergence
• Why should the K-means algorithm ever reach a
fixed point?
– A state in which clusters don’t change.
• K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
– EM is known to converge.
– Number of iterations could be large.
– But in practice usually isn’t
Sec. 16.4
Time Complexity
• Computing distance between two docs is
O(M) where M is the dimensionality of the
vectors.
• Reassigning clusters: O(KN) distance
computations, or O(KNM).
• Computing centroids: Each doc gets added
once to some centroid: O(NM).
• Assume these two steps are each done once
for I iterations: O(IKNM).
42
Sec. 16.4
Seed Choice
• Results can vary based on Example showing
random seed selection. sensitivity to seeds
• Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings. In the above, if you start
with B and E as centroids
– Select good seeds using a you converge to {A,B,C}
heuristic (e.g., doc least similar and {D,E,F}
to any existing mean) If you start with D and F
you converge to
– Try out multiple starting points {A,B,D,E} {C,F}
– Initialize with the results of
another method.
Random initialization
Should have K <M
Andrew
43
Local optima
Andrew
Random initialization
For i = 1 to 100 {
44
What is the right value of K?
Andrew
Cost func3on
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
45
Choosing the value of K
Sometimes, you’re running K--means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for how
well it performs for that later purpose.
Weight
Weight
Height Height
K-means Algorithm
46
K-means Algorithm
47
K=3
MLM Data K=4
48
Example - Stationary points in clustering
Distance-based Clustering
49
Hierarchical Clustering
50
Hierarchical clustering
Definition – Dendrogram
– Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
102
51
• The decision of the no. of clusters that can best depict
different groups can be chosen by observing the dendrogram.
• The best choice of the no. of clusters is the no. of vertical lines
in the dendrogram cut by a horizontal line that can transverse
the maximum distance vertically without intersecting a
cluster.
52
Example - Hierarchical clustering of MLM data
53
Hierarchical clustering
54
Hierarchical clustering
Hierarchical clustering
These linkage functions can be defined mathematically as
follows:
55
Hierarchical Aggelometirc clustering
The general algorithm to build a dendrogram is given in the next slide.
The tree is built from the data points upwards and is hence a bottom–up or
agglomerative algorithm.
56
Sec. 17.2.1
Computational Complexity
• In the first iteration, all HAC methods need to compute
similarity of all pairs of N initial instances, which is
O(N2).
• In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
• In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
– Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Hierarchical clustering
57
• Input distance matrix (L = 0 for all clusters)
• The nearest pair of cities is MI and TO, at
distance 138. These are merged into a single
cluster called "MI/TO". The level of the new
cluster is L(MI/TO) = 138 and the new sequence
number is m = 1
• Then we compute the distance from this new
compound object to all other objects.
• In single link clustering the rule is that the
distance from the compound object to another
object is equal to the shortest distance from any
member of the cluster to the outside object.
• So the distance from "MI/TO" to RM is chosen to
be 564, which is the distance from MI to RM, and
so on.
• After merging MI with TO we obtain the
following matrix:------------
58