Predict Based Simmiliarity and Validation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

SIMILIARITY-BASED LEARNING AND Pembelajaran Mesin

PERFORMANCE MEASURES FOR S1 IF


IFY
CATEGORICAL TARGETS IT Del 2020
WHAT IS SIMILIARITY-BASED LEARNING?
Similarity-based approaches to machine learning come from the idea
that the best way to make a predictions is to simply look at what has
worked well in the past and predict the same thing again.
The fundamental concepts required to build a system based on this idea
are feature spaces and measures of similarity, and these are covered
in the fundamentals section of this chapter. These concepts allow us to
understand the standard approach to building similarity-based models:
the nearest neighbor algorithm. After covering the standard algorithm,
we then look at extensions and variations that allow us to handle noisy
data (the k nearest neighbor, or k-NN, algorithm), to make predictions
more efficiently (k-d trees), to predict continuous targets, and to handle
different kinds of descriptive features with varying measures of
similarity
BIG IDEA
The process of classifying an unknown animal by matching the features of the
animal against the features of animals neatly encapsulates the big idea underpinning
similarity-based learning: if you are trying to make a prediction for a current situation
then you should search your memory to find situations that are similar to bthe current one
and make a prediction based on what was true for the most similar situation in your
memory
FUNDAMENTAL SIMILIARITY-BASED
LEARNING AND FEATURE SPACE
As the name similarity-based learning suggests, a key component of this approach to
prediction is defining a computational measure of similarity between instances. Often this
measure of similarity is actually some form of distance measure.
A consequence of this, and a somewhat less obvious requirement of similarity-based
learning, is that if we are going to compute distances between instances, we need to
have a concept of space in the representation of the domain used by our model. In this
section we introduce the concept of a feature space as a representation for a
training dataset and then illustrate how we can compute measures of
similarity between instances in a feature space.
We can represent this dataset in a feature space by taking each of the descriptive
features to be the axes of a coordinate system. We can then place each instance within
the feature space based on the values of its descriptive features.
EXAMPLE ID SPEEDAGILITY DRAFT
1 2,5 6 no
2 3,75 8 no
3 2,25 5,5 no
The SPEED and AGILITY ratings for 20 college 4 3,25 8,25 no
athletes and whether they were drafted by a 5 2,75 7,5 no
professional team. 6 4,5 5 no
Feature Space Plot: 7 3,5 5,25 no
8 3 3,25 no
9 4 4 no
10 4,25 3,75 no
11 2 2 no
12 5 2,5 no
13 8,25 8,5 no
14 5,75 8,75 yes
15 4,75 6,25 yes
16 5,5 6,75 yes
17 5,25 9,5 yes
18 7 4,25 yes
19 7,5 8 yes
20 7,25 5,75 yes
MEASURING SIMILARITY USING
DISTANCE METRICS
a and b, in a dataset is to measure the distance between the instances in a feature
space. We can use a distance metric to do this: metric(a, b) is a function that returns the
distance between two instances a and b. Mathematically, a metric must conform to the
following four criteria:
1. Non-negativity: metric(a, b) ≥ 0
2. Identity: metric(a, b) = 0 ⇔ a = b
3. Symmetry: metric(a, b) = metric(b, a)
4. Triangular Inequality: metric(a, b) ≤ metric(a, c) + metric(b, c)
1. Euclidean Distance
𝑚 2
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝐚, 𝐛 = 𝑖=1 𝐚 𝑖 − 𝐛[𝑖]
2. Manhattan Distance
𝑚
𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 𝐚, 𝐛 = 𝑖=1 𝑎𝑏𝑠 𝐚 𝑖 − 𝐛[𝑖]
1
𝑚 𝑝
3. Minkowski Distance: 𝑀𝑖𝑛𝑘𝑜𝑤𝑠𝑘𝑖 𝐚, 𝐛 = 𝑖=1 𝑎𝑏𝑠 𝐚 𝑖 − 𝐛[𝑖] 𝑝
EXAMPLE ID SPEEDAGILITY DRAFT
1 2,5 6 no
2 3,75 8 no
3 2,25 5,5 no
4 3,25 8,25 no
5 2,75 7,5 no
6 4,5 5 no
7 3,5 5,25 no
8 3 3,25 no
9 4 4 no
10 4,25 3,75 no
11 2 2 no
12 5 2,5 no
13 8,25 8,5 no
14 5,75 8,75 yes
15 4,75 6,25 yes
16 5,5 6,75 yes
17 5,25 9,5 yes
18 7 4,25 yes
19 7,5 8 yes
20 7,25 5,75 yes
THE NEAREST NEIGHBOR ALGORITHM
EUCLIDEAN DISTANCE
The distances (Dist.) between the query instance with SPEED = 6.75 and AGILITY = 3.00
ID SPEED AGILITY X1=SPEED-6,75 X2=AGILITY-3 X1 kuadrat X2 kuadrat akar jumlah
1 2,5 6 -4,25 3 18,06 9,00 5,20
2 3,75 8 -3 5 9,00 25,00 5,83
3 2,25 5,5 -4,5 2,5 20,25 6,25 5,15
4 3,25 8,25 -3,5 5,25 12,25 27,56 6,31
5 2,75 7,5 -4 4,5 16,00 20,25 6,02
6 4,5 5 -2,25 2 5,06 4,00 3,01
7 3,5 5,25 -3,25 2,25 10,56 5,06 3,95
8 3 3,25 -3,75 0,25 14,06 0,06 3,76
9 4 4 -2,75 1 7,56 1,00 2,93
10 4,25 3,75 -2,5 0,75 6,25 0,56 2,61
11 2 2 -4,75 -1 22,56 1,00 4,85
12 5 2,5 -1,75 -0,5 3,06 0,25 1,82
13 8,25 8,5 1,5 5,5 2,25 30,25 5,70
14 5,75 8,75 -1 5,75 1,00 33,06 5,84
15 4,75 6,25 -2 3,25 4,00 10,56 3,82
16 5,5 6,75 -1,25 3,75 1,56 14,06 3,95
17 5,25 9,5 -1,5 6,5 2,25 42,25 6,67
18 7 4,25 0,25 1,25 0,06 1,56 1,27
19 7,5 8 0,75 5 0,56 25,00 5,06
20 7,25 5,75 0,5 2,75 0,25 7,56 2,80
SORTIR EUCLID DISTANCE

The nearest neighbor prediction algorithm creates a set of local models, or


neighborhoods, across the feature space where each model is defined by a subset
of the training dataset (in this case, one instance). The decision boundary is the
boundary between regions of the feature space in which different target levels will
be predicted. We can generate the decision boundary by aggregating the
neighboring local models (in this case Voronoi regions) that make the same
prediction. our query instance and each instance fro ranked from lowest to highest
that the nearest neighbor to the query is instance d18, with a distance of 1.2749
and a target level of yes. For 3-NN nearest neighbor to the query is instance d18,
d12 and d10 with result of voting target level from three data is no.
How about 5-NN?
VORONOI TESSELLATION AND DECISSION
BOUNDARY
When the
algorithm is
searching for the
nearest neighbor
using Euclidean
distance, it is
partitioning the
feature space
into what is
known as a
Voronoi
tessellation, and
it is
trying to decide
which Voronoi
region the query
belongs to.
PREDICT OF TARGET

To do this we simply modify the algorithm to return the majority target level within
the set of k nearest neighbors to the query q

Where Mk(q) is the prediction of the model for the query q given the parameter of
the model k; levels(t) is the set of levels in the domain of the target feature, and l is
an element of this set; i iterates over the instances di in increasing distance from the
query q; ti is the value of the target feature for instance di; and δ(ti, l) is the
Kronecker delta function, which takes two parameters return 1 if they are equal
and 0 otherwise.
votes of the neighbors that are further away from the query get less weight.
The easiest way to implement this weighting scheme is to weight each neighbor by
the reciprocal of the squared distance between the neighbor d and the
query q:
PERFORMANCE MEASURES FOR CATEGORICAL
TARGETS: MATRIX CONFUSION
Performance measurement for machine learning classification problem where output can
be two or more classes. It is a table with 4 different combinations of predicted and
actual values.

True Positive: You predicted positive and it’s true.


True Negative: You predicted negative and it’s true.
False Positive: (Type 1 Error): You predicted positive and it’s false.
False Negative: (Type 2 Error): You predicted negative and it’s false.
F MEASURE

It is difficult to compare two models with low precision and high recall or
vice versa. So to make them comparable, we use F-Score/ F-measure. F-
measure helps to measure Recall and Precision at the same time. It uses
Harmonic Mean in place of Arithmetic Mean by punishing the extreme
values more. If we calculate accuracy with formula:

𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = x100%
𝑇𝑜𝑡𝑎𝑙
EXAMPLE- 90% TRAINING AND 10% TESTING FROM ATLETE DATA
Jarak terhadap data ke-19 Jarak terhadap data ke-20

AGILITY

Target
X2=AGILITY-
ID SPEED X1=SPEED-7,5 X2=AGILITY-8 Jarak X1=SPEED-7,25 5,75 Jarak
1 2,5 6no -5 -2 5,39 -4,75 -3,25 5,76
2 3,75 8no -3,75 0 3,75 -3,5 -2 4,03
3 2,25 5,5no -5,25 -2,5 5,81 -5 -3,5 6,10
4 3,25 8,25no -4,25 0,25 4,26 -4 -2,5 4,72
5 2,75 7,5no -4,75 -0,5 4,78 -4,5 -3 5,41
6 4,5 5no -3 -3 4,24 -2,75 -1,25 3,02
7 3,5 5,25no -4 -2,75 4,85 -3,75 -2,25 4,37
8 3 3,25no -4,5 -4,75 6,54 -4,25 -2,75 5,06
9 4 4no -3,5 -4 5,32 -3,25 -1,75 3,69
10 4,25 3,75no -3,25 -4,25 5,35 -3 -1,5 3,35
11 2 2no -5,5 -6 8,14 -5,25 -3,75 6,45
12 5 2,5no -2,5 -5,5 6,04 -2,25 -0,75 2,37
13 8,25 8,5no 0,75 0,5 0,90 1 2,5 2,69
14 5,75 8,75yes -1,75 0,75 1,90 -1,5 0 1,50
15 4,75 6,25yes -2,75 -1,75 3,26 -2,5 -1 2,69
16 5,5 6,75yes -2 -1,25 2,36 -1,75 -0,25 1,77
17 5,25 9,5yes -2,25 1,5 2,70 -2 -0,5 2,06
18 7 4,25yes -0,5 -3,75 3,78 -0,25 1,25 1,27
19 7,5 8yes
20 7,25 5,75yes
3-NN FOR 10% DATA TESTING

Jarak terhadap data ke-19


ID SPEED AGILITY X1=SPEED-7,5 X2=AGILITY-8 akar jumlah
13 8,25 8,5no 0,75 0,5 0,90
14 5,75 8,75yes -1,75 0,75 1,90
16 5,5 6,75yes -2 -1,25 2,36
Jarak terhadap data ke-20
X2=AGILITY-
ID SPEED AGILITY X1=SPEED-7,25 5,75 akar jumlah
18 7 4,25yes -0,25 1,25 1,27
14 5,75 8,75yes -1,5 0 1,50
16 5,5 6,75yes -1,75 -0,25 1,77
CONFUSING MATRIKS AND F-MEASURE
FOR 10% DATA TESTING
Actual
Predicted

Yes No
Yes 2 0
No 0 0

2
𝑅𝑒𝑐𝑎𝑙𝑙 = 2=1

2
𝑃𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛 = 2=1
2×2×2
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = =8/4 =2
2+2
2
𝐴𝑘𝑢𝑟𝑎𝑠𝑖 = 2 x100%=100%
LATIHAN
1. Tentukan prediksi menggunakan data atlete jika training:testing
70:30
2. Buatlah matriks konfusion dan hitung F-measure serta akurasinya
SUMBER
Narkhede, Sarang. 2018. Understanding Confusion Matrix.
https://towardsdatascience.com/understanding-confusion-matrix-
a9ad42dcfd62
Keller, J.D., Naame, B.M., Arcy, A.D., 2015. FUNDAMENTALS OF MACHINE
LEARNING FOR PREDICTIVE DATA ANALYTICS. The MIT
Cambridge Press

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy