0% found this document useful (0 votes)
8 views

FML Chapter1

Uploaded by

arcanessrm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

FML Chapter1

Uploaded by

arcanessrm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Foundations of

Machine Learning

Prof. Dr.-Ing. Michael Botsch


Summer Term 2024
i. Administrativia and Course Information

Contact Information
▶ Name: Michael Botsch
▶ Room: K209
▶ Email: michael.botsch@thi.de
▶ Tel.: +49 841 9348 2721
▶ Office Hours: Wednesday 13.30 − 14.30 in my office
Moodle-Password: FML_AVE_ss24_Botsch

1 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


i. Administrativia and Course Information

▶ Prerequisites
▶ Programming 1 and 2
▶ Math 1 and 2 (Algebra and Calculus)
▶ Statistics 1
▶ Literature
▶ BOTSCH, Michael, UTSCHICK, Wolfgang, 2020. Fahrzeugsicherheit und
automatisiertes Fahren: Methoden der Signalverarbeitung und des
maschinellen Lernens.. PDF e-Book: ISBN: 978-3-446-46804-7.
Hardcover: ISBN: 978-3-446-45326-5. https:
//www.hanser-fachbuch.de/buch/Fahrzeugsicherheit+
und+automatisiertes+Fahren/9783446453265.

▶ VAN DER PLAS, Jake. Python Data Science Handbook [online]. [Accessed on: ]. Available via:
https://jakevdp.github.io/PythonDataScienceHandbook.
▶ MURPHY, Kevin P., 2022. Probabilistic machine Learning: an introduction. Cambridge, Massachusetts: The MIT
Press. ISBN 978-0-262-04682-4.
▶ GÉRON, Aurélien, September 2019. Hands-on machine learning with Scikit-Learn, Keras, and Tensor- Flow:
concepts, tools, and techniques to build intelligent systems. S. edition. Beijing ; Boston ; Farnham ; Sebastopol ;
Tokyo: O’Reilly. ISBN 978-1-492-03264-9, 1-492-03264-6.
2 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
Outline
Foundations of Machine Learning

1 1. Basics 7 7. Neural Networks


1. 1 Introduction 7. 1 Deep Multilayer Perceptron
1. 2 Curse of Dimensionality 7. 2 Stochastic Gradient Descent
1. 3 Normalization of Features 7. 3 Backpropagation
1. 4 Similarity and Disimilarity Measures 8 8. Convolutions and Image Processing
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization 9 9. Recurrency and Sequence Processing
2 9. 1 Recurrency and Backpropagation Trough Time
2. Generalization and Validation
9. 1 Vanishing Gradient and LSTM
2. 1 Loss, Risk, Empirical Risk
2. 2 Generalization and Bias-Variance Decomposition 10 10. Attention and Transformers
2. 3 Model Selection and Evaluation of Machine Learning 10. 1 Embeddings
3 10. 2 Attention Mechanism
3. Classification and Regression
10. 3 Transformers
3. 1 Optimal Classification and Regression
3. 2 k-Nearest Neighbor Classifier and Nadaraya-Watson Regressor 11 11. Unsupervised Learning and Generative Models
11. 1 Clustering
4 4. Linear Models 11. 2 Autoencoder
4. 1 Linear Classification and Regression 11. 3 Variational Autoencoder
4. 2 Classification Using the softmax Function 11. 4 Generative Adverserial Networks
5 5. Decision Trees
6 6. Ensembles
Outline
Foundations of Machine Learning

1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 1 Introduction 1. 1. 1 Notation

▶ scalars:
▶ printed: lowercase letter, normal, e. g. x
▶ handwritten: lowercase letter, normal, e. g., x
▶ vectors:
▶ printed: lowercase letter, bold, e. g. x
▶ handwritten: lowercase letter, underlined, e. g., x
▶ matrices:
▶ printed: uppercase letter, bold, e. g. A
▶ handwritten: uppercase letter, underlined, e. g., A

3 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 2 Terms in Machine Learning

Machine Learning (ger. Maschinelles Lernen) denotes methods of signal processing, which
are finding statistical relationships based on some data with the help of computers, having the
goal to predict new data. One can talk about the artificial generation of knowledge through
experience. Machine learning is based on mathematical statistics and deals with “Learning
from Data”, i. e., finding regularities in data

4 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 2 Terms in Machine Learning

Categories of machine learning


▶ Supervised Learning (ger. Überwachtes Lernen): an algorithm learns from examples of inputs
AND outputs, i. e., input data and output data are available. It can be said that the input data is
labeled, i. e., for each input the corresponding output is known in the learning phase. The goal is to
predict the output values to new input data
▶ Unsupervised Learning (ger. Unüberwachtes Lernen): an algorithm learns based alone on the
input data, there is no output data available. The goal is to learn the structure of the input data and
to generate a model for it
▶ Semi-Supervised Learning (ger. Teilüberwachtes Lernen): for a part of the input data there is
output data available and for another part there is no output data available. The goal is to to predict
the output values to new input data, but in the learning process also the unlabeled data must be used
▶ Reinforcement Learning (ger. Verstärkendes Lernen): an algorithm is learning a policy
(sequence of actions) on its own such that a measure for reward is maximized

5 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 2 Terms in Machine Learning

▶ Machine learning is a part of Artificial Intelligence (ger. Künstliche Intelligenz) and


plays an increasing role also in engineering since the available amount of data and the
computational resources have grown at a high pace in the last few years. This enables
solving complex practical applications in various fields of engineering with machine
learning algorithms
▶ It should be mentioned in this context that with a large amount of data for a given learning
task, many types of machine learning algorithms can find suitable solutions

6 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 2 Terms in Machine Learning

▶ Above a certain amount of data the chosen type of algorithm (given a large amount of
parameters) is not playing the crucial role for the performance but the high amount of data.
This has been shown for example in [BB01] for an application from speech recognition.
The reason is that many algorithms that are available nowadays are Universal
Approximators (ger. Universelle Approximatoren), i. e, they can represent any smooth
function
▶ In practice for most applications only small or medium-sized data sets are available. In this
case for certain learning tasks some algorithms do perform better than others
▶ Unfortunately it is not possible to make general statements about what “large amount of
data” means, since this must be answered individually for each task. As it will be shown
later on, when talking about the curse of dimensionality it is very easy to be wrong
regarding the size of a data set

7 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 2 Terms in Machine Learning

▶ The tasks in supervised learning can be mainly divided into Classification (ger.
Klassifikation) and Regression (ger. Regression) problems
▶ In the filed of unsupervised learning a main task is Clustering (ger. Clusteranalyse), i. e.
methods to reveal similarity in unlabeled data

8 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 3 Supervised Learning

▶ Classification and regression aim at estimating values of an attribute of a system based on


previously measured attributes of this system.

▶ Given a set of measured observation attributes v = [v1 , . . . , vN ′ ]T ∈ RN , statistical
learning methods estimate the values of a different attribute y. If y takes on continuous
numerical values, i. e., y ∈ R one talks about regression and if it takes on discrete values
from a set of K categorical values, called classes, i. e., y ∈ {c1 , . . . , cK } one talks about
classification
Classification

Regression

Supervised Learning

Figure 1: Classification and Regression


9 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 1 Introduction 1. 1. 3 Supervised Learning

▶ Often a preprocessing of the observation vector v is performed in order to simplify the


mapping

f̃ : RN → R, v 7→ y for regression and (1)
N′
f̃ : R → {c1 , . . . , cK }, v 7→ y for classification. (2)

▶ Preprocessing plays a very important role being a possibility to introduce a priori


knowledge about the considered machine learning problem. This preprocessing, also
called Feature Generation (ger. Merkmalsgenerierung), transforms the observation
vector v into the so-called feature vector x = [x1 , . . . , xN ]T ∈ RN

10 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 3 Supervised Learning

▶ In classical machine learning the feature generation is done manually, while in Deep
Learning (ger. Tiefes Lernen) the feature generation is also part of the learning process
▶ If also the feature generation is part of the learning process this type of learning is called
Representation Learning (ger. Repräsentationslernen)
Manual Feature Machine Learning Algorithm
Generation

Classical Machine Learning

Automatic Feature
Generation
Machine Learning Algorithm
Representation Learning, Deep Learning

Figure 2: Manual and Automatic Feature Generation

11 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 3 Supervised Learning

▶ A pair (x, y) is called a Pattern (ger. Muster), x the Input (ger. Eingang), y the target
(ger. Zielwert), and ŷ the Output (ger. Ausgang)
▶ Because the measured attribute values are subject to variations which often cannot be
described deterministically, a statistical framework must be adopted. In this framework, x
is the realization of the random variable x and y of the random variable y
▶ The mapping from v to y or the mapping from x to y describes the behavior of the system
that has to be approximated by machine learning
▶ The mapping is computed based on a Data Set (ger. Datensatz) D that contains M patterns

D = {(v1 , y1 ), . . . , (vM , yM )} or D = {(x1 , y1 ), . . . , (xM , yM )} (3)

▶ In the following x will be considered as an input but all concepts and ideas also hold if v is
the input

12 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 4 Unsupervised Learning

▶ In Unsupervised Learning, there is only input data without the associated target values, i. e.
unlabeled data. The training data set can thus be written as

D = {x1 , . . . , xM } (4)

▶ The task is to learn the structure in the input data and to create a model for it. The main
tasks of Unsupervised Learning include cluster analysis, also called automatic
segmentation, which is the compression of data or, respectively, data reduction and
generation of new data similar to those from D

13 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 4 Unsupervised Learning

The goal of cluster analysis is to automatically segment unlabeled input data into groups of
similar data points. This can find use
▶ to represent the available data in a clear way and thus allow a better management of the
data
▶ to understand the available data better
▶ to simplify the available data for subsequent signal processing steps

14 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 4 Unsupervised Learning

Steps in cluster analysis

15 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 4 Unsupervised Learning

▶ Hard Clustering: Subdivision of the dataset D = x1 , x2 , . . . xM , with xm ∈ RN , into K


partitions C1 , C2 , . . . , CK , with K ≤ M that
1 Ck ̸= ∅ for k = 1, . . . , K
SK
2 Ck = D
k=1
3 Ck ∩ Cl = ∅ for k, l = 1, . . . , K and k ̸= l
▶ Fuzzy Clustering: An element xm can belong to all clusters/partitions with a “membership
share” ukm ∈ [0, 1] where
K
P
1 ukm = 1 for all xm
k=1
M
P
2 ukm < M for all Ck
m=1

16 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 1 Introduction 1. 1. 4 Unsupervised Learning

▶ Clustering is a subjective process

▶ A good clustering method generates clusters with


▶ high intra-cluster similarity
▶ low inter-cluster similarity

⇒ The quality of clustering mainly depends on the similarity measure used


17 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 1 Introduction 1. 1. 4 Unsupervised Learning

▶ Categorizing the clustering methods based on the input data xm ∈ RN , one can distinguish
between similarity-based and feature-based clustering. In Similarity-Based Clustering
(or Relational Clustering) the starting point is a M × M matrix in which the similarities
or the dissimilarities between all elements are stored in D. In Feature-Based Clustering,
the starting point is a M × N matrix, in which the M elements of dimension N from D are
stored.
▶ If we categorize the clustering methods based on the way in which the cluster analysis is
performed, the following important methods can be identified: partitioning, model-based,
hierarchical, probability density-based, grid-based methods, and spectral clustering.

18 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


Outline
Foundations of Machine Learning

1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 2 Curse of Dimensionality

▶ If there are many features in the vector x , i. e., N is large the possible complexity of the
Bayes-classifier fB (the theoretically best classifier, see Section 3.1) also increases. The complexity
can increase exponentially with the dimensionality N and thus the number M of pairs (x1 , y1 ) in the
training data D must also increase exponentially (worst case) to estimate such a complex fB
▶ The Curse of Dimensionality (ger. Fluch Der Hohen Dimensionen) can be visualized [Bis95]
x2
x2

L=2

x1

w = 8 = 23
x1 x1 x3
ℓ ℓ
w = 21 w = 4 = 22
L=2 L=2

The number w of volume elements with side length ℓ increases exponentially with the
dimensionality N given the same side length L of the cube: w = (L/ℓ)N
19 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 2 Curse of Dimensionality

▶ In many practical applications classification and regression algorithms can perform well
despite high dimensions because the relevant data is located in lower dimensional
subspaces or because it has some smoothness property, i. e., for most of the points in the
input space, if x is changing slightly, in classification problems the target class is not
changing at all and in regression problems the target value is also changing only slightly
▶ So, for many practical problems the Bayes regression function or the Bayes classifier fB
(see Section 3.1) is not as complex as the space RN might suggest
▶ In order to get a good performance in high dimensional spaces it is often useful to use
prior knowledge about the system that generates the data, e. g., by taking into account what
type of data it is (time series, images, etc.)

20 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


Outline
Foundations of Machine Learning

1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 3 Normalization of Features

▶ For most of the machine learning algorithms it is necessary to normalize the features in the
vector x. Otherwise features xn whose range of values is large will have a higher influence
on the loss function and thus on the solution
▶ One option to deal with this aspect is to normlize all features, so that their range is
between 0 and 1. Denoting with xn,min the minimal value and with xn,max the maximal
value of the n-th feature in the data set D, the normalized data set is obtained by
normalizing each of the N features for each input vector xm according to

(norm) xn,m − xn,min


xn,m = (5)
xn,max − xn,min

21 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 3 Normalization of Features

▶ The sample mean µ̂n and the sample variance σ̂n2 of the n-th feature are computed
according to
M M
1 X 1 X
µ̂n = xm,n and σ̂n2 = (xm,n − µ̂n )2
M M−1
m=1 m=1

▶ Another option to perform the normalization is to make the sample mean 0 and the sample
variance 1 for every feature, i. e.,

(norm) xn,m − µ̂n


xn,m = . (6)
σ̂n
(norm) (norm)
With this normalization the sample mean of xn is 0 and the sample variance of xn
is 1
22 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
Outline
Foundations of Machine Learning

1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures

Definition: a function d is called dissimilarity measure or distance measure or metric if the


following holds for all x1 , x2 , x3 ∈ RN :
1 d(x1 , x2 ) = d(x2 , x1 ) (symmetry)
2 d(x1 , x2 ) = 0 ⇒ x1 = x2 ) (positive definiteness)
3 d(x1 , x3 ) ≤ d(x1 , x2 ) + d(x2 , x3 ) (triangle inequality)
This implies that d(x1 , x2 ) ≥ 0

23 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures

▶ A norm is a function ∥ · ∥ that is mapping from RN to R and fulfills the following


1 ∥x∥ = 0 ⇔ x = [0, 0, . . . , 0]T
2 ∥a · x∥ = |a| · ∥x∥ for all a ∈ R and x ∈ RN
3 ∥x1 + x2 ∥ ≤ ∥x1 ∥ + ∥x2 ∥ for all x1 , x2 ∈ RN
▶ A category of dissimilarity measure is defined by the norm ∥ · ∥ of x1 − x2 :

d(x1 , x2 ) = ∥x1 − x2 ∥

▶ The best known distance measure is the Euclidean measure


q
d(x1 , x2 ) = ∥x1 − x2 ∥2 = (x1,1 − x2,1 )2 + (x1,2 − x2,2 )2 + (x1,N − x2,N )2

24 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures

The most commonly used measures of dissimilarity include


▶ Matrix norms
▶ Lebesgue or Minkowski norms
▶ Hamming distance

25 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures

Matrix norms √
▶ The matrix norm for a matrix A is ∥x∥A = xT Ax
▶ Examples:
   
1 0 ··· 0 w1 0 ··· 0
0 1 ··· 0 0 w2 ··· 0 
A = . . .. ; A= . ;
   
.. .. .. ..
 .. .. . .  .. . . . 
0 0 ··· 1 0 0 ··· wN
| {z } | {z }
leads to Euclidean norm leads to Diagonal norm

M
!−1 M
1 X 1 X
A = Ĉ −1 = (xm − µ̂)(xm − µ̂)T with µ̂ = xm
M−1 M
m=1 m=1
| {z }
leads to Mahalanobis norm
The Mahalanobis norm is adjusting the weights of of the features to the statistics of the data set and
is taking into account correlations between the features
26 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures

Lebesgue or Minkowski norms


▶ Lebesgue or Minkowski norm is defined as
v
u N
uX
p
∥x∥p = t |xn |p
n=1
| {z }
Lp norm of x

▶ Special cases
qP
N 2
L−∞ norm: ∥x∥−∞ = min |xn |; L2 norm or Euclidean norm: ∥x∥2 = n=1 xn
n=1,...,N
PN
L∞ norm: ∥x∥∞ = max |xn | L1 norm or Manhattan norm: ∥x∥1 = n=1 |xn |
n=1,...,N

27 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures

Hamming distance
▶ The Hamming distance is defined as
N 
X 0 if x1,n = x2,n
d(x1 , x2 ) = δ(x1,n , x2,n ), with δ(x1,n , x2,n ) =
1 else
n=1

▶ So, the Hamming distance corresponds to the number of different features in x1 and x2
▶ The Hamming distance is a metric since it fulfills the properties for a metric (symmetry,
positive definiteness, triangle inequality)

28 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 2 Similarity Measures

▶ A function s is a similarity measure for all x1 , x2 ∈ RN if


1 s(x1 , x2 ) = s(x2 , x1 )
2 s(x1 , x2 ) ≤ s(x1 , x1 )
3 s(x1 , x2 ) ≥ 0
▶ If additionally s(x1 , x1 ) = 1 one speaks of a normalized similarity measure
▶ Using a monotonically decreasing positive function f , with f (0) = 1 a dissimilarity
measure d can be transformed into a similarity measure s, e. g.
1 1
s(x1 , x2 ) = ⇔ d(x1 , x2 ) = −1
1 + d(x1 , x2 ) s(x1 , x2 )

29 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 2 Similarity Measures

A category of similarity measures for positive real-valued features results from differently
scaled scalar products
xT1 x2
▶ Cosine-similarity: s(x1 , x2 ) = ∥x1 ∥2 ·∥x2 ∥2
xT1 x2
▶ Overlap-similarity: s(x1 , x2 ) =
min(∥x1 ∥22 ,∥x2 ∥22 )
2xT1 x2
▶ Dice-similarity: s(x1 , x2 ) =
∥x1 ∥22 +∥x2 ∥22
2xT1 x2
▶ Jaccard-similarity: s(x1 , x2 ) =
∥x1 ∥22 +∥x2 ∥22 −xT1 x2
▶ For the null vector these similarities have to be defined separately, e. g.,
set the similarity to 0

30 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 2 Similarity Measures

The similarity between two features (not between two input vectors!!!) in the dataset D can be
computed using the Pearson correlation coefficient
M
P
(xm,n − µ̂n )(xm,ℓ − µ̂ℓ ) M M
m=1 1 X 1 X
rnℓ = s , with µ̂n = xm,n , µ̂ℓ = xm,ℓ
M M M M
P P m=1 m=1
(xm,n − µ̂n )2 (xm,ℓ − µ̂ℓ )2
m=1 m=1

31 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 3 Features with Non-Real Value Range

Similarity for feature vectors whose features are represented by binary variables
▶ Notation a11 for the number of features that have the value 1 in x1 and the value 1 in x2
▶ Notation a00 for the number of features that have the value 0 in x1 and the value 0 in x2
▶ Notation a10 for the number of features that have the value 1 in x1 and the value 0 in x2
▶ Notation a01 for the number of features that have the value 0 in x1 and the value 1 in x2
1 The match of 1 with 1 is weighted the same as the match of 0 with 0
 
a11 + a00 1
s(x1 , x2 ) = , with w ∈ 1, 2,
a11 + a00 + w(a10 + a01 ) 2
2 The match of 1 with 1 is weighted and the match of 0 with 0 not
 
a11 1
s(x1 , x2 ) = , with w ∈ 1, 2,
a11 + w(a10 + a01 ) 2

32 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 3 Features with Non-Real Value Range

Similarity for feature vectors whose features are represented by categorical variables
N
1X
s(x1 , x2 ) = s12,n ,
N
n=1

1 if x1 and x2 have the same value in the nth feature
with s12,n =
0 if x1 and x2 have a different value in the nth feature

33 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 3 Features with Non-Real Value Range

Similarity for feature vectors whose features are represented by ordinal numbers
▶ The possible values of the nth feature are 1, 2, . . . , Wn
▶ The closer x1,n and x2,n , the more similar x1 and x2 are with respect to the nth feature
▶ Since Wn is different for the individual features first a mapping to the interval [0, 1] is
performed

(new) xm,n − 1
xm,n =
Wn − 1
▶ Then similarity measures for real valued features can be used

34 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 3 Features with Non-Real Value Range

Similarity for feature vectors whose features are of different type (have other range)
N
P
δ12,n s12,n
n=1
s(x1 , x2 ) = N
,
P
δ12,n
n=1

where s12,n is the similarity between x1 and x2 in the nth feature and δ12,n ∈ {0, 1} indicates if
the nth feature is existent in both x1 and x2 (yes → 1; no → 0)
▶ for features that are represented by categorical variables the following holds

1 if x1 and x2 have the same value in the nth feature
s12,n =
0 if x1 and x2 have a different value in the nth feature
▶ for real valued features a similarity from the ones mentioned above for real valued features
can be used
35 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

▶ For sequential data the following peculiarities appear


▶ each element of the sequence describes the same feature (e. g. acceleration)
▶ the order of the elements in the sequence does play a role
▶ sequences can have different length
▶ Levenshtein or Edit-distance: measures the minimum number of single-character edits
required to change one sequence into the other. These edits may either be: insertions,
deletions or substitutions. Example: The Levenshtein distance between the German words
“Tier” and “Tor” is 2. Here are the steps to reach from “Tier” to “Tor”: 1. replace “i” with
“o” to get “Toer” and 2. delete “e” to get “Tor”

36 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

37 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping


▶ In applications where the similarity of shape is important, a dissimilarity measure other
than the Minkowski metric is more suitable: Dynamic Time Warping (DTW)
▶ Because the triangle inequality does not apply to DTW, it is not a metric and is therefore
not referred to as „distance“
▶ In contrast to Euclidean distance, DTW allows an elastic shift of the time axis ⇒ DTW
allows a local compression and stretching of the time axis, e. g., phase-shifted patterns are
recognized as similar

38 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping

Figure 3: Without Time Warping. Figure 4: With Time Warping.

The difference between Euclidean distance and DTW becomes clear from Fig. 3 and Fig. 4:
While the Euclidean distance is very sensitive to small distortions in the time axis because it
assumes that the nth point in s1 is aligned with the nth point in s2 , DTW can shift the time axes
of both sequences to achieve a better alignment
39 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping: Example

Figure 5: Example for determining the similarity of the sequences s1 and s2 using DTW.

40 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping: Example


▶ The time series s1 and s2 in the figure on the left both have the length
nend,1 = nend,2 = 128. They both have a plateau as a similar shape. The plateau of s1 ,
which is represented by the solid line, is larger than the plateau of s2 , which is represented
by the dashed line. This results in a large Euclidean distance between the two time series
▶ Using the dynamic programming method, the optimal warping path WDTW ′ , which is
shown in the middle figure, can be calculated
▶ With the knowledge of WDTW ′ , the time axes for each sequence can now be stretched
locally to take into account the similarity of the shape
▶ The result, which stretches both time series to a length of 193, is shown in the figure on the
right

41 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping


▶ Unlike other dissimilarity measures, with DTW, it is not necessary for the two sequences
being compared to be of the same length. For the sequence s1 with length nend,1 and the
sequence s2 with the length nend,2 the distance between the nth element in s1 and the kth
element in s2 is defined as
d[n, k] = ∥s1 [n] − s2 [k]∥2
▶ To calculate the DTW dissimilarity, the first step is to calculate the nend,1 × nend,2 matrix
DDTW , where the (n, k)-th element is d[n, k]
▶ A contiguous path in DDTW from d[0, 0] to d[nend,1 − 1, nend,2 − 1] is called Warping-Path

WDTW = {w1,DTW , . . . , wI,DTW },


where max(nend,1 , nend,2 ) ≤ I < nend,1 + nend,2 − 1 and wi,DTW = [n, k]Ti stores the indices
n and k of the ith element in the path
42 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping


▶ The warping path WDTW is subject to three conditions
1 Boundary condition: w1,DTW = [0, 0]T and wI,DTW = [nend,1 − 1, nend,2 − 1]T . This forces the
path to start and end in diagonally opposite corners of DDTW
2 Continuity condition: restricts the permitted steps in the path to neighboring cells: if
wi,DTW = [n, k]T and wi−1,DTW = [n′ , k′ ]T , then n − n′ ≤ 1 und k − k′ ≤ 1
3 Monotonicity condition: forces the points in WDTW to be arranged monotonically in time,
i. e., if wi,DTW = [n, k]T and wi−1,DTW = [n′ , k′ ]T , then n − n′ ≥ 0 and k − k′ ≥ 0
▶ There are many paths that fulfill these three boundary conditions
▶ The DTW algorithm determines the path WDTW ′ that minimizes the warping costs
 
I(W DTW ) 
X

WDTW = argmin d[wi,DTW ] , (7)
WDTW  
i=1

where I(WDTW ) represents the length of the warping path WDTW


43 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping



▶ The optimal warping path WDTW leads to the DTW dissimilarity

I(WDTW )
dDTW (s1 , s2 ) =
1 X
′ d[wi,DTW ]
I(WDTW )
i=1

▶ The optimization problem in Eq. (7) can be solved very efficiently by dynamic programming. For
this purpose, a new (nend,1 + 1) × (nend,2 + 1) matrix ΓDTW is calculated, whose elements are
pre-initialized with „∞“ and with γDTW [0, 0] = 0. The matrix γDTW is then calculated in such a
way that for n, k ≥ 1 its entries contain the cumulative distances γDTW [n, k]

γDTW [n, k] = d[n − 1, k − 1] + min {γDTW [n − 1, k − 1], γDTW [n − 1, k], γDTW [n, k − 1]}

The sequence of indices that leads to γDTW [nend,1 , nend,2 ] forms the warping path WDTW
▶ Note that for time series of equal length, nend,1 = nend,2 , the Euclidean distance can be considered a
special case of DTW if for all wi,DTW = [n, k]T n = k = i − 1 applies
44 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping: Pseudocode1

1
[Zha20]: https://towardsdatascience.com/dynamic-time-warping-3933f25fcdd;
Python-Code: s. Moodle
45 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping


▶ The complexity of the DTW is O(nend,1 nend,2 ), but can easily be accelerated by introducing
a limit on how far the path may deviate from the diagonal of DDTW . The width of the
window around the diagonal that marks the subset of DDTW that the path is allowed to
visit is called the warping window
▶ In addition to the complexity reasons, the main advantage of introducing a warping
window is the avoidance of „pathological warping“, where a small section of one sequence
maps onto a large section of the other

46 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences

Dynamic Time Warping mit Warping-Fenster: Pseudocode2 [Zha20]

2
Python-Code: s. Moodle
47 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
Outline
Foundations of Machine Learning

1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers

The Lagrange multiplier method is a technique for solving optimization problems with
constraints. The task is to find a local extremum of a function in several variables while meeting
the constraints.

48 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints

▶ The task in optimization with equality constraints is to find the maximum of the function
f (x) under the constraint g(x) = 0
▶ Considering the N-dimensional vector x and the equality constraint g(x) = 0, this
constraint represents a (N − 1)-dimensional hyperplane in the N-dimensional space. An
example is the constraint wT x + t = 0. All points x, for which wT x + t = 0 applies, form
a (N − 1)-dimensional hyperplane

49 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints

Example: For N = 2, g(x) = 0 defines a curve, and in this 2-dimensional space, the method of Lagrange multipliers
can be geometrically explained using Fig. 6
x2

f(x1,x2)=c1
f(x1,x2)=c2
g(x1,x2)=0
f(x1,x2)=c3

xGoal

x1
Figure 6: Geometric representation of the Lagrange multipliers method for equality constraints.

The point x = [x1 , x2 ]T is to be found, which lies on the curve g(x1 , x2 ) = 0 and where f (x1 , x2 ) has the highest
value. In Fig. 6, contour lines f (x1 , x2 ) = c for various values of c are shown, with c1 < c2 < c3 . As can be seen in
the sketch, the point sought is xGoal
50 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints

The gradient ∇x g(x) is orthogonal to the hyperplane determined by g(x) = 0:


▶ The gradient ∇x g(x) of the constraint is orthogonal to the (N − 1)-dimensional
hyperplane determined by g(x) = 0. In Fig. 6, gradients ∇x g(x1 , x2 ) are represented by
red arrows, which stand perpendicular to the red curve g(x1 , x2 ) = 0
▶ The orthogonality between ∇x g(x) and the hyperplane determined by g(x) = 0 can be
illustrated using the Taylor series expansion3 of the function g around a point x0 + ϵ,
where both x0 and the point close to x0 , x0 + ϵ, lie on the hyperplane determined by
g(x) = 0. It results in:
 T  
g(x0 + ϵ) ≈ g(x0 ) + ∇x g(x)|x=x0 ϵ = g(x0 ) + ϵT ∇x g(x)|x=x0 . (8)
3
Taylor series expansion:
∂g(x) 1 ∂ 2 g(x)  
g(x) = g(x0 ) + T
(x − x0 ) + (x − x0 )T T
(x − x0 ) + O ∥x − x0 ∥3
| ∂x
{z } x=x0
2! |∂x∂x{z } x=x0
J H

51 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints

The gradient ∇x g(x) is orthogonal to the hyperplane determined by g(x) = 0:


▶ Because both g(x0 ) and g(x0 + ϵ) lie on the hyperplane determined by g(x) = 0, i. e.,
g(x0 ) = 0 and g(x0 + ϵ) = 0, it follows from Eq. (8)
 
ϵT ∇x g(x)|x=x0 ≈ 0
 
▶ If ∥ϵ∥ approaches zero, then ϵT ∇x g(x)|x=x = 0 holds
0

▶ Because in this case, the direction


 of ϵ is tangential to the hyperplane determined by
g(x) = 0, it follows that ∇x g(x)|x=x0 is perpendicular to this hyperplane.4 Therefore,
the red arrows in Fig. 6 are perpendicular to the curve g(x1 , x2 ) = 0
4
When two vectors a and b are orthogonal, the scalar product is defined as

aT b = 0

52 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints

The gradient ∇x f (x) is orthogonal to the hyperplane determined by g(x) = 0 at xGoal :


▶ Looking for a point on the hyperplane of the constraint g(x) = 0 where f (x) becomes
maximal, the gradient ∇x f (x) must also be orthogonal to the hyperplane of the constraint
at this point. In Fig. 6 this means that at the sought point xGoal , the blue arrow is also
perpendicular to the red curve g(x1 , x2 ) = 0
▶ This is due to the fact that the gradient always points in the direction of the steepest ascent,
and also because, if ∇x f (x) were not orthogonal to the hyperplane of the constraint, the
value of f (x) could be increased by moving forward or backward on this hyperplane

53 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints

Lagrange Multiplier and Lagrange Function


▶ In the sought point xGoal , the two gradients ∇x g(x) and ∇x f (x) are parallel or
antiparallel vectors
▶ Thus, there must be a parameter λ in the sought point such that
 
∇x f xGoal + λ∇x g xGoal = 0, (9)

where λ is called the Lagrange multiplier. The Lagrange multiplier can be both positive
and negative
▶ To find the sought point xGoal that maximizes f (x) and satisfies the equality constraint
g(x) = 0, the Lagrange function

L(x, λ) = f (x) + λg(x) (10)

can be introduced
54 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints

Lagrange Multiplier and Lagrange Function


▶ Setting the derivative L(x, λ) with respect to x to zero, i. e., ∇x L(x, λ) = 0, results in
Eq. (9), and differentiating L(x, λ) with respect to λ and setting it to zero, i. e.,
∂L(x,λ)
∂λ = 0, gives the constraint g(x) = 0
▶ Thus, xGoal and λ can be found by calculating the stationary points of L(x, λ). From

∂L(x, λ)
∇x L(x, λ) = 0 and =0 (11)
∂λ
there are N + 1 equations from which both xGoal and λ can be calculated
▶ If only xGoal is needed, one can eliminate λ from the equations, and it is not necessary to
calculate the value of λ

55 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 2 Inequality Constraints

▶ The task to be solved is to find the maximum of the function f (x), with the inequality constraint g(x) ≥ 0
being met
▶ Fig. 7 visualizes the task in the case that the dimension N of the vector x is 2
x2 x2
g(x1,x2)>0
f(x1,x2)=c1 f(x1,x2)=c1
f(x1,x2)=c2 f(x1,x2)=c2
g(x1,x2)=0 g(x1,x2)=0
f(x1,x2)=c3 f(x1,x2)=c3

xGoal
xGoal

g(x1,x2)>0
x1 x1
Figure 7: Geometric representation of the Lagrange multiplier method for inequality constraints.
▶ As shown, one has to distinguish between two cases, depending on whether the point xGoal lies within the
region where g(x) ≥ 0 applies, or whether it lies on the hyperplane g(x) = 0
56 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 2 Inequality Constraints

▶ In the first case (shown on the left in Fig. 7), one also speaks of the constraint being not
active because the constraint does not play a role and the point xGoal is calculated solely by
∇x f (x) = 0. This corresponds to the calculation of a stationary point of the Lagrange
function from Eq. (10), with λ = 0
▶ In the second case (shown on the right in Fig. 7), where xGoal lies on the hyperplane
g(x) = 0, one also speaks of the constraint being active. It is an analogous case to the
optimization with equality constraints and corresponds to the calculation of a stationary
point of the Lagrange function from Eq. (10), with λ ̸= 0. Unlike the optimization with
equality constraints, however, the sign of λ does play a. A maximum of f (x) is only
achieved if the gradient ∇x f (x) points in the opposite direction of the region where
g(x) ≥ 0 applies, as visualized on the right in Fig. 7. The blue arrows represent the
gradient vectors of the function f (x) at the respective points. In this second case,
∇x f (x) = −λ∇x g(x), with λ > 0 holds

57 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 2 Inequality Constraints

▶ In both cases, λg(xZiel ) = 0 is true. In the first case, because λ = 0, and in the second
case, because xZiel lies on the hyperplane g(x) = 0. Using the Lagrange function from
Eq. (10), L(x, λ) = f (x) + λg(x), one can state for a local minimum xZiel that there is a
λ∗ such that

∇x L(xZiel , λ∗ ) = 0 (12)
g(xZiel ) ≥ 0 (13)

λ ≥0 (14)

λ g(xZiel ) = 0 (15)

▶ These conditions are known as the Karush-Kuhn-Tucker (KKT) conditions. They are
very useful because they allow searching for solutions for x for which one can find λ∗

58 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 2 Inequality Constraints

If one wishes to minimize the function f (x) (instead of maximizing) subject to the inequality
constraint g(x) ≥ 0, one would have to suppose in Fig. 7 for visualization that c3 < c2 < c1
and hence, draw all the blue arrows in the respective opposite directions. If the constraints are
active, then the gradients ∇x f (x) and ∇x g(x) point in the same direction and thus
∇x f (x) = λ∇x g(x) with λ > 0. As a result, the Lagrange function is

L(x, λ) = f (x) − λg(x). (16)

The task is therefore to determine the stationary point of this Lagrange function according to
Eq. (16) with respect to x and λ, with λ ≥ 0

59 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 2 Inequality Constraints

If one wishes to extend the method of Lagrange multipliers to optimization tasks with several
equality and inequality constraints, this can be achieved according to the above considerations.
If the task involves maximizing the function f (x) subject to the equality constraints gk (x) = 0
where k = 1, . . . , K, and the inequality constraints hm (x) ≥ 0 where m = 1, . . . , M, one must
introduce the Lagrange multipliers λ = [λ1 , . . . , λK ]T and µ = [µ1 , . . . , µM ]T and calculate the
stationary points of the Lagrange function
K
X M
X
L(x, λ, µ) = f (x) + λk gk (x) + µm hm (x) (17)
k=1 m=1

with respect to x, λ, and µ, subject to the conditions µm ≥ 0 and µm hm (x) = 0, for


m = 1, . . . , M

60 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


Outline
Foundations of Machine Learning

1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 6 Dimensionality Reduction and Visualization

▶ Redundancies and noise are often contained in the input features, which make learning
difficult (Curse of Dimensionality)
▶ The extraction of a few but relevant features simplifies learning and leads to reduced
resource consumption
▶ With only 2 or 3 dimensions, data visualization becomes possible.
⇒ Reduction of the dimension of the input space is possible, preferably without “loss of
information”

61 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization

Dimensionality Reduction
▶ Linear combinations of input features are attractive because they are simple to compute
and the projection methods are traceable
▶ Principal Component Analysis (PCA): linear projection of input features into a
low-dimensional space, which leads to an optimal representation by minimization of the
square error.
▶ Multidimensional Scaling: a technique to visualize the level of similarity of individual cases
of a dataset.
▶ Nonlinear combinations of input features: t-SNE, Autoencoder, Self-Organizing Maps,
Locally-Linear Embedding, IsoMap, UMAP, etc.
▶ Feature Selection: Filter, Wrapper or Embedded methods

62 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Two approaches for the derivation


1 Finding the coordinate axes in whose direction the variance (“dynamics”) of the data is
greatest after the projection. The motivation for this approach is noise and redundancy in
the input features
2 Finding the linear projection that leads to the smallest reconstruction error in terms of
minimizing the squared error
⇒ both approaches lead to the same result

63 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


Example:5 a mass point oscillates without friction on a spring and its position in the plane (x, y) is captured by 3
sensors, which are attached at different locations

Figure 8: Data recording [Shl14]. Figure 9: Measurements from sensor A with noise [Shl14].
▶ Each sensor provides a vector with position data at each time point. Therefore, for the one-dimensional movement, the input
vector is obtained as [xA , yA , xB , yB , xC , yC ]T
▶ Focusing only on sensor A and assuming that the measurement points are noisy, a scenario as shown in the figure on the
right is obtained. Without noise, the measurements from each sensor would result in a straight segment. The goal is to find
projection axes so that the data exhibit a large variance (“dynamics”) after projection
5
[Shl14] A Tutorial on Principal Component Analysis; J. Shlens; arXiv:1404.1100.
64 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


Continuation of the example: Clearly, with the input vector [xA , yA , xB , yB , xC , yC ]T , a reduction of features is possible
and meaningful
▶ For example, if sensors A and B are very close to each other, the features xA and xB contain a lot of redundancy
▶ In general, a lot redundancy is indicated when one feature can be well described by another, for example, for features r2 and
r1 , a relationship r2 ≈ kr1 implies the presence of considerable redundancy. It would be more meaningful to use only one
feature instead of r1 and r2 , for example, the linear combination r2 − kr1 .

Figure 10: Different levels of redundancy [Shl14].


▶ The goal is to find projection axes that lead to new features after projection that are uncorrelated with each other.
Maximizing the variance of the projected data (orthogonal axes) results in a diagonal structure of the sample covariance
matrix of these projected data (i. e., estimated correlation zero)
65 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


We assume in the following that all features are centered, i. e., the expected value of each feature is zero. If this is
not the case, one can center the data in
PaMpreprocessing step by subtracting the sample expected value in each
feature: xm = x′m − µ̂, with µ̂ = M1 ′
m=1 xm

▶ Generally: the projection of the vector a onto the vector b is calculated as aT b


∥b∥2
b. If one wants to express the
projection only through aT b, the vector b must be a unit vector, i. e., ∥b∥2 = 1
▶ If the available data xm ∈ RN from D are contained in the matrix X ∈ RN×M , the projection of the M data
points onto an axis described by the unit vector u1 ∈ RN is calculated as
y1T = uT1 X, with y1T ∈ R1×M
▶ The optimization task now consists in finding u1 in such a way that the sample variance of the data after the
projection becomes maximal, under the constraint that ∥u1 ∥2 = 1. The projected points are also centered,
which means they have a sample expected value of zero.6 With this, the optimization task is
n 2
o
u1 = argmax ũT X s. t. ∥u1 ∥2 − 1 = 0 (18)

 
6 1 PM 1 PM T T 1 PM
M m=1 y1,m = M m=1 u1 xm = u1 M m=1 xm = uT1 0 = 0.
66 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


▶ The optimization task in (18) can be solved using Lagrange multipliers. The Lagrange functions from Eq. (10)
are here
2
 
L(u1 , λ) = uT1 X + λ ∥u1 ∥2 − 1 .
▶ To find the optimal vector u1 , apply Eq. (11)

∂ uT1 XX T u1 ∂ uT1 u1 − 1 !
 
∇u1 L(u1 , λ) = +λ =0
∂u1 ∂u1
∂aT b ∂bT a
With a
= b und a
= b one obtains
!
∇u1 L(u1 , λ) = 2XX T u1 + 2λu1 = 0.

So, one obtains the equation for determining the eigenvectors of the matrix XX T
XX T u1 = λ′ u1

(19)
▶ So, the optimal vector u1 is the first eigenvector of the matrix XX T ∈ RN×N


67 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


▶ The matrix XX T is equal to the sample covariance matrix up to a scaling factor


M
1 X
Ĉx = xm xTm (20)
M − 1 m=1
| {z }
XX T

So, u1 is the first eigenvector of the sample covariance matrix


▶ If one wants to project the M data points xm not only into one dimension, i. e. onto u1 , but into two
dimensions, it makes sense to find a suitable second vector u2 that is orthogonal to u1 . As will be shown
below, choosing orthogonal vectors u1 and u2 avoids redundancies in the projected data. The resulting
optimization task is therefore
n 2
o
u2 = argmax ũT X s. t. ∥u2 ∥2 − 1 = 0 and uT1 u2 = 0 (21)

68 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


▶ The optimization task in (21) can be solved using Lagrange multipliers. The Lagrange functions from Eq. (10)
are here
2
 
L(u2 , λ1 , λ2 ) = uT2 X + λ1 ∥u2 ∥2 − 1 + λ2 uT1 u2 .
▶ To find the optimal vector u2 , apply Eq. (11)
∂ uT2 XX T u2 ∂ uT2 u2 − 1
 
∂uT u2 !
∇u2 L(u2 , λ1 , λ2 ) = + λ1 + λ2 1 =0
∂u2 ∂u2 ∂u2
∂aT b ∂bT a
With ∂a
= b and ∂a
= b one obtains
!
∇u2 L(u1 , λ) = 2XX T u2 + 2λ1 u2 + λ2 u1 = 0 (22)
Multiplication of Eq. (22) from the left with uT1 leads to
2 uT1 XX T u2 + 2λ1 uT1 u2 + λ2 uT1 u1 =0 ⇒ 2λ′ uT1 u2 + λ2 = 0 ⇒ λ2 = 0
| {z } | {z } | {z } | {z }
λ′ uT
1 due to Eq. (19)
0 1 0

69 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


▶ If one now sets λ2 = 0 in Eq. (22), one gets the equation for determining the eigenvectors of the matrix XX T
XX T u2 = λ′1 u2

(23)
▶ So, the optimal vector u2 is the second eigenvector of the matrix XX T ∈ RN×N and thus of the sample


covariance matrix Ĉx from Eq. (20)


▶ If one wants to project the M data points xm into a K-dimensional subspace in such a way that the axes onto
which they are projected are orthogonal to each other, the steps described above must be repeated for the
derivation. The result is that the K vectors uk onto which must be projected are the first K eigenvectors of the
sample covariance matrix Ĉx
▶ The projection xm,proj ∈ RK of a data point xm ∈ RN into the K-dimensional subspace is therefore done by
multiplying with the matrix UKT ∈ RK×N
 T
u1
T T
 uT2 
xm,proj = UK xm with UK =  
. . . (24)
T
uK

70 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via maximization of variance:


As mentioned above, after the projection the sample covariance matrix is a diagonal matrix, i. e., the estimated
correlation between the projected features is zero
M M M
!
1 X T 1 X T T T 1 X T
Ĉxm,proj = xm,proj xm,proj = UK xm xm UK = UK xm xm U K
M − 1 m=1 M − 1 m=1 M − 1 m=1
| {z }
Ĉx s. Eq. (20)

UKT Ĉx UK UKT T


UKT U ∆ U T UK
 
= = · U
| ∆U
{z } · UK =
Eigenvalue decomposition of Ĉx
 
λ1 0 ... 0
0 λ2 ... 0
=. ..  ,
 
.. ..
 .. . . . 
0 0 ... λK

because the column vectors in U and UK are the eigenvectors of Ĉx and these are orthogonal to each other
71 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via the minimization of the reconstruction error:


▶ The projection of the data in X ∈ RN×M onto a unit vector u ∈ RN takes place via xproj = uT x, with
xproj ∈ RM
▶ The m-th element xm,proj in xproj indicates the position of xm along u
▶ If one wants to project xm,proj back from the one-dimensional space into the N-dimensional space, i. e.,
reconstruct xm from xm,proj , this is done using xm,proj u
▶ The reconstruction of all M data points is obtained via

Xrec = uxTproj

72 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Derivation via the minimization of the reconstruction error:


▶ If one wants to choose u in such a way that the reconstruction error becomes minimal in terms of minimizing
the squared error, one gets
M
X M
X M
X
∥xm − uuT xm ∥2 = (xTm − xTm uuT )(xm − uuT xm ) = xTm xm − 2xTm uuT xm + xTm uuT xm
m=1 m=1 m=1
M M M M
X X T X X
= ∥xm ∥2 − u T xm u T xm = ∥xm ∥2 − ∥uT xm ∥2
m=1 m=1 m=1 m=1
M
X
= ∥xm ∥2 − ∥uT X∥2
m=1

▶ The first term is positive and cannot be influenced by the choice of u, only the second one can. Thus,
minimizing the reconstruction error corresponds to maximizing ∥uT X∥2 . Therefore, this approach also leads
to the optimization task from Eq. (18) and the result that the optimal vectors u for projecting the data xm are
the eigenvectors of the sample covariance matrix Ĉx

73 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

PCA Summary:
▶ PCA is a linear projection method, i. e., a linear combination of the original features occurs for each new
feature in the reduced space
▶ The projection is carried out on the main axes (Principal Components) of the data matrix X
▶ The main axes of the data matrix X are the eigenvectors of the sample covariance matrix Ĉx , i. e., the
columns of the matrix U
▶ The k-th value on the diagonal of the diagonal matrix Ĉxm,proj is the sample variance of the data in X along the
direction uk
▶ PCA attempts to retain the global structure of the data through variance maximization, i. e., it attempts to
project data clusters “as a whole”, which can result in loss of local structure
Computation of PCA and dimensionality reduction:
▶ Centering each individual feature by subtracting the mean (sample expected value) ⇒ X ∈ RN×M
▶ Calculation of the eigenvectors of the sample covariance matrix Ĉx from Eq. (20) ⇒ U ∈ RN×N
▶ Selection of the K eigenvectors corresponding to the largest K eigenvalues ⇒ UK ∈ RN×K
▶ Projection of the data points in X using the first K eigenvectors in U
Xproj = UKT X

74 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Which value of K is suitable for the projection?


▶ For visualizations, K = 2 or K = 3 is chosen
▶ If one does not want to use PCA for visualization, but in preprocessing for dimension reduction, a meaningful
approach is that the projection covers a certain proportion τ of the variance, for example, τ = 90%. In this
case, K is determined such that it the following still applies
PK
λk
Pk=1
N >τ
k=1 λk

Complexity of the PCA:7


▶ Sample expected value for N features: O(NM) because M numbers are added per feature
▶ Sample covariance matrix Ĉ = M−1 1
XX T ∈ RN×N : For each entry in the matrix, two vectors of dimension
M are multiplied, and there are a total of N 2 entries, i. e., O(MN 2 )
▶ Eigenvalue decomposition for K eigenvalues/eigenvectors: O(KN 2 )
▶ Projection of the M data points: MK entries in Xproj for which two N-dimensional vectors are multiplied each,
i. e., O(MKN)
7
Overview of the complexity of some mathematical operations: [Wik24]
https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations
75 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)

Examples PCA:8

8
[Van16]: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html;
Python-Code: s. Moodle

76 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

▶ t-SNE9 t-SNE is a method for dimension reduction that aims to preserve the local neighborhood
relationships of each point, in contrast to PCA, which aims to retain the global structure of the data
▶ The basic idea of t-SNE is to represent the distances between points by probability density
functions (PDF). For two points in a high-dimensional space, the value of a joint PDF is
determined, and then for these two points the value of another PDF is defined in a low-dimensional
space. In an optimization step, the aim is to maximize the similarity between the densities, i. e., to
preserve roughly the same distances between the points in the low-dimensional space as in the
high-dimensional space. The selection of Gaussian or Gaussian-like PDFs ensures that local
neighborhoods are preserved
▶ To compare PDFs for points in high-dimensional space with PDFs for points in low-dimensional
space, a similarity or dissimilarity measure for PDFs is needed. In t-SNE, the Kullback-Leibler
divergence is used

9
t-SNE was introduced in 2008 by L. J. P. van der Maaten and G. E. Hinton [vdMH08]. The method is based on the “Stochastic
Neighbor Embedding” technique by Hinton and Roweis from 2002.
77 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

Prerequisites: Kullback-Leibler-Divergence
The random variable x is distributed according to the probability P. If Q is a probability on the same event space, it may be
important to compare the two probability density functions p(x = x) and q(x = x). For this, usually, the Kullback-Leibler
divergence is used. It is a suitable measure to describe the dissimilarity of the two PDFs

Z∞ Z∞
p(x = x)
 
KL(p∥q) = ··· p(x = x)ln dx1 · · · dxN
q(x = x)
−∞ −∞

Some properties of the Kullback-Leibler divergence are:


▶ It is not commutative, i. e.KL(p∥q) ̸= KL(q∥p).
▶ It is always greater than or equal to zero KL(p∥q) ≥ 0.
▶ If KL(p∥q) = 0, then p = q.
If the random variable x is a discrete random variable with the range of values X, the Kullback-Leibler divergence for the two
probabilities P(x = x) and Q(x = x) is given as

P(x = x)
X  
KL(P∥Q) = P(x = x)ln
x∈X
Q(x = x)

78 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

1 In a first step, the Euclidean distance between two data points xm ∈ RN and xj ∈ RN is converted
into a conditional probability density pj|m , which represents a similarity:
▶ The similarity of the data point xj to the data point xm is the conditional probability pj|m that
xm would choose xj as its neighbor, if the neighborhood is chosen proportionally to their
probability density under a Gaussian PDF centered around xm :
∥xm −xj ∥2
− 2
e 2σm
pj|m = ∥xm −xk ∥2
P − 2
e 2σm

k̸=m

Here, σm2 is the variance of the Gaussian PDF centered around xm


▶ Since only in pairwise similarities are of interest, pm|m is set to zero
▶ The variance σm2 is different for each point m and is chosen such that points in areas with high
density in RN receive a lower variance than points in areas with low density

79 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

1 In a first step, the Euclidean distance between two data points xm ∈ RN and xj ∈ RN is converted
into a conditional probability density pj|m , which represents similarity:
▶ The choice of variance σm2 is based on a smooth measure for the effective number of neighbors,
specifically on the so-called perplexity. The perplexity is defined as
Perp(Pm ) = 2H(Pm ) ,

where H(Pm ) is the Shannon entropy, measured in bits10


X 
H(Pm ) = − pj|m log2 pj|m
j

The entropy increases with increasing values of σm2


▶ A binary search algorithm is used to determine the value of σm2 per data point xm in such a way that it
leads to a predetermined perplexity
▶ The t-SNE algorithm is relatively robust regarding the choice of perplexity, in practice this value is
chosen between 5 and 50
10
Considering a random variable. The information content of a realization with probability P is defined as − log2 (P). The
entropy is the expected value of the information content.
80 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

2 In a second step, the similarity between two data points is introduced as a symmetrized version of
the conditional probabilities
pj|m + pm|j
pmj = ,
2M
1
P
so that j pmj > 2M and thus each data point xm has a significant contribution to the cost function
(introduced further below)
3 In a third step, the similarity of the data points xred,m ∈ RNred and xred,j ∈ RNred in the
low-dimensional space RNred is described by the PDF qmj
1
1+∥xred,m −xred,j ∥2
qmj = P 1
.
1+∥xred,m −xred,k ∥2
k̸=m

This is the Student’s t-distribution with one degree of freedom, or in other words, the Cauchy
distribution
81 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)
3 In a third step, the similarity of the data points xred,m ∈ RNred and xred,j ∈ RNred in the
low-dimensional space RNred is described by the PDF qmj
▶ The advantage of the Student’s t-distribution (or Cauchy distribution) is that it has higher values at the
tails compared to the Gaussian distribution. Thereby, in the attempt to match qmj and pmj , data points
that have “medium” distances to each other in the high-dimensional space RN can also have “medium”
distances to each other in the low-dimensional space RNred .11 This is advantageous to enable clustering
in the low-dimensional space (avoidance of the “crowding problem” [vdMH08])
▶ Comparison between Gaussian and Cauchy distribution (Student’s t-distribution)

Figure 11: The Student’s t-distribution (Cauchy distribution) has higher values at the tails than the Gaussian distribution.
11
Otherwise, if small distances are correctly mapped in RNred , there would not be enough space left in RNred for “medium” distances and data points with “medium”
distances would be mapped with “large” distances.
82 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

4 In a fourth step, the data points in the low-dimensional space RNred are moved in such a way that the
PDF qmj comes as close as possible to the PDF pmj for all m and j
▶ This is realized by minimizing the Kullback-Leibler divergence
XX pmj
KL(P∥Q) = pmj log ,
m j
qmj

where pmm and qmm are set to zero


▶ The gradient for implementing the minimization task using gradient descent method is
(derivation: [vdMH08])

∂KL(P∥Q) X 1
=4 (pmj − qmj ) (xred,m − xred,j ) .
∂xred,m j
1 + ∥xred,m − xred,j ∥2

▶ In the gradient descent method, a momentum term is introduced

83 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

Algorithm from the original paper [vdMH08]


(relations to the notations here: n ≡ M; xm ≡ xm ∈ RN ; ym ≡ xred,m ∈ RNred ; C ≡ KL(P∥Q))

84 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

Example with the “HELLO” dataset from [Van16]


1RQOLQHDUO\WUDQVIRUPHG+(//2GDWDVHWWR' W61(ZLWKSHUSOH[LW\


 



 
 

 
  
 
 
 

      

⇒ For a non-linear embedding of the dataset into a higher-dimensional space, t-SNE is well suited. With t-SNE,
local neighborhoods are maintained.12

12
Python-Code: s. Moodle
85 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

Wahl der Perplexität beim t-SNE


▶ The perplexity balances the local and global aspects of the dataset in the visualization. A large value of perplexity leads to a
merging of clusters and a too small value of perplexity leads to many very strongly neighboring clusters. The value usually
lies between 5 and 50.
▶ Example for the “HELLO” dataset with different perplexity values
W61(ZLWKSHUSOH[LW\ W61(ZLWKSHUSOH[LW\ W61(ZLWKSHUSOH[LW\

  
 

 

  

 

 
 


                    

W61(ZLWKSHUSOH[LW\ W61(ZLWKSHUSOH[LW\ W61(ZLWKSHUSOH[LW\



 

 



 
 
 


 

                  

86 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch


1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)

▶ Further examples of t-SNE with Python code:


▶ https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/

Figure 12: Projektion des MNIST-Datensatzes nach 2D mittels t-SNE.


▶ https://blog.paperspace.com/dimension-reduction-with-t-sne/ or
https://github.com/asdspal/dimRed/blob/master/tsne.ipynb

▶ The complexity of t-SNE is O(M 2 ), but it can be reduced to O(M log M)


87 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 3 Other nonlinear methods

▶ Other methods that aim to maintain local neighborhood relationships after projection into
low-dimensional spaces (and not the global structure like PCA) are:
▶ Isomap
▶ Locally-Linear Embedding
▶ Uniform Manifold Approximation and Projection (UMAP)
▶ Self-Organizing Maps
▶ ...
▶ Autoencoder: s. Section 11.4
▶ . . .13

13
e. g.: https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
88 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 3 Other nonlinear methods

Example with the “HELLO” dataset from [Van16] for Locally-Linear Embedding
1RQOLQHDUO\WUDQVIRUPHG+(//2GDWDVHWWR' //(RQWKH+(//2GDWDVHWQRQOLQHDUO\WUDQVIRUPHGWR'




 

 



  
 
 
 

     

⇒ Locally-Linear Embedding can be well suited for a non-linear embedding of the dataset into a
higher-dimensional space14

14
Python-Code: s. Moodle
89 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
Bibliography
[Aut20] Various Authors.
Sensitivity and specificity.
In https://en.wikipedia.org/wiki/Sensitivity_and_specificity, 2020.
[BB01] Michele Banko and Eric Brill.
Scaling to very very large corpora for natural language disambiguation.
In Proceedings of the 39th Annual Meeting on Association for Computational
Linguistics, ACL ’01, pages 26–33, Stroudsburg, PA, USA, 2001. Association for
Computational Linguistics.
[BFOS84] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees.
The Wadsworth & Brooks/Cole Statistics/Probability Series. Wadsworth, 1984.
[Bis95] C. M. Bishop.
Neural Networks for Pattern Recognition.
Oxford University Press, 1995.
[CKV13] M. E. Celebi, H. A. Kingravi, and P. A. Vela.
A comparative study of efficient initialization methods for the k-means clustering
algorithm.
Expert Systems with Applications, 40(1):200–210, 2013.
[Doe16] C. Doersch.
Tutorial on variational autoencoders.
arXiv:1606.05908, 2016.
[Fro18] J. Frochte.
Maschinelles Lernen.
Hanser, 2018.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning.
MIT Press, 2016.
http://www.deeplearningbook.org.
[KW14] D. P. Kingma and M. Welling.
Auto-encoding variational bayes.
arXiv: 1312.6114, 2014.
[Shl14] J. Shlens.
A tutorial on principal component analysis.
arXiv:1404.1100, 2014.
[Van16] J. VanderPlas.
The Python Data Science Handbook.
O’Reilly, 2016.
[vdMH08] Laurens van der Maaten and Geoffrey Hinton.
Visualizing data using t-SNE.
Journal of Machine Learning Research, 9:2579–2605, 2008.
[VLL+ 10] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion.
In Journal of Machine Learning Research,, volume 11, pages 3371–3408, 2010.
[Wik24] Wikipedia contributors.
Computational complexity of mathematical operations, 2024.
[Online; accessed 23-March-2024].
[Zha20] J. Zhang.
Dynamic Time Warping.
https://towardsdatascience.com/
dynamic-time-warping-3933f25fcdd, 2020.
[Online; accessed 09-May-2021].

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy