FML Chapter1
FML Chapter1
Machine Learning
Contact Information
▶ Name: Michael Botsch
▶ Room: K209
▶ Email: michael.botsch@thi.de
▶ Tel.: +49 841 9348 2721
▶ Office Hours: Wednesday 13.30 − 14.30 in my office
Moodle-Password: FML_AVE_ss24_Botsch
▶ Prerequisites
▶ Programming 1 and 2
▶ Math 1 and 2 (Algebra and Calculus)
▶ Statistics 1
▶ Literature
▶ BOTSCH, Michael, UTSCHICK, Wolfgang, 2020. Fahrzeugsicherheit und
automatisiertes Fahren: Methoden der Signalverarbeitung und des
maschinellen Lernens.. PDF e-Book: ISBN: 978-3-446-46804-7.
Hardcover: ISBN: 978-3-446-45326-5. https:
//www.hanser-fachbuch.de/buch/Fahrzeugsicherheit+
und+automatisiertes+Fahren/9783446453265.
▶ VAN DER PLAS, Jake. Python Data Science Handbook [online]. [Accessed on: ]. Available via:
https://jakevdp.github.io/PythonDataScienceHandbook.
▶ MURPHY, Kevin P., 2022. Probabilistic machine Learning: an introduction. Cambridge, Massachusetts: The MIT
Press. ISBN 978-0-262-04682-4.
▶ GÉRON, Aurélien, September 2019. Hands-on machine learning with Scikit-Learn, Keras, and Tensor- Flow:
concepts, tools, and techniques to build intelligent systems. S. edition. Beijing ; Boston ; Farnham ; Sebastopol ;
Tokyo: O’Reilly. ISBN 978-1-492-03264-9, 1-492-03264-6.
2 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
Outline
Foundations of Machine Learning
1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 1 Introduction 1. 1. 1 Notation
▶ scalars:
▶ printed: lowercase letter, normal, e. g. x
▶ handwritten: lowercase letter, normal, e. g., x
▶ vectors:
▶ printed: lowercase letter, bold, e. g. x
▶ handwritten: lowercase letter, underlined, e. g., x
▶ matrices:
▶ printed: uppercase letter, bold, e. g. A
▶ handwritten: uppercase letter, underlined, e. g., A
Machine Learning (ger. Maschinelles Lernen) denotes methods of signal processing, which
are finding statistical relationships based on some data with the help of computers, having the
goal to predict new data. One can talk about the artificial generation of knowledge through
experience. Machine learning is based on mathematical statistics and deals with “Learning
from Data”, i. e., finding regularities in data
▶ Above a certain amount of data the chosen type of algorithm (given a large amount of
parameters) is not playing the crucial role for the performance but the high amount of data.
This has been shown for example in [BB01] for an application from speech recognition.
The reason is that many algorithms that are available nowadays are Universal
Approximators (ger. Universelle Approximatoren), i. e, they can represent any smooth
function
▶ In practice for most applications only small or medium-sized data sets are available. In this
case for certain learning tasks some algorithms do perform better than others
▶ Unfortunately it is not possible to make general statements about what “large amount of
data” means, since this must be answered individually for each task. As it will be shown
later on, when talking about the curse of dimensionality it is very easy to be wrong
regarding the size of a data set
▶ The tasks in supervised learning can be mainly divided into Classification (ger.
Klassifikation) and Regression (ger. Regression) problems
▶ In the filed of unsupervised learning a main task is Clustering (ger. Clusteranalyse), i. e.
methods to reveal similarity in unlabeled data
Regression
Supervised Learning
▶ In classical machine learning the feature generation is done manually, while in Deep
Learning (ger. Tiefes Lernen) the feature generation is also part of the learning process
▶ If also the feature generation is part of the learning process this type of learning is called
Representation Learning (ger. Repräsentationslernen)
Manual Feature Machine Learning Algorithm
Generation
Automatic Feature
Generation
Machine Learning Algorithm
Representation Learning, Deep Learning
▶ A pair (x, y) is called a Pattern (ger. Muster), x the Input (ger. Eingang), y the target
(ger. Zielwert), and ŷ the Output (ger. Ausgang)
▶ Because the measured attribute values are subject to variations which often cannot be
described deterministically, a statistical framework must be adopted. In this framework, x
is the realization of the random variable x and y of the random variable y
▶ The mapping from v to y or the mapping from x to y describes the behavior of the system
that has to be approximated by machine learning
▶ The mapping is computed based on a Data Set (ger. Datensatz) D that contains M patterns
▶ In the following x will be considered as an input but all concepts and ideas also hold if v is
the input
▶ In Unsupervised Learning, there is only input data without the associated target values, i. e.
unlabeled data. The training data set can thus be written as
D = {x1 , . . . , xM } (4)
▶ The task is to learn the structure in the input data and to create a model for it. The main
tasks of Unsupervised Learning include cluster analysis, also called automatic
segmentation, which is the compression of data or, respectively, data reduction and
generation of new data similar to those from D
The goal of cluster analysis is to automatically segment unlabeled input data into groups of
similar data points. This can find use
▶ to represent the available data in a clear way and thus allow a better management of the
data
▶ to understand the available data better
▶ to simplify the available data for subsequent signal processing steps
▶ Categorizing the clustering methods based on the input data xm ∈ RN , one can distinguish
between similarity-based and feature-based clustering. In Similarity-Based Clustering
(or Relational Clustering) the starting point is a M × M matrix in which the similarities
or the dissimilarities between all elements are stored in D. In Feature-Based Clustering,
the starting point is a M × N matrix, in which the M elements of dimension N from D are
stored.
▶ If we categorize the clustering methods based on the way in which the cluster analysis is
performed, the following important methods can be identified: partitioning, model-based,
hierarchical, probability density-based, grid-based methods, and spectral clustering.
1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 2 Curse of Dimensionality
▶ If there are many features in the vector x , i. e., N is large the possible complexity of the
Bayes-classifier fB (the theoretically best classifier, see Section 3.1) also increases. The complexity
can increase exponentially with the dimensionality N and thus the number M of pairs (x1 , y1 ) in the
training data D must also increase exponentially (worst case) to estimate such a complex fB
▶ The Curse of Dimensionality (ger. Fluch Der Hohen Dimensionen) can be visualized [Bis95]
x2
x2
L=2
ℓ
x1
w = 8 = 23
x1 x1 x3
ℓ ℓ
w = 21 w = 4 = 22
L=2 L=2
The number w of volume elements with side length ℓ increases exponentially with the
dimensionality N given the same side length L of the cube: w = (L/ℓ)N
19 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 2 Curse of Dimensionality
▶ In many practical applications classification and regression algorithms can perform well
despite high dimensions because the relevant data is located in lower dimensional
subspaces or because it has some smoothness property, i. e., for most of the points in the
input space, if x is changing slightly, in classification problems the target class is not
changing at all and in regression problems the target value is also changing only slightly
▶ So, for many practical problems the Bayes regression function or the Bayes classifier fB
(see Section 3.1) is not as complex as the space RN might suggest
▶ In order to get a good performance in high dimensional spaces it is often useful to use
prior knowledge about the system that generates the data, e. g., by taking into account what
type of data it is (time series, images, etc.)
1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 3 Normalization of Features
▶ For most of the machine learning algorithms it is necessary to normalize the features in the
vector x. Otherwise features xn whose range of values is large will have a higher influence
on the loss function and thus on the solution
▶ One option to deal with this aspect is to normlize all features, so that their range is
between 0 and 1. Denoting with xn,min the minimal value and with xn,max the maximal
value of the n-th feature in the data set D, the normalized data set is obtained by
normalizing each of the N features for each input vector xm according to
▶ The sample mean µ̂n and the sample variance σ̂n2 of the n-th feature are computed
according to
M M
1 X 1 X
µ̂n = xm,n and σ̂n2 = (xm,n − µ̂n )2
M M−1
m=1 m=1
▶ Another option to perform the normalization is to make the sample mean 0 and the sample
variance 1 for every feature, i. e.,
1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures
d(x1 , x2 ) = ∥x1 − x2 ∥
Matrix norms √
▶ The matrix norm for a matrix A is ∥x∥A = xT Ax
▶ Examples:
1 0 ··· 0 w1 0 ··· 0
0 1 ··· 0 0 w2 ··· 0
A = . . .. ; A= . ;
.. .. .. ..
.. .. . . .. . . .
0 0 ··· 1 0 0 ··· wN
| {z } | {z }
leads to Euclidean norm leads to Diagonal norm
M
!−1 M
1 X 1 X
A = Ĉ −1 = (xm − µ̂)(xm − µ̂)T with µ̂ = xm
M−1 M
m=1 m=1
| {z }
leads to Mahalanobis norm
The Mahalanobis norm is adjusting the weights of of the features to the statistics of the data set and
is taking into account correlations between the features
26 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 1 Dissimilarity Measures
▶ Special cases
qP
N 2
L−∞ norm: ∥x∥−∞ = min |xn |; L2 norm or Euclidean norm: ∥x∥2 = n=1 xn
n=1,...,N
PN
L∞ norm: ∥x∥∞ = max |xn | L1 norm or Manhattan norm: ∥x∥1 = n=1 |xn |
n=1,...,N
Hamming distance
▶ The Hamming distance is defined as
N
X 0 if x1,n = x2,n
d(x1 , x2 ) = δ(x1,n , x2,n ), with δ(x1,n , x2,n ) =
1 else
n=1
▶ So, the Hamming distance corresponds to the number of different features in x1 and x2
▶ The Hamming distance is a metric since it fulfills the properties for a metric (symmetry,
positive definiteness, triangle inequality)
A category of similarity measures for positive real-valued features results from differently
scaled scalar products
xT1 x2
▶ Cosine-similarity: s(x1 , x2 ) = ∥x1 ∥2 ·∥x2 ∥2
xT1 x2
▶ Overlap-similarity: s(x1 , x2 ) =
min(∥x1 ∥22 ,∥x2 ∥22 )
2xT1 x2
▶ Dice-similarity: s(x1 , x2 ) =
∥x1 ∥22 +∥x2 ∥22
2xT1 x2
▶ Jaccard-similarity: s(x1 , x2 ) =
∥x1 ∥22 +∥x2 ∥22 −xT1 x2
▶ For the null vector these similarities have to be defined separately, e. g.,
set the similarity to 0
The similarity between two features (not between two input vectors!!!) in the dataset D can be
computed using the Pearson correlation coefficient
M
P
(xm,n − µ̂n )(xm,ℓ − µ̂ℓ ) M M
m=1 1 X 1 X
rnℓ = s , with µ̂n = xm,n , µ̂ℓ = xm,ℓ
M M M M
P P m=1 m=1
(xm,n − µ̂n )2 (xm,ℓ − µ̂ℓ )2
m=1 m=1
Similarity for feature vectors whose features are represented by binary variables
▶ Notation a11 for the number of features that have the value 1 in x1 and the value 1 in x2
▶ Notation a00 for the number of features that have the value 0 in x1 and the value 0 in x2
▶ Notation a10 for the number of features that have the value 1 in x1 and the value 0 in x2
▶ Notation a01 for the number of features that have the value 0 in x1 and the value 1 in x2
1 The match of 1 with 1 is weighted the same as the match of 0 with 0
a11 + a00 1
s(x1 , x2 ) = , with w ∈ 1, 2,
a11 + a00 + w(a10 + a01 ) 2
2 The match of 1 with 1 is weighted and the match of 0 with 0 not
a11 1
s(x1 , x2 ) = , with w ∈ 1, 2,
a11 + w(a10 + a01 ) 2
Similarity for feature vectors whose features are represented by categorical variables
N
1X
s(x1 , x2 ) = s12,n ,
N
n=1
1 if x1 and x2 have the same value in the nth feature
with s12,n =
0 if x1 and x2 have a different value in the nth feature
Similarity for feature vectors whose features are represented by ordinal numbers
▶ The possible values of the nth feature are 1, 2, . . . , Wn
▶ The closer x1,n and x2,n , the more similar x1 and x2 are with respect to the nth feature
▶ Since Wn is different for the individual features first a mapping to the interval [0, 1] is
performed
(new) xm,n − 1
xm,n =
Wn − 1
▶ Then similarity measures for real valued features can be used
Similarity for feature vectors whose features are of different type (have other range)
N
P
δ12,n s12,n
n=1
s(x1 , x2 ) = N
,
P
δ12,n
n=1
where s12,n is the similarity between x1 and x2 in the nth feature and δ12,n ∈ {0, 1} indicates if
the nth feature is existent in both x1 and x2 (yes → 1; no → 0)
▶ for features that are represented by categorical variables the following holds
1 if x1 and x2 have the same value in the nth feature
s12,n =
0 if x1 and x2 have a different value in the nth feature
▶ for real valued features a similarity from the ones mentioned above for real valued features
can be used
35 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences
The difference between Euclidean distance and DTW becomes clear from Fig. 3 and Fig. 4:
While the Euclidean distance is very sensitive to small distortions in the time axis because it
assumes that the nth point in s1 is aligned with the nth point in s2 , DTW can shift the time axes
of both sequences to achieve a better alignment
39 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences
Figure 5: Example for determining the similarity of the sequences s1 and s2 using DTW.
▶ The optimization problem in Eq. (7) can be solved very efficiently by dynamic programming. For
this purpose, a new (nend,1 + 1) × (nend,2 + 1) matrix ΓDTW is calculated, whose elements are
pre-initialized with „∞“ and with γDTW [0, 0] = 0. The matrix γDTW is then calculated in such a
way that for n, k ≥ 1 its entries contain the cumulative distances γDTW [n, k]
γDTW [n, k] = d[n − 1, k − 1] + min {γDTW [n − 1, k − 1], γDTW [n − 1, k], γDTW [n, k − 1]}
′
The sequence of indices that leads to γDTW [nend,1 , nend,2 ] forms the warping path WDTW
▶ Note that for time series of equal length, nend,1 = nend,2 , the Euclidean distance can be considered a
special case of DTW if for all wi,DTW = [n, k]T n = k = i − 1 applies
44 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences
1
[Zha20]: https://towardsdatascience.com/dynamic-time-warping-3933f25fcdd;
Python-Code: s. Moodle
45 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 4 Similarity and Disimilarity Measures 1. 4. 4 Sequences
2
Python-Code: s. Moodle
47 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
Outline
Foundations of Machine Learning
1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers
The Lagrange multiplier method is a technique for solving optimization problems with
constraints. The task is to find a local extremum of a function in several variables while meeting
the constraints.
▶ The task in optimization with equality constraints is to find the maximum of the function
f (x) under the constraint g(x) = 0
▶ Considering the N-dimensional vector x and the equality constraint g(x) = 0, this
constraint represents a (N − 1)-dimensional hyperplane in the N-dimensional space. An
example is the constraint wT x + t = 0. All points x, for which wT x + t = 0 applies, form
a (N − 1)-dimensional hyperplane
Example: For N = 2, g(x) = 0 defines a curve, and in this 2-dimensional space, the method of Lagrange multipliers
can be geometrically explained using Fig. 6
x2
f(x1,x2)=c1
f(x1,x2)=c2
g(x1,x2)=0
f(x1,x2)=c3
xGoal
x1
Figure 6: Geometric representation of the Lagrange multipliers method for equality constraints.
The point x = [x1 , x2 ]T is to be found, which lies on the curve g(x1 , x2 ) = 0 and where f (x1 , x2 ) has the highest
value. In Fig. 6, contour lines f (x1 , x2 ) = c for various values of c are shown, with c1 < c2 < c3 . As can be seen in
the sketch, the point sought is xGoal
50 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints
aT b = 0
where λ is called the Lagrange multiplier. The Lagrange multiplier can be both positive
and negative
▶ To find the sought point xGoal that maximizes f (x) and satisfies the equality constraint
g(x) = 0, the Lagrange function
can be introduced
54 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 1 Equality Constraints
∂L(x, λ)
∇x L(x, λ) = 0 and =0 (11)
∂λ
there are N + 1 equations from which both xGoal and λ can be calculated
▶ If only xGoal is needed, one can eliminate λ from the equations, and it is not necessary to
calculate the value of λ
▶ The task to be solved is to find the maximum of the function f (x), with the inequality constraint g(x) ≥ 0
being met
▶ Fig. 7 visualizes the task in the case that the dimension N of the vector x is 2
x2 x2
g(x1,x2)>0
f(x1,x2)=c1 f(x1,x2)=c1
f(x1,x2)=c2 f(x1,x2)=c2
g(x1,x2)=0 g(x1,x2)=0
f(x1,x2)=c3 f(x1,x2)=c3
xGoal
xGoal
g(x1,x2)>0
x1 x1
Figure 7: Geometric representation of the Lagrange multiplier method for inequality constraints.
▶ As shown, one has to distinguish between two cases, depending on whether the point xGoal lies within the
region where g(x) ≥ 0 applies, or whether it lies on the hyperplane g(x) = 0
56 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 5 Constrained Optimization using Lagrange Multipliers 1. 5. 2 Inequality Constraints
▶ In the first case (shown on the left in Fig. 7), one also speaks of the constraint being not
active because the constraint does not play a role and the point xGoal is calculated solely by
∇x f (x) = 0. This corresponds to the calculation of a stationary point of the Lagrange
function from Eq. (10), with λ = 0
▶ In the second case (shown on the right in Fig. 7), where xGoal lies on the hyperplane
g(x) = 0, one also speaks of the constraint being active. It is an analogous case to the
optimization with equality constraints and corresponds to the calculation of a stationary
point of the Lagrange function from Eq. (10), with λ ̸= 0. Unlike the optimization with
equality constraints, however, the sign of λ does play a. A maximum of f (x) is only
achieved if the gradient ∇x f (x) points in the opposite direction of the region where
g(x) ≥ 0 applies, as visualized on the right in Fig. 7. The blue arrows represent the
gradient vectors of the function f (x) at the respective points. In this second case,
∇x f (x) = −λ∇x g(x), with λ > 0 holds
▶ In both cases, λg(xZiel ) = 0 is true. In the first case, because λ = 0, and in the second
case, because xZiel lies on the hyperplane g(x) = 0. Using the Lagrange function from
Eq. (10), L(x, λ) = f (x) + λg(x), one can state for a local minimum xZiel that there is a
λ∗ such that
∇x L(xZiel , λ∗ ) = 0 (12)
g(xZiel ) ≥ 0 (13)
∗
λ ≥0 (14)
∗
λ g(xZiel ) = 0 (15)
▶ These conditions are known as the Karush-Kuhn-Tucker (KKT) conditions. They are
very useful because they allow searching for solutions for x for which one can find λ∗
If one wishes to minimize the function f (x) (instead of maximizing) subject to the inequality
constraint g(x) ≥ 0, one would have to suppose in Fig. 7 for visualization that c3 < c2 < c1
and hence, draw all the blue arrows in the respective opposite directions. If the constraints are
active, then the gradients ∇x f (x) and ∇x g(x) point in the same direction and thus
∇x f (x) = λ∇x g(x) with λ > 0. As a result, the Lagrange function is
The task is therefore to determine the stationary point of this Lagrange function according to
Eq. (16) with respect to x and λ, with λ ≥ 0
If one wishes to extend the method of Lagrange multipliers to optimization tasks with several
equality and inequality constraints, this can be achieved according to the above considerations.
If the task involves maximizing the function f (x) subject to the equality constraints gk (x) = 0
where k = 1, . . . , K, and the inequality constraints hm (x) ≥ 0 where m = 1, . . . , M, one must
introduce the Lagrange multipliers λ = [λ1 , . . . , λK ]T and µ = [µ1 , . . . , µM ]T and calculate the
stationary points of the Lagrange function
K
X M
X
L(x, λ, µ) = f (x) + λk gk (x) + µm hm (x) (17)
k=1 m=1
1 1. Basics
1. 1 Introduction
1. 2 Curse of Dimensionality
1. 3 Normalization of Features
1. 4 Similarity and Disimilarity Measures
1. 5 Constrained Optimization using Lagrange Multipliers
1. 6 Dimensionality Reduction and Visualization
1. Basics
1. 6 Dimensionality Reduction and Visualization
▶ Redundancies and noise are often contained in the input features, which make learning
difficult (Curse of Dimensionality)
▶ The extraction of a few but relevant features simplifies learning and leads to reduced
resource consumption
▶ With only 2 or 3 dimensions, data visualization becomes possible.
⇒ Reduction of the dimension of the input space is possible, preferably without “loss of
information”
Dimensionality Reduction
▶ Linear combinations of input features are attractive because they are simple to compute
and the projection methods are traceable
▶ Principal Component Analysis (PCA): linear projection of input features into a
low-dimensional space, which leads to an optimal representation by minimization of the
square error.
▶ Multidimensional Scaling: a technique to visualize the level of similarity of individual cases
of a dataset.
▶ Nonlinear combinations of input features: t-SNE, Autoencoder, Self-Organizing Maps,
Locally-Linear Embedding, IsoMap, UMAP, etc.
▶ Feature Selection: Filter, Wrapper or Embedded methods
Figure 8: Data recording [Shl14]. Figure 9: Measurements from sensor A with noise [Shl14].
▶ Each sensor provides a vector with position data at each time point. Therefore, for the one-dimensional movement, the input
vector is obtained as [xA , yA , xB , yB , xC , yC ]T
▶ Focusing only on sensor A and assuming that the measurement points are noisy, a scenario as shown in the figure on the
right is obtained. Without noise, the measurements from each sensor would result in a straight segment. The goal is to find
projection axes so that the data exhibit a large variance (“dynamics”) after projection
5
[Shl14] A Tutorial on Principal Component Analysis; J. Shlens; arXiv:1404.1100.
64 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)
∂ uT1 XX T u1 ∂ uT1 u1 − 1 !
∇u1 L(u1 , λ) = +λ =0
∂u1 ∂u1
∂aT b ∂bT a
With a
= b und a
= b one obtains
!
∇u1 L(u1 , λ) = 2XX T u1 + 2λu1 = 0.
So, one obtains the equation for determining the eigenvectors of the matrix XX T
XX T u1 = λ′ u1
(19)
▶ So, the optimal vector u1 is the first eigenvector of the matrix XX T ∈ RN×N
M
1 X
Ĉx = xm xTm (20)
M − 1 m=1
| {z }
XX T
because the column vectors in U and UK are the eigenvectors of Ĉx and these are orthogonal to each other
71 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 1 Principal Component Analysis (linear)
Xrec = uxTproj
▶ The first term is positive and cannot be influenced by the choice of u, only the second one can. Thus,
minimizing the reconstruction error corresponds to maximizing ∥uT X∥2 . Therefore, this approach also leads
to the optimization task from Eq. (18) and the result that the optimal vectors u for projecting the data xm are
the eigenvectors of the sample covariance matrix Ĉx
PCA Summary:
▶ PCA is a linear projection method, i. e., a linear combination of the original features occurs for each new
feature in the reduced space
▶ The projection is carried out on the main axes (Principal Components) of the data matrix X
▶ The main axes of the data matrix X are the eigenvectors of the sample covariance matrix Ĉx , i. e., the
columns of the matrix U
▶ The k-th value on the diagonal of the diagonal matrix Ĉxm,proj is the sample variance of the data in X along the
direction uk
▶ PCA attempts to retain the global structure of the data through variance maximization, i. e., it attempts to
project data clusters “as a whole”, which can result in loss of local structure
Computation of PCA and dimensionality reduction:
▶ Centering each individual feature by subtracting the mean (sample expected value) ⇒ X ∈ RN×M
▶ Calculation of the eigenvectors of the sample covariance matrix Ĉx from Eq. (20) ⇒ U ∈ RN×N
▶ Selection of the K eigenvectors corresponding to the largest K eigenvalues ⇒ UK ∈ RN×K
▶ Projection of the data points in X using the first K eigenvectors in U
Xproj = UKT X
Examples PCA:8
8
[Van16]: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html;
Python-Code: s. Moodle
▶ t-SNE9 t-SNE is a method for dimension reduction that aims to preserve the local neighborhood
relationships of each point, in contrast to PCA, which aims to retain the global structure of the data
▶ The basic idea of t-SNE is to represent the distances between points by probability density
functions (PDF). For two points in a high-dimensional space, the value of a joint PDF is
determined, and then for these two points the value of another PDF is defined in a low-dimensional
space. In an optimization step, the aim is to maximize the similarity between the densities, i. e., to
preserve roughly the same distances between the points in the low-dimensional space as in the
high-dimensional space. The selection of Gaussian or Gaussian-like PDFs ensures that local
neighborhoods are preserved
▶ To compare PDFs for points in high-dimensional space with PDFs for points in low-dimensional
space, a similarity or dissimilarity measure for PDFs is needed. In t-SNE, the Kullback-Leibler
divergence is used
9
t-SNE was introduced in 2008 by L. J. P. van der Maaten and G. E. Hinton [vdMH08]. The method is based on the “Stochastic
Neighbor Embedding” technique by Hinton and Roweis from 2002.
77 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)
Prerequisites: Kullback-Leibler-Divergence
The random variable x is distributed according to the probability P. If Q is a probability on the same event space, it may be
important to compare the two probability density functions p(x = x) and q(x = x). For this, usually, the Kullback-Leibler
divergence is used. It is a suitable measure to describe the dissimilarity of the two PDFs
Z∞ Z∞
p(x = x)
KL(p∥q) = ··· p(x = x)ln dx1 · · · dxN
q(x = x)
−∞ −∞
P(x = x)
X
KL(P∥Q) = P(x = x)ln
x∈X
Q(x = x)
1 In a first step, the Euclidean distance between two data points xm ∈ RN and xj ∈ RN is converted
into a conditional probability density pj|m , which represents a similarity:
▶ The similarity of the data point xj to the data point xm is the conditional probability pj|m that
xm would choose xj as its neighbor, if the neighborhood is chosen proportionally to their
probability density under a Gaussian PDF centered around xm :
∥xm −xj ∥2
− 2
e 2σm
pj|m = ∥xm −xk ∥2
P − 2
e 2σm
k̸=m
1 In a first step, the Euclidean distance between two data points xm ∈ RN and xj ∈ RN is converted
into a conditional probability density pj|m , which represents similarity:
▶ The choice of variance σm2 is based on a smooth measure for the effective number of neighbors,
specifically on the so-called perplexity. The perplexity is defined as
Perp(Pm ) = 2H(Pm ) ,
2 In a second step, the similarity between two data points is introduced as a symmetrized version of
the conditional probabilities
pj|m + pm|j
pmj = ,
2M
1
P
so that j pmj > 2M and thus each data point xm has a significant contribution to the cost function
(introduced further below)
3 In a third step, the similarity of the data points xred,m ∈ RNred and xred,j ∈ RNred in the
low-dimensional space RNred is described by the PDF qmj
1
1+∥xred,m −xred,j ∥2
qmj = P 1
.
1+∥xred,m −xred,k ∥2
k̸=m
This is the Student’s t-distribution with one degree of freedom, or in other words, the Cauchy
distribution
81 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)
3 In a third step, the similarity of the data points xred,m ∈ RNred and xred,j ∈ RNred in the
low-dimensional space RNred is described by the PDF qmj
▶ The advantage of the Student’s t-distribution (or Cauchy distribution) is that it has higher values at the
tails compared to the Gaussian distribution. Thereby, in the attempt to match qmj and pmj , data points
that have “medium” distances to each other in the high-dimensional space RN can also have “medium”
distances to each other in the low-dimensional space RNred .11 This is advantageous to enable clustering
in the low-dimensional space (avoidance of the “crowding problem” [vdMH08])
▶ Comparison between Gaussian and Cauchy distribution (Student’s t-distribution)
Figure 11: The Student’s t-distribution (Cauchy distribution) has higher values at the tails than the Gaussian distribution.
11
Otherwise, if small distances are correctly mapped in RNred , there would not be enough space left in RNred for “medium” distances and data points with “medium”
distances would be mapped with “large” distances.
82 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)
4 In a fourth step, the data points in the low-dimensional space RNred are moved in such a way that the
PDF qmj comes as close as possible to the PDF pmj for all m and j
▶ This is realized by minimizing the Kullback-Leibler divergence
XX pmj
KL(P∥Q) = pmj log ,
m j
qmj
∂KL(P∥Q) X 1
=4 (pmj − qmj ) (xred,m − xred,j ) .
∂xred,m j
1 + ∥xred,m − xred,j ∥2
⇒ For a non-linear embedding of the dataset into a higher-dimensional space, t-SNE is well suited. With t-SNE,
local neighborhoods are maintained.12
12
Python-Code: s. Moodle
85 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 2 t-SNE (nonlinear)
▶ Other methods that aim to maintain local neighborhood relationships after projection into
low-dimensional spaces (and not the global structure like PCA) are:
▶ Isomap
▶ Locally-Linear Embedding
▶ Uniform Manifold Approximation and Projection (UMAP)
▶ Self-Organizing Maps
▶ ...
▶ Autoencoder: s. Section 11.4
▶ . . .13
13
e. g.: https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
88 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
1. Basics
1. 6 Dimensionality Reduction and Visualization 1. 6. 3 Other nonlinear methods
Example with the “HELLO” dataset from [Van16] for Locally-Linear Embedding
1 R Q O L Q H D U O \ W U D Q V I R U P H G + ( / / 2 G D W D V H W W R ' / / ( R Q W K H + ( / / 2 G D W D V H W Q R Q O L Q H D U O \ W U D Q V I R U P H G W R '
⇒ Locally-Linear Embedding can be well suited for a non-linear embedding of the dataset into a
higher-dimensional space14
14
Python-Code: s. Moodle
89 Foundations of Machine Learning; ST 2024; Prof. Dr.-Ing. Michael Botsch
Bibliography
[Aut20] Various Authors.
Sensitivity and specificity.
In https://en.wikipedia.org/wiki/Sensitivity_and_specificity, 2020.
[BB01] Michele Banko and Eric Brill.
Scaling to very very large corpora for natural language disambiguation.
In Proceedings of the 39th Annual Meeting on Association for Computational
Linguistics, ACL ’01, pages 26–33, Stroudsburg, PA, USA, 2001. Association for
Computational Linguistics.
[BFOS84] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees.
The Wadsworth & Brooks/Cole Statistics/Probability Series. Wadsworth, 1984.
[Bis95] C. M. Bishop.
Neural Networks for Pattern Recognition.
Oxford University Press, 1995.
[CKV13] M. E. Celebi, H. A. Kingravi, and P. A. Vela.
A comparative study of efficient initialization methods for the k-means clustering
algorithm.
Expert Systems with Applications, 40(1):200–210, 2013.
[Doe16] C. Doersch.
Tutorial on variational autoencoders.
arXiv:1606.05908, 2016.
[Fro18] J. Frochte.
Maschinelles Lernen.
Hanser, 2018.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning.
MIT Press, 2016.
http://www.deeplearningbook.org.
[KW14] D. P. Kingma and M. Welling.
Auto-encoding variational bayes.
arXiv: 1312.6114, 2014.
[Shl14] J. Shlens.
A tutorial on principal component analysis.
arXiv:1404.1100, 2014.
[Van16] J. VanderPlas.
The Python Data Science Handbook.
O’Reilly, 2016.
[vdMH08] Laurens van der Maaten and Geoffrey Hinton.
Visualizing data using t-SNE.
Journal of Machine Learning Research, 9:2579–2605, 2008.
[VLL+ 10] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion.
In Journal of Machine Learning Research,, volume 11, pages 3371–3408, 2010.
[Wik24] Wikipedia contributors.
Computational complexity of mathematical operations, 2024.
[Online; accessed 23-March-2024].
[Zha20] J. Zhang.
Dynamic Time Warping.
https://towardsdatascience.com/
dynamic-time-warping-3933f25fcdd, 2020.
[Online; accessed 09-May-2021].