Ann 5TH
Ann 5TH
Bruno Galerne
2023-2024
1
Credits
Most of the slides from Charles Deledalle’s course “UCSD ECE285 Machine
learning for image processing” (30 × 50 minutes course)
www.charles-deledalle.fr/
https://www.charles-deledalle.fr/pages/teaching.php#learning
2
Computer Vision and Machine Learning
Computer vision – Artificial Intelligence – Machine Learning
CV is a subfield of AI, CV’s new very best friend is machine learning (ML),
ML is also a subfield of AI, but not all computer vision algorithms are ML.
Definition
Machine Learning, noun: type of Artificial Intelligence that provides
computers with the ability to learn without being explicitly programmed.
ML provides various techniques that can learn from and make predictions on
data. Most of them follow the same general structure:
4
Computer vision – Image classification
6
Computer vision – Image segmentation
Goal: to partition an image into multiple segments such that pixels in a same
segment share certain characteristics (color, texture or semantic).
7
Computer vision – Image captioning
→ →
9
IP ∩ CV – Image colorization
Image colorization
Image generation
Generated images of bedrooms (Source: Alec Radford, Luke Metz, Soumith Chintala, 2015)
Image stylization
12
IP ∩ CV – Style transfer
Style transfer
14
Machine learning – Learning from examples
3 main ingredients
{x1 , x2 , . . . , xN }
2 Machine or model:
x→ f (x; θ) → y
| {z } |{z}
function / algorithm prediction
argmin E(θ; x1 , x2 , . . . , xN )
θ
14
Machine learning – Learning from examples
(
Data ↔ Statistics
Tools:
Loss ↔ Optimization
15
Machine learning – Terminology
Terminology
Sample (Observation or Data): item to process (e.g., classify). Example: an
individual, a document, a picture, a sound, a video. . .
Features (Input): set of distinct traits that can be used to describe each
sample in a quantitative manner. Represented as a multi-dimensional vector
usually denoted by x. Example: size, weight, citizenship, . . .
Learning approaches
18
Machine learning – Problem types
Problem types
20
Deep learning: Academic actors
21
Deep learning: Academic actors
Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron
X2 W12
f(x)
X3 W13
X4 W14
Activation
Input Weights Sum
Function
24
Goal for first sessions
25
Goal for first sessions
25
Goal of this course
26
Goal of this course
26
Goal of this course
26
Goal of this course
26
Goal of this course
Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron
X2 W12
f(x)
X3 W13
X4 W14
Activation
Input Weights Sum
Function
Perceptron
1958 Perceptron
1969
Perceptrons
book
Perceptron criticized
articial
X1 W11
Neuron
X2 W12
f(x)
X3 W13
X4 W14
Activation
Input Weights Sum
Function
29
Machine learning – Perceptron – Representation
(
+1 for the first class
y = sign(w0 x0 + w1 x1 + w2 x2 + w3 x3 + b) =
| {z } −1 for the second class
f (x;w)
31
Machine learning – Perceptron – Principle
= wT x
hw, xi + b
r=
||w||
Alternative representation
Simplifies algorithms as all parameters can now be processed in the same way.
33
Machine learning – Perceptron – Training
Perceptron algorithm
34
Machine learning – Perceptron – Training
Perceptron algorithm
• Initialize w randomly
• Repeat until convergence
• For all (x, d) ∈ T (or a random subset T 0 ⊂ T )
• Compute: y = sign hw, xi
• If y 6= d:
Update: w ← w + γdx
A perceptron can only classify data points that are linearly separable:
+1
+1
36
Artificial neural network
Machine learning – Artificial neural network
38
Machine learning – Artificial neural network
39
Machine learning – ANN
2 2 2 2
h4 + b21
y1 = g2 w11 h1 + w12 h2 + w13 h3 + w14
2 2 2 2
h4 + b22
y2 = g2 w21 h1 + w22 h2 + w23 h3 + w24
k
wij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
41
Machine learning – ANN
2 2 2 2
h4 + b21
y1 = g2 w11 h1 + w12 h2 + w13 h3 + w14
2 2 2 2
h4 + b22
y2 = g2 w21 h1 + w22 h2 + w23 h3 + w24
y = g2 (W2 h + b2 )
k
wij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
The matrices Wk and biases bk are learned from labeled training data. 41
Machine learning – ANN
43
Machine learning – ANN – Activation functions
Activation functions
y = WL hL−1 + bL
hL−1 = WL−1 hL−2 + bL−1
y = WL WL−1 hL−2 + WL bL−1 + bL
L−1
X
y = WL . . . W1 x + WL . . . Wk+1 bk + bL
k=1
44
Machine learning – ANN – Activation functions
Activation functions
45
Machine learning – ANN – Activation functions
Activation functions
ea − e−a
g(a) = tanh a = ∈ [−1, 1] 2
ea + e−a 1
-1
1 -2
g(a) = ∈ [0, 1] -5 0 5
1 + e−a
tanh(a/2) = 2logistic(a) − 1
46
Machine learning – ANN
Separation
0 {x s.t. P (C1 |x) = P (C2 |x)}
0 1
Complexity/capacity of the
network
0
0 1
⇒
1
Trade-off between
generalization and overfitting.
...
...
0
0 1
47
Machine learning – ANN – Activation functions
Activation functions
“Modern” units:
• Typical architecture:
• Hidden layer:
ReLU(a) = max(a, 0)
• Linear output:
g(a) = a
50
Tasks, architectures and loss functions
• Solution: Provided the network has enough flexibility and the size of the
training set grows to infinity
Z
y ? = f (x; W ? ) = E[d|x] = dp(d|x) dd
| {z }
posterior mean
51
Tasks, architectures and loss functions
• Hidden layer:
ReLU(a) = max(a, 0)
• Output layer:
exp(ak )
softmax(a)k = K
X
exp(a` )
`=1
• Solution: Provided the network has enough flexibility and the size of the
training set grows to infinity
54
Multivariate logistic regression
Multiclass classification – Multivariate logistic regression
• SVMs allow for multiclass classification but are not easily plugable to
neural networks.
• Instead neural networks generally use multivariate logistic regression.
55
Multivariate logistic regression
New notation
• Remark: Do not use the class index k directly as a scalar label: The order
of label is not informative.
56
Multivariate logistic regression
Feature transform
• We apply a feature transform ϕ : Rp → RD to each xn :
ϕn = ϕ(xn ), n = 1, . . . , N.
57
(Source: url)
Multivariate logistic regression
Linear classifier
We will consider linear multiclass classifier in feature space RD :
• Classifier parameters: Each class k has a weight vector wk ∈ RD and
bias bk ∈ R
• Class separation:
wkT ϕ + bk > w`T ϕ + b` ?
• Class k region:
n o
ϕ ∈ RD , ∀` 6= k, wkT ϕ + bk > w`T ϕ + b`
K
\ n o
= ϕ ∈ RD , wkT ϕ + bk > w`T ϕ + b`
`=1, `6=k
= intersection of K − 1 half-planes
• From now on this is implicit: We assume that the feature transform has
a 1 component so that wkT ϕ has an implicit bias component.
• The classifier parameters are just a set of weight vectors
{wk , k = 1, . . . , K}.
59
Multivariate logistic regression
• From now on this is implicit: We assume that the feature transform has
a 1 component so that wkT ϕ has an implicit bias component.
• The classifier parameters are just a set of weight vectors
{wk , k = 1, . . . , K}.
• What are the boundaries if we forget the bias ?
59
Multivariate logistic regression
such that ideally yk ' p(Ck |ϕ) is an estimate of the posterior probability
K
X
• By construction of the softmax, one has y ∈ (0, 1)K s.t. yk = 1.
k=1
60
Multivariate logistic regression
exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = K
, k = 1, . . . , K.
X
exp(wjT ϕ)
j=1
Training:
61
Multivariate logistic regression
exp(wkT ϕ)
p(Ck |ϕ) = PK T
j=1 exp(wj ϕ)
62
Multivariate logistic regression
exp(wkT ϕ)
p(Ck |ϕ) = PK T
j=1 exp(wj ϕ)
Loglikelihood:
exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = PK T
, k = 1, . . . , K.
j=1 exp(wj ϕ)
• Expression of loglikelihood:
N
Y
L(W ) = − log p(Ctn |ϕn )
n=1
N
X
=− log ytn (ϕn )
n=1
N K
!
X X
=− wtTn ϕn − log exp(wjT ϕn )
n=1 j=1
63
Multivariate logistic regression
Loglikelihood:
• Expression of loglikelihood:
N N K
!
X X X
L(W ) = − log ytn (ϕn ) = − wtTn ϕn − log exp(wjT ϕn )
n=1 n=1 j=1
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1
64
Multivariate logistic regression
Loglikelihood:
• Expression of loglikelihood:
N N K
!
X X X
L(W ) = − log ytn (ϕn ) = − wtTn ϕn − log exp(wjT ϕn )
n=1 n=1 j=1
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1
64
Multivariate logistic regression
Gradient of loglikelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1
j=1
K
!
X
• Partial gradient of ∇w` log exp(wjT ϕn ) ?
j=1
K
!
X
∇w` log exp(wjT ϕn ) = ∇w` log exp(w`T ϕn ) + constant
j=1
=?
65
Multivariate logistic regression
Gradient of log-likelihood:
Recall that for f : Rn → R and g : R → R,
Here,
exp(t)
g(t) = log(exp(t) + c) g 0 (t) =
exp(t) + c
f (w` ) = w`T ϕn ∇f (w` ) = ϕn .
So,
K
!
X exp(w`T ϕn )
∇w` log exp(wjT ϕn ) = PK T
ϕn = y` (ϕn )ϕn
j=1 j=1 exp(wj ϕn )
since
exp(wkT ϕ)
yk (ϕ) = PK T
, k = 1, . . . , K.
j=1 exp(wj ϕ)
66
Multivariate logistic regression
Gradient of log-likelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1
N
X N
X
∇w` L(W ) = − (tn,` − y` (ϕn ))ϕn = (y` (ϕn ) − tn,` )ϕn ∈ RD
n=1 n=1
67
Multivariate logistic regression
Gradient of log-likelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1
N
X N
X
∇w` L(W ) = − (tn,` − y` (ϕn ))ϕn = (y` (ϕn ) − tn,` )ϕn ∈ RD
n=1 n=1
N
X
∇L(W ) = ϕn (y(ϕn ) − tn )T ∈ RD×K .
n=1
• OK with intuition ?
Optimization:
• We can apply the gradient descent algorithm to minimize L. 67
Machine learning – Optimization – Gradient descent
Gradient descent
• Let F be a real function, coercive, and twice-differentiable such that:
?
converges to a stationary point x (i.e., it cancels the gradient)
∇F (x? ) = 0 .
• The parameter γ is called the step size (or learning rate in ML field).
• A too small step size γ leads to slow convergence.
68
Machine learning – ANN – Optimization – Gradient descent
One dimension
Two dimensions
* *
* *
69
Multivariate logistic regression
N
X
Gradient of log-likelihood: ∇L(W ) = ϕn (y(ϕn ) − tn )T ∈ RD×K .
n=1
Optimization:
• Problem: In machine learning, the larger the dataset the better... but then
more and more computation for the gradient.
• Solution: Use (averaged) stochastic gradient descent:
• Draw randomly a small subset S ⊂ T of the training set
• Compute a noisy gradient with this small set only and update weights:
X
W (n) = W (n−1) −γ (n) ∇L(W (n−1) , S) = W (n−1) −γ (n) ϕn (y(ϕn )−tn )T
n∈S
70
Non convexity in machine learning
71
Machine learning – Timeline
Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron
X2 W12
f(x)
X3 W13
X4 W14
Activation
Input Weights Sum
Function
72