0% found this document useful (0 votes)
48 views98 pages

Ann 5TH

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views98 pages

Ann 5TH

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Master 2 Statistique & Data Science, Ingénierie Mathématique

Réseaux de neurones profonds pour l’apprentissage


Deep neural networks for machine learning

Course I – Introduction to Artificial Neural Networks:


Multiclass logistic classification

Bruno Galerne
2023-2024

1
Credits

Most of the slides from Charles Deledalle’s course “UCSD ECE285 Machine
learning for image processing” (30 × 50 minutes course)

www.charles-deledalle.fr/
https://www.charles-deledalle.fr/pages/teaching.php#learning

2
Computer Vision and Machine Learning
Computer vision – Artificial Intelligence – Machine Learning

Definition (The British Machine Vision Association)


Computer vision (CV) is concerned with the automatic extraction, analysis
and understanding of useful information from a single image or a sequence of
images.

CV is a subfield of Artificial Intelligence.

Definition (Oxford dictionary)


Artificial Intelligence, noun: the theory and development of computer
systems able to perform tasks normally requiring human intelligence, such as
visual perception, speech recognition, decision-making, and translation.
3
Computer vision – Artificial Intelligence – Machine Learning

CV is a subfield of AI, CV’s new very best friend is machine learning (ML),
ML is also a subfield of AI, but not all computer vision algorithms are ML.

Definition
Machine Learning, noun: type of Artificial Intelligence that provides
computers with the ability to learn without being explicitly programmed.

ML provides various techniques that can learn from and make predictions on
data. Most of them follow the same general structure:

4
Computer vision – Image classification

Computer vision – Image classification

Goal: to assign a given image into one of the predefined classes.


5
Computer vision – Object detection

Computer vision – Object detection

(Source: Joseph Redmon)

Goal: to detect instances of objects of a certain class (such as human).

6
Computer vision – Image segmentation

Computer vision – Image segmentation

(Source: Abhijit Kundu)

Goal: to partition an image into multiple segments such that pixels in a same
segment share certain characteristics (color, texture or semantic).

7
Computer vision – Image captioning

Computer vision – Image captioning

(Karpathy, Fei-Fei, CVPR, 2015)

Goal: to write a sentence that describes what is happening. 8


Computer vision – Depth estimation

Computer vision – Depth estimation

→ →

(Stereo-vision: from two images acquired with different views.)

Goal: to estimate a depth map from one, two or several frames.

9
IP ∩ CV – Image colorization

Image colorization

(Source: Richard Zhang, Phillip Isola and Alexei A. Efros, 2016)

Goal: to add color to grayscale photographs. 10


IP ∩ CV – Image generation

Image generation

Generated images of bedrooms (Source: Alec Radford, Luke Metz, Soumith Chintala, 2015)

Goal: to automatically create realistic pictures of a given category.


11
IP ∩ CV – Image stylization

Image stylization

(Source: Neural Doodle, Champandard, 2016)

Goal: to create stylized images from rough sketches.

12
IP ∩ CV – Style transfer

Style transfer

(Source: Gatys, Ecker and Bethge, 2015)

Goal: transfer the style of an image into another one. 13


Machine learning – Learning from examples

Learning from examples

14
Machine learning – Learning from examples

Learning from examples

3 main ingredients

1 Training set / examples:

{x1 , x2 , . . . , xN }

2 Machine or model:

x→ f (x; θ) → y
| {z } |{z}
function / algorithm prediction

θ: parameters of the model

3 Loss, cost, objective function / energy:

argmin E(θ; x1 , x2 , . . . , xN )
θ

14
Machine learning – Learning from examples

Learning from examples

(
Data ↔ Statistics
Tools:
Loss ↔ Optimization

Goal: to extract information from the training set

• relevant for the given task,


• relevant for other data of the same kind.

15
Machine learning – Terminology

Terminology
Sample (Observation or Data): item to process (e.g., classify). Example: an
individual, a document, a picture, a sound, a video. . .

Features (Input): set of distinct traits that can be used to describe each
sample in a quantitative manner. Represented as a multi-dimensional vector
usually denoted by x. Example: size, weight, citizenship, . . .

Training set: Set of data used to discover potentially predictive relationships.

Validation set: Set used to adjust the model hyperparameters.

Testing set: Set used to assess the performance of a model.

Label (Output): The class or outcome assigned to a sample. The actual


prediction is often denoted by y and the desired/targeted class by d or t.
Example: man/woman, wealth, education level, . . . 16
Machine learning – Learning approaches

Learning approaches

Unsupervised learning: Discovering patterns in unlabeled


data. Example: cluster similar documents based on the
text content.

Supervised learning: Learning with a labeled training set.


Example: email spam detector with training set of already
labeled emails.

Semisupervised learning: Learning with a small amount of


labeled data and a large amount of unlabeled data.
Example: web content and protein sequence classifications.

Reinforcement learning: Learning based on feedback or


reward. Example: learn to play chess by winning or losing.
17
(Source: Jason Brownlee and Lucas Masuch)
Machine learning – Workflow

Machine learning workflow

(Source: Michael Walker)

18
Machine learning – Problem types

Problem types

(Source: Lucas Masuch)


19
Deep learning – What is deep learning?

What is deep learning?

• Part of the machine learning field of learning representations of data.


Exceptionally effective at learning patterns.

• Utilizes learning algorithms that derive meaning out of data by using a


hierarchy of multiple layers that mimic the neural networks of our brain.

• If you provide the system tons of information, it begins to understand it


and respond in useful ways.

• Rebirth of artificial neural networks.

(Source: Lucas Masuch)

20
Deep learning: Academic actors

• Popularized by Hinton in 2006 with Restricted Boltzmann Machines

• Developed by different actors:

and many others...

21
Deep learning: Academic actors

• Popularized by Hinton in 2006 with Restricted Boltzmann Machines

• Developed by different actors:

and many others...


• Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018
ACM A.M. Turing Award for conceptual and engineering breakthroughs
that have made deep neural networks a critical component of computing. 21
Deep learning

Actors and applications


• Very active technology adopted by big actors

• Success story for many different academic problems

• Image processing • Natural language processing


• Computer vision • Translation
• Speech recognition • etc
• Today all industries wonder if DL can improve their process.
22
Machine learning – Timeline

Timeline of (deep) learning

Convolution Neural Networks for Google Brain Project on


Handwritten Recognition 16k Cores
1958 Perceptron 1974 Backpropagation 1998 2012
awkward silence (AI winter)

1969 ~1980 1995 2006 2012


Perceptrons Multilayer SVM reigns Restricted AlexNet wins
book network Support Vector Machines
Boltzmann ImageNet
feature 2 Machine
Support Vectors

Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron

X2 W12
 f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 23


Plan of the course

1 Introduction to neural networks (recall from M1)


2 Convolutional neural networks for image classification (recall from M1)
3 Deep CNN for image classification, transfer learning
4 Convolutional neural networks for image segmentation
5 Deep generative models

Software: Python + PyTorch using Google Colab.


Remark: Focus on image processing and computer vision, but deep learning
works for many other applications:

• Signal processing, speech recognition,...


• Text processing
• Graph processing (discrete geometry, social networks,...)
• Physics, chemistry,. . .

24
Goal for first sessions

Neural networks for image classification

• Goal: Train a convolutional neural network for image classification

25
Goal for first sessions

Neural networks for image classification

• Goal: Train a convolutional neural network for image classification


• Goal: Understand the training of a convolutional neural network for
image classification

25
Goal of this course

Understand the training of a convolutional neural network for


image classification: A lot of notions: going backwards...

26
Goal of this course

Understand the training of a convolutional neural network for


image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that


uses local convolutions (e.g. 3 × 3 filters) for the first layers.

26
Goal of this course

Understand the training of a convolutional neural network for


image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that


uses local convolutions (e.g. 3 × 3 filters) for the first layers.
• Neural network: A specific architecture to compute a classifier (or
regression) having parameters=weights W to train at each layers.

26
Goal of this course

Understand the training of a convolutional neural network for


image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that


uses local convolutions (e.g. 3 × 3 filters) for the first layers.
• Neural network: A specific architecture to compute a classifier (or
regression) having parameters=weights W to train at each layers.
• Training is done by optimizing a classification loss L(W ) on a training
dataset: Typically this is a linear classifier using cross-entropy.

26
Goal of this course

Understand the training of a convolutional neural network for


image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that


uses local convolutions (e.g. 3 × 3 filters) for the first layers.
• Neural network: A specific architecture to compute a classifier (or
regression) having parameters=weights W to train at each layers.
• Training is done by optimizing a classification loss L(W ) on a training
dataset: Typically this is a linear classifier using cross-entropy.
• The optimization of the classification loss is done using stochastic
gradient descent on batches of training data.
26
Goal of this course

Understand the training of a convolutional neural network for


image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that


uses local convolutions (e.g. 3 × 3 filters) for the first layers.
• Neural network: A specific architecture to compute a classifier (or
regression) having parameters=weights W to train at each layers.
• Training is done by optimizing a classification loss L(W ) on a training
dataset: Typically this is a linear classifier using cross-entropy.
• The optimization of the classification loss is done using stochastic
gradient descent on batches of training data.
• The gradient ∇L(W ) is computed using backpropagation. 26
Machine learning – Timeline

Timeline of (deep) learning

Convolution Neural Networks for Google Brain Project on


Handwritten Recognition 16k Cores
1958 Perceptron 1974 Backpropagation 1998 2012
awkward silence (AI winter)

1969 ~1980 1995 2006 2012


Perceptrons Multilayer SVM reigns Restricted AlexNet wins
book network Support Vector Machines
Boltzmann ImageNet
feature 2 Machine
Support Vectors

Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron

X2 W12
 f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 27


Perceptron
Machine learning – Perceptron

Perceptron

1958 Perceptron

1969
Perceptrons
book

Perceptron criticized
articial
X1 W11
Neuron

X2 W12
 f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 28


Machine learning – Perceptron

Perceptron (Frank Rosenblatt, 1958)

First binary classifier based on supervised learning (discrimination).


Foundation of modern artificial neural networks.
At that time: technological, scientific and philosophical challenges.

29
Machine learning – Perceptron – Representation

Representation of the Perceptron

Parameters of the perceptron


• wk : synaptic weights 
←− real parameters to be estimated.
• b: bias

Training = adjusting the weights and biases


30
Machine learning – Perceptron – Inspiration

The origin of the Perceptron


Takes inspiration from the visual system known for its ability to learn patterns.

• When a neuron receives a stimulus


with high enough voltage, it emits
an action potential (aka, nerve
impulse or spike). It is said to fire.

• The perceptron mimics this


activation effect: it fires only when
X
wi xi + b > 0
i

(
+1 for the first class
y = sign(w0 x0 + w1 x1 + w2 x2 + w3 x3 + b) =
| {z } −1 for the second class
f (x;w)

31
Machine learning – Perceptron – Principle

1 Data are represented as vectors:

2 Collect training data with positive and negative examples:

(Source: Vincent Lepetit) 32


Machine learning – Perceptron – Principle

3 Training: find w and b so that: Dot product:


• hw, xi + b is positive for positive samples x, Xd

• hw, xi + b is negative for negative samples x. hw, xi = w i xi


i=1

= wT x

(Source: Vincent Lepetit) 32


Machine learning – Perceptron – Principle

3 Training: find w and b so that: Dot product:


• hw, xi + b is positive for positive samples x, Xd

• hw, xi + b is negative for negative samples x. hw, xi = w i xi


i=1

The equation hw, xi + b = 0 defines a hyperplane. = wT x

The hyperplane acts as a linear separator.


w is a normal vector to the hyperplane.

(Source: Vincent Lepetit) 32


Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.

(Source: Vincent Lepetit) 32


Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.


• A new example x is classified positive if hw, xi + b is positive,

(Source: Vincent Lepetit) 32


Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.


• A new example x is classified positive if hw, xi + b is positive,
• and negative if hw, xi + b is negative.

(Source: Vincent Lepetit) 32


Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.


• A new example x is classified positive if hw, xi + b is positive,
• and negative if hw, xi + b is negative.
(signed) distance of x to the hyperplane:

hw, xi + b
r=
||w||

(Source: Vincent Lepetit) 32


Machine learning – Perceptron – Representation

Alternative representation

Use the zero-index to encode the bias as a synaptic weight.

Simplifies algorithms as all parameters can now be processed in the same way.

33
Machine learning – Perceptron – Training

Perceptron algorithm

Goal: find the vector of weights w from a labeled training dataset T

How: minimize classification errors


X X
min E(w) = − d × hw, xi = max(−d × hw, xi , 0)
w
(x,d)∈T (x,d)∈T
st y6=d

• penalize only misclassified samples (y 6= d) for which d × hw, xi < 0,


• zero if all samples are correctly classified.

34
Machine learning – Perceptron – Training

Perceptron algorithm

• We assume that max(0, t) is derivable with derivative 1 if t > 0, 0 if


t <= 0.

Algorithm: (stochastic) gradient descent for E(w) (see later)

• Initialize w randomly
• Repeat until convergence
• For all (x, d) ∈ T (or a random subset T 0 ⊂ T )
• Compute: y = sign hw, xi
• If y 6= d:
Update: w ← w + γdx

• Converges to some solution if the training data are linearly separable,


• But may pick any of many solutions of varying quality.

⇒ Poor generalization error, compared with SVM and logistic loss.


35
Machine learning – Perceptron – Perceptrons book

Perceptrons book (Minsky and Papert, 1969)

A perceptron can only classify data points that are linearly separable:

+1

+1

Seen by many as a justification to stop research on perceptrons.

(Source: Vincent Lepetit)

36
Artificial neural network
Machine learning – Artificial neural network

Artificial neural network

(Source: Lucas Masuch & Vincent Lepetit) 37


Machine learning – Artificial neural network

Artificial neural network

• Supervised learning method initially inspired by


the behavior of the human brain.

• Consists of the inter-connection of several


small units (just like in the human brain).

• Introduced in the late 50s, very popular in the


90s, reappeared in the 2010s with deep
learning.

• Also referred to as Multi-Layer Perceptron (MLP).

• Historically used after feature extraction.

38
Machine learning – Artificial neural network

Artificial neuron (McCulloch & Pitts, 1943)

• An artificial neuron contains several incoming weighted connections, an


outgoing connection and has a nonlinear activation function g.
• Neurons are trained to filter and detect specific features or patterns (e.g.
edge, nose) by receiving weighted input, transforming it with the
activation function and passing it to the outgoing connections.
• Unlike the perceptron, can be used for regression (with proper choice of g).

39
Machine learning – ANN

Artificial neural network / Multilayer perceptron / NeuralNet

• Inter-connection of several artificial


neurons (also called nodes or units).
• Each level in the graph is called a layer:
• Input layer,
• Hidden layer(s),
• Output layer.
• Each neuron in the hidden layers acts as a
classifier / feature detector.
• Feedforward NN (no cycle)
• first and simplest type of NN,
• information moves in one direction.
• Recurrent NN (with cycle)
• used for time sequences,
• such as speech-recognition.
40
Machine learning – ANN

Artificial neural network / Multilayer perceptron / NeuralNet


1 1 1
x3 + b11

h1 = g1 w11 x1 + w12 x2 + w13
1 1 1
x3 + b12

h2 = g1 w21 x1 + w22 x2 + w23
1 1 1
x3 + b13

h3 = g1 w31 x1 + w32 x2 + w33
1 1 1
x3 + b14

h4 = g1 w41 x1 + w42 x2 + w43

2 2 2 2
h4 + b21

y1 = g2 w11 h1 + w12 h2 + w13 h3 + w14
2 2 2 2
h4 + b22

y2 = g2 w21 h1 + w22 h2 + w23 h3 + w24

k
wij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.

41
Machine learning – ANN

Artificial neural network / Multilayer perceptron / NeuralNet


1 1 1
x3 + b11

h1 = g1 w11 x1 + w12 x2 + w13
1 1 1
x3 + b12

h2 = g1 w21 x1 + w22 x2 + w23
1 1 1
x3 + b13

h3 = g1 w31 x1 + w32 x2 + w33
1 1 1
x3 + b14

h4 = g1 w41 x1 + w42 x2 + w43
h = g1 (W1 x + b1 )

2 2 2 2
h4 + b21

y1 = g2 w11 h1 + w12 h2 + w13 h3 + w14
2 2 2 2
h4 + b22

y2 = g2 w21 h1 + w22 h2 + w23 h3 + w24
y = g2 (W2 h + b2 )

k
wij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
The matrices Wk and biases bk are learned from labeled training data. 41
Machine learning – ANN

Artificial neural network / Multilayer perceptron

It can have 1 hidden layer only (shallow network),


It can have more than 1 hidden layer (deep network),
each layer may have a different size, and
hidden and output layers often have different activation functions.
42
Machine learning – ANN

Artificial neural network / Multilayer perceptron


• As for the perceptron, the biases can be integrated into the weights:
!
  1
Wk hk−1 + bk = bk Wk = W̃k h̃k−1
| {z } hk−1

| {z }
k
h̃k−1

• A neural network with L layers is a function of x parameterized by W̃ :

y = f (x; W̃ ) where W̃ = (W̃1 , W̃2 , . . . , W̃L )T

• It can be defined recursively as


 
y = f (x; W̃ ) = hL , hk = gk W̃k h̃k−1 and h0 = x

• For simplicity, W̃ will be denoted W (when no possible confusions).

43
Machine learning – ANN – Activation functions

Activation functions

Linear units: g(a) = a

y = WL hL−1 + bL
hL−1 = WL−1 hL−2 + bL−1
y = WL WL−1 hL−2 + WL bL−1 + bL
L−1
X
y = WL . . . W1 x + WL . . . Wk+1 bk + bL
k=1

We can always find an equivalent network without hidden units,


because compositions of affine functions are affine.

In general, non-linearity is needed to learn complex (non-linear)


representations of data, otherwise the NN would be just a linear function.
Otherwise, back to the problem of nonlinearly separable datasets.

44
Machine learning – ANN – Activation functions

Activation functions

Threshold units: for instance the sign function


(
−1 if a < 0
g(a) =
+1 otherwise.

or Heaviside (aka, step) activation functions


(
0 if a < 0
g(a) =
1 otherwise.

Discontinuities in the hidden layers


make the optimization really difficult.
We prefer functions that are continuous and differentiable.

45
Machine learning – ANN – Activation functions

Activation functions

Sigmoidal units: for instance the hyperbolic tangent function

ea − e−a
g(a) = tanh a = ∈ [−1, 1] 2
ea + e−a 1

or the logistic sigmoid function 0

-1
1 -2
g(a) = ∈ [0, 1] -5 0 5
1 + e−a

• In fact equivalent by linear transformations :

tanh(a/2) = 2logistic(a) − 1

• Differentiable approximations of the sign and step functions, respectively.


• Act as threshold units for large values of |a| and as linear for small values.

46
Machine learning – ANN

Sigmoidal units: logistic activation functions are used in binary classification


(class C1 vs C2 ) as they can be interpreted as posterior probabilities:

y = P (C1 |x) and 1 − y = P (C2 |x)

The architecture of the network defines the shape of the separator


1

Separation
0 {x s.t. P (C1 |x) = P (C2 |x)}
0 1

Complexity/capacity of the
network
0
0 1

1
Trade-off between
generalization and overfitting.
...

...

0
0 1
47
Machine learning – ANN – Activation functions

Activation functions

“Modern” units:

g(a) = max(a, 0) or g(a) = log(1 + ea )


| {z } | {z }
ReLU Softplus

Most neural networks use ReLU


(Rectifier linear unit) – max(a, 0) –
nowadays for hidden layers, since it
trains much faster, is more expressive
than logistic function and prevents the
gradient vanishing problem.

(Source: Lucas Masuch)


48
Machine learning – ANN

Neural networks solve non-linear separable problems

(Source: Vincent Lepetit) 49


Tasks, architectures and loss functions
Tasks, architectures and loss functions

Approximation – Least square regression


• Goal: Predict a real multivariate function.

• How: estimate the coefficients W of y = f (x; W )


from labeled training examples where labels are real vectors:

• Typical architecture:

• Hidden layer:

ReLU(a) = max(a, 0)

• Linear output:

g(a) = a
50
Tasks, architectures and loss functions

Approximation – Least square regression

• Loss: As for the polynomial curve fitting, it is standard to consider the


sum of square errors (assumption of Gaussian distributed errors)
N
X N
X
E(W ) = ||y i − di ||22 = ||f (xi ; W ) − di ||22
i=1 i=1

and look for W ∗ such that ∇E(W ∗ ) = 0.

• Solution: Provided the network has enough flexibility and the size of the
training set grows to infinity
Z
y ? = f (x; W ? ) = E[d|x] = dp(d|x) dd
| {z }
posterior mean

51
Tasks, architectures and loss functions

Multiclass classification – Multivariate logistic regression


(aka, multinomial classification)

• Goal: Classify an object x into one among K classes C1 , . . . , CK .

• How: Estimate the coefficients W of a multivariate function


K
X
y = f (x; W ) ∈ [0, 1]K s.t. yk = 1.
k=1

from training examples T = {(xi , di )} where di is a 1-of-K (one-hot) code


• Class 1: di = (1, 0, . . . , 0)T if xi ∈ C1
• Class 2: di = (0, 1, . . . , 0)T if xi ∈ C2
• ...
• Class K: di = (0, 0, . . . , 1)T if xi ∈ CK

• yk = f (x; W ) is understood as the probability of x ∈ Ck .


• Remark: Do not use the class index k directly as a scalar label: The order
of label is not informative.
52
Tasks, architectures and loss functions

Multiclass classification – Multivariate logistic regression


• Typical architecture:

• Hidden layer:

ReLU(a) = max(a, 0)

• Output layer:

exp(ak )
softmax(a)k = K
X
exp(a` )
`=1

• Softmax maps RK to the set of probability vectors


PK
{y ∈ (0, 1)K , k=1 yk = 1}.
• Smooth version of winner-takes-all activation model (maxout).
• The final decision function is winner-takes-all

argmaxk softmax(a) = argmaxk a


53
Tasks, architectures and loss functions

Multiclass classification – Multivariate logistic regression


• Loss: it is standard to consider the cross-entropy for K classes
(assumption of multinomial distributed data)
N X
X K
E(W ) = − dik log yki with y i = f (xi ; W ) = softmax(ai ) ∈ (0, 1)K .
i=1 k=1
N
" K
!#
X X
=− aidi − log exp(aik ) with di the class of xi .
i=1 k=1

and look for W ∗ such that ∇E(W ∗ ) = 0.

• Solution: Provided the network has enough flexibility and the size of the
training set grows to infinity

yk? = fk (x; W ? ) = P(Ck |x)


| {z }
posterior probability

54
Multivariate logistic regression
Multiclass classification – Multivariate logistic regression

• SVMs allow for multiclass classification but are not easily plugable to
neural networks.
• Instead neural networks generally use multivariate logistic regression.

Goal of this section:

• Mathematics of multivariate logistic regression.


• Reference: Section ”4.3.4 Multiclass logistic regression” of
C. M. Bishop, Pattern Recognition and Machine Learning, Information
Science and Statistics, Springer, 2006

55
Multivariate logistic regression

New notation

• Goal: Classify an object x into one among K classes C1 , . . . , CK .


• Training set: T = {(xn , tn ), n = 1, . . . , N }, tn ∈ {1, . . . , K} encodes the
class of xn .
• Each tn is transformed into a vector tn ∈ {0, 1}K with a 1-of-K code:
• Class 1: tn = (1, 0, . . . , 0)T if xn ∈ C1 , i.e tn = 1,
• Class 2: tn = (0, 1, . . . , 0)T if xn ∈ C2 , i.e tn = 2,
• ...
• Class K: tn = (0, 0, . . . , 1)T if xn ∈ CK , i.e tn = K,

• Remark: Do not use the class index k directly as a scalar label: The order
of label is not informative.

56
Multivariate logistic regression

Feature transform
• We apply a feature transform ϕ : Rp → RD to each xn :

ϕn = ϕ(xn ), n = 1, . . . , N.

• Depending on context it allows to increase (D > p) or decrease (D < p)


the dimension in a way to favor class discrimination (e.g. PCA...).
• This is a non linear map that should make the classes linearly separable.

57
(Source: url)
Multivariate logistic regression

Linear classifier
We will consider linear multiclass classifier in feature space RD :
• Classifier parameters: Each class k has a weight vector wk ∈ RD and
bias bk ∈ R
• Class separation:
wkT ϕ + bk > w`T ϕ + b` ?
• Class k region:
n o
ϕ ∈ RD , ∀` 6= k, wkT ϕ + bk > w`T ϕ + b`
K
\ n o
= ϕ ∈ RD , wkT ϕ + bk > w`T ϕ + b`
`=1, `6=k

= intersection of K − 1 half-planes

• The classification partition is made of (unbounded) convex polygons (in


feature space).
58
Multivariate logistic regression

Bias trick for linear classifier

• Classifier parameters: Each class k has a weight vector wk ∈ RD and


bias bk ∈ R
• Add an additional dummy coordinate 1 to ϕ so that
!T !
wk ϕ
wkT ϕ + bk = = w̃kT ϕ̃.
bk 1

• From now on this is implicit: We assume that the feature transform has
a 1 component so that wkT ϕ has an implicit bias component.
• The classifier parameters are just a set of weight vectors
{wk , k = 1, . . . , K}.

59
Multivariate logistic regression

Bias trick for linear classifier

• Classifier parameters: Each class k has a weight vector wk ∈ RD and


bias bk ∈ R
• Add an additional dummy coordinate 1 to ϕ so that
!T !
wk ϕ
wkT ϕ + bk = = w̃kT ϕ̃.
bk 1

• From now on this is implicit: We assume that the feature transform has
a 1 component so that wkT ϕ has an implicit bias component.
• The classifier parameters are just a set of weight vectors
{wk , k = 1, . . . , K}.
• What are the boundaries if we forget the bias ?

59
Multivariate logistic regression

• After feature transform the training set is: T = {(ϕn , tn ), n = 1, . . . , N },


• We want to estimate
K
X
y = f (ϕ) ∈ [0, 1]K s.t. yk = 1.
k=1

such that ideally yk ' p(Ck |ϕ) is an estimate of the posterior probability

p(Ck |ϕ) = Probability of being in class Ck given feature vector ϕ.

• Model assumption: Posterior probabilities p(Ck |ϕ) given the feature is a


softmax transformation of a linear function of the feature variable:
There exists K vectors w1 , . . . , wK ∈ RD such that:
exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = K , k = 1, . . . , K.
X T
exp(wj ϕ)
j=1

K
X
• By construction of the softmax, one has y ∈ (0, 1)K s.t. yk = 1.
k=1
60
Multivariate logistic regression

• Model assumption: There exists K vectors w1 , . . . , wK ∈ RD such that:

exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = K
, k = 1, . . . , K.
X
exp(wjT ϕ)
j=1

• We denote W = (w1 , . . . , wK ) ∈ RD×K the matrix containing all the


weights.

Training:

• Training = Find the best weight matrix W to explain the dataset.


• Performed using maximum likelihood.

61
Multivariate logistic regression

Likelihood: Assume a multinomial model for the classes


• For each ϕ, associate the multinomial random variable T (ϕ) that takes
the value k with probability

exp(wkT ϕ)
p(Ck |ϕ) = PK T
j=1 exp(wj ϕ)

• Each realization (ϕn , tn ) of the dataset are assumed independent.


• Then the likelihood of the dataset is:
N
Y
P ((T (ϕ1 ), . . . , T (ϕN ) = (t1 , . . . , tN )) = P (T (ϕn ) = tn )
n=1
N
Y
= p(Ctn |ϕn )
n=1

62
Multivariate logistic regression

Likelihood: Assume a multinomial model for the classes


• For each ϕ, associate the multinomial random variable T (ϕ) that takes
the value k with probability

exp(wkT ϕ)
p(Ck |ϕ) = PK T
j=1 exp(wj ϕ)

• Each realization (ϕn , tn ) of the dataset are assumed independent.


• Then the likelihood of the dataset is:
N
Y
P ((T (ϕ1 ), . . . , T (ϕN ) = (t1 , . . . , tN )) = P (T (ϕn ) = tn )
n=1
N
Y
= p(Ctn |ϕn )
n=1

• We want to maximize the likelihood with respect to


W = (w1 , . . . , wK ) ∈ RD×K the matrix containing all the weights.
• We minimize L(W ) = − log P instead (maximize the loglikelihood).
62
Multivariate logistic regression

Loglikelihood:

• Model assumption: There exists K vectors w1 , . . . , wK ∈ RD such that:

exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = PK T
, k = 1, . . . , K.
j=1 exp(wj ϕ)

• Expression of loglikelihood:

N
Y
L(W ) = − log p(Ctn |ϕn )
n=1
N
X
=− log ytn (ϕn )
n=1
N K
!
X X
=− wtTn ϕn − log exp(wjT ϕn )
n=1 j=1

63
Multivariate logistic regression

Loglikelihood:

• Expression of loglikelihood:

N N K
!
X X X
L(W ) = − log ytn (ϕn ) = − wtTn ϕn − log exp(wjT ϕn )
n=1 n=1 j=1

• Alternative expression with 1-of-K code: Recall that



1 if k = t , K
n X
tn,k = so that wtn = tn,k wk .
0 otherwise.
k=1

N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

64
Multivariate logistic regression

Loglikelihood:

• Expression of loglikelihood:

N N K
!
X X X
L(W ) = − log ytn (ϕn ) = − wtTn ϕn − log exp(wjT ϕn )
n=1 n=1 j=1

• Alternative expression with 1-of-K code: Recall that



1 if k = t , K
n X
tn,k = so that wtn = tn,k wk .
0 otherwise.
k=1

N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• One can show that W 7→ L(W ) is convex.


• What do we need to optimize L(W )?

64
Multivariate logistic regression

Gradient of loglikelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• Linear part: Partial gradient with respect to column w` , ` ∈ {1, . . . , K}:


"K #
X h i
∇w` tn,j wj ϕn = ∇w` tn,` w`T ϕn + constant = tn,` ϕn
T

j=1

K
!
X
• Partial gradient of ∇w` log exp(wjT ϕn ) ?
j=1

K
!
X  
∇w` log exp(wjT ϕn ) = ∇w` log exp(w`T ϕn ) + constant
j=1

=?

65
Multivariate logistic regression

Gradient of log-likelihood:
Recall that for f : Rn → R and g : R → R,

∇(g ◦ f )(x) = g 0 (f (x))∇f (x).

Here,
exp(t)
g(t) = log(exp(t) + c) g 0 (t) =
exp(t) + c
f (w` ) = w`T ϕn ∇f (w` ) = ϕn .

So,
K
!
X exp(w`T ϕn )
∇w` log exp(wjT ϕn ) = PK T
ϕn = y` (ϕn )ϕn
j=1 j=1 exp(wj ϕn )

since
exp(wkT ϕ)
yk (ϕ) = PK T
, k = 1, . . . , K.
j=1 exp(wj ϕ)

66
Multivariate logistic regression

Gradient of log-likelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• For each column w` ∈ RD of W , ` ∈ {1, . . . , K},

N
X N
X
∇w` L(W ) = − (tn,` − y` (ϕn ))ϕn = (y` (ϕn ) − tn,` )ϕn ∈ RD
n=1 n=1

67
Multivariate logistic regression

Gradient of log-likelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• For each column w` ∈ RD of W , ` ∈ {1, . . . , K},

N
X N
X
∇w` L(W ) = − (tn,` − y` (ϕn ))ϕn = (y` (ϕn ) − tn,` )ϕn ∈ RD
n=1 n=1

• Full gradient for W :

N
X
∇L(W ) = ϕn (y(ϕn ) − tn )T ∈ RD×K .
n=1

• OK with intuition ?

Optimization:
• We can apply the gradient descent algorithm to minimize L. 67
Machine learning – Optimization – Gradient descent

An iterative algorithm trying to find a minimum of a real function.

Gradient descent
• Let F be a real function, coercive, and twice-differentiable such that:

|| ∇2 F (x) ||2 6 L, for some L > 0.


| {z }
Hessian matrix of F

• Then, whatever the initialization x0 , if 0 < γ < 2/L, the sequence

x(n+1) = x(n) − γ∇F (x(n) ) ,


| {z }
direction of greatest descent

?
converges to a stationary point x (i.e., it cancels the gradient)

∇F (x? ) = 0 .

• The parameter γ is called the step size (or learning rate in ML field).
• A too small step size γ leads to slow convergence.
68
Machine learning – ANN – Optimization – Gradient descent

One dimension

Two dimensions
* *

* *

69
Multivariate logistic regression

N
X
Gradient of log-likelihood: ∇L(W ) = ϕn (y(ϕn ) − tn )T ∈ RD×K .
n=1
Optimization:
• Problem: In machine learning, the larger the dataset the better... but then
more and more computation for the gradient.
• Solution: Use (averaged) stochastic gradient descent:
• Draw randomly a small subset S ⊂ T of the training set
• Compute a noisy gradient with this small set only and update weights:
X
W (n) = W (n−1) −γ (n) ∇L(W (n−1) , S) = W (n−1) −γ (n) ϕn (y(ϕn )−tn )T
n∈S

and compute averaged weights


n
1 X (k) n 1
W̄ (n) = W = W̄ (n−1) + W (n) .
n+1 n+1 n+1
k=0

• Convergence results for L (strongly) convex if γ (n) decays well etc.

70
Non convexity in machine learning

But for neural network the cost is not convex...

71
Machine learning – Timeline

Timeline of (deep) learning

Convolution Neural Networks for Google Brain Project on


Handwritten Recognition 16k Cores
1958 Perceptron 1974 Backpropagation 1998 2012
awkward silence (AI winter)

1969 ~1980 1995 2006 2012


Perceptrons Multilayer SVM reigns Restricted AlexNet wins
book network Support Vector Machines
Boltzmann ImageNet
feature 2 Machine
Support Vectors

Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron

X2 W12
 f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 72


Backpropagation
Questions?

Next class: Backpropagation

Slides from Charles Deledalle


Sources, images courtesy and acknowledgment

• K. Chatfield • A. Horodniceanu • L. Masuch


• P. Gallinari, • Y. LeCun • A. Ng
• C. Hazırbaş • V. Lepetit • M. Ranzato

72

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy