0% found this document useful (0 votes)

48 views98 pages

Ann 5TH

Uploaded by

ranamohsinranag21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views98 pages

Ann 5TH

Uploaded by

ranamohsinranag21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Master 2 Statistique & Data Science, Ingénierie Mathématique

Réseaux de neurones profonds pour l’apprentissage

Deep neural networks for machine learning

Course I – Introduction to Artificial Neural Networks:

Multiclass logistic classification

Bruno Galerne
2023-2024

1
Credits

Most of the slides from Charles Deledalle’s course “UCSD ECE285 Machine
learning for image processing” (30 × 50 minutes course)

www.charles-deledalle.fr/
https://www.charles-deledalle.fr/pages/teaching.php#learning

2
Computer Vision and Machine Learning
Computer vision – Artificial Intelligence – Machine Learning

Definition (The British Machine Vision Association)

Computer vision (CV) is concerned with the automatic extraction, analysis
and understanding of useful information from a single image or a sequence of
images.

CV is a subfield of Artificial Intelligence.

Definition (Oxford dictionary)

Artificial Intelligence, noun: the theory and development of computer
systems able to perform tasks normally requiring human intelligence, such as
visual perception, speech recognition, decision-making, and translation.
3
Computer vision – Artificial Intelligence – Machine Learning

CV is a subfield of AI, CV’s new very best friend is machine learning (ML),
ML is also a subfield of AI, but not all computer vision algorithms are ML.

Definition
Machine Learning, noun: type of Artificial Intelligence that provides
computers with the ability to learn without being explicitly programmed.

ML provides various techniques that can learn from and make predictions on
data. Most of them follow the same general structure:

4
Computer vision – Image classification

Computer vision – Image classification

Goal: to assign a given image into one of the predefined classes.

5
Computer vision – Object detection

Computer vision – Object detection

(Source: Joseph Redmon)

Goal: to detect instances of objects of a certain class (such as human).

6
Computer vision – Image segmentation

Computer vision – Image segmentation

(Source: Abhijit Kundu)

Goal: to partition an image into multiple segments such that pixels in a same
segment share certain characteristics (color, texture or semantic).

7
Computer vision – Image captioning

Computer vision – Image captioning

(Karpathy, Fei-Fei, CVPR, 2015)

Goal: to write a sentence that describes what is happening. 8

Computer vision – Depth estimation

→ →

(Stereo-vision: from two images acquired with different views.)

Goal: to estimate a depth map from one, two or several frames.

9
IP ∩ CV – Image colorization

Image colorization

(Source: Richard Zhang, Phillip Isola and Alexei A. Efros, 2016)

Goal: to add color to grayscale photographs. 10

IP ∩ CV – Image generation

Image generation

Generated images of bedrooms (Source: Alec Radford, Luke Metz, Soumith Chintala, 2015)

Goal: to automatically create realistic pictures of a given category.

11
IP ∩ CV – Image stylization

Image stylization

(Source: Neural Doodle, Champandard, 2016)

Goal: to create stylized images from rough sketches.

12
IP ∩ CV – Style transfer

Style transfer

(Source: Gatys, Ecker and Bethge, 2015)

Goal: transfer the style of an image into another one. 13

Machine learning – Learning from examples

Learning from examples

14
Machine learning – Learning from examples

Learning from examples

3 main ingredients

1 Training set / examples:

{x1 , x2 , . . . , xN }

2 Machine or model:

x→ f (x; θ) → y
| {z } |{z}
function / algorithm prediction

θ: parameters of the model

3 Loss, cost, objective function / energy:

argmin E(θ; x1 , x2 , . . . , xN )
θ

14
Machine learning – Learning from examples

Learning from examples

(
Data ↔ Statistics
Tools:
Loss ↔ Optimization

Goal: to extract information from the training set

• relevant for the given task,

• relevant for other data of the same kind.

15
Machine learning – Terminology

Terminology
Sample (Observation or Data): item to process (e.g., classify). Example: an
individual, a document, a picture, a sound, a video. . .

Features (Input): set of distinct traits that can be used to describe each
sample in a quantitative manner. Represented as a multi-dimensional vector
usually denoted by x. Example: size, weight, citizenship, . . .

Training set: Set of data used to discover potentially predictive relationships.

Validation set: Set used to adjust the model hyperparameters.

Testing set: Set used to assess the performance of a model.

Label (Output): The class or outcome assigned to a sample. The actual

prediction is often denoted by y and the desired/targeted class by d or t.
Example: man/woman, wealth, education level, . . . 16
Machine learning – Learning approaches

Learning approaches

Unsupervised learning: Discovering patterns in unlabeled

data. Example: cluster similar documents based on the
text content.

Supervised learning: Learning with a labeled training set.

Example: email spam detector with training set of already
labeled emails.

Semisupervised learning: Learning with a small amount of

labeled data and a large amount of unlabeled data.
Example: web content and protein sequence classifications.

Reinforcement learning: Learning based on feedback or

reward. Example: learn to play chess by winning or losing.
17
(Source: Jason Brownlee and Lucas Masuch)
Machine learning – Workflow

Machine learning workflow

(Source: Michael Walker)

18
Machine learning – Problem types

Problem types

(Source: Lucas Masuch)

19
Deep learning – What is deep learning?

What is deep learning?

• Part of the machine learning field of learning representations of data.

Exceptionally effective at learning patterns.

• Utilizes learning algorithms that derive meaning out of data by using a

hierarchy of multiple layers that mimic the neural networks of our brain.

• If you provide the system tons of information, it begins to understand it

and respond in useful ways.

• Rebirth of artificial neural networks.

(Source: Lucas Masuch)

20
Deep learning: Academic actors

• Popularized by Hinton in 2006 with Restricted Boltzmann Machines

• Developed by different actors:

and many others...

21
Deep learning: Academic actors

• Popularized by Hinton in 2006 with Restricted Boltzmann Machines

• Developed by different actors:

and many others...

• Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018
ACM A.M. Turing Award for conceptual and engineering breakthroughs
that have made deep neural networks a critical component of computing. 21
Deep learning

Actors and applications

• Very active technology adopted by big actors

• Success story for many different academic problems

• Image processing • Natural language processing

• Computer vision • Translation
• Speech recognition • etc
• Today all industries wonder if DL can improve their process.
22
Machine learning – Timeline

Timeline of (deep) learning

Convolution Neural Networks for Google Brain Project on

Handwritten Recognition 16k Cores
1958 Perceptron 1974 Backpropagation 1998 2012
awkward silence (AI winter)

1969 ~1980 1995 2006 2012

Perceptrons Multilayer SVM reigns Restricted AlexNet wins
book network Support Vector Machines
Boltzmann ImageNet
feature 2 Machine
Support Vectors

Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron

X2 W12
f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 23

Plan of the course

1 Introduction to neural networks (recall from M1)

2 Convolutional neural networks for image classification (recall from M1)
3 Deep CNN for image classification, transfer learning
4 Convolutional neural networks for image segmentation
5 Deep generative models

Software: Python + PyTorch using Google Colab.

Remark: Focus on image processing and computer vision, but deep learning
works for many other applications:

• Signal processing, speech recognition,...

• Text processing
• Graph processing (discrete geometry, social networks,...)
• Physics, chemistry,. . .

24
Goal for first sessions

Neural networks for image classification

• Goal: Train a convolutional neural network for image classification

25
Goal for first sessions

Neural networks for image classification

• Goal: Train a convolutional neural network for image classification

• Goal: Understand the training of a convolutional neural network for
image classification

25
Goal of this course

Understand the training of a convolutional neural network for

image classification: A lot of notions: going backwards...

26
Goal of this course

Understand the training of a convolutional neural network for

image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that

uses local convolutions (e.g. 3 × 3 filters) for the first layers.

26
Goal of this course

Understand the training of a convolutional neural network for

image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that

26
Goal of this course

Understand the training of a convolutional neural network for

image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that

26
Goal of this course

Understand the training of a convolutional neural network for

image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that

uses local convolutions (e.g. 3 × 3 filters) for the first layers.
• Neural network: A specific architecture to compute a classifier (or
regression) having parameters=weights W to train at each layers.
• Training is done by optimizing a classification loss L(W ) on a training
dataset: Typically this is a linear classifier using cross-entropy.
• The optimization of the classification loss is done using stochastic
gradient descent on batches of training data.
26
Goal of this course

Understand the training of a convolutional neural network for

image classification: A lot of notions: going backwards...

• Convolutional neural networks: Special neural networks for images that

Timeline of (deep) learning

Convolution Neural Networks for Google Brain Project on

Handwritten Recognition 16k Cores
1958 Perceptron 1974 Backpropagation 1998 2012
awkward silence (AI winter)

1969 ~1980 1995 2006 2012

Perceptrons Multilayer SVM reigns Restricted AlexNet wins
book network Support Vector Machines
Boltzmann ImageNet
feature 2 Machine
Support Vectors

Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron

X2 W12
f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 27

Perceptron
Machine learning – Perceptron

Perceptron

1958 Perceptron

1969
Perceptrons
book

Perceptron criticized
articial
X1 W11
Neuron

X2 W12
f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 28

Machine learning – Perceptron

Perceptron (Frank Rosenblatt, 1958)

First binary classifier based on supervised learning (discrimination).

Foundation of modern artificial neural networks.
At that time: technological, scientific and philosophical challenges.

29
Machine learning – Perceptron – Representation

Representation of the Perceptron

Parameters of the perceptron

• wk : synaptic weights
←− real parameters to be estimated.
• b: bias

Training = adjusting the weights and biases

30
Machine learning – Perceptron – Inspiration

The origin of the Perceptron

Takes inspiration from the visual system known for its ability to learn patterns.

• When a neuron receives a stimulus

with high enough voltage, it emits
an action potential (aka, nerve
impulse or spike). It is said to fire.

• The perceptron mimics this

activation effect: it fires only when
X
wi xi + b > 0
i

(
+1 for the first class
y = sign(w0 x0 + w1 x1 + w2 x2 + w3 x3 + b) =
| {z } −1 for the second class
f (x;w)

31
Machine learning – Perceptron – Principle

1 Data are represented as vectors:

2 Collect training data with positive and negative examples:

(Source: Vincent Lepetit) 32

Machine learning – Perceptron – Principle

3 Training: find w and b so that: Dot product:

• hw, xi + b is positive for positive samples x, Xd

• hw, xi + b is negative for negative samples x. hw, xi = w i xi

i=1

= wT x

(Source: Vincent Lepetit) 32

Machine learning – Perceptron – Principle

3 Training: find w and b so that: Dot product:

• hw, xi + b is positive for positive samples x, Xd

• hw, xi + b is negative for negative samples x. hw, xi = w i xi

i=1

The equation hw, xi + b = 0 defines a hyperplane. = wT x

The hyperplane acts as a linear separator.

w is a normal vector to the hyperplane.

(Source: Vincent Lepetit) 32

Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.

(Source: Vincent Lepetit) 32

Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.

• A new example x is classified positive if hw, xi + b is positive,

(Source: Vincent Lepetit) 32

Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.

• A new example x is classified positive if hw, xi + b is positive,
• and negative if hw, xi + b is negative.

(Source: Vincent Lepetit) 32

Machine learning – Perceptron – Principle

4 Testing: the perceptron can now classify new examples.

• A new example x is classified positive if hw, xi + b is positive,
• and negative if hw, xi + b is negative.
(signed) distance of x to the hyperplane:

hw, xi + b
r=
||w||

(Source: Vincent Lepetit) 32

Machine learning – Perceptron – Representation

Alternative representation

Use the zero-index to encode the bias as a synaptic weight.

Simplifies algorithms as all parameters can now be processed in the same way.

33
Machine learning – Perceptron – Training

Perceptron algorithm

Goal: find the vector of weights w from a labeled training dataset T

How: minimize classification errors

X X
min E(w) = − d × hw, xi = max(−d × hw, xi , 0)
w
(x,d)∈T (x,d)∈T
st y6=d

• penalize only misclassified samples (y 6= d) for which d × hw, xi < 0,

• zero if all samples are correctly classified.

34
Machine learning – Perceptron – Training

Perceptron algorithm

• We assume that max(0, t) is derivable with derivative 1 if t > 0, 0 if

t <= 0.

Algorithm: (stochastic) gradient descent for E(w) (see later)

• Initialize w randomly
• Repeat until convergence
• For all (x, d) ∈ T (or a random subset T 0 ⊂ T )
• Compute: y = sign hw, xi
• If y 6= d:
Update: w ← w + γdx

• Converges to some solution if the training data are linearly separable,

• But may pick any of many solutions of varying quality.

⇒ Poor generalization error, compared with SVM and logistic loss.

35
Machine learning – Perceptron – Perceptrons book

Perceptrons book (Minsky and Papert, 1969)

A perceptron can only classify data points that are linearly separable:

Seen by many as a justification to stop research on perceptrons.

(Source: Vincent Lepetit)

36
Artificial neural network
Machine learning – Artificial neural network

Artificial neural network

(Source: Lucas Masuch & Vincent Lepetit) 37

Machine learning – Artificial neural network

Artificial neural network

• Supervised learning method initially inspired by

the behavior of the human brain.

• Consists of the inter-connection of several

small units (just like in the human brain).

• Introduced in the late 50s, very popular in the

90s, reappeared in the 2010s with deep
learning.

• Also referred to as Multi-Layer Perceptron (MLP).

• Historically used after feature extraction.

38
Machine learning – Artificial neural network

Artificial neuron (McCulloch & Pitts, 1943)

• An artificial neuron contains several incoming weighted connections, an

outgoing connection and has a nonlinear activation function g.
• Neurons are trained to filter and detect specific features or patterns (e.g.
edge, nose) by receiving weighted input, transforming it with the
activation function and passing it to the outgoing connections.
• Unlike the perceptron, can be used for regression (with proper choice of g).

39
Machine learning – ANN

Artificial neural network / Multilayer perceptron / NeuralNet

• Inter-connection of several artificial

neurons (also called nodes or units).
• Each level in the graph is called a layer:
• Input layer,
• Hidden layer(s),
• Output layer.
• Each neuron in the hidden layers acts as a
classifier / feature detector.
• Feedforward NN (no cycle)
• first and simplest type of NN,
• information moves in one direction.
• Recurrent NN (with cycle)
• used for time sequences,
• such as speech-recognition.
40
Machine learning – ANN

Artificial neural network / Multilayer perceptron / NeuralNet

1 1 1
x3 + b11

h1 = g1 w11 x1 + w12 x2 + w13
1 1 1
x3 + b12

h2 = g1 w21 x1 + w22 x2 + w23
1 1 1
x3 + b13

h3 = g1 w31 x1 + w32 x2 + w33
1 1 1
x3 + b14

h4 = g1 w41 x1 + w42 x2 + w43

2 2 2 2
h4 + b21

y1 = g2 w11 h1 + w12 h2 + w13 h3 + w14
2 2 2 2
h4 + b22

y2 = g2 w21 h1 + w22 h2 + w23 h3 + w24

k
wij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.

41
Machine learning – ANN

Artificial neural network / Multilayer perceptron / NeuralNet

1 1 1
x3 + b11

h1 = g1 w11 x1 + w12 x2 + w13
1 1 1
x3 + b12

h2 = g1 w21 x1 + w22 x2 + w23
1 1 1
x3 + b13

h3 = g1 w31 x1 + w32 x2 + w33
1 1 1
x3 + b14

h4 = g1 w41 x1 + w42 x2 + w43
h = g1 (W1 x + b1 )

2 2 2 2
h4 + b21

y1 = g2 w11 h1 + w12 h2 + w13 h3 + w14
2 2 2 2
h4 + b22

y2 = g2 w21 h1 + w22 h2 + w23 h3 + w24
y = g2 (W2 h + b2 )

k
wij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
The matrices Wk and biases bk are learned from labeled training data. 41
Machine learning – ANN

Artificial neural network / Multilayer perceptron

It can have 1 hidden layer only (shallow network),

It can have more than 1 hidden layer (deep network),
each layer may have a different size, and
hidden and output layers often have different activation functions.
42
Machine learning – ANN

• A neural network with L layers is a function of x parameterized by W̃ :

y = f (x; W̃ ) where W̃ = (W̃1 , W̃2 , . . . , W̃L )T

• It can be defined recursively as

y = f (x; W̃ ) = hL , hk = gk W̃k h̃k−1 and h0 = x

• For simplicity, W̃ will be denoted W (when no possible confusions).

43
Machine learning – ANN – Activation functions

Activation functions

Linear units: g(a) = a

y = WL hL−1 + bL
hL−1 = WL−1 hL−2 + bL−1
y = WL WL−1 hL−2 + WL bL−1 + bL
L−1
X
y = WL . . . W1 x + WL . . . Wk+1 bk + bL
k=1

We can always find an equivalent network without hidden units,

because compositions of affine functions are affine.

In general, non-linearity is needed to learn complex (non-linear)

representations of data, otherwise the NN would be just a linear function.
Otherwise, back to the problem of nonlinearly separable datasets.

44
Machine learning – ANN – Activation functions

Activation functions

Threshold units: for instance the sign function

(
−1 if a < 0
g(a) =
+1 otherwise.

or Heaviside (aka, step) activation functions

(
0 if a < 0
g(a) =
1 otherwise.

Discontinuities in the hidden layers

make the optimization really difficult.
We prefer functions that are continuous and differentiable.

45
Machine learning – ANN – Activation functions

Activation functions

Sigmoidal units: for instance the hyperbolic tangent function

ea − e−a
g(a) = tanh a = ∈ [−1, 1] 2
ea + e−a 1

or the logistic sigmoid function 0

-1
1 -2
g(a) = ∈ [0, 1] -5 0 5
1 + e−a

• In fact equivalent by linear transformations :

tanh(a/2) = 2logistic(a) − 1

• Differentiable approximations of the sign and step functions, respectively.

• Act as threshold units for large values of |a| and as linear for small values.

46
Machine learning – ANN

Sigmoidal units: logistic activation functions are used in binary classification

(class C1 vs C2 ) as they can be interpreted as posterior probabilities:

y = P (C1 |x) and 1 − y = P (C2 |x)

The architecture of the network defines the shape of the separator

Separation
0 {x s.t. P (C1 |x) = P (C2 |x)}
0 1

Complexity/capacity of the
network
0
0 1
⇒
1
Trade-off between
generalization and overfitting.
...

...

0
0 1
47
Machine learning – ANN – Activation functions

Activation functions

“Modern” units:

g(a) = max(a, 0) or g(a) = log(1 + ea )

| {z } | {z }
ReLU Softplus

Most neural networks use ReLU

(Rectifier linear unit) – max(a, 0) –
nowadays for hidden layers, since it
trains much faster, is more expressive
than logistic function and prevents the
gradient vanishing problem.

(Source: Lucas Masuch)

48
Machine learning – ANN

Neural networks solve non-linear separable problems

(Source: Vincent Lepetit) 49

Tasks, architectures and loss functions
Tasks, architectures and loss functions

Approximation – Least square regression

• Goal: Predict a real multivariate function.

• How: estimate the coefficients W of y = f (x; W )

from labeled training examples where labels are real vectors:

• Typical architecture:

• Hidden layer:

ReLU(a) = max(a, 0)

• Linear output:

g(a) = a
50
Tasks, architectures and loss functions

Approximation – Least square regression

• Loss: As for the polynomial curve fitting, it is standard to consider the

sum of square errors (assumption of Gaussian distributed errors)
N
X N
X
E(W ) = ||y i − di ||22 = ||f (xi ; W ) − di ||22
i=1 i=1

and look for W ∗ such that ∇E(W ∗ ) = 0.

• Solution: Provided the network has enough flexibility and the size of the
training set grows to infinity
Z
y ? = f (x; W ? ) = E[d|x] = dp(d|x) dd
| {z }
posterior mean

51
Tasks, architectures and loss functions

Multiclass classification – Multivariate logistic regression

(aka, multinomial classification)

• Goal: Classify an object x into one among K classes C1 , . . . , CK .

• How: Estimate the coefficients W of a multivariate function

K
X
y = f (x; W ) ∈ [0, 1]K s.t. yk = 1.
k=1

from training examples T = {(xi , di )} where di is a 1-of-K (one-hot) code

• Class 1: di = (1, 0, . . . , 0)T if xi ∈ C1
• Class 2: di = (0, 1, . . . , 0)T if xi ∈ C2
• ...
• Class K: di = (0, 0, . . . , 1)T if xi ∈ CK

• yk = f (x; W ) is understood as the probability of x ∈ Ck .

• Remark: Do not use the class index k directly as a scalar label: The order
of label is not informative.
52
Tasks, architectures and loss functions

Multiclass classification – Multivariate logistic regression

• Typical architecture:

• Hidden layer:

ReLU(a) = max(a, 0)

• Output layer:

exp(ak )
softmax(a)k = K
X
exp(a` )
`=1

• Softmax maps RK to the set of probability vectors

PK
{y ∈ (0, 1)K , k=1 yk = 1}.
• Smooth version of winner-takes-all activation model (maxout).
• The final decision function is winner-takes-all

argmaxk softmax(a) = argmaxk a

53
Tasks, architectures and loss functions

Multiclass classification – Multivariate logistic regression

• Loss: it is standard to consider the cross-entropy for K classes
(assumption of multinomial distributed data)
N X
X K
E(W ) = − dik log yki with y i = f (xi ; W ) = softmax(ai ) ∈ (0, 1)K .
i=1 k=1
N
" K
!#
X X
=− aidi − log exp(aik ) with di the class of xi .
i=1 k=1

and look for W ∗ such that ∇E(W ∗ ) = 0.

• Solution: Provided the network has enough flexibility and the size of the
training set grows to infinity

yk? = fk (x; W ? ) = P(Ck |x)

| {z }
posterior probability

54
Multivariate logistic regression
Multiclass classification – Multivariate logistic regression

• SVMs allow for multiclass classification but are not easily plugable to
neural networks.
• Instead neural networks generally use multivariate logistic regression.

Goal of this section:

• Mathematics of multivariate logistic regression.

• Reference: Section ”4.3.4 Multiclass logistic regression” of
C. M. Bishop, Pattern Recognition and Machine Learning, Information
Science and Statistics, Springer, 2006

55
Multivariate logistic regression

New notation

• Goal: Classify an object x into one among K classes C1 , . . . , CK .

• Training set: T = {(xn , tn ), n = 1, . . . , N }, tn ∈ {1, . . . , K} encodes the
class of xn .
• Each tn is transformed into a vector tn ∈ {0, 1}K with a 1-of-K code:
• Class 1: tn = (1, 0, . . . , 0)T if xn ∈ C1 , i.e tn = 1,
• Class 2: tn = (0, 1, . . . , 0)T if xn ∈ C2 , i.e tn = 2,
• ...
• Class K: tn = (0, 0, . . . , 1)T if xn ∈ CK , i.e tn = K,

• Remark: Do not use the class index k directly as a scalar label: The order
of label is not informative.

56
Multivariate logistic regression

Feature transform
• We apply a feature transform ϕ : Rp → RD to each xn :

ϕn = ϕ(xn ), n = 1, . . . , N.

• Depending on context it allows to increase (D > p) or decrease (D < p)

the dimension in a way to favor class discrimination (e.g. PCA...).
• This is a non linear map that should make the classes linearly separable.

57
(Source: url)
Multivariate logistic regression

Linear classifier
We will consider linear multiclass classifier in feature space RD :
• Classifier parameters: Each class k has a weight vector wk ∈ RD and
bias bk ∈ R
• Class separation:
wkT ϕ + bk > w`T ϕ + b` ?
• Class k region:
n o
ϕ ∈ RD , ∀` 6= k, wkT ϕ + bk > w`T ϕ + b`
K
\ n o
= ϕ ∈ RD , wkT ϕ + bk > w`T ϕ + b`
`=1, `6=k

= intersection of K − 1 half-planes

• The classification partition is made of (unbounded) convex polygons (in

feature space).
58
Multivariate logistic regression

Bias trick for linear classifier

• Classifier parameters: Each class k has a weight vector wk ∈ RD and

bias bk ∈ R
• Add an additional dummy coordinate 1 to ϕ so that
!T !
wk ϕ
wkT ϕ + bk = = w̃kT ϕ̃.
bk 1

• From now on this is implicit: We assume that the feature transform has
a 1 component so that wkT ϕ has an implicit bias component.
• The classifier parameters are just a set of weight vectors
{wk , k = 1, . . . , K}.

59
Multivariate logistic regression

Bias trick for linear classifier

• Classifier parameters: Each class k has a weight vector wk ∈ RD and

bias bk ∈ R
• Add an additional dummy coordinate 1 to ϕ so that
!T !
wk ϕ
wkT ϕ + bk = = w̃kT ϕ̃.
bk 1

59
Multivariate logistic regression

• After feature transform the training set is: T = {(ϕn , tn ), n = 1, . . . , N },

• We want to estimate
K
X
y = f (ϕ) ∈ [0, 1]K s.t. yk = 1.
k=1

such that ideally yk ' p(Ck |ϕ) is an estimate of the posterior probability

p(Ck |ϕ) = Probability of being in class Ck given feature vector ϕ.

• Model assumption: Posterior probabilities p(Ck |ϕ) given the feature is a

softmax transformation of a linear function of the feature variable:
There exists K vectors w1 , . . . , wK ∈ RD such that:
exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = K , k = 1, . . . , K.
X T
exp(wj ϕ)
j=1

K
X
• By construction of the softmax, one has y ∈ (0, 1)K s.t. yk = 1.
k=1
60
Multivariate logistic regression

• Model assumption: There exists K vectors w1 , . . . , wK ∈ RD such that:

exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = K
, k = 1, . . . , K.
X
exp(wjT ϕ)
j=1

• We denote W = (w1 , . . . , wK ) ∈ RD×K the matrix containing all the

weights.

Training:

• Training = Find the best weight matrix W to explain the dataset.

• Performed using maximum likelihood.

61
Multivariate logistic regression

Likelihood: Assume a multinomial model for the classes

• For each ϕ, associate the multinomial random variable T (ϕ) that takes
the value k with probability

exp(wkT ϕ)
p(Ck |ϕ) = PK T
j=1 exp(wj ϕ)

• Each realization (ϕn , tn ) of the dataset are assumed independent.

• Then the likelihood of the dataset is:
N
Y
P ((T (ϕ1 ), . . . , T (ϕN ) = (t1 , . . . , tN )) = P (T (ϕn ) = tn )
n=1
N
Y
= p(Ctn |ϕn )
n=1

62
Multivariate logistic regression

Likelihood: Assume a multinomial model for the classes

• For each ϕ, associate the multinomial random variable T (ϕ) that takes
the value k with probability

exp(wkT ϕ)
p(Ck |ϕ) = PK T
j=1 exp(wj ϕ)

• Each realization (ϕn , tn ) of the dataset are assumed independent.

• Then the likelihood of the dataset is:
N
Y
P ((T (ϕ1 ), . . . , T (ϕN ) = (t1 , . . . , tN )) = P (T (ϕn ) = tn )
n=1
N
Y
= p(Ctn |ϕn )
n=1

• We want to maximize the likelihood with respect to

W = (w1 , . . . , wK ) ∈ RD×K the matrix containing all the weights.
• We minimize L(W ) = − log P instead (maximize the loglikelihood).
62
Multivariate logistic regression

Loglikelihood:

• Model assumption: There exists K vectors w1 , . . . , wK ∈ RD such that:

exp(wkT ϕ)
yk (ϕ) = p(Ck |ϕ) = PK T
, k = 1, . . . , K.
j=1 exp(wj ϕ)

• Expression of loglikelihood:

N
Y
L(W ) = − log p(Ctn |ϕn )
n=1
N
X
=− log ytn (ϕn )
n=1
N K
!
X X
=− wtTn ϕn − log exp(wjT ϕn )
n=1 j=1

63
Multivariate logistic regression

Loglikelihood:

• Expression of loglikelihood:

N N K
!
X X X
L(W ) = − log ytn (ϕn ) = − wtTn ϕn − log exp(wjT ϕn )
n=1 n=1 j=1

• Alternative expression with 1-of-K code: Recall that


1 if k = t , K
n X
tn,k = so that wtn = tn,k wk .
0 otherwise.
k=1

N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

64
Multivariate logistic regression

Loglikelihood:

• Expression of loglikelihood:

N N K
!
X X X
L(W ) = − log ytn (ϕn ) = − wtTn ϕn − log exp(wjT ϕn )
n=1 n=1 j=1

• Alternative expression with 1-of-K code: Recall that


1 if k = t , K
n X
tn,k = so that wtn = tn,k wk .
0 otherwise.
k=1

N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• One can show that W 7→ L(W ) is convex.

• What do we need to optimize L(W )?

64
Multivariate logistic regression

Gradient of loglikelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• Linear part: Partial gradient with respect to column w` , ` ∈ {1, . . . , K}:

"K #
X h i
∇w` tn,j wj ϕn = ∇w` tn,` w`T ϕn + constant = tn,` ϕn
T

j=1

K
!
X
• Partial gradient of ∇w` log exp(wjT ϕn ) ?
j=1

K
!
X
∇w` log exp(wjT ϕn ) = ∇w` log exp(w`T ϕn ) + constant
j=1

65
Multivariate logistic regression

Gradient of log-likelihood:
Recall that for f : Rn → R and g : R → R,

∇(g ◦ f )(x) = g 0 (f (x))∇f (x).

Here,
exp(t)
g(t) = log(exp(t) + c) g 0 (t) =
exp(t) + c
f (w` ) = w`T ϕn ∇f (w` ) = ϕn .

So,
K
!
X exp(w`T ϕn )
∇w` log exp(wjT ϕn ) = PK T
ϕn = y` (ϕn )ϕn
j=1 j=1 exp(wj ϕn )

since
exp(wkT ϕ)
yk (ϕ) = PK T
, k = 1, . . . , K.
j=1 exp(wj ϕ)

66
Multivariate logistic regression

Gradient of log-likelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• For each column w` ∈ RD of W , ` ∈ {1, . . . , K},

N
X N
X
∇w` L(W ) = − (tn,` − y` (ϕn ))ϕn = (y` (ϕn ) − tn,` )ϕn ∈ RD
n=1 n=1

67
Multivariate logistic regression

Gradient of log-likelihood:
N X
K K
!
X X
L(W ) = − tn,j wjT ϕn − log exp(wjT ϕn )
n=1 j=1 j=1

• For each column w` ∈ RD of W , ` ∈ {1, . . . , K},

N
X N
X
∇w` L(W ) = − (tn,` − y` (ϕn ))ϕn = (y` (ϕn ) − tn,` )ϕn ∈ RD
n=1 n=1

• Full gradient for W :

N
X
∇L(W ) = ϕn (y(ϕn ) − tn )T ∈ RD×K .
n=1

• OK with intuition ?

Optimization:
• We can apply the gradient descent algorithm to minimize L. 67
Machine learning – Optimization – Gradient descent

An iterative algorithm trying to find a minimum of a real function.

Gradient descent
• Let F be a real function, coercive, and twice-differentiable such that:

|| ∇2 F (x) ||2 6 L, for some L > 0.

| {z }
Hessian matrix of F

• Then, whatever the initialization x0 , if 0 < γ < 2/L, the sequence

x(n+1) = x(n) − γ∇F (x(n) ) ,

| {z }
direction of greatest descent

?
converges to a stationary point x (i.e., it cancels the gradient)

∇F (x? ) = 0 .

• The parameter γ is called the step size (or learning rate in ML field).
• A too small step size γ leads to slow convergence.
68
Machine learning – ANN – Optimization – Gradient descent

One dimension

Two dimensions
* *

* *

69
Multivariate logistic regression

N
X
Gradient of log-likelihood: ∇L(W ) = ϕn (y(ϕn ) − tn )T ∈ RD×K .
n=1
Optimization:
• Problem: In machine learning, the larger the dataset the better... but then
more and more computation for the gradient.
• Solution: Use (averaged) stochastic gradient descent:
• Draw randomly a small subset S ⊂ T of the training set
• Compute a noisy gradient with this small set only and update weights:
X
W (n) = W (n−1) −γ (n) ∇L(W (n−1) , S) = W (n−1) −γ (n) ϕn (y(ϕn )−tn )T
n∈S

and compute averaged weights

n
1 X (k) n 1
W̄ (n) = W = W̄ (n−1) + W (n) .
n+1 n+1 n+1
k=0

• Convergence results for L (strongly) convex if γ (n) decays well etc.

70
Non convexity in machine learning

But for neural network the cost is not convex...

71
Machine learning – Timeline

Timeline of (deep) learning

Convolution Neural Networks for Google Brain Project on

Handwritten Recognition 16k Cores
1958 Perceptron 1974 Backpropagation 1998 2012
awkward silence (AI winter)

1969 ~1980 1995 2006 2012

Perceptrons Multilayer SVM reigns Restricted AlexNet wins
book network Support Vector Machines
Boltzmann ImageNet
feature 2 Machine
Support Vectors

Perceptron criticized
Maximal
articial
Margin
X1 W11 feature 1 Hyperplane
Neuron

X2 W12
f(x)
X3 W13

X4 W14

Activation
Input Weights Sum
Function

(Source: Lucas Masuch & Vincent Lepetit) 72

Backpropagation
Questions?

Next class: Backpropagation

Slides from Charles Deledalle

Sources, images courtesy and acknowledgment

• K. Chatfield • A. Horodniceanu • L. Masuch

• P. Gallinari, • Y. LeCun • A. Ng
• C. Hazırbaş • V. Lepetit • M. Ranzato

Cat and Dog Classification Using CNN Fin
No ratings yet
Cat and Dog Classification Using CNN Fin
34 pages
Week 30 Lesson Plan
No ratings yet
Week 30 Lesson Plan
13 pages
Computer Vision and Deep Learning 1708702317
No ratings yet
Computer Vision and Deep Learning 1708702317
93 pages
DEEP LEARNING - FDP
No ratings yet
DEEP LEARNING - FDP
18 pages
Deep Learning - Roy Keyes
No ratings yet
Deep Learning - Roy Keyes
163 pages
DL Inference FPGA Class1
No ratings yet
DL Inference FPGA Class1
56 pages
Notions de Deep Learning
No ratings yet
Notions de Deep Learning
116 pages
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
No ratings yet
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
245 pages
AI Training2024Haile
No ratings yet
AI Training2024Haile
37 pages
Table of Content: (Page Numbers in PDF File)
No ratings yet
Table of Content: (Page Numbers in PDF File)
223 pages
1 AI - Introduction and ML
No ratings yet
1 AI - Introduction and ML
32 pages
CV Ss16 0609 Deep Learning
No ratings yet
CV Ss16 0609 Deep Learning
91 pages
8 Deep Learning CNN
No ratings yet
8 Deep Learning CNN
63 pages
Module 1 DL Snotes
No ratings yet
Module 1 DL Snotes
11 pages
Super VIP Cheetsheet - Deep Learning, AI, ML
No ratings yet
Super VIP Cheetsheet - Deep Learning, AI, ML
47 pages
Lecture1 Introduction CVML
No ratings yet
Lecture1 Introduction CVML
26 pages
Eng PPT Tech
No ratings yet
Eng PPT Tech
18 pages
Deep Learning U1
No ratings yet
Deep Learning U1
5 pages
Deep Learning Techniques and Application
No ratings yet
Deep Learning Techniques and Application
20 pages
AA12 Deep Learning 2024
No ratings yet
AA12 Deep Learning 2024
30 pages
Seminar Report cnn1
No ratings yet
Seminar Report cnn1
23 pages
CERN Deep Learning and Vision
No ratings yet
CERN Deep Learning and Vision
72 pages
Stage 424 June 2023
No ratings yet
Stage 424 June 2023
89 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
SocrAI Day 1
No ratings yet
SocrAI Day 1
104 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
The Evolution of Deep Learning
No ratings yet
The Evolution of Deep Learning
53 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
58 pages
AI & Deep Learning TensorFlow, Keras, PyTorch - 80 Hours-1
No ratings yet
AI & Deep Learning TensorFlow, Keras, PyTorch - 80 Hours-1
12 pages
Introduction To Deep Learning 17th January 2025
No ratings yet
Introduction To Deep Learning 17th January 2025
60 pages
Vocational Training Format
No ratings yet
Vocational Training Format
13 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
NN DL Unit - III
No ratings yet
NN DL Unit - III
19 pages
Image Classification Using Small Convolutional Neural Network
No ratings yet
Image Classification Using Small Convolutional Neural Network
5 pages
Computation 11 00052
No ratings yet
Computation 11 00052
24 pages
Day 1 S3
No ratings yet
Day 1 S3
29 pages
Deep Learning - Concepts, Techniques, and Applications
No ratings yet
Deep Learning - Concepts, Techniques, and Applications
10 pages
DNN Merged Sugata
No ratings yet
DNN Merged Sugata
243 pages
Unit 1 Ai, ML, Ann, DL and Examples
No ratings yet
Unit 1 Ai, ML, Ann, DL and Examples
68 pages
Special Topics (Compiled Notes)
No ratings yet
Special Topics (Compiled Notes)
11 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Deep Learning NLP and Computer Vision
No ratings yet
Deep Learning NLP and Computer Vision
9 pages
Lecture Notes: Introduction To Machine Learning For The Sciences
No ratings yet
Lecture Notes: Introduction To Machine Learning For The Sciences
80 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
1.neural Networks and Convolutional Processing
No ratings yet
1.neural Networks and Convolutional Processing
94 pages
Unit 3
No ratings yet
Unit 3
105 pages
Module 3
No ratings yet
Module 3
97 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
Lecture1 ANN - Full
No ratings yet
Lecture1 ANN - Full
66 pages
Report About Neural Network For Image Classification
No ratings yet
Report About Neural Network For Image Classification
51 pages
Topic Ai: Submitted by Sheharbano
No ratings yet
Topic Ai: Submitted by Sheharbano
7 pages
Lec 1 Intro
No ratings yet
Lec 1 Intro
54 pages
Lec14 CNNRNNModels
No ratings yet
Lec14 CNNRNNModels
64 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
Unit Ii ML
No ratings yet
Unit Ii ML
22 pages
1c Machinelearning
No ratings yet
1c Machinelearning
50 pages
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
No ratings yet
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
106 pages
Artificial Intelligence Algorithms
From Everand
Artificial Intelligence Algorithms
akosnemeth
No ratings yet
Deep learning: deep learning explained to your granny – a guide for beginners
From Everand
Deep learning: deep learning explained to your granny – a guide for beginners
PAT NAKAMOTO
3/5 (2)
Unit 3. The News and The Media - Listening
No ratings yet
Unit 3. The News and The Media - Listening
7 pages
Umcdmt-15-1 Portfolio Assessment Brief
No ratings yet
Umcdmt-15-1 Portfolio Assessment Brief
7 pages
Computer Science 2010-2011 Program Sheet: Artificial Intelligence Track
No ratings yet
Computer Science 2010-2011 Program Sheet: Artificial Intelligence Track
2 pages
Wa0000.
No ratings yet
Wa0000.
5 pages
Individual Learning Continuity Plan For Sy 2020-2021
No ratings yet
Individual Learning Continuity Plan For Sy 2020-2021
6 pages
Let Reviewer 50items
No ratings yet
Let Reviewer 50items
15 pages
TOS Reading and Writing Skills
No ratings yet
TOS Reading and Writing Skills
1 page
20 Scenario Based Interview Questions
No ratings yet
20 Scenario Based Interview Questions
5 pages
1 Pengenalan AeU
No ratings yet
1 Pengenalan AeU
64 pages
Lesson Plan in 21st Century Literature P
No ratings yet
Lesson Plan in 21st Century Literature P
15 pages
Inclusive Lesson Plan Two
No ratings yet
Inclusive Lesson Plan Two
14 pages
Budgen p209
No ratings yet
Budgen p209
21 pages
Lesson Plan Seasons
No ratings yet
Lesson Plan Seasons
6 pages
Syllabus
No ratings yet
Syllabus
3 pages
Training Feedback Form Trainees
No ratings yet
Training Feedback Form Trainees
2 pages
Meeting 13 Review Text
No ratings yet
Meeting 13 Review Text
8 pages
Teacher Observation Form - Google Forms
100% (2)
Teacher Observation Form - Google Forms
16 pages
Performance Management Appraisal
No ratings yet
Performance Management Appraisal
6 pages
Homework 1
No ratings yet
Homework 1
2 pages
Semester Spring 2020 Cognitive Psychology (Psy504) Assignment No. 01
No ratings yet
Semester Spring 2020 Cognitive Psychology (Psy504) Assignment No. 01
4 pages
Style, Context and Register
No ratings yet
Style, Context and Register
3 pages
Best Lesson-Dominique Ho
No ratings yet
Best Lesson-Dominique Ho
24 pages
Review of Semiotics of Cinema
No ratings yet
Review of Semiotics of Cinema
15 pages
Personality Types (2015)
No ratings yet
Personality Types (2015)
24 pages
TTT Lesson Plan Template: Itti Tunis 2021 1 of 2
100% (1)
TTT Lesson Plan Template: Itti Tunis 2021 1 of 2
2 pages
Chapter 1 Scientific Article
100% (1)
Chapter 1 Scientific Article
24 pages
الخلفية التاريخية لنشأة الديداكتيك قراءة في المفهوم والموضوع
No ratings yet
الخلفية التاريخية لنشأة الديداكتيك قراءة في المفهوم والموضوع
8 pages
Dr. Hamzah, M.A., M.M.: Introduction To Functional Grammar
No ratings yet
Dr. Hamzah, M.A., M.M.: Introduction To Functional Grammar
43 pages
Basic Concepts in Cookery - Teacher
No ratings yet
Basic Concepts in Cookery - Teacher
25 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.