Lecture Slides 2 - Neural Networks - 2021
Lecture Slides 2 - Neural Networks - 2021
Dr Etienne Pienaar
etienne.pienaar@uct.ac.za Room 5.43
October 5, 2021
1/42
Feed-forward Neural Networks
Based on Nielsen (2015).
2/42
Feed-forward Neural Network
dX
0 =p
yi ≈ a = xik wk + b,
k=1
3/42
Feed-forward Neural Network
dX
0 =p
πi ≈ a = xik wk + b,
k=1
4/42
Feed-forward Neural Network
dX
0 =p
πi ≈ a = σ xik wk + b ,
k=1
5/42
Feed-forward Neural Network
6/42
Feed-forward Neural Network
k=1
7/42
Feed-forward Neural Network
8/42
Feed-forward Neural Network
w denotes the kj -th weight parameter linking the k -th node in layer
l
l
training example!
9/42
Model `Class' ?
σ (.).
Typically, the activation functions are chosen to be the same for
l
hidden layers.
The activation function for the output layer is usually chosen in
conjunction with an appropriate cost function (objective).
Naturally, the latter is also determined by the measurement scale of
the target variable/output.
Design choices for the network are often motivated within the context of
the particular application, though one should not be tempted to attribute
performance solely to the design parameters: the particular hypothesis
chosen at the learning phase may not be distinguishable statistically other
designs!
10/42
1 1 1
b11
β1
1
w11 b21
xi2 β2 xi2 a11
2
w11
P ≈ yi
j xij βj
β. 1
wp1
... ... ... a21
≈ yi
βp wd21 1
11/42
1 1
1
b11
β1
1
w11 b21
a11
xi1 2
w11
β2 ≈ yi
1
P
xi1 σ( xij βj + β1 ) w21 2
j
w21 ≈ yi
a12 a21
2
w31
β3
xi2
xi2 a13
12/42
In vector-form, let x = [x , x ] denote the input/predictor vector (the
T
Dimensions: d = 2, d = 3, d = 1.
Evaluation of layer 1:
0 1 2
a = σ (W x + b ) (a 3 × 1 matrix).
1
1
T
1 i
1
14/42
Activation Functions
Activation Functions
1.0
0.5
f(z)
0.0
−0.5
Logistic
Tanh
−1.0
ReLU
−4 −2 0 2 4
15/42
Activation Functions
1.0
0.5
0.0
∂z
∂f
−0.5
Logistic
Tanh
−1.0
ReLU
−4 −2 0 2 4
16/42
Output Layer
The output layer relates to the nal evaluation of the updating equation
that denes the model. Since this is the layer that interfaces with the
data via the cost function, it is usually chosen with reference to the
problem at hand.
For regression-type problems, where responses are continuous, linear
activations are useful. For these purposes, we often use σ (z) = z.
For classication-type problems, where responses are categorical, we
J
choose activations that match the range of the encoded outputs. This
could be logistic, Tanh or even linear (not recommended) it is often
more useful to ensure that the output layer behaves like a distribution.
For these purposes, it is recommended that one uses the so-called
soft-max function: zj
e
PdL ,
k=1 ez k
for j = 1, 2, ..., d .
The resulting outputs can then be treated as probabilities and predictions
L
1.0
0.8
0.6
0.4
0.2
0.0
18/42
Cost/Objective Functions
Based on various sources.
19/42
Cost Functions
20/42
Cost Functions: Squared Error Loss
N
1 X
CM SE = (yi − ŷi )2
2N i=1
2
Post-processing of predictions can easily lead to incorrectly reporting
prediction performance since the objective function was not minimised whilst
explicitly accounting for the processing mechanism.
22/42
Cost Functions: Cross-Entropy-Error
This objective does not suer from the saturation issue mentioned
previously: the gradient of the objective w.r.t. the parameters of the
network are less prone to bottoming out when it is far from the global
solution.
23/42
Choice of Cost Functions
24/42
Choice of Cost Functions
C[CE
C[CE
]
]
w[2
w[2
]
]
] ]
w[1 w[1
25/42
Fitting Neural Networks
Tackle Problem 1, (a), (b), and (c) in the Neural Networks section of the
exercises in your lecture notes.
26/42
Numerical Techniques for
Finding Extrema
27/42
Numerical Techniques for Finding Extrema
Newton's method sets up an iterative equation that attempts to iteratively seek out a
critical point of a function based on rst and second-order properties of the function.
Suppose at the present iteration, we are at w (could also be initial
value) and f (w ) 6= 0.
k
f (wk )
wk+1 = wk − .
f 0 (wk )
28/42
Newton's Method in 1D
29/42
Newton's Method
where
∂f
!
∂wx
(wk )
∇f (xk ) = ∂f
∂wy
(wk )
and
∂2f ∂2f
∂wx 2 (wk ) ∂wx ∂wy
(wk )
Γ(wk ) = ∂2f ∂2f
.
∂wx ∂wy
(wk ) ∂wy2 (w k)
30/42
Newton's Method in 2D
31/42
Newton's Method in 2D
32/42
Fitting Neural Networks
33/42
Gradient Decent
Gradient Descent gets rid of the second-order elements and makes the
step-size purely proportional to the gradient:
wk+1 = wk − λ∇f (wk ); k = 1, 2, . . .
where λ is a control parameter for the algorithm (henceforth to be
referred as the `learning rate'.
This has the advantage that the algorithm much more computationally
ecient.
Note that, for a p-parameter function, we would require all the second
order cross-derivatives at each step of the algorithm if we wished to
use Newton's method.
Gradient Descent would only require evaluation of the rst-order
derivatives.
34/42
Gradient Descent
35/42
Gradient Descent w Momentum
37/42
Stochastic Gradient Descent
38/42
Passing over subsets of the data led to the convention that the length of
time that a `learning' algorithm runs is expressed in terms of
epochsone epoch being a full pass over the dataset.
Often software parametrises such procedures in terms of batch-size (or
`mini-batch', i.e., the number of observations in each subset.
At the other extreme of the full data set, it is also possible to look at a
single training example. This is sometimes referred to as `online',
though formally in the literature 'online learning' is taken to mean
something else.
In practice one has to manage the trade-o between computational
gains and the convergence of the algorithm so batch sizes have to
small but not so small as to result in poor convergence.
39/42
Stochastic Gradient Descent
40/42
Stochastic Gradient Descent
41/42
References I
42/42