0% found this document useful (0 votes)
27 views42 pages

Lecture Slides 2 - Neural Networks - 2021

The document discusses feed-forward neural networks. It describes the basic structure and updating equations of a feed-forward neural network, including nodes, layers, weights, biases, and activation functions. It also notes some design choices like the number of hidden layers and nodes as well as activation functions.

Uploaded by

alvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views42 pages

Lecture Slides 2 - Neural Networks - 2021

The document discusses feed-forward neural networks. It describes the basic structure and updating equations of a feed-forward neural network, including nodes, layers, weights, biases, and activation functions. It also notes some design choices like the number of hidden layers and nodes as well as activation functions.

Uploaded by

alvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Lecture 5: Neural Networks.

Dr Etienne Pienaar
etienne.pienaar@uct.ac.za  Room 5.43

University of Cape Town

October 5, 2021

1/42
Feed-forward Neural Networks
Based on Nielsen (2015).

2/42
Feed-forward Neural Network

Consider modeling a dataset consisting observations of p inputs


and a single output y for i = 1, 2, . . . , N (a total
xTi = [xi1 , xi2 , . . . , xip ]
of N training examples). A simple linear model follows:
i

dX
0 =p

yi ≈ a = xik wk + b,
k=1

for all observations i = 1, 2, . . . , N .


If our approximation is good, then a will be close to yi by some
chosen objective measure (over all i).

3/42
Feed-forward Neural Network

What if the nature of the response is such that it is categorical? If it's


binary, we can encode as y ∈ {0, 1}, pick the objective to be
parametrised in terms of a probability, say π = Pr(Y = 1), in which case
i
i i

dX
0 =p

πi ≈ a = xik wk + b,
k=1

for all observations i = 1, 2, . . . , N .


But then we'd have to constrain the parameters in the linear
component in order to ensure a ∈ [0, 1].

4/42
Feed-forward Neural Network

What if the nature of the response is such that it is categorical? If it's


binary, we can encode as y ∈ {0, 1}, pick the objective to be
parametrised in terms of a probability, say π = Pr(Y = 1), in which case
i
i i

 dX
0 =p 
πi ≈ a = σ xik wk + b ,
k=1

for all observations i = 1, 2, . . . , N .


But then we'd have to constrain the parameters in the linear
component in order to ensure a ∈ [0, 1]. Alternatively, we can
transform the linear component appropriately using some
activation function σ(.) which maps its argument to [0, 1].

5/42
Feed-forward Neural Network

Over and above generalising congurations for the response, if we are in


the world of non-linear problems, further generalisation of this idea relies
on constructing a non-linear basis for making predictions:
a = Basis(x , θ),
i

for all observations i = 1, 2, . . . , N .

6/42
Feed-forward Neural Network

Feed-forward neural networks give such a basis by extending our earlier


ideas through recursion. For example, one recursive step:
for j = 1, 2, . . . , d
 dX
0 =p 
a1j = σ1 1
xik wkj + b1j 1
k=1
d1
X 
2
a = σ2 a1k wk2 +b 2

k=1

for all observations i = 1, 2, . . . , N .


Each row (iteration) constitutes a `layer' (for which we incur the
superscripts in our indexing), and the nal layer is what our
non-linear basis returns as a prediction of the response. Each
element (along index j ) in the rst layer constitutes a `node' in
that layer.

7/42
Feed-forward Neural Network

In a more general form, we can encapsulate this idea in the simple


(despite the tedious indexing) recursive relation:
 dX
l−1 
alj = σl al−1
k w l
kj + bl
j , l = 1, 2, . . . , L; j = 1, 2, . . . , dl ,
k=1

This is the scalar form of the updating equations which denes


feed-forward neural networks. (Scalar because, for each observation
a node is a single numberscalar.)

8/42
Feed-forward Neural Network

denotes an activation function on layer l (e.g., sigmoid or


hyperbolic tangent),
σl (.)

denotes the number of nodes in layer l − 1 (the number of nodes


in layer l is d ),
dl−1

w denotes the kj -th weight parameter linking the k -th node in layer
l
l

l − 1 and j -th node in layer l (although we connect from layer l − 1 to


kj

layer l, we use the convention of indexing the weights using the


superscript lthe forward layer),
b denotes the j -th bias in layer l, and the equation is evaluated
l

subject to the initial conditions a = x for all j at the i-th training


j
(0)

example. The updating equation thus needs to be evaluated for each


j ij

training example!

9/42
Model `Class' ?

Though we restrict ourselves to the the feed-forward architecture for now,


this structure permits quite a bit of freedom w.r.t. dening models.
In addition to the dimensions of the model, i.e., the number of nodes
in each hidden layer, d , and the number of hidden layers, L − 1,
the user denes the kinds of activation function used in each layer
l−1

σ (.).
Typically, the activation functions are chosen to be the same for
l

hidden layers.
The activation function for the output layer is usually chosen in
conjunction with an appropriate cost function (objective).
Naturally, the latter is also determined by the measurement scale of
the target variable/output.
Design choices for the network are often motivated within the context of
the particular application, though one should not be tempted to attribute
performance solely to the design parameters: the particular hypothesis
chosen at the learning phase may not be distinguishable statistically other
designs!
10/42
1 1 1
b11
β1
1
w11 b21
xi2 β2 xi2 a11
2
w11
P ≈ yi
j xij βj
β. 1
wp1
... ... ... a21
≈ yi
βp wd21 1

xip xip a1d1

Directed graph of a normal linear model (left) vs. that of a standard


feed-forward neural network, specically, a (d )-network (right). Weights
shown for the rst columns of W and W .
1
1 2

11/42
1 1
1
b11
β1
1
w11 b21
a11
xi1 2
w11
β2 ≈ yi
1
P
xi1 σ( xij βj + β1 ) w21 2
j
w21 ≈ yi
a12 a21
2
w31
β3
xi2
xi2 a13

Directed graph of a logistic regression model (left) vs. that of a (3)-


network (right) for two predictors on a single response. Weights shown for
the rst columns of W and W . Here, a = σ ( w x + b ) and
3
1 2
P 1
1 1
2
j=1
1
j1 ij
1
1
a21 2 1
b21 )
P
= σ2 ( j=1 wj1 aj +

12/42
In vector-form, let x = [x , x ] denote the input/predictor vector (the
T

input layer), a = [a , a , . . . , a ] (a d × 1 matrix) denote the vector of


i i1 i2
l l l l T

activations in layer l, W the d × d weight matrix connecting layers


1 2 dl l

l − 1 and l, and b = [b , b , . . . , b ] the vector of bias terms on layer l,


l l−1 l
l l l l T

then, for a (3)-network with two inputs on a single response, we have:


1 2 dl

Dimensions: d = 2, d = 3, d = 1.
Evaluation of layer 1:
0 1 2

a = σ (W x + b ) (a 3 × 1 matrix).
1
1
T
1 i
1

Evaluation of layer 2 (the output layer):


a = σ (W a + b ) (a 1 × 1 matrix).
2
2
T 1
2
2

Note: W is a (2 × 3)-matrix, W is a (3 × 1)-matrix, b is a 1

(3 × 1)-matrix, and b is a (1 × 1)-matrix.


1 2
2

a supplies the prediction for response y (hence ≈ y ).


2

Note that this evaluation is for a single observation! We need to apply


i i

the above steps for each observation in our data set.


13/42
Activation Functions

Numerous activation functions exist with the mode commonly used


variants for neural networks being logistic, hyperbolic tangent, and
rectied linear units (ReLU)
The logistic activation function is the bog-standard expression from
logistic regression:
ez 1
σLG (z) = = ∈ [0, 1].
1+e z 1 + e−z
The so-called hyperbolic tangent hyperbolic tangent function is
dened by the elementary mathematical function
ez − e−z e2z − 1
σH (z) = tanh(z) = z −z
= 2z ∈ [−1, 1].
e +e e +1
Rectied linear units are simply dened as:
σReLU (z) = max(0, z) ∈ R+ .

14/42
Activation Functions

Activation Functions

1.0
0.5
f(z)

0.0
−0.5

Logistic
Tanh
−1.0

ReLU

−4 −2 0 2 4

15/42
Activation Functions

Activation Function Gradients

1.0
0.5
0.0
∂z
∂f

−0.5

Logistic
Tanh
−1.0

ReLU

−4 −2 0 2 4

16/42
Output Layer

The output layer relates to the nal evaluation of the updating equation
that denes the model. Since this is the layer that interfaces with the
data via the cost function, it is usually chosen with reference to the
problem at hand.
For regression-type problems, where responses are continuous, linear
activations are useful. For these purposes, we often use σ (z) = z.
For classication-type problems, where responses are categorical, we
J

choose activations that match the range of the encoded outputs. This
could be logistic, Tanh or even linear (not recommended) it is often
more useful to ensure that the output layer behaves like a distribution.
For these purposes, it is recommended that one uses the so-called
soft-max function: zj
e
PdL ,
k=1 ez k
for j = 1, 2, ..., d .
The resulting outputs can then be treated as probabilities and predictions
L

may correspond to the node with the highest predicted probability.


17/42
Softmax Function

Softmax Function on 5 outputs

1.0
0.8
0.6
0.4
0.2
0.0

aL1 aL2 aL3 aL4 aL5

18/42
Cost/Objective Functions
Based on various sources.

19/42
Cost Functions

In order to t neural networks, we need to extract a conguration of the


network that replicates the data in some sense. That is, we need to nd
the parameters of the network that produce the `best' approximation of
the relationship between the predictors and responses.
`best' is usually taken to mean most accurate on the basis of
prediction performance.
Depending on the nature of the response variable, this might mean
picking a measure that is commensurate with the task.

20/42
Cost Functions: Squared Error Loss

Regression problems, in the context of supervised learning, usually refer to modeling


continuous responses. That is, Yi may take on any value in a continuum of possible
outcomes. As such, it makes sense to think of a prediction under a particular
conguration mapping the inputs to a coordinate in the domain of the response
variable. Consequently, a natural measure for the quality of a particular prediction is
the distance between the prediction and the observed response for a particular training
example. Assuming that we have N independent examples, we may posit the so-called
mean-squared-error (MSE) objective:

N
1 X
CM SE = (yi − ŷi )2
2N i=1

1 where ŷi denotes our prediction (one dimensional output).


C here represents some distance measure to be minimised.
For a neural network, ŷi is calculated by evaluating the output layer of a neural
network, aL1 for the ith observation.
The cost function can easily accommodate multivalued outputs by simply
calculating the euclidean distance between the prediction coordinate and the
observed coordinate over all observations.
For these purposes it is practical to either scale the observations to lie in [0, 1], or
alternatively to impose a linear activation function on the output layer.
1
Scaling by 1/2 here is arbitrary, but simplies calculations.
21/42
Cost Functions: Cross-Entropy-Error

Classication tasks, in the context of supervised learning, usually refer


to modeling nominal/categorical responses. The responses Y are thus
assumed to take on discrete values from a nite set of possible outcomes.
i

Though we could still make use of an appropriate distance measure to set


up a cost function for such responses, direct application is
disadvantageous.
Consider, a variable which takes on values in {0, 1}.
If the output layer activation is sigmoidal, we would be attempting to
saturate the output activations in order to minimize the distance
between our predictions and the observations.
Alternatively, we could again choose a linear activation function on the
output layer. Our predictions would then take on values in a
continuum and we would have to post-process the predictions in order
to produce a prediction. 2

2
Post-processing of predictions can easily lead to incorrectly reporting
prediction performance since the objective function was not minimised whilst
explicitly accounting for the processing mechanism.
22/42
Cost Functions: Cross-Entropy-Error

For classication tasks, it is advantageous to formulate the objective


function in terms of probabilities. That is, we set up the output layer to
produce quantities that can reasonably be interpreted as probabilities.
Subsequently, we can formulate an objective in terms of the distribution
for the target variable. Under such a construction, we may use the
so-called cross-entropy error:
N
X
CCE = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi )),
i=1

for the case of a single-valued response and ŷ ∈ [0, 1].


Optimisation of this objective is equivalent to MLE.
i

This objective does not suer from the saturation issue mentioned
previously: the gradient of the objective w.r.t. the parameters of the
network are less prone to bottoming out when it is far from the global
solution.
23/42
Choice of Cost Functions

Remember that the goal is to `learn' the structure of the relationship


between the responses and the predictors. Since the interface between
the data and the model is at the cost function, `learning' is heavily
inuenced by the behaviour of the cost function. So make sure you
choose the output activations and cost function keeping the particular
task in mind.

24/42
Choice of Cost Functions

C[CE

C[CE
]

]
w[2

w[2
]

]
] ]
w[1 w[1

C surface vs. C on a two-parameter space for a classication task


(sigmoidal output).
CE SE

25/42
Fitting Neural Networks

Tackle Problem 1, (a), (b), and (c) in the Neural Networks section of the
exercises in your lecture notes.

26/42
Numerical Techniques for
Finding Extrema

27/42
Numerical Techniques for Finding Extrema

Newton's method sets up an iterative equation that attempts to iteratively seek out a

critical point of a function based on rst and second-order properties of the function.
Suppose at the present iteration, we are at w (could also be initial
value) and f (w ) 6= 0.
k

We wish to nd w : f (w) = 0. Using the line tangent to the curve at


k
?

w , we may get some insight as to which direction to travel in search


of w :
k
?
f (w ) − f (w )k+1 k
f 0 (wk ) =
wk+1 − wk
for w − w small.
Setting w to be the solution, we arrive at an updating equation:
k+1 k
k+1

f (wk )
wk+1 = wk − .
f 0 (wk )

Actually, we like to use λ f (wk )


for some (step-size) control parameter
λ.
f 0 (wk )

28/42
Newton's Method in 1D

29/42
Newton's Method

Newton's method easily generalises to higher dimensions. Using similar


arguments, one can derive an updating equation for the two-dimensional case. Let
f : R2 → R and wk = {wx , wy }T , then
(k) (k)

wk+1 = wk − Γ(wk )−1 ∇f (wk ); k = 1, 2, . . .

where
∂f
!
∂wx
(wk )
∇f (xk ) = ∂f
∂wy
(wk )

and
∂2f ∂2f
 
∂wx 2 (wk ) ∂wx ∂wy
(wk )
Γ(wk ) =  ∂2f ∂2f
.
∂wx ∂wy
(wk ) ∂wy2 (w k)

30/42
Newton's Method in 2D

31/42
Newton's Method in 2D

32/42
Fitting Neural Networks

Complete Problem 1 in the Neural Networks section of the exercises in


your lecture notes.

33/42
Gradient Decent

Gradient Descent gets rid of the second-order elements and makes the
step-size purely proportional to the gradient:
wk+1 = wk − λ∇f (wk ); k = 1, 2, . . .
where λ is a control parameter for the algorithm (henceforth to be
referred as the `learning rate'.
This has the advantage that the algorithm much more computationally
ecient.
Note that, for a p-parameter function, we would require all the second
order cross-derivatives at each step of the algorithm if we wished to
use Newton's method.
Gradient Descent would only require evaluation of the rst-order
derivatives.

34/42
Gradient Descent

35/42
Gradient Descent w Momentum

A fundamental issue with overspecied models such as neural networks is


that the underlying system has multiple (possibly identical/near identical)
solutions. As such, the error surface has multiple local extrema. Since
gradient Descent step-size is determined entirely by the gradient of the
surface, such an algorithm might get stuck at local extrema. Gradient
Descent with momentum modies the updating equation by keeping
track of previous updates, e.g.:
wk+1 = wk − λ∇f (wk ) + ν∇f (wk−1 ); k = 2, 3, . . .
where λ is a control parameter for the algorithm (henceforth to be
referred as the `learning rate'.
You will nd numerous modications of gradient Descent aimed
specically at neural nets: RPROP, GRPROP etc.
Note that momentum can severely destabilize the algorithm, a
phenomenon which is easy to miss without careful analysis of
convergenceALWAYS double check output!
36/42
Gradient Descent with Momentum

37/42
Stochastic Gradient Descent

Stochastic gradient Descent modies the algorithm further by


calculating the cost function at each iteration using only a small subset
of the data. I.e., using say 10% of the available training examples at each
iteration the objective function is eectively perturbed at random
resulting in stochastic updates to the parameter trajectories as the
algorithm attempts to nd a minima.
Since less data is used, this results in substantial computational
eciency gains.
This also conveniently takes care (ish) of multi-modality issues.
The claim is also that, by focusing small subsets of the data, it is
possible for models trained under such an algorithm to `learn' nuances
in the data that would have otherwise been opaque under evaluation of
the full error surface.

38/42
Passing over subsets of the data led to the convention that the length of
time that a `learning' algorithm runs is expressed in terms of
epochsone epoch being a full pass over the dataset.
Often software parametrises such procedures in terms of batch-size (or
`mini-batch', i.e., the number of observations in each subset.
At the other extreme of the full data set, it is also possible to look at a
single training example. This is sometimes referred to as `online',
though formally in the literature 'online learning' is taken to mean
something else.
In practice one has to manage the trade-o between computational
gains and the convergence of the algorithm so batch sizes have to
small but not so small as to result in poor convergence.

39/42
Stochastic Gradient Descent

40/42
Stochastic Gradient Descent

41/42
References I

[Nielsen 2015] Nielsen, Michael A.: Neural networks and deep


learning. Determination Press, 2015

42/42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy