0% found this document useful (0 votes)

27 views42 pages

Lecture Slides 2 - Neural Networks - 2021

The document discusses feed-forward neural networks. It describes the basic structure and updating equations of a feed-forward neural network, including nodes, layers, weights, biases, and activation functions. It also notes some design choices like the number of hidden layers and nodes as well as activation functions.

Uploaded by

alvin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views42 pages

Lecture Slides 2 - Neural Networks - 2021

Uploaded by

alvin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Lecture 5: Neural Networks.

Dr Etienne Pienaar
etienne.pienaar@uct.ac.za Room 5.43

University of Cape Town

October 5, 2021

1/42
Feed-forward Neural Networks
Based on Nielsen (2015).

2/42
Feed-forward Neural Network

Consider modeling a dataset consisting observations of p inputs

and a single output y for i = 1, 2, . . . , N (a total
xTi = [xi1 , xi2 , . . . , xip ]
of N training examples). A simple linear model follows:
i

dX
0 =p

yi ≈ a = xik wk + b,
k=1

for all observations i = 1, 2, . . . , N .

If our approximation is good, then a will be close to yi by some
chosen objective measure (over all i).

3/42
Feed-forward Neural Network

What if the nature of the response is such that it is categorical? If it's

binary, we can encode as y ∈ {0, 1}, pick the objective to be
parametrised in terms of a probability, say π = Pr(Y = 1), in which case
i
i i

dX
0 =p

πi ≈ a = xik wk + b,
k=1

for all observations i = 1, 2, . . . , N .

But then we'd have to constrain the parameters in the linear
component in order to ensure a ∈ [0, 1].

4/42
Feed-forward Neural Network

What if the nature of the response is such that it is categorical? If it's

binary, we can encode as y ∈ {0, 1}, pick the objective to be
parametrised in terms of a probability, say π = Pr(Y = 1), in which case
i
i i

dX
0 =p
πi ≈ a = σ xik wk + b ,
k=1

for all observations i = 1, 2, . . . , N .

But then we'd have to constrain the parameters in the linear
component in order to ensure a ∈ [0, 1]. Alternatively, we can
transform the linear component appropriately using some
activation function σ(.) which maps its argument to [0, 1].

5/42
Feed-forward Neural Network

Over and above generalising congurations for the response, if we are in

the world of non-linear problems, further generalisation of this idea relies
on constructing a non-linear basis for making predictions:
a = Basis(x , θ),
i

for all observations i = 1, 2, . . . , N .

6/42
Feed-forward Neural Network

Feed-forward neural networks give such a basis by extending our earlier

ideas through recursion. For example, one recursive step:
for j = 1, 2, . . . , d
dX
0 =p
a1j = σ1 1
xik wkj + b1j 1
k=1
d1
X
2
a = σ2 a1k wk2 +b 2

k=1

for all observations i = 1, 2, . . . , N .

Each row (iteration) constitutes a `layer' (for which we incur the
superscripts in our indexing), and the nal layer is what our
non-linear basis returns as a prediction of the response. Each
element (along index j ) in the rst layer constitutes a `node' in
that layer.

7/42
Feed-forward Neural Network

In a more general form, we can encapsulate this idea in the simple

(despite the tedious indexing) recursive relation:
dX
l−1
alj = σl al−1
k w l
kj + bl
j , l = 1, 2, . . . , L; j = 1, 2, . . . , dl ,
k=1

This is the scalar form of the updating equations which denes

feed-forward neural networks. (Scalar because, for each observation
a node is a single numberscalar.)

8/42
Feed-forward Neural Network

denotes an activation function on layer l (e.g., sigmoid or

hyperbolic tangent),
σl (.)

denotes the number of nodes in layer l − 1 (the number of nodes

in layer l is d ),
dl−1

w denotes the kj -th weight parameter linking the k -th node in layer
l
l

l − 1 and j -th node in layer l (although we connect from layer l − 1 to

layer l, we use the convention of indexing the weights using the

superscript lthe forward layer),
b denotes the j -th bias in layer l, and the equation is evaluated
l

subject to the initial conditions a = x for all j at the i-th training

j
(0)

example. The updating equation thus needs to be evaluated for each

j ij

training example!

9/42
Model `Class' ?

Though we restrict ourselves to the the feed-forward architecture for now,

this structure permits quite a bit of freedom w.r.t. dening models.
In addition to the dimensions of the model, i.e., the number of nodes
in each hidden layer, d , and the number of hidden layers, L − 1,
the user denes the kinds of activation function used in each layer
l−1

σ (.).
Typically, the activation functions are chosen to be the same for
l

hidden layers.
The activation function for the output layer is usually chosen in
conjunction with an appropriate cost function (objective).
Naturally, the latter is also determined by the measurement scale of
the target variable/output.
Design choices for the network are often motivated within the context of
the particular application, though one should not be tempted to attribute
performance solely to the design parameters: the particular hypothesis
chosen at the learning phase may not be distinguishable statistically other
designs!
10/42
1 1 1
b11
β1
1
w11 b21
xi2 β2 xi2 a11
2
w11
P ≈ yi
j xij βj
β. 1
wp1
... ... ... a21
≈ yi
βp wd21 1

xip xip a1d1

Directed graph of a normal linear model (left) vs. that of a standard

feed-forward neural network, specically, a (d )-network (right). Weights
shown for the rst columns of W and W .
1
1 2

11/42
1 1
1
b11
β1
1
w11 b21
a11
xi1 2
w11
β2 ≈ yi
1
P
xi1 σ( xij βj + β1 ) w21 2
j
w21 ≈ yi
a12 a21
2
w31
β3
xi2
xi2 a13

Directed graph of a logistic regression model (left) vs. that of a (3)-

network (right) for two predictors on a single response. Weights shown for
the rst columns of W and W . Here, a = σ ( w x + b ) and
3
1 2
P 1
1 1
2
j=1
1
j1 ij
1
1
a21 2 1
b21 )
P
= σ2 ( j=1 wj1 aj +

12/42
In vector-form, let x = [x , x ] denote the input/predictor vector (the
T

input layer), a = [a , a , . . . , a ] (a d × 1 matrix) denote the vector of

i i1 i2
l l l l T

activations in layer l, W the d × d weight matrix connecting layers

1 2 dl l

l − 1 and l, and b = [b , b , . . . , b ] the vector of bias terms on layer l,

l l−1 l
l l l l T

then, for a (3)-network with two inputs on a single response, we have:

1 2 dl

Dimensions: d = 2, d = 3, d = 1.
Evaluation of layer 1:
0 1 2

a = σ (W x + b ) (a 3 × 1 matrix).
1
1
T
1 i
1

Evaluation of layer 2 (the output layer):

a = σ (W a + b ) (a 1 × 1 matrix).
2
2
T 1
2
2

Note: W is a (2 × 3)-matrix, W is a (3 × 1)-matrix, b is a 1

(3 × 1)-matrix, and b is a (1 × 1)-matrix.

1 2
2

a supplies the prediction for response y (hence ≈ y ).

Note that this evaluation is for a single observation! We need to apply

i i

the above steps for each observation in our data set.

13/42
Activation Functions

Numerous activation functions exist with the mode commonly used

variants for neural networks being logistic, hyperbolic tangent, and
rectied linear units (ReLU)
The logistic activation function is the bog-standard expression from
logistic regression:
ez 1
σLG (z) = = ∈ [0, 1].
1+e z 1 + e−z
The so-called hyperbolic tangent hyperbolic tangent function is
dened by the elementary mathematical function
ez − e−z e2z − 1
σH (z) = tanh(z) = z −z
= 2z ∈ [−1, 1].
e +e e +1
Rectied linear units are simply dened as:
σReLU (z) = max(0, z) ∈ R+ .

14/42
Activation Functions

Activation Functions

1.0
0.5
f(z)

0.0
−0.5

Logistic
Tanh
−1.0

ReLU

−4 −2 0 2 4

15/42
Activation Functions

Activation Function Gradients

1.0
0.5
0.0
∂z
∂f

−0.5

Logistic
Tanh
−1.0

ReLU

−4 −2 0 2 4

16/42
Output Layer

The output layer relates to the nal evaluation of the updating equation
that denes the model. Since this is the layer that interfaces with the
data via the cost function, it is usually chosen with reference to the
problem at hand.
For regression-type problems, where responses are continuous, linear
activations are useful. For these purposes, we often use σ (z) = z.
For classication-type problems, where responses are categorical, we
J

choose activations that match the range of the encoded outputs. This
could be logistic, Tanh or even linear (not recommended) it is often
more useful to ensure that the output layer behaves like a distribution.
For these purposes, it is recommended that one uses the so-called
soft-max function: zj
e
PdL ,
k=1 ez k
for j = 1, 2, ..., d .
The resulting outputs can then be treated as probabilities and predictions
L

may correspond to the node with the highest predicted probability.

17/42
Softmax Function

Softmax Function on 5 outputs

1.0
0.8
0.6
0.4
0.2
0.0

aL1 aL2 aL3 aL4 aL5

18/42
Cost/Objective Functions
Based on various sources.

19/42
Cost Functions

In order to t neural networks, we need to extract a conguration of the

network that replicates the data in some sense. That is, we need to nd
the parameters of the network that produce the `best' approximation of
the relationship between the predictors and responses.
`best' is usually taken to mean most accurate on the basis of
prediction performance.
Depending on the nature of the response variable, this might mean
picking a measure that is commensurate with the task.

20/42
Cost Functions: Squared Error Loss

Regression problems, in the context of supervised learning, usually refer to modeling

continuous responses. That is, Yi may take on any value in a continuum of possible
outcomes. As such, it makes sense to think of a prediction under a particular
conguration mapping the inputs to a coordinate in the domain of the response
variable. Consequently, a natural measure for the quality of a particular prediction is
the distance between the prediction and the observed response for a particular training
example. Assuming that we have N independent examples, we may posit the so-called
mean-squared-error (MSE) objective:

N
1 X
CM SE = (yi − ŷi )2
2N i=1

1 where ŷi denotes our prediction (one dimensional output).

C here represents some distance measure to be minimised.
For a neural network, ŷi is calculated by evaluating the output layer of a neural
network, aL1 for the ith observation.
The cost function can easily accommodate multivalued outputs by simply
calculating the euclidean distance between the prediction coordinate and the
observed coordinate over all observations.
For these purposes it is practical to either scale the observations to lie in [0, 1], or
alternatively to impose a linear activation function on the output layer.
1
Scaling by 1/2 here is arbitrary, but simplies calculations.
21/42
Cost Functions: Cross-Entropy-Error

Classication tasks, in the context of supervised learning, usually refer

to modeling nominal/categorical responses. The responses Y are thus
assumed to take on discrete values from a nite set of possible outcomes.
i

Though we could still make use of an appropriate distance measure to set

up a cost function for such responses, direct application is
disadvantageous.
Consider, a variable which takes on values in {0, 1}.
If the output layer activation is sigmoidal, we would be attempting to
saturate the output activations in order to minimize the distance
between our predictions and the observations.
Alternatively, we could again choose a linear activation function on the
output layer. Our predictions would then take on values in a
continuum and we would have to post-process the predictions in order
to produce a prediction. 2

2
Post-processing of predictions can easily lead to incorrectly reporting
prediction performance since the objective function was not minimised whilst
explicitly accounting for the processing mechanism.
22/42
Cost Functions: Cross-Entropy-Error

For classication tasks, it is advantageous to formulate the objective

function in terms of probabilities. That is, we set up the output layer to
produce quantities that can reasonably be interpreted as probabilities.
Subsequently, we can formulate an objective in terms of the distribution
for the target variable. Under such a construction, we may use the
so-called cross-entropy error:
N
X
CCE = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi )),
i=1

for the case of a single-valued response and ŷ ∈ [0, 1].

Optimisation of this objective is equivalent to MLE.
i

This objective does not suer from the saturation issue mentioned
previously: the gradient of the objective w.r.t. the parameters of the
network are less prone to bottoming out when it is far from the global
solution.
23/42
Choice of Cost Functions

Remember that the goal is to `learn' the structure of the relationship

between the responses and the predictors. Since the interface between
the data and the model is at the cost function, `learning' is heavily
inuenced by the behaviour of the cost function. So make sure you
choose the output activations and cost function keeping the particular
task in mind.

24/42
Choice of Cost Functions

C[CE

C[CE
]

]
w[2

w[2
]

]
] ]
w[1 w[1

C surface vs. C on a two-parameter space for a classication task

(sigmoidal output).
CE SE

25/42
Fitting Neural Networks

Tackle Problem 1, (a), (b), and (c) in the Neural Networks section of the
exercises in your lecture notes.

26/42
Numerical Techniques for
Finding Extrema

27/42
Numerical Techniques for Finding Extrema

Newton's method sets up an iterative equation that attempts to iteratively seek out a

critical point of a function based on rst and second-order properties of the function.
Suppose at the present iteration, we are at w (could also be initial
value) and f (w ) 6= 0.
k

We wish to nd w : f (w) = 0. Using the line tangent to the curve at

k
?

w , we may get some insight as to which direction to travel in search

of w :
k
?
f (w ) − f (w )k+1 k
f 0 (wk ) =
wk+1 − wk
for w − w small.
Setting w to be the solution, we arrive at an updating equation:
k+1 k
k+1

f (wk )
wk+1 = wk − .
f 0 (wk )

Actually, we like to use λ f (wk )

for some (step-size) control parameter
λ.
f 0 (wk )

28/42
Newton's Method in 1D

29/42
Newton's Method

Newton's method easily generalises to higher dimensions. Using similar

arguments, one can derive an updating equation for the two-dimensional case. Let
f : R2 → R and wk = {wx , wy }T , then
(k) (k)

wk+1 = wk − Γ(wk )−1 ∇f (wk ); k = 1, 2, . . .

where
∂f
!
∂wx
(wk )
∇f (xk ) = ∂f
∂wy
(wk )

and
∂2f ∂2f
 
∂wx 2 (wk ) ∂wx ∂wy
(wk )
Γ(wk ) =  ∂2f ∂2f
.
∂wx ∂wy
(wk ) ∂wy2 (w k)

30/42
Newton's Method in 2D

31/42
Newton's Method in 2D

32/42
Fitting Neural Networks

Complete Problem 1 in the Neural Networks section of the exercises in

your lecture notes.

33/42
Gradient Decent

Gradient Descent gets rid of the second-order elements and makes the
step-size purely proportional to the gradient:
wk+1 = wk − λ∇f (wk ); k = 1, 2, . . .
where λ is a control parameter for the algorithm (henceforth to be
referred as the `learning rate'.
This has the advantage that the algorithm much more computationally
ecient.
Note that, for a p-parameter function, we would require all the second
order cross-derivatives at each step of the algorithm if we wished to
use Newton's method.
Gradient Descent would only require evaluation of the rst-order
derivatives.

34/42
Gradient Descent

35/42
Gradient Descent w Momentum

A fundamental issue with overspecied models such as neural networks is

that the underlying system has multiple (possibly identical/near identical)
solutions. As such, the error surface has multiple local extrema. Since
gradient Descent step-size is determined entirely by the gradient of the
surface, such an algorithm might get stuck at local extrema. Gradient
Descent with momentum modies the updating equation by keeping
track of previous updates, e.g.:
wk+1 = wk − λ∇f (wk ) + ν∇f (wk−1 ); k = 2, 3, . . .
where λ is a control parameter for the algorithm (henceforth to be
referred as the `learning rate'.
You will nd numerous modications of gradient Descent aimed
specically at neural nets: RPROP, GRPROP etc.
Note that momentum can severely destabilize the algorithm, a
phenomenon which is easy to miss without careful analysis of
convergenceALWAYS double check output!
36/42
Gradient Descent with Momentum

37/42
Stochastic Gradient Descent

Stochastic gradient Descent modies the algorithm further by

calculating the cost function at each iteration using only a small subset
of the data. I.e., using say 10% of the available training examples at each
iteration the objective function is eectively perturbed at random
resulting in stochastic updates to the parameter trajectories as the
algorithm attempts to nd a minima.
Since less data is used, this results in substantial computational
eciency gains.
This also conveniently takes care (ish) of multi-modality issues.
The claim is also that, by focusing small subsets of the data, it is
possible for models trained under such an algorithm to `learn' nuances
in the data that would have otherwise been opaque under evaluation of
the full error surface.

38/42
Passing over subsets of the data led to the convention that the length of
time that a `learning' algorithm runs is expressed in terms of
epochsone epoch being a full pass over the dataset.
Often software parametrises such procedures in terms of batch-size (or
`mini-batch', i.e., the number of observations in each subset.
At the other extreme of the full data set, it is also possible to look at a
single training example. This is sometimes referred to as `online',
though formally in the literature 'online learning' is taken to mean
something else.
In practice one has to manage the trade-o between computational
gains and the convergence of the algorithm so batch sizes have to
small but not so small as to result in poor convergence.

39/42
Stochastic Gradient Descent

40/42
Stochastic Gradient Descent

41/42
References I

[Nielsen 2015] Nielsen, Michael A.: Neural networks and deep

learning. Determination Press, 2015

42/42

Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
06 NeuralNetworks 2024
No ratings yet
06 NeuralNetworks 2024
82 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
7 NN Apr 28 2021
No ratings yet
7 NN Apr 28 2021
81 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
978-3-030-41068-1 (1) - 133-188
No ratings yet
978-3-030-41068-1 (1) - 133-188
56 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
Applications of Machine Learning To Machine Fault Diagnosis A Review and Roadmap
No ratings yet
Applications of Machine Learning To Machine Fault Diagnosis A Review and Roadmap
136 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Previewpdf
No ratings yet
Previewpdf
102 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Holberton Code
No ratings yet
Holberton Code
50 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
UNIT III 3.1 ML Artificial Neural Networks
No ratings yet
UNIT III 3.1 ML Artificial Neural Networks
65 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Backpropagation - Theory, Architectures, and Applications (1) - 1-100
No ratings yet
Backpropagation - Theory, Architectures, and Applications (1) - 1-100
100 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
DL - ANN - RNN - CNN (Autosaved) (Autosaved)
No ratings yet
DL - ANN - RNN - CNN (Autosaved) (Autosaved)
53 pages
Feedforward Networks: Marco Kuhlmann
No ratings yet
Feedforward Networks: Marco Kuhlmann
53 pages
NNDL
No ratings yet
NNDL
96 pages
08 Neural Networks
No ratings yet
08 Neural Networks
47 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Module 2
No ratings yet
Module 2
44 pages
Recurrent Neural Networks For Time Series Forecasting
No ratings yet
Recurrent Neural Networks For Time Series Forecasting
22 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Machine Learning New
No ratings yet
Machine Learning New
41 pages
Thesis Floris Visser 406508fv
No ratings yet
Thesis Floris Visser 406508fv
80 pages
7 NN Apr 28 2021
No ratings yet
7 NN Apr 28 2021
81 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
04 - Neural Networks PDF
No ratings yet
04 - Neural Networks PDF
46 pages
Physics On Machine Learning
100% (1)
Physics On Machine Learning
44 pages
Neural Networks Tutorial Answers
No ratings yet
Neural Networks Tutorial Answers
32 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
2.3 Feed Forward Netwoks
No ratings yet
2.3 Feed Forward Netwoks
25 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
No ratings yet
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
48 pages
Deep Learning in Matlab
No ratings yet
Deep Learning in Matlab
36 pages
Brivio 2022 Neuromorph. Comput. Eng. 2 042001
No ratings yet
Brivio 2022 Neuromorph. Comput. Eng. 2 042001
23 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
29 Feb Report
No ratings yet
29 Feb Report
21 pages
The Future of Communication
No ratings yet
The Future of Communication
52 pages
A Survey of Methods For Explaining Black Box Models
No ratings yet
A Survey of Methods For Explaining Black Box Models
42 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Notes Chapter Neural Networks
No ratings yet
Notes Chapter Neural Networks
18 pages
TKAN: Temporal Kolmogorov-Arnold Networks: Rémi Genet & Hugo Inzirillo May 2024
No ratings yet
TKAN: Temporal Kolmogorov-Arnold Networks: Rémi Genet & Hugo Inzirillo May 2024
17 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
Thesis Reflectarray
No ratings yet
Thesis Reflectarray
15 pages
Toward Multi-Modal Approach For Identification and Detection of Cyberbullying in Social Networks 2
No ratings yet
Toward Multi-Modal Approach For Identification and Detection of Cyberbullying in Social Networks 2
13 pages
Time Series Forecasting With Multilayer Perceptrons and Elmen Neural Neworks
No ratings yet
Time Series Forecasting With Multilayer Perceptrons and Elmen Neural Neworks
5 pages
Intra-Hour Solar Irradiance Forecasting Using Topology Data Analysis and Physics-Driven Deep Learning
No ratings yet
Intra-Hour Solar Irradiance Forecasting Using Topology Data Analysis and Physics-Driven Deep Learning
12 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Deep Learning Applications in Agriculture: A Short Review: January 2020
No ratings yet
Deep Learning Applications in Agriculture: A Short Review: January 2020
13 pages
Unit 2
No ratings yet
Unit 2
36 pages
Research Article Ant Colony Optimization-Enabled CNN Deep Learning Technique For Accurate Detection of Cervical Cancer
No ratings yet
Research Article Ant Colony Optimization-Enabled CNN Deep Learning Technique For Accurate Detection of Cervical Cancer
9 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Efficient Wildlife Intrusion Detection System Using Hybrid Algorithm
No ratings yet
Efficient Wildlife Intrusion Detection System Using Hybrid Algorithm
7 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Neural Network Notes
No ratings yet
Neural Network Notes
8 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Face Recognition Using Artificial Neural Network
No ratings yet
Face Recognition Using Artificial Neural Network
10 pages
Neuralnetworks 1
No ratings yet
Neuralnetworks 1
65 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
心理学deep learning Introduction - to - deep - neural - networks - - Syllabus - v0.9
No ratings yet
心理学deep learning Introduction - to - deep - neural - networks - - Syllabus - v0.9
3 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
The Mass Appraisal of The Real Estate by Computational Intelligence
No ratings yet
The Mass Appraisal of The Real Estate by Computational Intelligence
6 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Did You Know That Over $140 Billion
No ratings yet
Did You Know That Over $140 Billion
11 pages
Artificial Intelligence Group No 1 Assignment 1
No ratings yet
Artificial Intelligence Group No 1 Assignment 1
5 pages
Intelligent Machines The Dark Secret at The Heart of AI
No ratings yet
Intelligent Machines The Dark Secret at The Heart of AI
6 pages
Group 5 Assignment On Advantages and Disadvantages of Supervised Hebbian Learning Ver2
No ratings yet
Group 5 Assignment On Advantages and Disadvantages of Supervised Hebbian Learning Ver2
6 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture Slides 2 - Neural Networks - 2021

Uploaded by

Lecture Slides 2 - Neural Networks - 2021

Uploaded by

Lecture 5: Neural Networks.

University of Cape Town

Consider modeling a dataset consisting observations of p inputs

for all observations i = 1, 2, . . . , N .

What if the nature of the response is such that it is categorical? If it's

for all observations i = 1, 2, . . . , N .

What if the nature of the response is such that it is categorical? If it's

for all observations i = 1, 2, . . . , N .

Over and above generalising congurations for the response, if we are in

for all observations i = 1, 2, . . . , N .

Feed-forward neural networks give such a basis by extending our earlier

for all observations i = 1, 2, . . . , N .

In a more general form, we can encapsulate this idea in the simple

This is the scalar form of the updating equations which denes

denotes an activation function on layer l (e.g., sigmoid or

denotes the number of nodes in layer l − 1 (the number of nodes

l − 1 and j -th node in layer l (although we connect from layer l − 1 to

layer l, we use the convention of indexing the weights using the

subject to the initial conditions a = x for all j at the i-th training

example. The updating equation thus needs to be evaluated for each

Though we restrict ourselves to the the feed-forward architecture for now,

xip xip a1d1

Directed graph of a normal linear model (left) vs. that of a standard

Directed graph of a logistic regression model (left) vs. that of a (3)-

input layer), a = [a , a , . . . , a ] (a d × 1 matrix) denote the vector of

activations in layer l, W the d × d weight matrix connecting layers

l − 1 and l, and b = [b , b , . . . , b ] the vector of bias terms on layer l,

then, for a (3)-network with two inputs on a single response, we have:

Evaluation of layer 2 (the output layer):

Note: W is a (2 × 3)-matrix, W is a (3 × 1)-matrix, b is a 1

(3 × 1)-matrix, and b is a (1 × 1)-matrix.

a supplies the prediction for response y (hence ≈ y ).

Note that this evaluation is for a single observation! We need to apply

the above steps for each observation in our data set.

Numerous activation functions exist with the mode commonly used

Activation Function Gradients

may correspond to the node with the highest predicted probability.

Softmax Function on 5 outputs

aL1 aL2 aL3 aL4 aL5

In order to t neural networks, we need to extract a conguration of the

Regression problems, in the context of supervised learning, usually refer to modeling

1 where ŷi denotes our prediction (one dimensional output).

Classication tasks, in the context of supervised learning, usually refer

Though we could still make use of an appropriate distance measure to set

For classication tasks, it is advantageous to formulate the objective

for the case of a single-valued response and ŷ ∈ [0, 1].

Remember that the goal is to `learn' the structure of the relationship

C surface vs. C on a two-parameter space for a classication task

We wish to nd w : f (w) = 0. Using the line tangent to the curve at

w , we may get some insight as to which direction to travel in search

Actually, we like to use λ f (wk )

Newton's method easily generalises to higher dimensions. Using similar

wk+1 = wk − Γ(wk )−1 ∇f (wk ); k = 1, 2, . . .

Complete Problem 1 in the Neural Networks section of the exercises in

A fundamental issue with overspecied models such as neural networks is

Stochastic gradient Descent modies the algorithm further by

[Nielsen 2015] Nielsen, Michael A.: Neural networks and deep

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Over and above generalising congurations for the response, if we are in

This is the scalar form of the updating equations which denes

In order to t neural networks, we need to extract a conguration of the

Classication tasks, in the context of supervised learning, usually refer

For classication tasks, it is advantageous to formulate the objective

C surface vs. C on a two-parameter space for a classication task

We wish to nd w : f (w) = 0. Using the line tangent to the curve at

A fundamental issue with overspecied models such as neural networks is

Stochastic gradient Descent modies the algorithm further by