Lec13 Neural Networks and Deep Learning PDF
Lec13 Neural Networks and Deep Learning PDF
COURSE: CS60045
Pallab Dasgupta
Professor,
Dept. of Computer Sc & Engg
1
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Brain inspired computing
• Simple units
• The power is in
the network
2
Preliminaries
3
Neural Networks
A neural network consists of a set of nodes
(neurons/units) connected by links 𝒂𝒂𝟎𝟎 = −𝟏𝟏
𝒏𝒏 𝒏𝒏
4
Perceptron
𝒂𝒂 = 𝒈𝒈(𝒊𝒊𝒊𝒊)
𝒙𝒙𝟎𝟎 = −𝟏𝟏 𝑾𝑾𝟎𝟎
𝒈𝒈
𝑾𝑾𝟏𝟏
𝒙𝒙𝟏𝟏 𝒊𝒊𝒊𝒊 𝒂𝒂
5
Perceptron
Linear Function:
𝒂𝒂 = 𝒈𝒈(𝒊𝒊𝒊𝒊)
𝒙𝒙𝟎𝟎 = −𝟏𝟏 𝑾𝑾𝟎𝟎 𝒊𝒊𝒊𝒊 = 𝒙𝒙𝟏𝟏 𝑾𝑾𝟏𝟏 + 𝒙𝒙𝟐𝟐 𝑾𝑾𝟐𝟐 − 𝑾𝑾𝟎𝟎
𝒈𝒈
𝑾𝑾𝟏𝟏
𝒙𝒙𝟏𝟏 𝒊𝒊𝒊𝒊 𝒂𝒂 𝟎𝟎 𝒊𝒊𝒊𝒊 𝒊𝒊𝒊𝒊 ≤ 𝟎𝟎
𝒂𝒂 = �
𝟏𝟏 𝒊𝒊𝒊𝒊 𝒊𝒊𝒊𝒊 > 𝟎𝟎
.
𝒙𝒙𝟐𝟐 𝑾𝑾𝟐𝟐 Input Output
Function Activation
Function
AND: W1 = 1, W2 = 1, W0 = 1
in = x1 + x2 − 1
OR: W1 = 2, W2 = 2, W0 = 1
in = 2x1 + 2x2 − 1
6
Multiple Layers Increase the Capacity
The black and white dots are not linearly separable, that
is, no linear function of the following form separates them:
𝒊𝒊𝒊𝒊 = 𝒙𝒙𝟏𝟏 𝑾𝑾𝟏𝟏 + 𝒙𝒙𝟐𝟐 𝑾𝑾𝟐𝟐 − 𝑾𝑾𝟎𝟎
7
Supervised Learning by back-propagating errors
The basic idea:
• We compute the output error as: Golden
output
Error = golden output (y) − output of network (a)
• The training error function computed over all training data is: Training
Neural Network
𝟏𝟏 input
𝑬𝑬 = ∑𝒊𝒊(𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 )𝟐𝟐
𝟐𝟐
8
Learning in Single Layered Networks
Idea: Optimize the weights so as to minimize error function:
𝟏𝟏 𝟏𝟏 𝟐𝟐
𝑬𝑬 = 𝑬𝑬𝑬𝑬𝑬𝑬𝟐𝟐 = 𝒚𝒚 − 𝒈𝒈 ∑𝒏𝒏𝒋𝒋=𝟎𝟎 𝑾𝑾𝒋𝒋 𝒙𝒙𝒋𝒋
𝟐𝟐 𝟐𝟐
𝝏𝝏𝝏𝝏
𝝏𝝏𝑾𝑾𝒋𝒋 Weight update rule:
𝝏𝝏𝝏𝝏𝝏𝝏𝝏𝝏
= 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝑾𝑾𝒋𝒋 ← 𝑾𝑾𝒋𝒋 + 𝜶𝜶 × 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊) × 𝒙𝒙𝒋𝒋
𝝏𝝏𝑾𝑾𝒋𝒋
𝒏𝒏 where α is the learning rate
𝝏𝝏
= 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝒚𝒚 − 𝒈𝒈 � 𝑾𝑾𝒋𝒋 𝒙𝒙𝒋𝒋
𝝏𝝏𝑾𝑾𝒋𝒋
𝒋𝒋=𝟎𝟎 We purposefully eliminate a fraction of the
= −𝐄𝐄𝐄𝐄𝐄𝐄 × 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊) × 𝒙𝒙𝒋𝒋 error through the weight adjustment rule,
but not the whole of it. Why?
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
9
Multi-Layer Feed-Forward Network
Weight updation rule at the output layer:
Oi Output units
𝑾𝑾𝒋𝒋 ← 𝑾𝑾𝒋𝒋 + 𝜶𝜶 × 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊) × 𝒙𝒙𝒋𝒋
Wj,i (same as single layer)
aj Hidden units
Wk,j
Ik Input units
In multilayer networks, the hidden layers also contribute to the error at the output.
• So the important question is: How do we revise the hidden layers?
10
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Back-Propagation Learning
11
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
The mathematics behind the updation rule
The squared error on a single example is defined as:
𝟏𝟏
𝑬𝑬 = ∑𝒊𝒊(𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 )𝟐𝟐
𝟐𝟐
where the sum is over the nodes in the output layer. To obtain the gradient with respect to a specific weight Wj,i
in the output layer, we need only expand out the activation ai as all other terms in the summation are unaffected
by Wj,i
𝝏𝝏
′
= − � 𝚫𝚫𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒈𝒈 𝒊𝒊𝒊𝒊𝒋𝒋
𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋
� 𝑾𝑾𝒌𝒌,𝒋𝒋 𝒂𝒂𝒌𝒌 Wj,i
𝒊𝒊 𝒌𝒌
aj
= − � 𝚫𝚫𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒈𝒈′ 𝒊𝒊𝒊𝒊𝒋𝒋 𝒂𝒂𝒌𝒌 = −𝒂𝒂𝒌𝒌 𝚫𝚫𝒋𝒋
𝒊𝒊
Wk,j
𝑾𝑾𝒌𝒌,𝒋𝒋 ← 𝑾𝑾𝒌𝒌,𝒋𝒋 + 𝜶𝜶 × 𝒂𝒂𝒌𝒌 × 𝚫𝚫𝒋𝒋
ak
Problems with this Learning
14
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Convolutional and Recurrent
Neural Networks
• Convolution is useful for learning artifacts that
have a small locality of reference
• Recurrence is useful for learning sequences
15
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
16
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
The Convolution Operation
This operation is called convolution. The first argument, x( ), is called the input, and the
second argument, w( ), is called the kernel.
17
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Discrete Convolution
If we assume that x and w are defined only on integer t, we can define discrete convolution:
∞
𝒔𝒔 𝒕𝒕 = 𝒙𝒙 ∗ 𝒘𝒘 𝒕𝒕 = � 𝒙𝒙 𝒂𝒂 𝒘𝒘(𝒕𝒕 − 𝒂𝒂)
𝒂𝒂=−∞
Convolution can also be defined over more than one axis at a time. For example, if we use a two dimensional
image I as our input, we may want to use a two dimensional kernel:
Convolution is commutative, that is, we can also write (by replacing m by i — m and n by j — n):
18
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Convolution Networks help us to learn image filters
19
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
If kernel width is small, the network will be sparse
20
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Convolution and Pooling A pooling function replaces the output
of the net at a certain location with a
summary statistic of the nearby
outputs
21
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Sequence Modeling: Recurrent and Recursive Networks
• Recurrent Neural Networks (RNNs) are a family of neural networks for processing sequential data
• Recurrent networks can scale to much longer sequences than would be practical for networks without
sequence-based specialization
• Most recurrent networks can also process sequences of variable length
• The key idea behind RNNs is parameter sharing
• For example, in a dynamical system, the parameters of the transfer function do not change with time
• Therefore we can use the same part of the neural network over and over again
22
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Unfolding Computation
Consider a dynamical system:
𝒔𝒔(𝒕𝒕) = 𝒇𝒇 𝒔𝒔 𝒕𝒕−𝟏𝟏
; 𝜽𝜽
where s(t) is the state at time t and θ is the set of parameters of f
• The state after a finite number of steps can be obtained by applying the definition recursively. For example,
after 3 steps:
𝒔𝒔(𝟑𝟑) = 𝒇𝒇 𝒔𝒔 𝟐𝟐 ; 𝜽𝜽 = 𝒇𝒇 𝒇𝒇 𝒔𝒔 𝟏𝟏 ; 𝜽𝜽 ; 𝜽𝜽
23
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Unfolding computation and Recurrent Network
𝒉𝒉(𝒕𝒕) = 𝒇𝒇 𝒉𝒉 𝒕𝒕−𝟏𝟏 , 𝒙𝒙 𝒕𝒕 ; 𝜽𝜽
• Regardless of the sequence length, the learned model always has the same input size, because it is specified
in terms of transition from one state to another state, rather than specified in terms of a variable-length history
of states
• It is possible to use the same transition function f with the same parameters at each step
24
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Useful topologies of RNNs
• RNNs that produce an output at each time step and have recurrent connections between hidden units
• RNNs that produce an output at each time step and have recurrent connections only from the output at one
time step to the hidden units at the next time step
• RNNs with recurrent connections between hidden units, that read an entire sequence and then produce a
single output
25
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
RNN with hidden-hidden feedback
29
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Source: DARPA
30
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
The ML problem in regression
What is the function 𝒇𝒇 . ?
31
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Classification
Given training data set with:
• Input values: 𝒙𝒙𝒏𝒏 = [𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 … 𝒙𝒙𝑴𝑴 ]𝑻𝑻 for 𝒏𝒏 = 𝟏𝟏 … 𝑵𝑵.
• Output class labels, for example:
• 0/1 or −1/+1 for binary classification problems
• 1 … K for multi-class classification problems
• 1-of-K coding scheme:
𝐲𝐲 = [𝟎𝟎 … 𝟎𝟎 𝟏𝟏 𝟎𝟎 … 𝟎𝟎]𝑻𝑻
where, if 𝒙𝒙𝒏𝒏 belongs to class k, then the kth bit is 1 and all others are 0.
32
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Classification strategies
Linear discriminants
(2-class classifiers)
K-class discriminant
33
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR