0% found this document useful (0 votes)
43 views44 pages

4 LSTM Gru

Uploaded by

SHRAVANI ANAND
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views44 pages

4 LSTM Gru

Uploaded by

SHRAVANI ANAND
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

LSTM & GRU

DSE 3151 DEEP LEARNING

Dr. Rohini Rao & Dr. Abhilash K Pai


Dept. of Data Science and Computer Applications
MIT Manipal
LSTM and GRU : Introduction

• The state (si) of an RNN records information from


all previous time steps.

• At each new timestep the old information gets


morphed by the current input.

• After ‘t’ steps the information stored at time step t-


k (for some k < t) gets completely morphed.

• It would be impossible to extract the original


Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
information stored at time step t – k.

• Also, there is the Vanishing gradients problem!

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
LSTM and GRU : Introduction
• Consider a scenario where we have to evaluate the
The white board analogy: expression on a whiteboard:
Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Normally, the evaluation in white board would look
like:
ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras • Now, if the white board has space to accommodate
only 3 steps, the above evaluation cannot fit in the
required space and would lead to loss of information.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11

ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac = 5

bd + a = 34

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac = 5
ac(bd + a) = 170
bd + a = 34

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac = 5
ac(bd + a) = 170
ad = 11

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac(bd + a) + ad = 181
ac(bd + a) = 170
ad = 11
Since the RNN also has a finite state size, we need to figure out a way to allow it to selectively read, write and forget

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
LSTM and GRU : Introduction

• RNN reads the document from left to right and after


every word updates the state.

• By the time we reach the end of the document the


information obtained from the first few words is
completely lost.

• In our improvised network, ideally, we would like to:

• Forget the information added by stop words (a, the,


etc.)

• Selectively read the information added by previous


sentiment bearing words (awesome, amazing, etc.)
Example: Predicting the sentiment of a review
• Selectively write new information from the current
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras word to the state.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
LSTM and GRU : Introduction
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Selectively write:

• In an RNN, the state st is defined as follows:

• Instead of passing st-1 as it is, we need to pass


(write) only some portions of it.

• To do this, we introduce a vector ot-1 which


decides what fraction of each element of st-1
The RNN has to learn ot-1 along with other parameters
should be passed to the next state.
(W,U,V)
• Each element of ot-1 (restricted to be between 0
and 1) gets multiplied with st-1

New parameters to be learned are: Wo, Uo, bo


• How does RNN know what fraction of the state
Ot is called the output gate as it decides how much to pass to pass on?
(write) to the next time step.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
LSTM and GRU : Introduction
Selectively read:

• We will now use ht-1 and xt to compute the


new state at the time step t :

• Again, to pass only useful information from


to st, we selectively read from it before
constructing the new cell state.
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• To do this we introduce another gate called as


the input gate:

• And use to selectively read the


information.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
LSTM and GRU : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Selectively forget
• How do we combine st-1 and to get the new
state?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
LSTM and GRU : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Selectively forget • But we may not want to use the whole of st-1
• How do we combine st-1 and to get the new but forget some parts of it.
state?
• To do this a forget gate is introduced:

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
LSTM (Long Short-Term Memory)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Long-term memory

Short-term memory
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
LSTM (Long Short-Term Memory)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• LSTM has many variants which include different number of gates and also different arrangement of gates.

• A popular variant of LSTM is the Gated Recurrent Unit (GRU).

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
LSTM Cell

▪ Neuron called a Cell

▪ FC are fully connected layers

▪ Long Term state c(t-1) traverses through forget gate


forgetting some memories and adding some new
memories

▪ Long term state c(t-1) is passed through tanh and


then filtered by an output gate, which produces
short term state h(t)

▪ Update gate- g(t) takes current input x(t) and


previous short term state h(t-1)

▪ Important parts of output g(t) goes to long term


• Note: ct is same as st state
Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
LSTM Cell

▪ Gating Mechanism- regulates information


that the network stores

Other 3 layers are gate controllers


▪ Forget gate f(t) controls which part of the
long—term state should be erased

▪ Input gate i(t) controls which part of g(t)


should be added to long term state

▪ Output gate o(t) controls which parts of long


term state should be read and output at this
time state
▪ both to h(t) and to y(t)

Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
LSTM computations

An LSTM cell can learn to recognize an important

input (role of the input gate), store it in long term

state, preserve it for as long as possible it is needed

(role of forget gate), and extract it whenever it is

needed.

Wxi, Wxf, Wxo, Wxg are the weight matrices of each of the four layers for their connection to the input
vector x(t).
Whi, Whf, Who, and Whg are the weight matrices of each of the four layers for their connection to the
previous short-term state h(t–1).
bi, bf, bo, and bg are the bias terms for each of the four layers.
Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Gated Recurrent Unit (GRU)

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Gated Recurrent Unit (GRU)
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

-1

Gates: States:

No explicit forget gate (the forget gate and input gates are tied)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Gated Recurrent Unit (GRU)
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

-1

Gates: States:

The gates depend directly on st-1 and not the intermediate ht-1 as in the case of LSTMs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Gated Recurrent Unit CELL (Kyunghyun Cho et al, 2014)
The main simplifications of LSTM are:
• Both state vectors (short and long term) are
merged into a single vector h(t).

• Gate controller z(t) controller controls both the


forget gate and the input gate.
• If the gate controller outputs
• 1, the forget gate is open and the input gate is
closed.
• 0, the opposite happens
• whenever a memory must be written, the location
where it will be stored is erased first.

• No output gate, the full state vector is output at


every time step.

• Reset gate controller r(t) that controls which


part of the previous state will be shown to the
main layer g(t).

Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
LSTM vs GRU computation

• GRU Performance is good but may have a slight dip in the


accuracy
• But lesser number of trainable parameters which makes it
advantageous to use
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Avoiding vanishing gradients with LSTMs: Intuition
• During forward propagation the gates control the flow of information.

• They prevent any irrelevant information from being written to the state.

• Similarly during backward propagation they control the flow of gradients.

• It is easy to see that during backward pass the gradients will get multiplied by the gate.

• If the state at time t-1 did not contribute much to the state at time t then during backpropagation the gradients
flowing into st-1 will vanish

• The key difference from vanilla RNNs is that the flow of information and gradients is controlled by the gates which
ensure that the gradients vanish only when they should.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Different RNNs

Vanilla RNNs

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Different RNNs

Eg: Image Captioning


Image -> Sequence of words

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Different RNNs

Eg: Sentiment classification


Sequence of words -> Sentiment

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Different RNNs

Eg: Machine Translation


Sequence of words -> Sequence of words

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Different RNNs

Eg: Video Classification on frame level

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Deep RNNs

RNNs that are deep not only in the time


direction but also in the input-to-output
direction.

Source: Deep Recurrent Neural Networks — Dive into Deep Learning 1.0.0-alpha1.post0 documentation (d2l.ai)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Bi-Directional RNNs: Intuition

Source: codebasics - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Bi-Directional RNNs: Intuition

• The o/p at the third time step (where input is the string “apple”) depends on only previous two i/ps

Source: codebasics - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Bi-Directional RNNs

• Adding an additional backward layer with connections as shown above makes the o/p at a time step depend on both
previous as well as future i/ps.

Source: codebasics - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Bi-directional RNNs

▪ Example - speech detection


▪ I am ___.

▪ I am ___ hungry.

▪ I am ___ hungry, and I can eat half a


cake.
▪ Regular RNNs are causal
▪ look at past and present inputs to generate
output.
▪ Use 2 recurrent layers on the same inputs
▪ One reading words from left to right
▪ Another reading words from right to left
▪ Combine their outputs at each time step

Source: Bidirectional Recurrent Neural Networks — Dive into Deep Learning 1.0.0-alpha1.post0 documentation (d2l.ai)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Bi-directional RNN computation

Ht (Forward) = A(Xt * WXH (forward) + Ht-1 (Forward) * WHH (Forward) + bH (Forward))

Ht (Backward) = A(Xt * WXH (Backward) + Ht+1 (Backward) * WHH (Backward) + bH (Backward))

where,

A = activation function,

W = weight matrix

b = bias

The output at any given hidden state is :

Yt = Ht * WAY + by , where Ht is a concatenation of Ht (Forward) and Ht (Backward)


Bidirectional Recurrent Neural Network - GeeksforGeeks

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
Generating Shakespearan Text using a Character RNN

▪ “The unreasonable Effectiveness of ▪ Chop the Sequential dataset into


Recurrent Neural Networks” – Andrej multiple windows
Karpathy (2015)
▪ 3-layer RNN with 512 hidden nodes on
each layer
▪ Char-RNN was trained on Shakpeare’s
work used to generate novel text- one
character at a time
▪ PANDARUS:
Alas, I think he shall be come approached and
the day
When little srain would be attain'd into being
never fed,
And who is but a chain and subjects of his
death,
I should not sleep.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Stateful RNN

▪ Stateless RNNs
▪ at each training iteration the model starts
with a hidden state full of 0s
▪ Update this state at each time step
▪ Discards the output at the final state
when moving onto next training batch
▪ Stateful RNN
▪ Uses sequential nonoverlapping input
sequences
▪ Preserves the final state after processing
one training batch
▪ use it as initial state for next training
batch
▪ Model will learn long-term patterns
despite only backpropagating through
short sequences
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy