IC Unit6 DeepLearning

Intelligent Computing
(MIT 554)
Unit 6: Deep Learning
1
Deep Learning
• Deep learning is a method in artificial intelligence (AI) that teaches
computers to process data in a way that is inspired by the human brain
• It is capable of learning complex patterns and relationships within data
• Also known as deep neural networks
• Based on artificial neural network architecture
• Types of Deep neural Network
1. FNN (Forward Neural Network)
2. CNN (Convolutional Neural Network)
3. RNN (Recurrent Neural Network)
2
Deep Learning…
• Create algorithms
• That can understand scenes and describe them in natural
language
• That can infer semantic concepts to allow machines to
interact with humans using these concepts
• Requires creating a series of abstractions
• Image (Pixel Intensities)  Objects in Image  Object
Interactions  Scene Description
• Deep learning aims to automatically learn these
abstractions with little supervision
3
Activation Functions
• Propagate the output of one layer’s nodes forward to the next layer (up
to and including the output layer)
• Approaches
• Linear
• Sigmoid
• Tanh
• Rectified Linear
• Leaky ReLU
4
Loss Functions
• Loss functions quantify how close a given neural network is to the
ideal toward which it is training
• We calculate a metric based on the error we observe in the network’s
predictions, then aggregate these errors over the entire dataset and
average them and now we have a single number representative of
how close the neural network is to its ideal
5
Hyper parameters
• Parameters we tune to make networks train better and faster, these tuning parameters
are called hyper parameters, and they deal with controlling optimization functions
• Hyper parameter selection focuses on ensuring that the model neither under fits nor over
fits the training dataset, while learning the structure of the data as quickly as possible
• Some Categories for Hyperparameter
• Layer size
• Magnitude (momentum, learning rate)
• Regularization (dropout, drop connect, L1, L2)
• Activations (and activation function families)
• Weight initialization strategy
• Loss functions
• Settings for epochs during training (mini-batch size)
• Normalization scheme for input data (vectorization)
6
Hyper parameters…
• Learning Rate
• The learning rate affects the amount by which you adjust parameters during optimization in order to minimize the
error of neural network’s guesses
• Momentum
• Momentum helps the learning algorithm get out of spots in the search space where it would otherwise become
stuck
• Regularization
• Regularization is a measure taken against overfitting
• Overfitting occurs when a model describes the training set but cannot generalize well over new inputs
• Overfitted models have no predictive capacity for data that they haven’t seen
• Dropout and DropConnect mute parts of the input to each layer, such that the neural network learns other portions
• Dropout is driven by randomly dropping a neuron so that it will not contribute to the forward pass and
backpropagation
• DropConnect does the same thing as Dropout, but instead of choosing a hidden unit, it mutes the connection
1
between two neurons

• The penalty methods L1 and L2, in contrast, are a way of preventing the neural network parameter space from
getting too big in one direction. They make large weights smaller
7
Optimization Algorithms
• Training a model in machine learning involves finding the best set of
values for the parameter vector of the model
• Machine learning as an optimization problem in which we minimize
the loss function with respect to the parameters of our prediction
function (based on our model)
8
Major Architectures of Deep Networks
• GAN
• CNN
• RNN
 LSTM
 GRU
9
Generative Adversarial Networks
• A generative adversarial network (GAN) is a class of machine learning
frameworks and a prominent framework for approaching generative A
• Generative modeling is an unsupervised learning task in machine learning
that involves automatically discovering and learning the regularities or
patterns in input data in such a way that the model can be used to generate
or output new examples that plausibly could have been drawn from the
original dataset
• GANs are a clever way of training a generative model by framing the problem
as a supervised learning problem with two sub-models:
• The generator model that we train to generate new examples, and
• The discriminator model that tries to classify examples as either real (from the
domain) or fake (generated)
10
Generative Adversarial Networks…
• Generative
• To learn a generative model, which describes how data is generated
• Adversarial
• The word adversarial refers to setting one thing up against another
• This means that, in the context of GANs, the generative result is compared
with the actual images in the data set
• A mechanism known as a discriminator is used to apply a model that attempts
to distinguish between real and fake images
• Networks
• Use deep neural networks as artificial intelligence (AI) algorithms for training
purposes
11
CNN (Convolution Neural Network)
• The goal of a CNN is to learn higher-order features in the data via convolutions
• They are well suited to object recognition with images and consistently top
image classification
• They can identify faces, individuals, street signs, platypuses, and many other
aspects of visual data
• A convolutional neural network can have tens or hundreds of layers that each
learn to detect different features of an image
• Filters are applied to each training image at different resolutions, and the
output of each convolved image is used as the input to the next layer
• The filters can start as very simple features, such as brightness and edges, and
increase in complexity to features that uniquely define the object
12
Convolutional Neural Network
Architecture
• A CNN typically has
three layers:
1. A convolutional layer
2. A pooling layer, and
3. A fully connected
layer
13
Architecture…
• Convolution Layer
 Convolution puts the input images through
a set of convolutional filters, each of which
activates certain features from the images
 This layer performs a dot product between
two matrices, where one matrix is the set
of learnable parameters known as a kernel,
and the other matrix is the restricted
portion of the receptive field
 The kernel is spatially smaller than an
image but is more in-depth
14
Architecture…
• Example
15
Architecture…
• Convolution Kernels
16
Architecture…
• Pooling Layer
• The pooling layer replaces the output of the
network at certain locations by deriving a
summary statistic of the nearby outputs
• This helps in reducing the spatial size of the
representation, which decreases the required
amount of computation and weights
• The pooling operation is processed on every
slice of the representation individually
• There are several pooling functions, however,
the most popular process is max pooling,
which reports the maximum output from the
neighborhod.
17
Architecture…
• Fully Connected Layer
 Dense network of neurons and
every two neurons are
connected
 Neurons in this layer have full
connectivity with all neurons in
the preceding and succeeding
layer
 Learn patterns from the
extracted features
18
Architecture…
• Fully Connected Layer
19
RNN (Recurrent Neural Network)
• A recurrent neural network (RNN) is a deep learning
model that is trained to process and convert a
sequential data input into a specific sequential data
output
• Sequential data is data—such as words, sentences, or
time-series data—where sequential components
interrelate based on complex semantics and syntax
rules
• Here
 x_1, x_2, x_3, …, x_t represent the input words from the
text,
 y_1, y_2, y_3, …, y_t represent the predicted next words
and
 h_0, h_1, h_2, h_3, …, h_t hold the information for the
previous input words
20
RNN (Recurrent Neural Network)…
• Example (Name Entity Recognition)
21
LSTM (Long Short Term Memory)
• Is a recurrent neural network (RNN), aimed at dealing with the vanishing gradient problem
present in traditional RNNs
• Sometimes, we only need to look at recent information to perform the present task. For
example, consider a language model trying to predict the next word based on the previous
ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any
further context – it’s pretty obvious the next word is going to be sky. In such cases, where the
gap between the relevant information and the place that it’s needed is small, RNNs can learn
to use the past information
• But there are also cases where we need more context. Consider trying to predict the last word
in the text “I grew up in France… I speak fluent French.” Recent information suggests that the
next word is probably the name of a language, but if we want to narrow down which
language, we need the context of France, from further back. It’s entirely possible for the gap
between the relevant information and the point where it is needed to become very large.
• Unfortunately, as that gap grows, RNNs become unable to learn to connect the information
22
LSTM (Long Short Term Memory)…
• Assignment
• https://colah.github.io/posts/2015-08-Understanding-LSTMs/
23
Transformer
• Since RNN can suffer from a loss of information
with long sequences of input and also it can’t be
trained in parallel, transformers (Vaswani et al.,
2017) can be better choice, as they process the
sentences using multi-headed mechanisms and
positional embedding
• Transformer is an architecture which transform
one sequence to another using Encoder and
Decoder without implying any recurrent
networks
• Model Consists of two components
1. Encoder
2. Decoder
24
Transformer…
• Steps
1. Generate input embedding
2. Generate positional embedding
3. Self – attention
4. Feed forward neural network
25
Transformer…
INPUT
TEXT : मानव जिवनमा खेलकु दले ज्यादै ठु लो महत्व राख्दछ

X0 X1 X2 X3 X4 X5 X6
Vocabulary Indices [561 678 899 9 67 10002 99]
INPUT EMBEDDING
0.5 0.1 0.4 -0.7 0.2 0.3 0.4
0.9 0.7 0.8 0.4 -0.9 0.8 0.7

Dmodel = 512 (Vaswani 0.1 0.2 0.2 0.1
0.2 -0.3 0.5
et al., 2017)
-0.1 -0.2 0.6 0.3 -0.1 0.6 -0.4
E0 E1 E2 E3 E4 E5 E6
2
1
Figure 1. Transformer (Vaswani et al., 2017)
26
Transformer…
Why Order Matters POSITIONAL ENCODING
हामी वीर छौ तर वुद् छौ

E0 P0 E1 P1 E6 P6
हामी वुद् छौ रत वीर छौ 0.5 ? 0.1 ? 0.4 ?
(भूपि शेरचन) 0.9 ? 0.7 ? 0.7 ?
0.2 + ?
+ ……………….. +
-0.3 ? 0.1 ?
-0.1 ? -0.2 ? -0.4 ?
(Vaswani et al., 2017) use

wave frequencies
¿
3
2 ¿
1 EP0 EP1 EP2 EP3 EP4 EP5 EP6
27
Transformer…
MULTI HEAD ATTENTION
Why Attention
मुनामदन का लेखक को हुन?

मुनामदन का लेखक महाकवि देवकोटा हुन।
Self-Attention
ढु कु टि कोठाको साँचो लुकाउने हरिलाई साँचो खोइ भन्दा उसले साँचो कु रा बोलेन।
3
2
1
28
Transformer…
मानव जिवनमा खेलकु दले ज्यादै ठु लो महत्व राख्दछ
मानव
जिवनमा
खेलकु दले
ज्यादै
ठु लो 65 77 54 86 98 90 20
महत्व
राख्दछ
29
LLM (Large Language Model)
• Large Language Models (LLMs) are machine learning models with billions of parameters that
use deep learning techniques to process natural languages
• These models are trained with massive amount of text data from which they would learn
patterns and relationship in the language, as they are specially designed to understand natural
language
• LLMs can be used in different tasks like text classification, sentiment analysis, question
answering and summarization etc.
• LLMs have mostly used the transformer (Vaswani, et al., 2017) architecture
• Some of the pre-trained NLP language models are
• T5
• BERT
• GPT
• ALBERT
• ELECTRA
30
LLM (Large Language Model)…
• The way of interacting with language models is quite different than other machine learning
paradigms
• In those cases, computer code with formalized syntax is written to interact with APIs and
libraries
• In contrast Large Language Models are able to take natural language or human written
instructions and perform tasks as much as human would
• These instructions provided to LLM are called prompt
• Fine-tuning is a supervised learning process where you use a dataset of labelled examples
to update the weights of LLM
• The fine-tuning process extends the training of the model to improve its ability to generate
good completions for a specific task
• Thus LLMs can be fine tuned for different purposes like text classification and text
generation
31
• A language model basically undergoes a two-step process:
1. Pre-training
 This process builds up a foundation for the model by training on huge amount of unstructured text such
as Wikipedia, social network, literature books, news sites etc, to establish a upper level understanding on
natural language in unsupervised fashion
2. Fine-tuning
 The pre-trained model is further trained on more specific downstream task separately, which requires
specific training dataset
 Fine tuning can be used to apply knowledge learned from one task to another related task as transfer
learning. That is this technique is used to optimize a model’s performance on a new or different task
• Three variants of pre-trained model
1. Encoder Only
2. Decoder Only
3. Encoder Decoder
32
• Encoder Only
 Masked language modeling
 Here tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct
the original sentence
 Suited for sentiment analysis, NER (Name Entity Recognition)
• Bi-directional representations on the input sentence, meaning that the model has an understanding of the full context of a token
• Decoder Only
• Causal language modeling
• Here, the training objective is to predict the next token based on the previous sequence of token
• These models mask the input sequence and can only see the input tokens leading up to the token in the question
• The model has no knowledge of the end of the sentence (Unidirectional)
• These models are best suited for text generation
• Encoder Decoder
• Used both encoder and decoder parts of the original transformer architecture
• It pre-trains the encoder using span corruption which makes masks random sequence of input tokens
• Those mass sequences are then replaced with a unique sentinel token
• The decoder is then tasked with reconstructing the mask token sequences auto-regressively
• The output will be the sentinel token followed by the predicted token
• These models are best suited for Machine Translation, Text Summarization, Question Answering purposes
33
34
End of Unit 6
35

IC Unit6 DeepLearning

Uploaded by

Copyright:

Available Formats

IC Unit6 DeepLearning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IC Unit6 DeepLearning

Uploaded by

Copyright:

Available Formats

Intelligent Computing

between two neurons

TEXT : मानव जिवनमा खेलकु दले ज्यादै ठु लो महत्व राख्दछ

0.9 0.7 0.8 0.4 -0.9 0.8 0.7

हामी वीर छौ तर वुद् छौ

(Vaswani et al., 2017) use

मुनामदन का लेखक को हुन?

मानव जिवनमा खेलकु दले ज्यादै ठु लो महत्व राख्दछ

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.