GROUP19_EEE_PAPER

MULTI MODEL NEURAL MACHINE TRANSLATION
22AIE114 Introduction to Electrical and Electronics Engineering
SUBMITTED BY
BATCH B GROUP 19
B.Nikitha (CB.SC.U4AIE23119)
K.Bhanu Prakash(CB.SC.U4AIE23134)
M.Sravanthi Suma (CB.SC.U4AIE23148)
M.Kavya Srihitha (CB.SC.U4AIE23167)
2023 B. Tech CSE – AI

Amrita School of Artificial Intelligence
Amrita Vishwa Vidyapeetham
Coimbatore – 641 11
June 2024
1
CONTENTS
1. ACKNOWLEGEMENT................................................................. 3
2. ABSTRACT .................................................................................... 4
3. INTRODUCTION ........................................................................... 5
4. BACKGROUND ............................................................................. 6
5. METHODOLOGY ......................................................................... 8
6. RESULTS AND DISCUSSION .................................................... 20
7. REFERENCES .............................................................................. 22
8. CONCLUSION .............................................................................23
2
ACKNOWLEDGEMENT
We would like to express our sincere gratitude to all those who

contributed to the success of this project entitled “MULTI MODEL
NEURAL MACHINE TRANSLATION”. We are also thankful for the
support and guidance provided by Dr.Ambika P.S, whose insights and
encouragement significantly contributed to the project’s
development. The collective efforts of the entire team have resulted
in an outcome we can all be proud of. Thank you to everyone
involved for your unwavering commitment and hard work.
PROJECT TEAM:
1.B.NIKHITA – CB.SC.U4AIE23119
2.K.BHANU PRAKASH – CB.SC.U4AIE23134
3.M.SRAVANTHI SUMA – CB.SC.U4AIE23148
4.M.KAVYA SRIHITHA – CB.SC.U4AIE23167
3
ABSTRACT
Neural Machine Translation (NMT) has seen a tremendous spurt of growth in less
than ten years, and has already entered a mature phase. While considered as the
most widely used solution for Machine Translation, its performance on low-
resource language pairs still remains sub-optimal compared to the high-resource
counterparts, due to the unavailability of large parallel corpora. Therefore, the
implementation of NMT techniques for low-resource language pairs has been
receiving the spotlight in the recent NMT research arena, thus leading to a
substantial amount of research reported on this topic. This paper presents a
detailed survey of research advancements in low-resource language NMT (LRL-
NMT), along with a quantitative analysis aimed at identifying the most popular
solutions. Based on our findings from reviewing previous work, this survey paper
provides a set of guidelines to select the possible NMT technique
The Neural Translation has been done using Neural network various architectures
We used these four modules:
• Gated recurrent units (GRUs)

• LSTM with embedding layer
• Bidirectional LSTM
• Encoder Decoder
4
INTRODUCTION
Neural Machine Translation (Neural MT/NMT, Deep Neural MT/NMT, or DNMT) is an

advanced form of machine translation that utilizes neural network methodologies to
estimate the probability of a particular sequence of words. This can be a text fragment,
complete sentence, or if it is with the latest advances an entire document.
NMT is an entirely new paradigm that is used to address the issue of language translation
and localization and employs deep neural networks and artificial intelligence to train
neural models. Currently NMT has emerged as the leading technique of machine
translation and a shift from SMT to NMT within a span of 3 years. Neural Machine
Translation is often observed to generate much more accurate translations as compared to
Statistical Machine Translation solutions, in terms of both fluency and adequacy.
Neural machine translation is even more efficient than the traditional Statistical Machine
Translation (SMT) models in terms of memory usage; it requires a significantly less
amount of memory. This NMT approach is dissimilar to other translation SMT systems
as in all the components of the neural translation model are trained in an end-to-end
manner to enhance the performance of the translation.
Neural machine translation in contrast to the traditional phrase-based translation system
is an end-to-end model where there are many smaller sub-components that are separately
optimized, but it tries to create and train a single large neural network that takes a
sentence and produces a correct translation.
However, it does not mean that SMT system should be ignored completely because there
are many situations where SMT will provide a better quality translated output than NMT.
That is why Omniscien has adopted the Hybrid Machine Translation approach which
combines the best features of both technologies in order to produce a much better
translated text.
5
BACKGROUND THEORY
What is Recurrent Neural Network:

Recurrent Neural Network or RNN is a kind of Neural Network that uses
feedback to connect the present state with the previous state, i. e. , the output of
one input step is passed as an input to the next step. In traditional neural network,
all the input and output nodes are isolated meaning none of them affect the other.
However, in scenarios where the word that should be predicted is the following
word in a certain sentence, the preceding words are necessary, so there is a
necessity of the latter’s memorization. This way the existence of RNN was
realized, which helped eradicate this problem with the use of Hidden Layer. The
most meriting feature of RNN is the ‘Hidden state’ that helps to remember some
information about a sequence. The state is also called Memory State because it is
only dependent on previous state that was input to the network but now modified.
It employs the same parameters for the given input as it works alike over all
inputs or the hidden layers to yield the output. This makes the parameters more
simpler than other neural networks since here we have no complex parameters to
deal with.
6
The differences between RNN and feedforward neural network are as follows:
Feed forward neural networks are those artificial neural networks that do not
possess /include looping nodes. This kind of neural network is also called a multi-
layer neural network because of all information can only be passed to
forward.Data flows in feedforward ann from the input layer to the output layer or
indirectly to the hidden layers if exist. These networks are suitable for use in
image classification problems that have input/output autonomy for instance.
However, the inability to memorize previous inputs to a certain degree allows
them to be less effective for sequential data analysis..
7
METHODOLOGY
Tamil to English translation using Neural Networks is the main aim of the project
Models:
1. Gated Recurrent Unit (GRU)

Gated Recurrent Unit (GRU) is a kind of RNNs, which was proposed by Cho et
al. in their paper in 2014 with less complexity structure than LSTM networks.
Like LSTM, GRU can handle sequences as well which includes a text, speech,
time series data and so on.
For detailed analysis, the fundamental concepts of GRU include the usage of
gating mechanisms maintain and control the hidden state of a network at each
time step. The gating can be performed based on a set of criteria that limit the
access and sharing of data within the network. Minimizing the memory carry-over
and keeping the gradients normalized are two reasons the GRU has two gating
mechanisms: The reset gate; The update gate
The reset gate precisely defines how much of the previous hidden state to forget;
whereas, the update gate dictates how much of the new input to use in modifying
the present hidden state. The end result of the successive computations in the
GRU is, therefore, the output computed based on the updated h t .
The equations used to calculate the reset gate, update gate, and hidden state of a
GRU are as follows:
Reset gate: r_t = sigmoid(W_r * [h_{t-1}, x_t])

Update gate: z_t = sigmoid(W_z * [h_{t-1}, x_t])
Candidate hidden state: h_t’ = tanh(W_h * [r_t * h_{t-1}, x_t])
Hidden state: h_t = (1 – z_t) * h_{t-1} + z_t * h_t’
In conclusion, GRU networks are a category of RNN specifically designed to

incorporate methods of suspicion during the process of the replacement of the
hidden layer thus enabling the modelling of sequential data. It has been explicitly
demonstrated that they are beneficial and can utilized in numerous applications of
natural language processing like language modeling, machine translation
recognition of speech and others.
8
Prerequisites: A common type of Recurrent Neural Network is the Long Short
Term Memory network.
Knowing that a basic method of Recurrent Neural Network often faces the
Vanishing-Exploding gradients problem while operating, improvements and
variations were introduced. Some of the most used variations include the
following: One of the most popular is the Long Short Term Memory Network
(LSTM)’. Another one of the variations not quite as well-known but since it is
nearly as good as the previous one is the Gated Recurrent Unit Network (GRU).
In the cell it has only three gates and it does not possess Internal Cell State like
LSTM does.
The information that is being stored in the Internal Cell State in the LSTM
recurrent unit is assimilated in the hidden state of the Gated Recurrent Unit. This
collective information is then transferred to the next Gated Recurrent Unit of the
model. The different gates of a GRU are as described below:
Update Gate(z): It fulfills the function of defining how much of the prior
knowledge has to be transferred into the future. It has a similar function to the
Output Gate in LSTM recurrent unit category.
Reset Gate(r): It specifies how much is to be forgotten out of the previous

knowledge. This is similar to the case with the Input Gate and the Forget Gate in
an LSTM recurrent unit.
Current Memory Gate( ): As often, in a rather standard conversation about Gated

Recurrent Unit Network, it might be lost in the shuffle. It is installed in the Reset
Gate as it is the Input Modulation Gate, which is a sub-component of the Input
Gate, used to apply some non-linear transformation to the input signal and to
make the signal average zero.
9
Working of a Gated Recurrent Unit:
Extend the current input to the previous hidden state as vectors.
Calculate the values of the three different gates by following the steps given
below:
For each gate, sum up the products between the concerned vector and weights of
the matrix multiplied by each vector respectively, in which a new parameterized
current input and the previously hidden state vectors are obtained through
element-wise multiplication (Hadamard Product).
Perform the respective activation function on each element of the matrices for the
given gates and the parameterized vectors. Here, is given the list of the gates with
the type of activation function to be applied for the gate Site Level Gates Gate
Activation function Targeted Gate None Targeted Gate None
Update Gate : The mix of the differential and the sigmoid functionsCMD27 is
given below The derivative of the sigmoid function is calculated as 0 for values of
y that are in the sigmoid function sigmoid curve as shown below Conversion of
the sigmoid function to angle form Using the trigonometric table, the following
shows how the sigmoid function can be converted to an angular form:
Reset Gate : Sigmoid function is a mathematical function that transforms the data
and models the financial and economic processes effectively
The calculation of the Current Memory Gate is slightly different from the above
described formula and procedure actually. First Hadamard product of the Reset
Gate and the state vector which was formerly concealed needs to be calculated.
Then this vector is passed to parameterize and then add this current input vector is
also made to be parameterized.
10
Therefore when calculating the current hidden state then a vector of all ones with
dimensions equal to the inputs is defined. This vector will be referred to as ones
and in mathematical terms can be represented by 1. First, take the element-wise
product of update gate and the previous hidden state vector by using a Hadamard
Product. By applying the sigmoid function on the intermediate is obtained a new
vector: Vector Update Gate = As sigmoid (layer 4) – 1 Notes: Then apply another
layer sigmoid on the result is obtained Hadamard Product from the new vector with
the current memory gate . Lastly, sum the two vectors to obtain the currently
hidden state vector that is otherwise hidden from the user.
The above-stated working is stated as below:-
Notably, using blue circles in the context of soft max and w represents element-
wise multiplication. The positive sign inside the circle represents the vector
summation while the negative sign represents the vector subtraction(vector
summation with negative value). The matrix W here depict the weights of the
current input vector and the previous hidden state for each of the gates.
Similar to the case of Recurrent Neural Networks, the output of a GRU network
also allows its calculation at each time step and the output is then used to train the
network with the help of gradient descent.
11
2. LSTM with embedding layer
What does Embedding LSTM Layers ?
12
Term Index: Embedding LSTM layers In computer vision, the term “Embedding
LSTM layers” refers to the integration of two important aspects of Deep learning.
So in this there are embedding layers and LSTM layers which is useful in thinking
about the past information while thinking about what next word to generate.
It is used in Deep Neural Network when handling with, for example, sequential
data in the field of computer vision. Let’s break down these two concepts:Let’s
break down these two concepts:
Embedding Layers:
• It is common in deep learning models especially computer vision and natural

language processing (NLP) models to use embedding layers for transforming
categorical or discrete input data into a continuous vector space. These layers
to what ever category or label they want to assign, this enables the model to get
the meaningful representations for a given data.
• In computer vision, the embedding layers often refer to the process that
changes any kind of information on objects, classes or other discrete
materialized values into vectors with continuous values. For instance, while
constructing an image you might employ embedding layers where object label
or tag is represented as a realvalued vector.
LSTM Layers:
• Bidirectional RNN (Long Short-Term Memory) :

Recurrent Neural Network, or RNN, is a type of layer that would be better
described as layers. The key purpose of CNNS is in the modelling of sequential
data and are most suitable for sequences with long term dependencies. Memory
cells are included in LSTMs to allow them to include in their processing some
context from previous time steps depending on the problem at hand.
• In the computer vision LSTM layers are used to deal with sequences of
information related to the image. For instance, in the case of video data, LSTM
layers can then be applied for learning the temporal dependence between two
or more frames of a video, for the identification of actions or object tracking.
13
Combining Embedding and LSTM Layers in Computer Vision:
In computer vision, embedding layers can be used to transform categorical

information into dense vectors; for example, it can transform object names, object
IDs, or Some random translated text into a form of continuous vector.
3. Bidirectional LSTM
In this topic we will first talk about bidirectional LSTMs and how the architecture
of such networks operates. We will then examine how we can apply a form of a
review system through Bidirectional LSTM.
Bidirectional LSTM (BiLSTM)

BiLSTM or Bidirectional LSTM is a term given to a sequence model which takes
two LSTM layers as the input and one LSTM layer processes the data in the
forward direction while the other goes in the reverse direction. It is quite common
in NL- related tasks: The rationale of this approach lies in the degree of
information interaction during forward and backward processing, which allows
the model to evaluate the connection between sequences (for example, the
correspondence between the following and previous words in a sentence). The
idea of Bidirectional Recurrent Neural Networks (RNNs) is straightforward . It
involves duplicating the first recurrent layer in the network so that there are now
two layers side-by-side, providing the input sequence as-in as input to the first
layer and providing a reserved copy of the input sequence to the second.
the second. Before proceeding let us illustrate this concept to further clarify.
Let us consider a statement 1: Server can you bring me that particular dish.
Let us consider a statement 2: He crashed the server.
While these statements are placed in a sequence then the regular RNN like LSTM
will definitely consider the word Server as same object.
But That not the same in the case of Bidirectional LSTM, as it propagates in
backword direction also and achieves a good relation with the sentence and gives
the perfect results.
14
Architecture
In bidirectional LSTM concept it has two unidirectional LSTM through which it
takes the input sequence and processes it both forwards and in reverse. This
architecture can be interpreted simply as there are two LSTM with the first getting
the sequence of tokens and the second getting reverse while the two are combined.
Both of these LSTM network provide a probability vector as the output and the
final output is the weighted sum of the probabilities from the two LSTM
networks. It can be represented
where, Pt = Ptf + Ptb

• Pt : Final probability vector.
• Ptf : Probability vector from the forward direction.
• Ptb :Probability vector from the backward direction.
Bidirectional LSTM layer Architecture
15
4. Encoders-Decoders Model
In fact, the main architecture that is used in machine translation models is known as
the Encoders-Decoders or the Sequence to Sequence model.
What is Encoder-Decoder?
Datasets employed within an encoder-decoder system typically include sequences,

and are commonly used in natural language processing and Machine Translation. It
consists of two main components: an encoder and a decoder.
An encoder can best be described as a block that encodes or compresses a certain

amount of data to a much smaller sized data format.
1. Encoder: The encoder converts an input sequence into a fixed-sized vector

known as the “context vector” or the “thought vector. ” In general, the input
sequence can be any text of words or strings and is not necessarily limited to a single
sentence. The encoder normally applies RNNs such as LSTM or GRU to model how
the input data are sequential. When starting the outlining process of the input
sequence each step is processed, and the information is stored within the context
vector – a vector that is supposed to capture the input’s key points.
2. Decoder: The decoder then uses the context vector, which has been produced
from the encoder, and then provide an output sequence. It can be another sequence
of different length, for example, the translated sentence or the reply in chatbot. And
just like in the case of the encoder, the decoder can also employ the use of an RNN.
The first hidden state is the context vector, and after that, each element of the output
sequence is calculated using the recurrent calculation process of the RNN model.
The decoding process is then conditioned on the context vector and previous
generated output, to produce the output sequence.
16
Lets discuss how it works internally?
Now let’s discuss a practical example in order to understand how an ‘encoder-

decoder’ model operates internally. We shall look at a specific task in machine
translation, where we are translating English to French sentences using the
encoder-decoder framework.
1. Encoder: Let the English input Text be: “I love playing cricket”.
It takes the input sentence word by word and then encodes it for the purpose of
reformatting by the decoder. Each word is represented as a vector (e. g. , using
word embeddings (Ex: Glovec). The so-called encoder in this process is usually
based on LSTM or GRU to receive these word vectors continuously and adjust the
hidden status, and it will finally output the fixed-size vector.
The encoder, LSTM or GRU-based, takes these word vectors in sequence and
modifies its hidden state at each time step. The last one is the context vector in the
encoder, which encapsulates the entire sequence of the input sentence.
2. Decoder: The decoder then uses the contextual vector created by the encoder to
generate the translated output sequence in word-by-word procedure. In our
example, the target is the corresponding German translation: “I like playing
cricket. ”
Before decoding phase, the context vector is passed to the decoder as the first
hidden state of the decoding process. The decoder provides a word by word output
sequence, with each word predicting in the selection based on the context vector
and previously produced words.
In training, the decoder is trained to expect the correct previous words as has been
seen from the above flow chart. For instance, at the first decoding step, the decoder
uses the context vector and then produces the word “Ich. ” The generated word
“Ich” is taken as input to the next internal step in the decoder to come up with the
word “liebe. ” These steps are repeated until the entire output sequence is
produced.
Training involves procedures such as teacher forcing where the program is trained
to accept the proper previous words during training irrespective of the decoder’s
secured prediction. This makes it possible to train the decoder in producing the
right output sequence to match with the intended one.
17
The key idea is that the encoder captures the essential information from the input
sentence and represents it in the context vector. The decoder then uses this context
vector to generate the corresponding output sequence, one word at a time. By
jointly training the encoder and decoder, the model learns to effectively encode and
decode the input-output relationship, enabling translation or other sequence
generation tasks.
Encoder Decoder Architecture
Encoder Decoder Architecture

The architecture of an encoder-decoder model typically consists of two main
components: an encoder and decoder An encoder forms the author’s argument,
while a decoder is responsible for the structure of the writing.
1. Encoder: The encoder component encodes the input sequence to enable it
to hold its information into a fixed size representation that will be known
as the context vector or the thought vector. It normally has the following
components:
• Input Embedding Layer: Because the input is ordered, each word or

token in the sequence either from the source sentence or from the
vocabulary is vectorized into a higher, fixed-dimensional representation
using pre-processing techniques such as word embeddings like
Word2Vec, GloVe among others. This layer converts every word into the
vector form, making it easier for the system to analyze the data.
• Encoder RNN: The input sequence is then transmitted to a Recurrent

Neural Network RNN layer which tasks include LSTM( Long Short Term
Memory) or GRU (Gated Recurrent Unit). The RNN used to process the
18
input that is sequential in nature and the hidden layer is updated each time.
It provides full information about dependence and context of the input
sequence.
• Context Vector: Thus, the context vector which we used also for the final
output of RNN is the summarize representation of the input sequence. But
in the case of the embedding layer, it transforms the input sequence into a
vector representation which is the same size as the embedding layer. This
context vector is passed to the decoder component; this specific
component also makes use of Equation (7) during the decoder procedure
2. Decoder: The decoder is in charge of decoding and generating the output

sequence based on the context vector which is generated from the encoder
part. It generally consists of the following elements:It generally consists of
the following elements:
• Output Embedding Layer: Just like with the input embedding layer, the
output is ex pressed as dense vectors using embeddings of sequences. Then
each word in the output sequence is assigned to the vector relevant to it.
• Decoder RNN: This output sequence is passed through another layer of

RNNs, commonly of the same kind as the encoder’s RNN layer. This new
RNN, known as the decoder RNN, has the RNN’s context vector as the
initial hidden state and addresses the sequential nature of the input vector.
Algorithms that are applied in context generation produce every subsequent
term progressively, with reference to two parameters: the current context
vector and the words that have occurred in the output sequence previously.
19
RESULTS AND DISCUSSION
GRU
LSTM
20
BI-LSTM
ENCODER – DECODER MODEL
21
REFERENCES
FOR MODELS:
https://www.geeksforgeeks.org/gated-recurrent-unit-networks/
https://www.natalieparde.com/teaching/cs_521_spring2020/LSTMs,%20GRUs,%20
Encoder-Decoder%20Models,%20and%20Attention.pdf
DATA SET:
https://huggingface.co/datasets/Hemanth-thunder/en_ta
CODE:
https://github.com/Barqawiz/aind2-nlp-translation/tree/master
22
CONCLUSION
In this project, we developed and trained a neural machine translation (NMT)

model for translating text from Tamil to English. Using an -
• encoder-decoder architecture with recurrent neural networks (RNNs),
• long short-term memory (LSTM) units,
• Bi-direactional LSTM’s
• Gated recurrent unit (GRU)
The model's performance was evaluated using various metrics, and the results
indicate that our approach can effectively handle the complexities of language
translation.
On the overall evaluation we found that LSTM with Embedding layer performs
well.
23

GROUP19_EEE_PAPER

Uploaded by

Copyright:

Available Formats

GROUP19_EEE_PAPER

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GROUP19_EEE_PAPER

Uploaded by

Copyright:

Available Formats

MULTI MODEL NEURAL MACHINE TRANSLATION

22AIE114 Introduction to Electrical and Electronics Engineering

2023 B. Tech CSE – AI

We would like to express our sincere gratitude to all those who

• Gated recurrent units (GRUs)

Neural Machine Translation (Neural MT/NMT, Deep Neural MT/NMT, or DNMT) is an

What is Recurrent Neural Network:

1. Gated Recurrent Unit (GRU)

Reset gate: r_t = sigmoid(W_r * [h_{t-1}, x_t])

In conclusion, GRU networks are a category of RNN specifically designed to

Reset Gate(r): It specifies how much is to be forgotten out of the previous

Current Memory Gate( ): As often, in a rather standard conversation about Gated

Extend the current input to the previous hidden state as vectors.

The above-stated working is stated as below:-

What does Embedding LSTM Layers ?

• It is common in deep learning models especially computer vision and natural

• Bidirectional RNN (Long Short-Term Memory) :

In computer vision, embedding layers can be used to transform categorical

Bidirectional LSTM (BiLSTM)

where, Pt = Ptf + Ptb

Bidirectional LSTM layer Architecture

Datasets employed within an encoder-decoder system typically include sequences,

An encoder can best be described as a block that encodes or compresses a certain

1. Encoder: The encoder converts an input sequence into a fixed-sized vector

Now let’s discuss a practical example in order to understand how an ‘encoder-

Encoder Decoder Architecture

Encoder Decoder Architecture

• Input Embedding Layer: Because the input is ordered, each word or

• Encoder RNN: The input sequence is then transmitted to a Recurrent

2. Decoder: The decoder is in charge of decoding and generating the output

• Decoder RNN: This output sequence is passed through another layer of

ENCODER – DECODER MODEL

In this project, we developed and trained a neural machine translation (NMT)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.