0% found this document useful (0 votes)
251 views

French To English Translator in PyTorch

task is to convert a sentence given in French into English language. We use Recurrent Neural Networks for the same, and use relevant approaches using the same. We used LSTMs instead of RNNs for the same, due to the improved gradient propagation and the memory storage. We will encode a given sentence using an encoder network which will consists of stacked LSTMs, and store the last encoding into an LSTM, then use a decoder network, which again consists of LSTMs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views

French To English Translator in PyTorch

task is to convert a sentence given in French into English language. We use Recurrent Neural Networks for the same, and use relevant approaches using the same. We used LSTMs instead of RNNs for the same, due to the improved gradient propagation and the memory storage. We will encode a given sentence using an encoder network which will consists of stacked LSTMs, and store the last encoding into an LSTM, then use a decoder network, which again consists of LSTMs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Project Report

CSI CO-CURRICULAR

French to English Translation using PyTorch

Presented By:
Divyanshu 16BCI0127

Presented to:
Prof. Govinda K

October, 2018
Abstract
Our task is to convert a sentence given in French into English language. We plan to use
Recurrent Neural Networks for the same, and use relevant approaches using the same. We
will be using LSTMs instead of RNNs for the same, due to the improved gradient
propagation and the memory storage. We will encode a given sentence using an encoder
network which will consists of stacked LSTMs, and store the last encoding into an LSTM,
then use a decoder network, which again consists of stacked LSTMs and take input as the
last hidden layer of the encoder network, and each cell outputs a word.

Introduction
Neural machine translation had been proposed by Kalchbrenner and Blunsom{[1]} ,
sutskever et al., and cho et al. Generally, the statistical machine translation systems are
based on phrase based translation that tries to translate sentences based on statistical
measures, the neural machine translation systems attempt at making a single end-to-end
trainable neural network to translate the given sentence into its target language. It consists
of further small sub-components that are tuned separately.

Most of the approaches consist of a single encoder-decoder network, that takes as input
a single sentence that has to be translated and feeds it word-by-word to an encoder. At
the end of all encoder cells, we get the representation of the whole given sentence, and
can be projected onto an n-dimensional space that can be accepted as a decoder. The
decoder network is the same as the encoder network, and produces target sentence
word-by-word and at the end of the decoder network, we get the whole target sentence.
The encoder decoder network is trained on a loss that aims at reducing the distance
between the target and the predicted word in a pre-defined text vocabulary.

A potential issue with this encoder– decoder approach is that a neural system should
have the capacity to pack all the fundamental data of a source sentence into a settled
length vector. This may make it troublesome for the neural system to adapt to long
sentences, particularly those that are longer than the sentences in the preparation corpus.
Cho et al. (2014b) demonstrated that for sure the execution of an essential encoder–
decoder falls apart quickly as the length of an information sentence increments. With
the end goal to address this issue, we acquaint an expansion with the encoder– decoder
show which learns to adjust and decipher mutually. Each time the proposed model
produces a word in an interpretation, looks for an arrangement of positions in a source
sentence where the most applicable data is concentrated. The model at that point
predicts an objective word dependent on the setting vectors related with these source
positions and all the past created target words.
Literature Review:

Advantages of the Disadvantages of Future Scope


proposition the
Sno Paper title Author Journal Name Year Abstract - Summary proposition
Proposes a general end to end network for proposes a new scope Cannot deal with A new kind of scope
sequential learning and dealing with of RNNs and a different large or complex can be proposed
sequential data, using an encoderdecoder way of interpreting it sentences
approach with Deep Learning methods.
Training large DNNs on massive text
Sequence to corpuses, demonstrating an accuracy much
Sequence Ilya sutskever, NIPS(Neural more than conventional statistical models.
learning oriol Information Reads the whole sentence, encodes it and
with neural vinyals,Quoc Processing then decodes it by predicting each word in
1 networks Lee systems) 2014 the translation.
One of the most influential works in the Proposes a state-of-the- The attention basis The attention basis
NLP and DL community, proposes a neural art solution for machine can sometimes can be improved for
network that is tuned to train to maximise translation and achieves lead to wrong better results
the translation performance. Uses an ground breaking results results due to
Neural attention basis to focus on certain parts and wrong attention
Machine then translating part by part, with context map
Translation Dzmitry dependent attention, much like humans.
Bahdanau, ICLR
by jointly (International This work achieves a very high accuracy,
learning to Kyung Cho Conference on and when incorporated with AUTO-ML,
align and Yoshua Learning becomes Google’s language translation
2 translate Bengio Representations) 2014 engine.

Joint Proposes a language understanding model, Uses conjunction of Modelling the The assumptions
language both for source and target languages, and RNNs which give a language is a bit taken can be
Michael Auli, high accuracy of the
and Michel Galley, then from this knowledge representation, off-the track due to justified more and
model and can learn to
Translation Chris Quirk, Association for tries to learn a function using deep learning the loss function improved
translate and model at
modelling Geoffrey computational methods, specially conjunction of RNNs involved accordingly
the same time
3 with recurrent Zweig Linguistics 2013
neural and simple feedforward neural networks,
networks that can learn a translation model from the
source to the target language. Several
independent and fair assumptions for model
and demonstrates a fairly high accuracy.

This also is one of the most influential Solves the problem word embeddings Language modelling
works in the which had been are biased towards can be improved by
Language and Learning community, this daunting researchers certain types and jointly learning
paper solves one of the greatest problems of for years, and proposes characteristics assigning the
new kind of vector
word representations. A very high question embeddings probability
for a long time was how to represent words distribution and
Yoshua with a more semantic meaning, this model word representation
Bengio, exactly solves the same problem, and learns
Rejean and assigns a probability distribution for
ducharme, each word, and presents a very good
A neural Pascal JMLR (Journal representation. One good example is that
probabilistic Vincent, of machine given king:queen, man : ??, the model
language Chrisitan learning predicts woman. Essentially, this work is to
4 model Jauvin research) 2003 make a language understanding model.
Learning This work forms one of the most Statistical models were Not all models can model can be
phrase fundamental approaches for neural machine improved drastically be improvised improvised to
representation translation. Here, statistical modelling is because of this and using this accomodate all
drawing the statistical models
s using RNN Kyung Cho, done using an RNN encoder-decoder probability distributions technique and pertaining to
encoder Dzmitry approach, which was taken and understood needs careful
became very easy machine learning.
decoder for Bahdanau, by Y.Bengio, and applied in language implementation. Implementation can
statistical Fethi translation, It is this paper, which forms a also be improved by
machine Bougares, large chunk of the translation models being paying attention and
translation Holgere used right now. Almost all models being care.
schwenk, Association for employed in real-time use this approach,
Yoshua computational and those who don’t, underperform. This is
5 Bengio Linguistics 2014 the current state-of-the-art.
Context Gives us a good method uses a large corpus Pre-trained weights
dependent When there are massive text corpuses and to use pre-trained and generally such can be suitably
pretrained very very large deep neural networks, weights for massive large corpora arent stabilised to make
deep neural training a model from scratch is highly text corpora and use available. Pre- the best out of this
networks for computationally expensive. So, for this model as a base for trained weights can technique which has
large application, people use pre-trained weights further models. sometime lead to been presented.
vocabulary so that they already have a good unstable learning. Henceforth, the
speech approximation to the learned weights and networks using this
recognition the process becomes comparatively easier. model as the
This paper demonstrates the parameters to baseline can also be
George, Dahl, be considered while doing so, and also the improvised to
6 Dong Yu ARXIV 2012 methodology of the same. perform better.

The goal of any translation model is to learn This paper makes it The joint Correct learning of
a joint probability distribution of translated easy to implement and proabablity joint probability
words, given the previous translation, and deploy a robust neural distributions can distributions
Fast and sometimes lead to
the previous input words till a particular network. The goal of incorrect learning modeled and
robust Jacob Devlin,
neural time-step. However, until the time this paper learning a joint of the two conditioned on
R. was published, training neural network probability distribution languages on appropraite
network
joint Zbib, based language models was very of translated words is which translation
Zhongqiang we will attempt
models for cumbersome, and would sometimes take made easier by this probabilities.
Huang, translation. As a
statistical Thomas months on very large text corpuses. This paper.
Lamar, R. paper demonstrates how to implement not result, translation
Association for
machine
Schwartz, J. computational only a fast, but also a robust neural network probability will
7 translation Makhaul Linguistics 2014 to deploy for such models. take a massive hit.

This paper also addresses how to deal with Establishes that RNN Rules out a major No future scope as it
data that is timerelated and modelling a joint are bad at method that is only refutes an
probability distribution of events occurring remembering long widely used. accepted
Generating term inputs. This
sequences on a particular time-steps, given the methodology.
allows us to venture
with outcomes of previous events. This paper
into other methods.
recurrent specifically shows how LSTMs can be used
neural to generate complex structures of data, by
8 networks Alex Graves ARXIV 2013 predicting one data point at a time. This
paper also reveals that RNNs are bad at
remembering long term inputs, and suggests
a modification to these “fuzzy” RNNs. This
paper was one of the first ones to actually
put RNNs to use and demonstrate their
potential.

Transduction, essentially meaning introduces an end to these networks predefined


transformation, is the main property and use end learning to direly alignments can be
of sequential or temporal networks, to transform termporal need a predefined trained using a
transform or transduce, a given input
structure into an output sequence or domain inputs and a alignment, different neural
structure, that is dependent in the temporal probabilistic sequence without which network altogether.
domain. However, these networks require a transfer system. training these
predefined alignment between the two to models is nearly
learn a transformation function impossible.
for the relevant task. And a big challenge in
this methodology is to find that function, and
Sequence this paper introduces an end to end
probabilistic sequence transformation
transduction
system, that transforms any input sentence to
with an output and generates outputs similar to
recurrent given examples.
neural For example, generating a Shakespeare style
9 networks Alex Graves ARXIV 2012 text.
This is the paper that we owe all the establishes methods to is very data Their computational
development in Artificial Intelligence to. deal with temporal hungry, and needs complexity can be
Developed at TU München, this is the paper data, and introduces a a very large dataset reduced
that introduces LSTMs, the backbone of new kind of network, compared to other to make them more
almost all systems that require processing on which is very robust models, and are efficient, and further
sequential or temporal data. Some prominent and achieves also subject to work can include to
Long Short
Sepp examples include Speech Recognition accuracies above 95 overfitting make them less data
Term
Memory Hochreiter, Association for (Alexa), Google Assistant, Siri, etc. This percent also. hungry.
Networks Jurgen computational paper essentially solved the problem of
10 (LSTM) Schmidhuber Linguistics, IEEE 1997 vanishing gradient in RNNs by introducing a
memory structure, in conjunction with the
already existing hidden layers, and gives the
4 most essential equations that made the
development of modern AI possible.

This is a paper in the pre-neural network era, One of the very Although this Not much future
and forms a seminal work on classical important techniques gives a high scope or further
machine translation. These models are based in statistical machine accuracy, still it work can be done as
on learning a translation probability on set of learning, and achieved does not match these are limited to
words, and then using that transition break-through results. those with deep- the statistical
probability, to find expectation given a Also, this technique is learning methods technqiues and can't
word, and then transforming or translating it not data-hungry and and cross the learn a robust
to the target language. This model calculates gives a fairly good production-level function.
transition probability, by conditioning the amount of accuracy accuracy bar.
probability distribution on the given even with smaller
Recurrent preceding word, and the input source datasets.
continuous Nal Association for sentence, and finding a continued product of
translation Kalchbrenner computational the same, as these probabilities are
11 models Phil Blunsom Linguistics(ACL) 2013 dependent and in accordance to Baye’s rule.

Overcoming ¨ The problem that this paper addresses is This is a paper with a Not many Further work
the curse of J. Pouget- the problem of translating long sentences very big advantage, disadvantages, includes
sentence length in the paper of Sequence to Sequence and solved the only that futher conjuncting them
for neural Abadie,
machine Dzmitry Learning with NNs. The approach that this probelem of curse of work can be done with robust neural
translation Bahdanau, paper adopts, is to segment a Long sentence length, to improve it. networks.
Bart V. sentence into smaller chunks and essentially proposed a
using
Merrienboer, iteratively applying the approach of model that can learn
automatic
segmentation Kyung Cho, sequence to sequence learning, giving long-term
Yoshua results with a BLEU score, higher than the dependencies.
12 Bengio ACL 2014 one obtained in the competing paper.
On the This paper introduces using the RNN- Achieves Cannot capture Attention model
properties of J. Pouget- encoder-decoder approach as adopted in groundbreaking results long term was introduced to
Neural Abadie, the paper “Learning phrase representations by applying sequence dependencies and capture long term
Machine Dzmitry using RNN encoder decoder for statistical to sequence learning in gives wrong dependencies thus
translation - Bahdanau, machine translation”, discussed above. It conjuction with RNN translations give correct
encoder and Bart V.
Merrienboer, is just an application of the paper for encoderdecoder translations.
decoder language translations, and gave ground- approach
Kyung Cho,
approaches Yoshua breaking results in the translation task, and
13 Bengio ACL 2014 inspired the whole NLP community.
This paper was introduced after the paper Produce better results Underperforms A different type of
Tomas “A neural probabilistic language model”, one hot embeddings than the learned embedding model
and introduced a
Mikolov, and trained a recurrent neural network embedding models can be introduced.
new scope of using
Martin (RNN) on the word2vec embeddings as
word 2-vec
Karafiat, proposed in the same paper. This paper
embeddings
Recurrent Lukas burget, produced better results than using one-hot
neural network Jan cernocky, embeddings, and gave a much better
based language Sanjeev understanding of language modelling and
14 model Khudanpur Interspeech 2010 is the same approach currently used.
LSTM Martin This paper is just the same work as the Uses LSTM cells Subject to Introducing a type
neural Sundermeyer, Interspeech paper mentioned above “RNN based instead of RNNs to overfitting and of network that are
networks for Ralf Schluter, language model”, and only employs the achieve much better needs a very high robust to overfitting,
language and much powerful LSTMs rather than the results dataset. like the MAC
15 modelling Hermann Ney 2012 RNNs used before. networks.

This was the work of Geoffrey Hinton, Paper has introduced Still has speech Making the model
Geoffrey along with his other colleagues at Google, ground breaking recognition robust to dialects
problems in
Hinton, Li and developed an open source devout for results in speech and accents
real life like
the same. This paper improved speech recognition and is
Deep Neural Deng, Dong unable to
Yu, George recognition several times, by using DNNs widely used in
Networks for understand dialects
Dahl, and processing spectrogram data, and assistants like
Acoustic and accents.
Navdeep finds very good onsets of particular speech ALEXA, Google
modelling in Jaitly, IEEE signal tokens, which found exciting applications assistant.
speech Andrew processing like “Hey Alexa !”, “Okay Google”, etc..,.
16 recognition Senior magazine 2012 Until this paper, the state of the art was
done by Hidden Markov Models, and used
to work on short window frames, however
this algorithm surpassed all those
limitations.

This paper introduces BLEU, an First metric proposed Has problems in Using f-scores and
evaluation metric on how good a that evaluate a evaluating weighted N-gram
generated translation is. The main machine learning translation models models to enhance
principle behind BLEU metric is the model that was very as it uses only an evalutation
BLEU : A measurement of the overlap in unigrams much needed by the N-gram model.
method for (single words) and higher order n-grams community
Kishore
automatic Papineni, of words, between a translation being
evaluation of Salim Roukos, evaluated and a set of one or more
Todd Ward, reference translations. A perfect match
machine
and gives a score of 1, whereas a perfect
17 translation Wei-Jing Zh ACL 2002 mismatch gives a score of 0.
METEOR: An The METEOR translation metric, Better than the BLEU Still flaws exist in Although it uses f-
Automatic proposed by the Language Technologies metric and addresses the model and scores, weighted N-
Metric for MT Institute, Carnegie Mellon University, all its problems and work is currently Gram is very much
Evaluation addresses the weakness in the BLEU give better translation being done to necessary for an
with metric. It evaluates a translation by results improve the even better metric
Improved computing a score based on explicit standard.
Correlation wordtoword matches between the
with Human EACL(European translation and a reference translation. If
Judgments Satanjeev Chapter of the more than one reference translation is
Banerjee, association of the available, the given translation is scored
Alon computaional against each reference independently, and
18 Lavie linguistics) 2014 the best score is reported.
Statistical machine translation systems are Introduces a new way Data hungry Not much further
based on one or more translation models of projecting approach and work as statistical
and a language model of the target probability needs a high models are not used
language. While many different translation distributions on an search space today.
models and phrase extraction algorithms Ndimensional space,
have been proposed, a standard word n- thereby increasing the
gram back-off language model is used in acuracy of these
most systems. In this work, we propose to models.
use a new statistical language model that
is based on a continuous representation of
the words in the vocabulary. A neural
network is used to perform the projection
and the probability estimation. We
consider the translation of European
Parliament Speeches. This task is part of
an international evaluation organized by
the TC-STAR project in 2006. The
proposed method achieves consistent
improvements in the BLEU score on the
Philipp development and test data. The paper also
Koehn, presents algorithms to improve the
Statistical Franz Josef estimation of the language model
phrase based Och, probabilities when splitting long sentences
19 translation Daniel Marcu ACL 2003 into shorter chunks.
This paper is only an It is only a case Further work
This paper enlightens the problem of explanation of a fact, study, not a includes making
RNNs not being able to capture long term which enlightens the technique, hence models that capture
dependencies, due to the vanishing community to the disadvantages long term
Learning Yoshua gradient problem, as long term problem of long term don't apply. dependencies, and
longterm Bengio, dependency means a large and even dependencies on make use of
dependencies deeper RNN, due to which propagation of RNNs, and LSTMS.
Patrice
with gradient ACM(Association errors till starting layers is difficult, and collaboratively work
Simardt,
descent is of computing hence after this, the LSTM was started to to address this
20 difficult Paolo Frasconi machinery) 1994 be used. problem
ICSASSP (IEEE Introduction of a Dynamic Further work does
International a dynamic programming algorithm for dynamic programming programming does not include much as
conference on determining classes in a way that provably algorithm makes not always reach a Dynamic
Speed Acoustics, Speech minimizes the runtime of the resulting implementation of minima and can Programming was
Regularization and signal class-based language models. However, these models better and result in non- discarded in this
and Geoffrey processing) we also find that the speed-based methods reduces the heuristic and false domain long ago.
Optimality in Zweig, degrade the perplexity of the language computational results.
Word Konstantin models by 5-10% over traditional complexity of the
21 Classing Makarychev 2013 likelihood-based classing. same.
Large, Pruned Providing an open No disadvantage Further work
or Continuous source kit enables as it includes compiling
Space This paper explains how continuous space more people to get only provides a this toolkit with
Language models have a high computational into training models toolkit, which CuDNN and
Models on a complexity and how implementing that is and boosted reasearch enhances research TensorRT to make
Holger
GPU for difficult, and hence they implement an in this field. Also the in this domain. it more efficient.
Statistical Schwenk,
open source kit, to provide people to train toolkit reduces the
Anthony
Machine such models on GPUs with very high computational
Rousseau and
Translation Mohammed efficiency, so that the large computational complexity of training
22 Attik ACL 2012 complexity doesn't pose a problem, models.
This paper does a comparative study of all Opens up scope for The new way to Further work
statistical machine translation models at further work in this train models needs includes a better
that time, and also introduces a new and statistical machine a careful loss delta function,
efficient way for training an unsmoothed learning, by implementation as to make the change
error count. Generally error functions are investigating and the progress of in the
convex, and smooth, however in some comparing approaches these models is weights stable,
Minimum of statistical and deep making it a more
cases in machine translation due to the unstable and needs
Error Rate
Training in discrete behaviour, the loss function learning. Also extra attention. efficient algorithm.
Statistical behaviour might change and hence need a introduces better loss
Machine Franz Josef new way to train those models with such behavioral functions.
23 Translation Och ACL 2003 behaviour, this paper does exactly that.
This evaluation metric is not for machine Gives a robust and Does not suit Further work
translation, but for image captioning, better metric for image well to includes making it
which compares how well does a model captioning, and better machine suitable for machine
CIDEr: Ramakrishna generate a description for a given image, than BLEU. translation translation to
Consensus Vedantam, C. CVPR(Computer and performs better than both BLEU and tasks, provide an even
based Image Lawrence Vision and METEOR in this task. It is also based on better metric for
Description Zitnick, Pattern the fact that the closer the description is to machine translation.
24 Evaluation Devi Parikh Recognition) 2014 a human generated one, the better it is.
This work explains the progress of Opens up scope for Not all models are Models currently
language modelling at that time, for further work in this covered in this being published can
example continuous space models for domain, by investigation, and be included for a
statistical machine translation, hidden investigating and the ones published better case study
A bit of Markov models for language modelling comparing approaches in recent journals and compare the
progress in and probabilistic calculation of next word of statistical and deep are missing. current stateof-the-
language Joshua T. given the conditional probability of last learning. arts.
25 modeling Goodman ARXIV 2001 word generated and the source sentence.
Proposed Model
The Recurrent Neural Network is a characteristic speculation of feedforward neural systems
to arrangements. Given a succession of sources of info (x1, . . . , xT ), a standard RNN
registers a succession of yields (y1, . . . , yT ) by emphasizing the accompanying condition:

ht = sigm(W hxxt + Whhht−1)

yt = W yhht

The RNN can without much of a stretch guide succession to arrangements at whatever point
the arrangement between the information sources the yields is known early. Be that as it may,
it isn't clear how to apply an RNN to issues whose info and the yield groupings have
distinctive lengths with confounded and non-monotonic connection ships.

A basic system for general succession learning is to outline input grouping to a settled
estimated vector utilizing one RNN, and afterward to delineate vector to the objective
arrangement with another RNN. While it could work on a basic level since the RNN is given
all the significant data, it is hard to prepare the RNNs because of the subsequent long-haul
conditions. Be that as it may, the Long Short-Term Memory (LSTM) is known to learn issues
with long range transient conditions, so a LSTM may prevail in this setting.

The objective of the LSTM is to assess the contingent likelihood p(y1, . . . , yT ′ |x1, . . . , xT )
where (x1, . . . , xT ) is an info grouping and y1, . . . , yT ′ is its comparing yield grouping
whose length T ′ may vary from T .

The LSTM figures this contingent likelihood by first acquiring the settled dimensional
portrayal v of the info grouping (x1, . . . , xT ) given by the last concealed condition of the
LSTM, and after that figuring the likelihood of y1, . . . , yT ′ with a standard LSTM-LM plan
whose underlying concealed state is set to the portrayal v of x1, . . . , xT. Each p(yt|v, y1, . . . ,
yt−1) conveyance is spoken to with a softmax over every one of the words in the vocabulary.

We utilize the LSTM plan from Graves. Note that we necessitate that each
sentence closes with an extraordinary end-of-sentence image "<EOS>", which
empowers the model to characterize a circulation over groupings of every
single conceivable length, where the demonstrated LSTM registers the
portrayal of "A", "B", "C", "<EOS>" and afterward utilizes this portrayal to
figure the likelihood of "W", "X", "Y", "Z", "<EOS>".
Initially, we utilized two diverse LSTMs: one for the information arrangement and another
for the yield grouping, on the grounds that doing as such expands the number model
parameters at irrelevant computational expense and makes it characteristic to prepare the
LSTM on numerous dialect matches at the same time. Second, we found that profound
LSTMs essentially beat shallow LSTMs, so we picked a LSTM with four layers. Third, we
discovered it greatly important to invert the request of the expressions of the information
sentence.

Proposed Algorithm

from io import open

import unicodedata

import string

import re

import random

import torch

import torch.nn as nn

from torch import optim

import torch.nn.functional as F

DEFINE device as GPU if AVAIL, else CPU

DEFINE SOS index = 0

DEFINE EOS index = 1

DEFINE language class


{

def __init__(self, name):

self.name = name

self.word2index = {}

self.word2count = {}

self.index2word = {0: "SOS", 1: "EOS"}

self.n_words = 2 # Count SOS and EOS

def addSentence(self, sentence):

for word in sentence.split(' '):

self.addWord(word)

def addWord(self, word):

if word not in self.word2index:

self.word2index[word] = self.n_words

self.word2count[word] = 1

self.index2word[self.n_words] = word

self.n_words += 1

else:

self.word2count[word] += 1

def unicodeToAscii(s):

return ''.join(

c for c in unicodedata.normalize('NFD', s)

if unicodedata.category(c) != 'Mn'

)
}

DEFINE function readLangs()

Read file

For sentence in FILE :

Split in lines

Normalise

DEFINE function prepareData()

readLangs()

Trim it to sentnece pairs

Count frequency of words

Use relevant portion of Dataset

CLASS EncoderRNN(nn.Module)

def __init__(self, input_size, hidden_size):

super(EncoderRNN, self).__init__()

self.hidden_size = hidden_size
self.embedding = nn.Embedding(input_size, hidden_size)

self.gru = nn.GRU(hidden_size, hidden_size)

def forward(self, input, hidden):

embedded = self.embedding(input).view(1, 1, -1)

output = embedded

output, hidden = self.gru(output, hidden)

return output, hidden

def initHidden(self):

return torch.zeros(1, 1, self.hidden_size, device=device)

CLASS AttnDecoderRNN(nn.Module)

def __init__(self, hidden_size, output_size, dropout_p=0.1,


max_length=MAX_LENGTH):

super(AttnDecoderRNN, self).__init__()

self.hidden_size = hidden_size

self.output_size = output_size

self.dropout_p = dropout_p

self.max_length = max_length

self.embedding = nn.Embedding(self.output_size, self.hidden_size)

self.attn = nn.Linear(self.hidden_size * 2, self.max_length)


self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)

self.dropout = nn.Dropout(self.dropout_p)

self.gru = nn.GRU(self.hidden_size, self.hidden_size)

self.out = nn.Linear(self.hidden_size, self.output_size)

def forward(self, input, hidden, encoder_outputs):

embedded = self.embedding(input).view(1, 1, -1)

embedded = self.dropout(embedded)

attn_weights = F.softmax(

self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)

attn_applied = torch.bmm(attn_weights.unsqueeze(0),

encoder_outputs.unsqueeze(0))

output = torch.cat((embedded[0], attn_applied[0]), 1)

output = self.attn_combine(output).unsqueeze(0)

output = F.relu(output)

output, hidden = self.gru(output, hidden)

output = F.log_softmax(self.out(output[0]), dim=1)

return output, hidden, attn_weights

def initHidden(self):

return torch.zeros(1, 1, self.hidden_size, device=device)


}

# PARALLEL PART OF THE ALGORITHM – ALL COMPUTATIONS ON THE GPU

DEFINE FUNCTION Train

INIT Encoder Hidden State

SET zero gradients for Encoder

SET zero gradients for Decoder

SET input_length = input_tensor.size(0)

SET target_length = target_tensor.size(0)

encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

loss = 0

FOR ei in range(input_length):

encoder_output, encoder_hidden = encoder(

input_tensor[ei], encoder_hidden)

encoder_outputs[ei] = encoder_output[0, 0]

decoder_input = torch.tensor([[SOS_token]], device=device)


decoder_hidden = encoder_hidden

use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

if use_teacher_forcing:

# Teacher forcing: Feed the target as the next input

for di in range(target_length):

decoder_output, decoder_hidden, decoder_attention = decoder(

decoder_input, decoder_hidden, encoder_outputs)

loss += criterion(decoder_output, target_tensor[di])

decoder_input = target_tensor[di] # Teacher forcing

BACKPROP on loss

OPTIMIZER step Encoder

OPTIMIZER step Decoder

return Loss/target_length

DEFINE FUNCTION trainIters:

encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)

decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)

training_pairs = [tensorsFromPair(random.choice(pairs))
for i in range(n_iters)]

criterion = nn.NLLLoss()

for iter in range(1, n_iters + 1):

training_pair = training_pairs[iter - 1]

input_tensor = training_pair[0]

target_tensor = training_pair[1]

loss = train(input_tensor, target_tensor, encoder,

decoder, encoder_optimizer, decoder_optimizer, criterion)

Testing Script
from __future__ import unicode_literals, print_function, division
from io import open
import random

import torch

from project import prepareData, tensorFromSentence, device


from project.language import MAX_LENGTH, SOS_token, EOS_token
from project.network import EncoderRNN, AttnDecoderRNN

teacher_forcing_ratio = 0.5
input_lang, output_lang, pairs = prepareData("eng", "fra", True)

def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):


with torch.no_grad():
input_tensor = tensorFromSentence(input_lang, sentence)
input_length = input_tensor.size()[0]
encoder_hidden = encoder.initHidden()

encoder_outputs = torch.zeros(max_length, encoder.hidden_size,


device=device)

for ei in range(input_length):
encoder_output, encoder_hidden = encoder(input_tensor[ei],
encoder_hidden)
encoder_outputs[ei] += encoder_output[0, 0]

decoder_input = torch.tensor([[SOS_token]], device=device) #


SOS

decoder_hidden = encoder_hidden

decoded_words = []
decoder_attentions = torch.zeros(max_length, max_length)

for di in range(max_length):
decoder_output, decoder_hidden, decoder_attention =
decoder(
decoder_input, decoder_hidden, encoder_outputs
)
decoder_attentions[di] = decoder_attention.data
topv, topi = decoder_output.data.topk(1)
if topi.item() == EOS_token:
decoded_words.append("<EOS>")
break
else:

decoded_words.append(output_lang.index2word[topi.item()])

decoder_input = topi.squeeze().detach()

return decoded_words, decoder_attentions[: di + 1]

def evaluateRandomly(encoder, decoder, n=10):


for i in range(n):
pair = random.choice(pairs)
print(">", pair[0])
print("=", pair[1])
output_words, attentions = evaluate(encoder, decoder, pair[0])
output_sentence = " ".join(output_words)
print("<", output_sentence)
print("")

hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).cuda()
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words,
dropout_p=0.1).cuda()

encoder1.load_state_dict(torch.load("/home/sid/work_other/pdc_jcomp/enc
oder.pt"))
attn_decoder1.load_state_dict(
torch.load("/home/sid/work_other/pdc_jcomp/attndecoder.pt")
)
encoder1.eval()
attn_decoder1.eval()

def translate_sentence(input_sentence):
output_words, attentions = evaluate(encoder1, attn_decoder1,
input_sentence)
print("input =", input_sentence)
print("output =", " ".join(output_words))
print()

translate_sentence("elle a cinq ans de moins que moi .")


translate_sentence("je n en ai pas fini avec lui .")
translate_sentence("elle est trop petit .")
translate_sentence("je ne crains pas de mourir .")
translate_sentence("c est un jeune directeur plein de talent .")
Implementation Details

PyTorch

PyTorch is an open source machine learning library primarily developed by Facebook’s


Artificial Intelligence Research Group (FAIR). It is widely used by researchers and
developers worldwide, and is one of the most used libraries in the world. It is derived from
torch, which was written in the LuaJIT programming language, and an underlying C
implementation. It supports matrix or array manipulation, like mathematical manipulations
like Addition, Multiplication, etc., except that it supports tensor computations. Tensors are
nothing but multidimensional arrays that can also be used on a GPU. PyTorch provides two
high-level features –

(i) Tensor computation like NumPy with strong GPU Acceleration.

(ii) Deep Neural Networks built on a tape-based auto differentiation system.

PyTorch essentially has 3 modules, namely the autograd module, optim module and NN
module. It also has many sub-modules like Variable, Dataset, DataLoader, etc.

CUDA

CUDA is NVIDIA’s language/API for programming on the graphics card. It is a parallel


computing platform and application programming interface (API) model created by Nvidia. It
allows software developers and software engineers to use a CUDA-enabled graphics
processing unit (GPU) for general purpose processing.

It is an approach termed GPGPU (General-Purpose computing on Graphics Processing


Units). The CUDA platform is a software layer that gives direct access to the GPU's virtual
instruction set and parallel computational elements, for the execution of compute kernels. The
CUDA platform is designed to work with programming languages such as C, C++, and
Fortran.
This accessibility makes it easier for specialists in parallel programming to use GPU
resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced
skills in graphics programming. Also, CUDA supports programming frameworks such as
OpenACC and OpenCL. When it was first introduced by Nvidia, the name CUDA was an
acronym for Compute Unified Device Architecture, but Nvidia subsequently dropped the use
of the acronym.

CuDNN

CuDNN is a library for deep neural nets built using CUDA. It provides GPU accelerated
functionality for common operations in deep neural nets. The deep learning libraries like
Tensorflow, PyTorch already have built abstractions backed by CuDNN. CuDNN provides
highly tuned implementations of routines arising frequently in deep neural network
implementations like convolution -forward and backward including cross-correlation,
forward and backward pooling, neuron activations like ReLU,sigmoid, Hyperbolic Tangent,
tensor transformation functions, LRN and LCN batch normalizations. cuDNN's convolution
routines aim for performance competitive with the fastest GEMM (matrix multiply) based
implementations of such routines while using significantly less memory.

cuDNN's convolution routines aim for performance competitive with the fastest GEMM
(matrix multiply) based implementations of such routines while using significantly less
memory. CuDNN features customizable data layouts, supporting flexible dimension ordering,
striding, and subregions for the 4D tensors used as inputs and outputs to all of its routines.
This flexibility allows easy integration into any neural network implementation and avoids
the input/output transposition steps sometimes necessary with GEMM-based convolutions.
CuDNN offers a context-based API that allows for easy multithreading and (optional)
interoperability with CUDA streams.

The whole research project is implemented in PyTorch, and is carried out on a Nvidia GTX
960M GPU. PyTorch is facebook’s open source machine learning library, and is implemented
on CUDA and CuDNN. Training the model takes about 4-5 days on the mentioned system.
The testing forward-pass is very fast and takes about 3-4 milliseconds per sentence. Hence,
given a sentence, it will take the mentioned time for real-time processing
Result

Figure depicting GPU usage while training

Figure depicting training the model

Figure depicting results after training


Comparison with existing models

The model that we compete against is the normal encoder-decoder type model that finds itself
difficult to cope with longer sentences. Hence, RNNencdec in the below table are the results
obtained by our competitor. Our model is denoted by RNNsearch. The below table shows the
BLEU scores for the models that have been tested. 20 & 30 denote the length of the longest
sentence in the dataset that we limited to. For example, RNNencdec-20 means the encoder-
decoder model (our competitor) with the length of longest sentence being 20.

RNNsearch-30* is our model but trained for a very long period of time. As is evident from
the table, the BLEU score obtained for RNNsearch-30* outperforms the RNNencdec model
by a very large margin (almost about twice the score).

MODEL BLEU SCORE

RNNencdec-20 12.12

RNNsearch-20 19.71

RNNencdec-30 15.32

RNNsearch-30 24.55

RNNsearch-30* 27.88

Performance parameters considered

BLEU:

BLEU, an evaluation metric on how good a generated translation is. The main principle
behind BLEU metric is the measurement of the overlap in unigrams (single words) and
higher order n-grams of words, between a translation being evaluated and a set of one or
more reference translations. A perfect match gives a score of 1, whereas a perfect mismatch
gives a score of 0. It was the metric proposed that evaluate a machine learning model that was
very much needed by the community
METEOR:

The METEOR translation metric, proposed by the Language Technologies Institute, Carnegie
Mellon University, addresses the weakness in the BLEU metric. It evaluates a translation by
computing a score based on explicit word-to-word matches between the translation and a
reference translation. If more than one reference translation is available, the given translation
is scored against each reference independently, and the best score is reported. It is still better
than the BLEU metric and addresses all its problems and gives better translation results

Conclusion

In this work, we show that an attention model can outperform a standard encoder-decoder
network. The intuition for this can be thought of as the model performing much like a human
does. The encoder-decoder network encodes the whole sentence, makes a representation of it
and then decodes it, keeping in context the whole sentence at once. On the other hand, our
model, much more like humans, first takes a chunk of words, pays attention to the words in
vicinity, and then tries to figure out a translation more suitable for that context. Because it
only takes up a small context/chunk of the whole sentence while doing translation, the length
of the sentence does not matter in its operation.

We expected our model to outperform the standard proposed encoder-decoder model using a
sequence to sequence translation technique, but were surprised to see such a huge
improvement. The BLEU score almost jumped to double, beating the standard model by a
large margin

References

[1] Auli,M ., Galley,M ., Quirk ,C., and Zweig, G (2013). Joint language and translation
modelling with recurrent neural networks. EMNLP
[2] Bahdanau,D ., Cho,K ., and Bengio,Y (2014). Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473.
[3] Sutskever, Ilya ., Vinyals, Oriol ., Lee, Quoc (2014). Sequence to Sequence learning with
neural networks. NIPS.
[4] Bengio, Yoshua ., Ducharme, Rejean ., Vincent, Pascal ., Jauvin, Chrisitan (2003) A neural
probabilistic language model .JMLR
[5] Cho, Kyung ., Bahdanau, Dzmitry ., Bougares, Fethi ., Schwenk, Holgere ., Bengio, Yoshua
(2014). Learning phrase representation s using RNN encoder decoder for statistical machine
translation. Association for Computational Linguistics.
[6] George, Dahl ., Yu, Dong(2012). Context dependent pre-trained deep neural networks for
large vocabulary speech recognition. ARXIV.
[7] Devlin, Jacob R ., Zbib, Zhongqiang Huang., Lamar, Thomas R ., Schwartz, J. Makhaul
(2014). Fast and robust neural network joint models for statistical machine translation.
Association for Computational Linguistics.
[8] Graves, Alex. (2013) Generating sequences with recurrent neural networks. ARXIV.
[9] Graves, Alex. (2012) Sequence transduction with recurrent neural networks. ARXIV
[10] Hochreiter, Sepp ., Schmidhuber, Jurgen (1997) . Long Short Term Memory Networks
(LSTM). Association for computational Linguistics, IEEE.
[11] Kalchbrenner, Nal ., Blunsom, Phil (2013). Recurrent continuous translation models.
Association for computational Linguistics(ACL)
[12] Pouget-Abadie, J., Bahdanau, Dzmitry ., Merrienboer, Bart V., Cho, Kyung ., Bengio,
Yoshua (2014). Overcoming the curse of sentence length for neural machine translation
using automatic segmentation. ACL
[13] Pouget-Abadie, J., Bahdanau, Dzmitry ., Merrienboer, Bart V., Cho, Kyung ., Bengio,
Yoshua (2014). On the properties of Neural Machine translation - encoder and decoder
approaches. ACL
[14] Mikolov, Tomas ., Karafiat, Martin ., Burget, Lukas ., Cernocky, Jan ., Khudanpur, Sanjeev.
(2010) Recurrent neural network based language model. Interspeech.
[15] Sundermeyer, Martin ., Schluter, Ralf., and Ney, Hermann. (2012). LSTM neural networks
for language modelling. Interspeech.
[16] Hinton, Geoffrey ., Deng, Li ., Yu, Dong., Dahl, George ., Jaitly, Navdeep ., Senior, Andrew.
(2012). Deep Neural Networks for Acoustic modelling in speech recognition. IEEE signal
processing magazine.
[17] Papineni, Kishore ., Roukos, Salim ., Ward, Todd and Zh, Wei-Jing (2002). BLEU : A
method for automatic evaluation of machine translation. ACL.
[18] Banerjee, Satanjeev ., Lavie, Alon (2014). METEOR: An Automatic Metric for MT
Evaluation with Improved Correlation with Human Judgments. European Chapter of the
association of the Computational Linguistics.
[19] Koehn, Philipp ., Och, Franz Josef ., Marcu, Daniel (2003). Statistical phrase based
translation
ACL
[20] Bengio, Yoshua ., Simardt, Patrice ., Frasconi, Paolo (1994). Learning longterm
dependencies with gradient descent is difficult. Association of computing machinery
[21] Zweig, Geoffrey ., Makarychev, Konstantin (2013). Speed Regularization and Optimality in
Word Classing . IEEE International conference on Acoustics, Speech and signal processing
[22] Schwenk, Holger ., Rousseau, Anthony and Attik, Mohammed (2012). Large, Pruned or
Continuous Space Language Models on a GPU for Statistical Machine Translation ACL.
[23] Och, Franz Josef.(2003) Minimum Error Rate Training in Statistical Machine Translation.
ACL
[24] Vedantam, Ramakrishna C., Zitnick, Lawrence., Parikh, Devi(2014). CIDEr: Consensus
based Image Description Evaluation. Computer Vision and Pattern Recognition.
[25] Goodman, Joshua T. (2001) A bit of progress in language modeling. ARXIV.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy