Skip to content

lmthang/nmt.matlab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Effective Approaches to Attention-based Neural Machine Translationsm

Code to train Neural Machine Translation systems as described in our EMNLP paper Effective Approaches to Attention-based Neural Machine Translation.

Features:

  • Multi-layer bilingual encoder-decoder models: GPU-enabled.
  • Attention mechanisms: global and local models.
  • Beam-search decoder: can decode multiple models.
  • Other features: dropout, train monolingual language models.
  • End-to-end pipeline: scripts to preprocess, compute evaluation scores.

Citations:

If you make use of this code in your research, please cite our paper

@InProceedings{luong-pham-manning:2015:EMNLP,
  author    = {Luong, Minh-Thang  and  Pham, Hieu  and  Manning, Christopher D.},
  title     = {Effective Approaches to Attention-based Neural Machine Translation},
  booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
  year      = {2015},
}

Files

README.md       - this file
code/           - main Matlab code
  trainLSTM.m: train models
  testLSTM.m: decode models
data/           - toy data
scripts/        - utility scripts

The code directory further divides into sub-directories:

  basic/: define basic functions like sigmoid, prime. It also has an efficient way to aggreate embeddings.
  layers/: we define various layers like attention, LSTM, etc.
  misc/: things that we haven't categorized yet.
  preprocess/: deal with data.
  print/: print results, logs for debugging purposes.

Examples

  • Prepare the data
./scripts/run_prepare_data.sh ./data/train.10k.en ./data/valid.100.en ./data/test.100.en 1000 ./output/id.1000
./scripts/run_prepare_data.sh ./data/train.10k.de ./data/valid.100.de ./data/test.100.de 1000 ./output/id.1000

Here, we convert train/valid/test files in text format into integer format that can be handled efficiently in Matlab. The script syntax is:

run_prepare_data.sh <trainFile> <validFile> <testFile> <vocabSize> <outDir>
  • Train a bilingual, encoder-decoder model

In Matlab, go into the code/ directory and run:

trainLSTM('../output/id.1000/train.10k', '../output/id.1000/valid.100', '../output/id.1000/test.100', 'de', 'en', '../output/id.1000/train.10k.de.vocab.1000', '../output/id.1000/train.10k.en.vocab.1000', '../output/basic', 'isResume', 0)

This trains a very basic model with all the default settings. We set 'isResume' to 0 so that it will train a new model each time you run the command instead of loading existing models. See trainLSTM.m for more.

(To run directly from your terminal, checkout scripts/train.sh)

The syntax is:

trainLSTM(trainPrefix,validPrefix,testPrefix,srcLang,tgtLang,srcVocabFile,tgtVocabFile,outDir,varargin)
% Arguments:
%   trainPrefix, validPrefix, testPrefix: expect files trainPrefix.srcLang,
%     trainPrefix.tgtLang. Similarly for validPrefix and testPrefix.
%     These data files contain sequences of integers one per line.
%   srcLang, tgtLang: languages, e.g. en, de.
%   srcVocabFile, tgtVocabFile: one word per line.
%   outDir: output directory.
%   varargin: other optional arguments.

The trainer code outputs logs such as:

1, 20, 6.38K, 1, 6.52, gN=11.62

which means: at epoch 1, mini-batches 20, training speed is 6.38K words/s, learning rate is 1, train cost is 6.52, and grad norm is 11.62.

Once in a while, the code will evaluate on the valid and test sets:

# eval 34.40, 2, 144, 6.19K, 1.00, train=4.76, valid=3.63, test=3.54, W_src{1}=0.050 W_tgt{1}=0.081 W_emb_src=0.050 W_emb_tgt=0.056 W_soft=0.076, time=0.57s

which tells us additional information such as the current test perplexity, 34.40, the valid / test costs, and the average abs values of the model paramters.

  • Decode
testLSTM('../output/basic/modelRecent.mat', 2, 10, 1, '../output/basic/translations.txt')

Decode with beamSize 2, collect maximum 10 translations, batchSize 1.

Note that the testLSTM implicitly decodes the test file specified during training. To specifiy a different test file, use 'testPrefix', see testLSTM.m for more.

testLSTM('../output/basic/modelRecent.mat', 2, 10, 1, '../output/basic/translations.txt', 'testPrefix', '../output/id.1000/valid.100')

(To run directly from your terminal, checkout scripts/test.sh)

Syntax:

testLSTM(modelFiles, beamSize, stackSize, batchSize, outputFile,varargin)
% Arguments:
%   modelFiles: single or multiple models to decode. Multiple models are
%     separated by commas.
%   beamSize: number of hypotheses kept at each time step.
%   stackSize: number of translations retrieved.
%   batchSize: number of sentences decoded simultaneously. We only ensure
%     accuracy of batchSize = 1 for now.
%   outputFile: output translation file.
%   varargin: other optional arguments.
  • Grad check
trainLSTM('', '', '', '', '', '', '', '../output/gradcheck', 'isGradCheck', 1)

Advanced Examples

  • Profiling
trainLSTM('../output/id.1000/train.10k', '../output/id.1000/valid.100', '../output/id.1000/test.100', 'de', 'en', '../output/id.1000/train.10k.de.vocab.1000', '../output/id.1000/train.10k.en.vocab.1000', '../output/basic', 'isProfile', 1, 'isResume', 0)
  • Train a monolingual language model
trainLSTM('../output/id.1000/train.10k', '../output/id.1000/valid.100','../output/id.1000/test.100', '', 'en', '','../output/id.1000/train.10k.en.vocab.1000', '../output/mono', 'isBi', 0, 'isResume', 0)
  • Train multi-layer model with dropout
trainLSTM('../output/id.1000/train.10k', '../output/id.1000/valid.100', '../output/id.1000/test.100', 'de', 'en', '../output/id.1000/train.10k.de.vocab.1000', '../output/id.1000/train.10k.en.vocab.1000', '../output/advanced', 'numLayers', 2, 'lstmSize', 100, 'dropout', 0.8, 'isResume', 0)

We train a 2 stacking LSTM layers with 100 LSTM cells and 100-dim embeddings. Here, 0.8 means the dropout "keep" probability.

  • More control of hyperparameters
trainLSTM('../output/id.1000/train.10k', '../output/id.1000/valid.100', '../output/id.1000/test.100', 'de', 'en', '../output/id.1000/train.10k.de.vocab.1000', '../output/id.1000/train.10k.en.vocab.1000', '../output/hypers', 'learningRate', 1.0, 'maxGradNorm', 5, 'initRange', 0.1, 'batchSize', 128, 'numEpoches', 10, 'finetuneEpoch', 5, 'isResume', 0)

Apart from "obvious" hyperparameters, we want to scale our gradients whenever its norm averaged by batch size (128) is greater than 5. After training for 5 epochs, we start halving our learning rate each epoch. To have further control of learning rate schedule, see 'epochFraction' and 'finetuneRate' options in trainLSTM.m

  • Train attention model:
trainLSTM('../output/id.1000/train.10k', '../output/id.1000/valid.100', '../output/id.1000/test.100', 'de', 'en', '../output/id.1000/train.10k.de.vocab.1000', '../output/id.1000/train.10k.en.vocab.1000', '../output/attn', 'attnFunc', 1, 'attnOpt', 1, 'isReverse', 1, 'feedInput', 1, 'isResume', 0)

Here, we also use source reversing 'isReverse' and the input feeding approach 'feedInput' as described in the paper. Other attention architectures can be specified as follows:

  % attnFunc=0: no attention.
  %          1: global attention
  %          2: local attention + monotonic alignments
  %          4: local attention  + regression for absolute pos (multiplied distWeights)
  % attnOpt: decide how we generate the alignment weights:
  %          0: location-based
  %          1: content-based, dot product
  %          2: content-based, general dot product
  %          3: content-based, concat Montreal style

'isResume' is set to 0 to avoid loading existing models (done by default), so that you can try different attention architectures.

  • More grad checks:
./scripts/run_grad_checks.sh > output/grad_checks.txt 2>&1

Then compare with the provided grad check outputs data/grad_checks.txt. They should look similar.

Note: many different configurations will be run with the run_grad_checks.sh script. For many configuration, we set the 'initRange' to a large value 10, so you will notice the total gradient differences are large. This is to debug subtle mistakes; and if the total diff < 10, you can mostly be assured. We do note that with attnFunc=4, attnOpt=1, the diff is quite large; this is something to be checked though the model seems to work in practice.

About

Code to train state-of-the-art Neural Machine Translation systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy