0% found this document useful (0 votes)
7 views

GPT1

Uploaded by

Kan Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

GPT1

Uploaded by

Kan Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Improving Language Understanding

by Generative Pre-Training

Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever


OpenAI OpenAI OpenAI OpenAI
alec@openai.com karthikn@openai.com tim@openai.com ilyasu@openai.com

Abstract
Easy to find large unlabeled
Natural language understanding comprises a wide range of diverse tasks such text but no having
as textual entailment, question answering, semantic similarity assessment, and associated labels is scarce
document classification. Although large unlabeled text corpora are abundant, for such tasks.
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model First generative pre-training
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each step followed by
specific task. In contrast to previous approaches, we make use of task-aware input discriminative fine-tuning.
Task-specific input transformations transformations during fine-tuning to achieve effective transfer while requiring
are done during fine-tuning to get minimal changes to the model architecture. We demonstrate the effectiveness of
good results. Transfer learning is our approach on a wide range of benchmarks for natural language understanding.
effective in this case and it also Our general task-agnostic model outperforms discriminatively trained models that The task agnostic model
doesn't cause any changes to the performs better than
use architectures specifically crafted for each task, significantly improving upon the
model architecture. models made specific for a
state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute
certain type of task.
improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on
question answering (RACE), and 1.5% on textual entailment (MultiNLI).

1 Introduction
The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised For problems where
learning in natural language processing (NLP). Most deep learning methods require substantial annotated data is not
amounts of manually labeled data, which restricts their applicability in many domains that suffer available in large numbers,
from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic using models that can
capture the linguistic
information from unlabeled data provide a valuable alternative to gathering more annotation, which information can serve as a
Good representations, can be time-consuming and expensive. Further, even in cases where considerable supervision replacement for more
in general always help. is available, learning good representations in an unsupervised fashion can provide a significant annotation.
Whether we have performance boost. The most compelling evidence for this so far has been the extensive use of pre-
annotations available trained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].
or not.
Leveraging more than word-level information from unlabeled text, however, is challenging for two 1. No clear way yet of
main reasons. First, it is unclear what type of optimization objectives are most effective at learning knowing what optimization
text representations that are useful for transfer. Recent research has looked at various objectives objective will help for
such as language modeling [44], machine translation [38], and discourse coherence [22], with each what task when using
method outperforming the others on different tasks.1 Second, there is no consensus on the most transfer learning.
effective way to transfer these learned representations to the target task. Existing techniques involve 2. No consensus on what is
a combination of making task-specific changes to the model architecture [43, 44], using intricate the most effective way of
learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made transferring the learned
it difficult to develop effective semi-supervised learning approaches for language processing. representations to the
target task.
1
https://gluebenchmark.com/leaderboard

Preprint. Work in progress.


Learn a universal
In this paper, we explore a semi-supervised approach for language understanding tasks using a representation that will
This is interesting. combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal work well with most tasks
The tasks for which representation that transfers with little adaptation to a wide range of tasks. We assume access to (transfer learning) and
the annotated data a large corpus of unlabeled text and several datasets with manually annotated training examples will require very little
we have, need not be (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled input adaptation.
in the same domain corpus. We employ a two-stage training procedure. First, we use a language modeling objective on
as the corpus we the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt
have.
these parameters to a target task using the corresponding supervised objective. The two step process.

For our model architecture, we use the Transformer [62], which has been shown to perform strongly on
Transformers are used
various tasks such as machine translation [62], document generation [34], and syntactic parsing [29].
as the underlying model This model choice provides us with a more structured memory for handling long-term dependencies in
architecture to capture text, compared to alternatives like recurrent networks, resulting in robust transfer performance across
long term dependencies diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style
which help in such approaches [52], which process structured text input as a single contiguous sequence of tokens. As
diverse tasks. we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal
changes to the architecture of the pre-trained model.
We evaluate our approach on four types of language understanding tasks – natural language inference,
question answering, semantic similarity, and text classification. Our general task-agnostic model
outperforms discriminatively trained models that employ architectures specifically crafted for each
task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance,
we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test) [40],
5.7% on question answering (RACE) [30], 1.5% on textual entailment (MultiNLI) [66] and 5.5% on
the recently introduced GLUE multi-task benchmark [64]. We also analyzed zero-shot behaviors
of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic
knowledge for downstream tasks.

2 Related Work

Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised
learning for natural language. This paradigm has attracted significant interest, with applications to
tasks like sequence labeling [24, 33, 57] or text classification [41, 70]. The earliest approaches used
Word embeddings
unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a
work better than any
statistical feature
supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using
extraction. word embeddings [11, 39, 42], which are trained on unlabeled corpora, to improve performance on a Important Ones -
variety of tasks [8, 11, 26, 45]. These approaches, however, mainly transfer word-level information, 1. Skip thought vectors
However, the paper 2. Paragraph vectors
wants to focus on the
whereas we aim to capture higher-level semantics.
3. Quick thought vectors
higher-level semantics. Recent approaches have investigated learning and utilizing more than word-level semantics from 4. InferSent
unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled 5. Gensen
corpus, have been used to encode text into suitable vector representations for various target tasks [28, 6. Unsupervised Training
32, 1, 36, 22, 12, 56, 31]. for Machine Translation

Unsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning


where the goal is to find a good initialization point instead of modifying the supervised learning
objective. Early works explored the use of the technique in image classification [20, 49, 63] and
regression tasks [3]. Subsequent research [15] demonstrated that pre-training acts as a regularization
There is existing work scheme, enabling better generalization in deep neural networks. In recent work, the method has
following similar lines i.e., been used to help train deep neural networks on various tasks like image classification [69], speech
pre-training a neural recognition [68], entity disambiguation [17] and machine translation [48].
network using a language
modelling objective and The closest line of work to ours involves pre-training a neural network using a language modeling
then fine tuning it on a objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and
task with supervision. But Ruder [21] follow this method to improve text classification. However, although the pre-training
[13] and [21] used LSTMs phase helps capture some linguistic information, their usage of LSTM models restricts their prediction
and this paper uses ability to a short range. In contrast, our choice of transformer networks allows us to capture longer-
transformers. range linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the
effectiveness of our model on a wider range of tasks including natural language inference, paraphrase
Helps capture a longer detection and story completion. Other approaches [43, 44, 38] use hidden representations from a
context and performs
well on a wide range of
tasks. 2
There are approaches which use the final layer or some other hidden layer weights
of the LM or MT model as the embeddings and use those as the features.
However this requires large number of new parameters for each separate task,
whereas the goal of this paper is to to minimal changes to the model during transfer
learning for the different tasks.

pre-trained language or machine translation model as auxiliary features while training a supervised
model on the target task. This involves a substantial amount of new parameters for each separate
target task, whereas we require minimal changes to our model architecture during transfer.

Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative


form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of This paper also uses a
Interesting. auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling auxiliary objective but
to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling the unsupervised pre-
Adding a language model training already learns
objective to their target task objective and demonstrated performance gains on sequence labeling
objective helped improve several linguistic related
tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training
performance for to the tasks at hand.
sequence labeling tasks.
already learns several linguistic aspects relevant to target tasks.

3 Framework
Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to
a discriminative task with labeled data.

3.1 Unsupervised pre-training

Given an unsupervised corpus of tokens U = {u1 , . . . , un }, we use a standard language modeling


objective to maximize the following likelihood:
X
L1 (U) = log P (ui |ui−k , . . . , ui−1 ; Θ) (1)
i
where k is the size of the context window, and the conditional probability P is modeled using a neural
network with parameters Θ. These parameters are trained using stochastic gradient descent [51].
In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is
a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution
over target tokens:
h0 = U We + Wp
hl = transformer_block(hl−1 )∀i ∈ [1, n] (2)
P (u) = softmax(hn WeT )
where U = (u−k , . . . , u−1 ) is the context vector of tokens, n is the number of layers, We is the token
embedding matrix, and Wp is the position embedding matrix.

3.2 Supervised fine-tuning


Inputs x_1, ... x_m After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target
passed through pre-
task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens,
trained model to get
x1 , . . . , xm , along with a label y. The inputs are passed through our pre-trained model to obtain
the final transform
block's activation h_l
the final transformer block’s activation hm l , which is then fed into an added linear output layer with
(m). This is then fed parameters Wy to predict y:
into a linear layer with P (y|x1 , . . . , xm ) = softmax(hm l Wy ). (3)
parameters W_y to
predict y. This gives us the following objective to maximize:
X
L2 (C) = log P (y|x1 , . . . , xm ). (4)
(x,y)

We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):
L3 (C) = L2 (C) + λ ∗ L1 (C) (5)
Overall, the only extra parameters we require during fine-tuning are Wy , and embeddings for delimiter
tokens (described below in Section 3.3).

3
Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input
transformations for fine-tuning on different tasks. We convert all structured inputs into token
sequences to be processed by our pre-trained model, followed by a linear+softmax layer.

3.3 Task-specific input transformations

For some tasks, like text classification, we can directly fine-tune our model as described above.
Certain other tasks, like question answering or textual entailment, have structured inputs such as
ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model
was trained on contiguous sequences of text, we require some modifications to apply it to these tasks.
Previous work proposed learning task specific architectures on top of transferred representations [44].
Such an approach re-introduces a significant amount of task-specific customization and does not
use transfer learning for these additional architectural components. Instead, we use a traversal-style
approach [52], where we convert structured inputs into an ordered sequence that our pre-trained
model can process. These input transformations allow us to avoid making extensive changes to the
architecture across tasks. We provide a brief description of these input transformations below and
Figure 1 provides a visual illustration. All transformations include adding randomly initialized start
and end tokens (hsi, hei).
Since we are converting
two sequences to a single
Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token one, hence the ordering
sequences, with a delimiter token ($) in between. could play a role during
concatenation of the
Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared. sequences. So both the
To reflect this, we modify the input sequence to contain both possible sentence orderings (with a ways are tried out and
delimiter in between) and process each independently to produce two sequence representations hm element-wise sum is taken
l
of the obtained
which are added element-wise before being fed into the linear output layer.
representations before
passing it through the
Question Answering and Commonsense Reasoning For these tasks, we are given a context linear layer.
document z, a question q, and a set of possible answers {ak }. We concatenate the document context
and question with each possible answer, adding a delimiter token in between to get [z; q; $; ak ]. Each
of these sequences are processed independently with our model and then normalized via a softmax
layer to produce an output distribution over possible answers.
Interesting technique.
Delimiter is introduced
4 Experiments between (z, q) and a_k.

4.1 Setup

Unsupervised pre-training We use the BooksCorpus dataset [71] for training the language model. Dataset with long
It contains over 7,000 unique unpublished books from a variety of genres including Adventure, sequences chosen.
Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the
generative model to learn to condition on long-range information. An alternative dataset, the 1B
Word Benchmark, which is used by a similar approach, ELMo [44], is approximately the same size

4
Table 1: A list of the different tasks and datasets used in our experiments.

Task Datasets
Natural language inference SNLI [5], MultiNLI [66], Question NLI [64], RTE [4], SciTail [25]
Question Answering RACE [30], Story Cloze [40]
Model perplexity serves
Sentence similarity MSR Paraphrase Corpus [14], Quora Question Pairs [9], STS Benchmark [6]
Classification Stanford Sentiment Treebank-2 [54], CoLA [65] as a good technique for
self-evaluation. A lower
score is good, meaning
the model works well in
but is shuffled at a sentence level - destroying long-range structure. Our language model achieves a finding the next token.
very low token level perplexity of 18.4 on this corpus.

Model specifications Our model largely follows the original transformer work [62]. We trained a
12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12
attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.
We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate Note the LR scheduling.
was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.
We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
Since layernorm [2] is used extensively throughout the model, a simple weight initialization of
N (0, 0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53]
Important experiment
details
and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also
employed a modified version of L2 regularization proposed in [37], with w = 0.01 on all non bias or Positional encodings are
gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We learned instead of using
used learned position embeddings instead of the sinusoidal version proposed in the original work. the sinusoidal version
We use the ftfy library2 to clean the raw text in BooksCorpus, standardize some punctuation and proposed in the
whitespace, and use the spaCy tokenizer.3 Transformers paper.

Fine-tuning details Unless specified, we reuse the hyperparameter settings from unsupervised
3 epochs is pretty pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate
fast convergence. of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient
for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ
was set to 0.5.

4.2 Supervised fine-tuning

We perform experiments on a variety of supervised tasks including natural language inference,


question answering, semantic similarity, and text classification. Some of these tasks are available
as part of the recently released GLUE multi-task benchmark [64], which we make use of. Figure 1
provides an overview of all the tasks and datasets.

Natural Language Inference The task of natural language inference (NLI), also known as recog-
nizing textual entailment, involves reading a pair of sentences and judging the relationship between
them from one of entailment, contradiction or neutral. Although there has been a lot of
recent interest [58, 35, 44], the task remains challenging due to the presence of a wide variety of
phenomena like lexical entailment, coreference, and lexical and syntactic ambiguity. We evaluate
on five datasets with diverse sources, including image captions (SNLI), transcribed speech, popular
fiction, and government reports (MNLI), Wikipedia articles (QNLI), science exams (SciTail) or news
articles (RTE).
Table 2 details various results on the different NLI tasks for our model and previous state-of-the-art
approaches. Our method significantly outperforms the baselines on four of the five datasets, achieving
absolute improvements of upto 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI and 0.6% on SNLI
over the previous best results. This demonstrates our model’s ability to better reason over multiple
sentences, and handle aspects of linguistic ambiguity. On RTE, one of the smaller datasets we
evaluate on (2490 examples), we achieve an accuracy of 56%, which is below the 61.7% reported by a
multi-task biLSTM model. Given the strong performance of our approach on larger NLI datasets, it is
likely our model will benefit from multi-task training as well but we have not explored this currently.
2
https://ftfy.readthedocs.io/en/latest/
3
https://spacy.io/

5
Table 2: Experimental results on natural language inference tasks, comparing our model with current
state-of-the-art methods. 5x indicates an ensemble of 5 models. All datasets use accuracy as the
evaluation metric.

Method MNLI-m MNLI-mm SNLI SciTail QNLI RTE


ESIM + ELMo [44] (5x) - - 89.3 - - -
CAFE [58] (5x) 80.2 79.0 89.3 - - -
Stochastic Answer Network [35] (3x) 80.6 80.1 - - - -
CAFE [58] 78.7 77.9 88.5 83.3
GenSen [64] 71.4 71.3 - - 82.3 59.2
Multi-task BiLSTM + Attn [64] 72.2 72.1 - - 82.1 61.7
Finetuned Transformer LM (ours) 82.1 81.4 89.9 88.3 88.1 56.0

Table 3: Results on question answering and commonsense reasoning, comparing our model with
current state-of-the-art methods.. 9x means an ensemble of 9 models.

Method Story Cloze RACE-m RACE-h RACE


val-LS-skip [55] 76.5 - - -
Hidden Coherence Model [7] 77.6 - - -
Dynamic Fusion Net [67] (9x) - 55.6 49.4 51.2
BiAttention MRU [59] (9x) - 60.2 50.3 53.3
Finetuned Transformer LM (ours) 86.5 62.9 57.4 59.0

Question answering and commonsense reasoning Another task that requires aspects of single
and multi-sentence reasoning is question answering. We use the recently released RACE dataset [30],
RACE dataset has more consisting of English passages with associated questions from middle and high school exams. This
reasoning type questions.
corpus has been shown to contain more reasoning type questions that other datasets like CNN [19] or Better results on both
Mainly because it is from
SQuaD [47], providing the perfect evaluation for our model which is trained to handle long-range Story Cloze and RACE.
the middle and high school
exam questions.
contexts. In addition, we evaluate on the Story Cloze Test [40], which involves selecting the correct
ending to multi-sentence stories from two options. On these tasks, our model again outperforms the
previous best results by significant margins - up to 8.9% on Story Cloze, and 5.7% overall on RACE.
This demonstrates the ability of our model to handle long-range contexts effectively.

Semantic Similarity Semantic similarity (or paraphrase detection) tasks involve predicting whether
two sentences are semantically equivalent or not. The challenges lie in recognizing rephrasing of
concepts, understanding negation, and handling syntactic ambiguity. We use three datasets for this
task – the Microsoft Paraphrase corpus (MRPC) [14] (collected from news sources), the Quora
Question Pairs (QQP) dataset [9], and the Semantic Textual Similarity benchmark (STS-B) [6]. Better results on 2/3
We obtain state-of-the-art results on two of the three semantic similarity tasks (Table 4) with a 1 datasets.
point absolute gain on STS-B. The performance delta on QQP is significant, with a 4.2% absolute
improvement over Single-task BiLSTM + ELMo + Attn.

Classification Finally, we also evaluate on two different text classification tasks. The Corpus
of Linguistic Acceptability (CoLA) [65] contains expert judgements on whether a sentence is
grammatical or not, and tests the innate linguistic bias of trained models. The Stanford Sentiment
Better results on both
Treebank (SST-2) [54], on the other hand, is a standard binary classification task. Our model obtains datasets.
an score of 45.4 on CoLA, which is an especially big jump over the previous best result of 35.0,
showcasing the innate linguistic bias learned by our model. The model also achieves 91.3% accuracy
on SST-2, which is competitive with the state-of-the-art results. We also achieve an overall score of
72.8 on the GLUE benchmark, which is significantly better than the previous best of 68.9.

6
Table 4: Semantic similarity and classification results, comparing our model with current state-of-the-
art methods. All task evaluations in this table were done using the GLUE benchmark. (mc= Mathews
correlation, acc=Accuracy, pc=Pearson correlation)

Method Classification Semantic Similarity GLUE


CoLA SST2 MRPC STSB QQP
(mc) (acc) (F1) (pc) (F1)
Sparse byte mLSTM [16] - 93.2 - - - -
TF-KLD [23] - - 86.0 - - -
ECNU (mixed ensemble) [60] - - - 81.0 - -
Single-task BiLSTM + ELMo + Attn [64] 35.0 90.2 80.2 55.5 66.1 64.8
Multi-task BiLSTM + ELMo + Attn [64] 18.9 91.6 83.5 72.8 63.3 68.9
Finetuned Transformer LM (ours) 45.4 91.3 82.3 82.0 70.3 72.8

Model performs well


The fact that it Overall, our approach achieves new state-of-the-art results in 9 out of the 12 datasets we evaluate when we have both less
outperforms ensembles as on, outperforming ensembles in many cases. Our results also indicate that our approach works well data and large data.
well is impressive. across datasets of different sizes, from smaller datasets such as STS-B (≈5.7k training examples) –
to the largest one – SNLI (≈550k training examples).

5 Analysis
Impact of number of layers transferred We observed the impact of transferring a variable number
Transferring more of layers from unsupervised pre-training to the supervised target task. Figure 2(left) illustrates the
number of layers from performance of our approach on MultiNLI and RACE as a function of the number of layers transferred.
the unsupervised to the We observe the standard result that transferring embeddings improves performance and that each
supervised task helps. transformer layer provides further benefits up to 9% for full transfer on MultiNLI. This indicates that
each layer in the pre-trained model contains useful functionality for solving target tasks.

Figure 2: (left) Effect of transferring increasing number of layers from the pre-trained language
model on RACE and MultiNLI. (right) Plot showing the evolution of zero-shot performance on
different tasks as a function of LM pre-training updates. Performance per task is normalized between
a random guess baseline and the current state-of-the-art with a single model.

Zero-shot Behaviors We’d like to better understand why language model pre-training of transform-
ers is effective. A hypothesis is that the underlying generative model learns to perform many of the
tasks we evaluate on in order to improve its language modeling capability and that the more structured
One hypothesis is that to improve its LM capabilities, the model itself To justify LM-optimised
learns many of the tasks that we evaluate on later. LSTMs vs pretraining, zero shot testing
7 is used i.e., using the model
transformers can be explained due to the the attention in the
transformers. directly without any fine
tuning.
Table 5: Analysis of various model ablations on different tasks. Avg. score is a unweighted average
of all the results. (mc= Mathews correlation, acc=Accuracy, pc=Pearson correlation)

Method Avg. Score CoLA SST2 MRPC STSB QQP MNLI QNLI RTE
(mc) (acc) (F1) (pc) (F1) (acc) (acc) (acc)
Transformer w/ aux LM (full) 74.7 45.4 91.3 82.3 82.0 70.3 81.8 88.1 56.0
Transformer w/o pre-training 59.9 18.9 84.0 79.4 30.9 65.5 75.7 71.2 53.8
Transformer w/o aux LM 75.0 47.9 92.0 84.9 83.2 69.8 81.1 86.9 54.4
LSTM w/ aux LM 69.1 30.3 90.5 83.2 71.8 68.1 73.7 81.1 54.6

attentional memory of the transformer assists in transfer compared to LSTMs. We designed a series
of heuristic solutions that use the underlying generative model to perform tasks without supervised Performance of the pre-
finetuning. We visualize the effectiveness of these heuristic solutions over the course of generative trained model increases
pre-training in Fig 2(right). We observe the performance of these heuristics is stable and steadily with increased updates.
increases over training suggesting that generative pretraining supports the learning of a wide variety Meaning that more
of task relevant functionality. We also observe the LSTM exhibits higher variance in its zero-shot training leads to implicit
learning of these tasks.
performance suggesting that the inductive bias of the Transformer architecture assists in transfer.
This is so interesting. For
For CoLA (linguistic acceptability), examples are scored as the average token log-probability the
Averaging + Thresholding sentiment analysis, if we
generative model assigns and predictions are made by thresholding. For SST-2 (sentiment analysis), just get a output
for linguistic acceptabilty.
we append the token very to each example and restrict the language model’s output distribution to only probability in the end, just
Highest average token log the words positive and negative and guess the token it assigns higher probability to as the prediction. append "very" and restrict
prob score when For RACE (question answering), we pick the answer the generative model assigns the highest average the output distribution to
conditioned on the token log-probability when conditioned on the document and question. For DPRD [46] (winograd only "positive" and
document and question is schemas), we replace the definite pronoun with the two possible referrents and predict the resolution "negative". the one with
the answer. that the generative model assigns higher average token log-probability to the rest of the sequence the higher probability is
after the substitution. the sentiment.

Ablation studies We perform three different ablation studies (Table 5). First, we examine the
performance of our method without the auxiliary LM objective during fine-tuning. We observe that
LM objective fine the auxiliary objective helps on the NLI tasks and QQP. Overall, the trend suggests that larger datasets
tuning helps on larger benefit from the auxiliary objective but smaller datasets do not. Second, we analyze the effect of the LSTMs in general show a
datasets but not much Transformer by comparing it with a single layer 2048 unit LSTM using the same framework. We drop in performance when
on smaller datasets. observe a 5.6 average score drop when using the LSTM instead of the Transformer. The LSTM only replaced. They perform
outperforms the Transformer on one dataset – MRPC. Finally, we also compare with our transformer better on MRPC though
architecture directly trained on supervised target tasks, without pre-training. We observe that the lack (semantic similarity).
of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease compared to our
full model. Using a task specifc model (transformers)
trained in a supervised fashion, without pre-
training leads to bad performance as
6 Conclusion compared to the pre-trained + fine tuned
one.
We introduced a framework for achieving strong natural language understanding with a single
task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training
on a diverse corpus with long stretches of contiguous text our model acquires significant world
knowledge and ability to process long-range dependencies which are then successfully transferred to
solving discriminative tasks such as question answering, semantic similarity assessment, entailment
determination, and text classification, improving the state of the art on 9 of the 12 datasets we
study. Using unsupervised (pre-)training to boost performance on discriminative tasks has long
been an important goal of Machine Learning research. Our work suggests that achieving significant
performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets
(text with long range dependencies) work best with this approach. We hope that this will help enable
new research into unsupervised learning, for both natural language understanding and other domains,
further improving our understanding of how and when unsupervised learning works.

References
[1] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.

8
[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In
Advances in neural information processing systems, pages 153–160, 2007.

[4] L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo. The fifth pascal recognizing textual entailment
challenge. In TAC, 2009.

[5] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning natural
language inference. EMNLP, 2015.

[6] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. Semeval-2017 task 1: Semantic textual
similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.

[7] S. Chaturvedi, H. Peng, and D. Roth. Story comprehension for predicting what happens next. In Proceedings
of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1603–1614, 2017.

[8] D. Chen and C. Manning. A fast and accurate dependency parser using neural networks. In Proceedings
of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 740–750,
2014.

[9] Z. Chen, H. Zhang, X. Zhang, and L. Zhao. Quora question pairs. https://data.quora.com/First-Quora-
Dataset-Release-Question-Pairs, 2018.

[10] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks
with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages
160–167. ACM, 2008.

[11] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing
(almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

[12] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning of universal sentence
representations from natural language inference data. EMNLP, 2017.

[13] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing
Systems, pages 3079–3087, 2015.

[14] W. B. Dolan and C. Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings
of the Third International Workshop on Paraphrasing (IWP2005), 2005.

[15] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised
pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.

[16] S. Gray, A. Radford, and K. P. Diederik. Gpu kernels for block-sparse weights. 2017.

[17] Z. He, S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang. Learning entity representation for entity disam-
biguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), volume 2, pages 30–34, 2013.

[18] D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear
units. arXiv preprint arXiv:1606.08415, 2016.

[19] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching
machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–
1701, 2015.

[20] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural
computation, 18(7):1527–1554, 2006.

[21] J. Howard and S. Ruder. Universal language model fine-tuning for text classification. Association for
Computational Linguistics (ACL), 2018.

[22] Y. Jernite, S. R. Bowman, and D. Sontag. Discourse-based objectives for fast unsupervised sentence
representation learning. arXiv preprint arXiv:1705.00557, 2017.

[23] Y. Ji and J. Eisenstein. Discriminative improvements to distributional sentence similarity. In Proceedings


of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 891–896, 2013.

9
[24] F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields
for improved sequence segmentation and labeling. In Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics,
pages 209–216. Association for Computational Linguistics, 2006.

[25] T. Khot, A. Sabharwal, and P. Clark. Scitail: A textual entailment dataset from science question answering.
In Proceedings of AAAI, 2018.

[26] Y. Kim. Convolutional neural networks for sentence classification. EMNLP, 2014.

[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
2014.

[28] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought
vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.

[29] N. Kitaev and D. Klein. Constituency parsing with a self-attentive encoder. ACL, 2018.

[30] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from
examinations. EMNLP, 2017.

[31] G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora
only. ICLR, 2018.

[32] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International Conference
on Machine Learning, pages 1188–1196, 2014.

[33] P. Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology,
2005.

[34] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer. Generating wikipedia by
summarizing long sequences. ICLR, 2018.

[35] X. Liu, K. Duh, and J. Gao. Stochastic answer networks for natural language inference. arXiv preprint
arXiv:1804.07888, 2018.

[36] L. Logeswaran and H. Lee. An efficient framework for learning sentence representations. ICLR, 2018.

[37] I. Loshchilov and F. Hutter. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101,
2017.

[38] B. McCann, J. Bradbury, C. Xiong, and R. Socher. Learned in translation: Contextualized word vectors. In
Advances in Neural Information Processing Systems, pages 6297–6308, 2017.

[39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words
and phrases and their compositionality. In Advances in neural information processing systems, pages
3111–3119, 2013.

[40] N. Mostafazadeh, M. Roth, A. Louis, N. Chambers, and J. Allen. Lsdsem 2017 shared task: The story cloze
test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level
Semantics, pages 46–51, 2017.

[41] K. Nigam, A. McCallum, and T. Mitchell. Semi-supervised text classification using em. Semi-Supervised
Learning, pages 33–56, 2006.

[42] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings
of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543,
2014.

[43] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power. Semi-supervised sequence tagging with bidirec-
tional language models. ACL, 2017.

[44] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextual-
ized word representations. NAACL, 2018.

[45] Y. Qi, D. S. Sachan, M. Felix, S. J. Padmanabhan, and G. Neubig. When and why are pre-trained word
embeddings useful for neural machine translation? NAACL, 2018.

10
[46] A. Rahman and V. Ng. Resolving complex cases of definite pronouns: the winograd schema challenge. In
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, pages 777–789. Association for Computational Linguistics,
2012.

[47] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension
of text. EMNLP, 2016.

[48] P. Ramachandran, P. J. Liu, and Q. V. Le. Unsupervised pretraining for sequence to sequence learning.
arXiv preprint arXiv:1611.02683, 2016.

[49] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an
energy-based model. In Advances in neural information processing systems, pages 1137–1144, 2007.

[50] M. Rei. Semi-supervised multitask learning for sequence labeling. ACL, 2017.

[51] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics,
pages 400–407, 1951.

[52] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kočiskỳ, and P. Blunsom. Reasoning about entailment
with neural attention. arXiv preprint arXiv:1509.06664, 2015.

[53] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv
preprint arXiv:1508.07909, 2015.

[54] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for
semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical
methods in natural language processing, pages 1631–1642, 2013.

[55] S. Srinivasan, R. Arora, and M. Riedl. A simple and effective approach to the story cloze test. arXiv
preprint arXiv:1803.05547, 2018.

[56] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal. Learning general purpose distributed sentence
representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079, 2018.

[57] J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-word scale
unlabeled data. Proceedings of ACL-08: HLT, pages 665–673, 2008.

[58] Y. Tay, L. A. Tuan, and S. C. Hui. A compare-propagate architecture with alignment factorization for
natural language inference. arXiv preprint arXiv:1801.00102, 2017.

[59] Y. Tay, L. A. Tuan, and S. C. Hui. Multi-range reasoning for machine comprehension. arXiv preprint
arXiv:1803.09074, 2018.

[60] J. Tian, Z. Zhou, M. Lan, and Y. Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp
features and neural networks to build a universal model for multilingual and cross-lingual semantic textual
similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017),
pages 191–197, 2017.

[61] Y. Tsvetkov. Opportunities and challenges in working with low-resource languages. CMU, 2017.

[62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.

[63] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with
denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages
1096–1103. ACM, 2008.

[64] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and
analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.

[65] A. Warstadt, A. Singh, and S. R. Bowman. Corpus of linguistic acceptability. http://nyu-mll.github.io/cola,


2018.

[66] A. Williams, N. Nangia, and S. R. Bowman. A broad-coverage challenge corpus for sentence understanding
through inference. NAACL, 2018.

[67] Y. Xu, J. Liu, J. Gao, Y. Shen, and X. Liu. Towards human-level machine reading comprehension:
Reasoning and inference with multiple strategies. arXiv preprint arXiv:1711.04964, 2017.

11
[68] D. Yu, L. Deng, and G. Dahl. Roles of pre-training and fine-tuning in context-dependent dbn-hmms for
real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature
Learning, 2010.

[69] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel
prediction. In CVPR, volume 1, page 6, 2017.

[70] X. Zhu. Semi-supervised learning literature survey. 2005.

[71] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and
movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of
the IEEE international conference on computer vision, pages 19–27, 2015.

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy