0% found this document useful (0 votes)
16 views

N19-1213

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

N19-1213

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

An Embarrassingly Simple Approach for Transfer Learning from

Pretrained Language Models


Alexandra Chronopoulou1 , Christos Baziotis1 , Alexandros Potamianos1,2,3
1
School of ECE, National Technical University of Athens, Athens, Greece
2
Signal Analysis and Interpretation Laboratory (SAIL), USC, Los Angeles, USA
3
Behavioral Signal Technologies, Los Angeles, USA
el12068@central.ntua.gr, cbaziotis@mail.ntua.gr
potam@central.ntua.gr

Abstract slanted triangular learning rate scheme to adapt the


A growing number of state-of-the-art trans- parameters of the LM to the target dataset.
fer learning methods employ language mod- We propose a simple and effective transfer
els pretrained on large generic corpora. In this learning approach, that leverages LM contextual
paper we present a conceptually simple and representations and does not require any elaborate
effective transfer learning approach that ad- scheduling schemes during training. We initially
dresses the problem of catastrophic forgetting. train a LM on a Twitter corpus and then transfer
Specifically, we combine the task-specific op-
its weights. We add a task-specific recurrent layer
timization function with an auxiliary language
model objective, which is adjusted during the and a classification layer. The transferred model
training process. This preserves language reg- is trained end-to-end using an auxiliary LM loss,
ularities captured by language models, while which allows us to explicitly control the weighting
enabling sufficient adaptation for solving the of the pretrained part of the model and ensure that
target task. Our method does not require pre- the distilled knowledge it encodes is preserved.
training or finetuning separate components of Our contributions are summarized as follows:
the network and we train our models end-to-
1) We show that transfer learning from language
end in a single step. We present results on a va-
riety of challenging affective and text classifi-
models can achieve competitive results, while also
cation tasks, surpassing well established trans- being intuitively simple and computationally ef-
fer learning methods with greater level of com- fective. 2) We address the problem of catastrophic
plexity. forgetting, by adding an auxiliary LM objective
and using an unfreezing method. 3) Our results
1 Introduction
show that our approach is competitive with more
Pretrained word representations captured by Lan- sophisticated transfer learning methods. We make
guage Models (LMs) have recently become pop- our code widely available. 1
ular in Natural Language Processing (NLP). Pre-
trained LMs encode contextual information and 2 Related Work
high-level features of language, modeling syntax
Unsupervised pretraining has played a key role in
and semantics, producing state-of-the-art results
deep neural networks, building on the premise that
across a wide range of tasks, such as named entity
representations learned for one task can be use-
recognition (Peters et al., 2017), machine transla-
ful for another task. In NLP, pretrained word vec-
tion (Ramachandran et al., 2017) and text classifi-
tors (Mikolov et al., 2013; Pennington et al., 2014)
cation (Howard and Ruder, 2018).
are widely used, improving performance in vari-
However, in cases where contextual embed-
ous downstream tasks, such as part-of-speech tag-
dings from language models are used as additional
ging (Collobert et al., 2011) and question answer-
features (e.g. ELMo (Peters et al., 2018)), results
ing (Xiong et al., 2016). These pretrained word
come at a high computational cost and require
vectors serve as initialization of the embedding
task-specific architectures. At the same time, ap-
layer and remain frozen during training, while our
proaches that rely on fine-tuning a LM to the task
pretrained language model also initializes the hid-
at hand (e.g. ULMFiT (Howard and Ruder, 2018))
den layers of the model and is fine-tuned to each
depend on pretraining the model on an exten-
1
sive vocabulary and on employing a sophisticated /github.com/alexandra-chron/siatl

2089
Proceedings of NAACL-HLT 2019, pages 2089–2095
Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics
classification task. the LM to the target dataset, the metric (e.g. ac-
Aiming to learn from unlabeled data, Dai and curacy) that we intend to optimize cannot be ob-
Le (2015) use unsupervised objectives such as se- served. We propose adopting a multi-task learning
quence autoencoding and language modeling for perspective, via the addition of an auxiliary LM
as pretraining methods. The pretrained model is loss to the transferred model, to control the loss
then fine-tuned to the target task. However, the of the pretrained and the new task simultaneously.
fine-tuning procedure of the language model to the The intuition is that we should avoid catastrophic
target task does not include an auxiliary objective. forgetting, but at the same time allow the LM to
Ramachandran et al. (2017) also pretrain encoder- distill the knowledge of the prior data distribution
decoder pairs using language models and fine-tune and keep the most useful features.
them to a specific task, using an auxiliary lan- Multi-Task Learning (MTL) via hard parame-
guage modeling objective to prevent catastrophic ter sharing (Caruana, 1993) in neural networks
forgetting. This approach, nevertheless, is only has proven to be effective in many NLP prob-
evaluated on machine translation tasks; moreover, lems (Collobert and Weston, 2008). More re-
the seq2seq (Sutskever et al., 2014) and language cently, alternative approaches have been suggested
modeling losses are weighted equally throughout that only share parameters across lower layers (So-
training. By contrast, we propose a weighted sum gaard and Goldberg, 2016). By introducing part-
of losses, where the language modeling contribu- of-speech tags at the lower levels of the network,
tion gradually decreases. ELMo embeddings (Pe- the proposed model achieves competitive results
ters et al., 2018) are obtained from language mod- on chunking and CCG super tagging. Our auxil-
els and improve the results in a variety of tasks iary language model objective follows this line of
as additional contextual representations. However, thought and intends to boost the performance of
ELMo embeddings rely on character-level models, the higher classification layer.
whereas our approach uses a word-level LM. They
are, furthermore, concatenated to pretrained word
3 Our Model
vectors and remain fixed during training. We in- We introduce SiATL, which stands for Single-step
stead propose a fine-tuning procedure, aiming to Auxiliary loss Transfer Learning. In our proposed
adjust a generic architecture to different end tasks. approach, we first train a LM. We then transfer its
Moreover, BERT (Devlin et al., 2018) pretrains weights and add a task-specific recurrent layer to
language models and fine-tunes them on the tar- the final classifier. We also employ an auxiliary
get task. An auxiliary task (next sentence predic- LM loss to avoid catastrophic forgetting.
tion) is used to enhance the representations of the LM Pretraining. We train a word-level language
LM. BERT fine-tunes masked bi-directional LMs. model, which consists of an embedding LSTM
Nevertheless, we are limited to a uni-directional layer (Hochreiter and Schmidhuber, 1997), 2 hid-
model. Training BERT requires vast computa- den LSTM layers and a linear layer. We want to
tional resources, while our model only requires 1 minimize the negative log-likelihood of the LM:
GPU. We note that our approach is not orthogo-
n
nal to BERT and could be used to improve it, by N T
1 XX
adding an auxiliary LM objective and weighing its L(p̂) = − logp̂(xnt |xn1 , ..., xnt−1 ) (1)
N
contribution. n=1 t=1

Towards the same direction, ULMFiT (Howard where p̂(xnt |xn1 , ..., xnt−1 ) is the distribution of the
and Ruder, 2018) shows impressive results on a tth word in the nth sentence given the t − 1 words
variety of tasks by employing pretrained LMs. preceding it and N is total number of sentences.
The proposed pipeline requires three distinct steps,
that include (1) pretraining the LM, (2) fine-tuning Transfer & auxiliary loss. We transfer the
it on a target dataset with an elaborate schedul- weights of the pretrained model and add one
ing procedure and (3) transferring it to a classifica- LSTM with a self-attention mechanism (Lin et al.,
tion model. Our proposed model is closely related 2017; Bahdanau et al., 2015).
to ULMFiT. However, ULMFiT trains a LM and In order to adapt the contribution of the pretrained
fine-tunes it to the target dataset, before transfer- model to the task at hand, we introduce an auxil-
ring it to a classification model. While fine-tuning iary LM loss during training. The joint loss is the

2090
them sequentially, according to Howard and Ruder
(2018); Chronopoulou et al. (2018). We first fine-
tune only the extra, randomly initialized LSTM
and the output layer for n − 1 epochs. At the nth
epoch, we unfreeze the pretrained hidden layers.
We let the model fine-tune, until epoch k − 1. Fi-
nally, at epoch k, we also unfreeze the embedding
layer and let the network train until convergence.
The values of n and k are obtained through grid
search. We find the sequential unfreezing scheme
important, as it minimizes the risk of overfitting to
small datasets.
Optimizers. While pretraining the LM, we use
Stochastic Gradient Descent (SGD). When we
transfer the LM and fine-tune on each classifica-
tion task, we use 2 different optimizers: SGD for
the pretrained LM (embedding and hidden layer)
with a small learning rate, in order to preserve its
Figure 1: High-level overview of our proposed TL ar- contextual information. As for the new, randomly
chitecture. We transfer the pretrained LM add an extra initialized LSTM and classification layers, we em-
recurrent layer and an auxiliary LM loss. ploy Adam (Kingma and Ba, 2015), in order to al-
low them to train fast and adapt to the target task.
weighted sum of the task-specific loss Ltask and Dataset Domain # classes # examples
the auxiliary LM loss LLM , where γ is a weight- Irony18 Tweets 4 4618
ing parameter to enable adaptation to the target Sent17 Tweets 3 61854
SCv2 Debate Forums 2 3260
task but at the same time keep the useful knowl- SCv1 Debate Forums 2 1995
edge from the source task. Specifically: PsychExp Experiences 7 7480

L = Ltask + γLLM (2) Table 1: Datasets used for the downstream tasks.

Exponential decay of γ. An advantage of the pro- 4 Experiments and Results


posed TL method is that the contribution of the 4.1 Datasets
LM can be explicitly controlled in each training
To pretrain the language model, we collect a
epoch. In the first few epochs, the LM should con-
dataset of 20 million English Twitter messages,
tribute more to the joint loss of SiATL so that the
including approximately 2M unique tokens. We
task-specific layers adapt to the new data distribu-
use the 70K most frequent tokens as vocabu-
tion. After the knowledge of the pretrained LM
lary. We evaluate our model on five datasets:
is transferred to the new domain, the task-specific
Sent17 for sentiment analysis (Rosenthal et al.,
component of the loss function is more important
2017), PsychExp for emotion recognition (Wall-
and γ should become smaller. This is also crucial
bott and Scherer, 1986), Irony18 for irony detec-
due to the fact that the new, task-specific LSTM
tion (Van Hee et al., 2018), SCv1 and SCv2 for
layer is randomly initialized. Therefore, by back-
sarcasm detection (Oraby et al., 2016; Lukin and
propagating the gradients of this layer to the pre-
Walker, 2013). More details about the datasets can
trained LM in the first few epochs, we would add
be found in Table 1.
noise to the pretrained representation. To avoid
this issue, we choose to initially pay attention to 4.2 Experimental Setup
the LM objective and gradually focus on the clas- To preprocess the tweets, we use Ekphra-
sification task. In this paper, we use an exponential sis (Baziotis et al., 2017). For the generic datasets,
decay for γ over the training epochs. we use NLTK (Loper and Bird, 2002). For the
Sequential Unfreezing. Instead of fine-tuning all NBoW baseline, we use word2vec (Mikolov et al.,
the layers simultaneously, we propose unfreezing 2013) 300-dimensional embeddings as features.

2091
Irony18 Sent17 SCv2 SCv1 PsychExp
BoW 43.7 61.0 65.1 60.9 25.8
NBoW 45.2 63.0 61.1 51.9 20.3
P-LM 42.7 ± 0.6 61.2 ± 0.7 69.4 ± 0.4 48.5 ± 1.5 38.3 ± 0.3
P-LM + su 41.8 ± 1.2 62.1 ± 0.8 69.9 ± 1.0 48.4 ± 1.7 38.7 ± 1.0
P-LM + aux 45.5 ± 0.9 65.1 ± 0.6 72.6 ± 0.7 55.8 ± 1.0 40.9 ± 0.5
SiATL (P-LM + aux + su) 47.0 ± 1.1 66.5 ± 0.2 75.0 ± 0.7 56.8 ± 2.0 45.8 ± 1.6
ULMFiT (Wiki-103) 23.6 ± 1.6 60.5 ± 0.5 68.7 ± 0.6 56.6 ± 0.5 21.8 ± 0.3
ULMFiT (Twitter) 41.6 ± 0.7 65.6 ± 0.4 67.2 ± 0.9 44.0 ± 0.7 40.2 ± 1.1
53.6 68.5 76.0 69.0 57.0
State of the art
(Baziotis et al., 2018) (Cliche, 2017) (Ilic et al., 2018) (Felbo et al., 2017)

Table 2: Ablation study on various downstream datasets. Average over five runs with standard deviation. BoW
stands for Bag of Words, NBoW for Neural Bag of Words. P-LM stands for a classifier initialized with our
pretrained LM, su for sequential unfreezing and aux for the auxiliary LM loss. In all cases, F1 is employed.

For the neural models, we use an LM with an em- training in Twitter, we obtain higher results in two
bedding size of 400, 2 hidden layers, 1000 neurons Twitter datasets and three generic.
per layer, embedding dropout 0.1, hidden dropout Auxiliary LM objective. The effect of the auxil-
0.3 and batch size 32. We add Gaussian noise of iary objective is highlighted in very small datasets,
size 0.01 to the embedding layer. A clip norm of such as SCv1, where it results in an impressive
5 is applied, as an extra safety measure against ex- boost in performance (7%). We hypothesize that
ploding gradients. For each text classification neu- when the classifier is simply initialized with the
ral network, we add on top of the transferred LM pretrained LM, it overfits quickly, as the target vo-
an LSTM layer of size 100 with self-attention and cabulary is very limited. The auxiliary LM loss,
a softmax classification layer. In the pretraining however, permits refined adjustments to the model
step, SGD with a learning rate of 0.0001 is em- and fine-grained adaptation to the target task.
ployed. In the transferred model, SGD with the Exponential decay of γ. For the optimal γ in-
same learning rate is used for the pretrained layers. terval, we empirically find that exponentially de-
However, we use Adam (Kingma and Ba, 2015) caying γ from 0.2 to 0.1 over the number of train-
with a learning rate of 0.0005 for the newly added ing epochs provides best results for our classifica-
LSTM and classification layers. For developing tion tasks. A heatmap of γ is depicted in Figure 3.
our models, we use PyTorch (Paszke et al., 2017) We observe that small values of γ should be em-
and Scikit-learn (Pedregosa et al., 2011). ployed, in order to scale the LM loss in the same
order of magnitude as the classification loss over
5 Results & Discussion the training period. Nevertheless, the use of ex-
Baselines and Comparison. Table 2 summarizes ponential decay instead of linear decay does not
our results. The top two rows detail the baseline provide a significant improvement, as our model
performance of the BoW and NBoW models. We is not sensitive to the way of decaying hyperpa-
observe that when enough data is available (e.g. rameter γ.
Sent17), baselines provide decent results. Next, Sequential Unfreezing. Results show that se-
the results for the generic classifier initialized from quential unfreezing is crucial to the proposed
a pretrained LM (P-LM) are shown with and with- method, as it allows the pretrained LM to adapt
out sequential unfreezing, followed by the results to the target word distribution. The performance
of the proposed model SiATL. SiATL is also di- improvement is more pronounced when there is
rectly compared with its close relative ULMFiT a mismatch between the LM and task domains,
(trained on Wiki-103 or Twitter) and the state-of- i.e., the non-Twitter domain tasks. Specifically
the-art for each task; ULMFiT also fine-tunes a for the PsychExp and SCv2 datasets, sequentially
LM for classification tasks. The proposed SiATL unfreezing yields significant improvement in F1
method consistently outperforms the baselines, the building upon our intuition.
P-LM method and ULMFiT in all datasets. Even Number of training examples. Transfer learning
though we do not perform any elaborate learn- is particularly useful when limited training data
ing rate scheduling and we limit ourselves to pre- are available. We notice that for our largest dataset

2092
Figure 3: Heatmap of the effect of γ to F1 -score, eval-
uated on SCv2. The horizontal axis depicts the initial
value of γ and the vertical axis the final value of γ.
Figure 2: Results of SiATL, our proposed approach
(continuous lines) and ULMFiT (dashed lines) for dif-
ferent datasets (indicated by different markers) as a transferring its weights to a classifier with a task-
function of the number of training examples. specific layer. The model is trained using a task-
specific functional with an auxiliary LM loss.
Sent17, SiATL outperforms ULMFiT only by a SiATL avoids catastrophic forgetting of the lan-
small margin when trained on all the training ex- guage distribution learned by the pretrained LM.
amples available (see Table 2), while for the small Experiments on various text classification tasks
SCv2 dataset, SiATL outperforms ULMFiT by a yield competitive results, demonstrating the effi-
large margin and ranks very close to the state-of- cacy of our approach. Furthermore, our method
the-art model (Ilic et al., 2018). Moreover, the outperforms more sophisticated transfer learning
performance of SiATL vs ULMFiT as a function approaches, such as ULMFiT in all tasks.
of the training dataset size is shown in Figure 2. In future work, we plan to move from Twitter to
Note that the proposed model achieves competi- more generic domains and evaluate our approach
tive results on less than 1000 training examples for to more tasks. Additionally, we aim at exploring
the Irony18, SCv2, SCv1 and PsychExp datasets, ways for scaling our approach to larger vocabu-
demonstrating the robustness of SiATL even when lary sizes (Kumar and Tsvetkov, 2019) and for
trained on a handful of training examples. better handling of out-of-vocabulary words (OOV)
Catastrophic forgetting. We observe that SiATL (Mielke and Eisner, 2018; Sennrich et al., 2015) in
indeed provides a way of mitigating catastrophic order to be applicable to diverse datasets.
forgetting. Empirical results that are shown in Ta- Finally, we want to explore approaches for im-
ble 2 indicate that by only adding the auxiliary lan- proving the adaptive layer unfreezing process and
guage modeling objective, we obtain better results the contribution of the language model objective
on all downstream tasks. Specifically, a compari- (value of γ) to the target task.
son of the P-LM + aux model and the P-LM model
shows that the performance of SiATL on classifi- Acknowledgments
cation tasks is improved by the auxiliary objective.
We hypothesize that the language model objective We would like to thank Katerina Margatina and
acts as a regularizer that prevents the loss of the Georgios Paraskevopoulos for their helpful sug-
most generalizable features. gestions and comments. This work has been par-
tially supported by computational time granted
6 Conclusions and Future Work from the Greek Research & Technology Network
(GR-NET) in the National HPC facility - ARIS.
We introduce SiATL, a simple and efficient trans- Also, the authors would like to thank NVIDIA for
fer learning method for text classification tasks. supporting this work by donating a TitanX GPU.
Our approach is based on pretraining a LM and

2093
References Bjarke Felbo, Alan Mislove, Anders Sogaard, Iyad
Rahwan, and Sune Lehmann. 2017. Using millions
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- of emoji occurrences to learn any-domain represen-
gio. 2015. Neural machine translation by jointly tations for detecting sentiment, emotion and sar-
learning to align and translate. In Proceedings of casm. In Proceedings of the Conference on Empiri-
the International Conference on Learning Represen- cal Methods in Natural Language Processing, pages
tations, San Diego, California. 1615–1625, Copenhagen, Denmark.
Christos Baziotis, Athanasiou Nikolaos, Pinelopi Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Papalampidi, Athanasia Kolovou, Georgios short-term memory. Neural computation, (8):1735–
Paraskevopoulos, Nikolaos Ellinas, and Alexandros 1780.
Potamianos. 2018. Ntua-slp at semeval-2018 task
3: Tracking ironic tweets using ensembles of word Jeremy Howard and Sebastian Ruder. 2018. Univer-
and character level attentive rnns. In Proceedings sal language model fine-tuning for text classifica-
of the 12th International Workshop on Semantic tion. In Proceedings of the Annual Meeting of the
Evaluation (SemEval-2018), pages 613–621, New ACL, pages 328–339, Melbourne, Australia.
Orleans, Louisiana.
Suzana Ilic, Edison Marrese-Taylor, Jorge A. Bal-
Christos Baziotis, Nikos Pelekis, and Christos Doulk- azs, and Yutaka Matsuo. 2018. Deep contextu-
eridis. 2017. Datastories at semeval-2017 task alized word representations for detecting sarcasm
4: Deep lstm with attention for message-level and and irony. In Proceedings of the 9th Workshop
topic-based sentiment analysis. In Proceedings of on Computational Approaches to Subjectivity, Sen-
the 11th International Workshop on Semantic Eval- timent and Social Media Analysis, pages 2–7.
uation (SemEval-2017), pages 747–754, Vancouver,
Canada. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In Proceedings
Rich Caruana. 1993. Multitask learning: A of the International Conference on Learning Repre-
knowledge-based source of inductive bias. In Ma- sentations.
chine Learning: Proceedings of the Tenth Interna-
tional Conference, pages 41–48. Sachin Kumar and Yulia Tsvetkov. 2019. Von mises-
fisher loss for training sequence to sequence models
Alexandra Chronopoulou, Aikaterini Margatina, Chris- with continuous outputs. In International Confer-
tos Baziotis, and Alexandros Potamianos. 2018. ence on Learning Representations.
Ntua-slp at iest 2018: Ensemble of neural transfer
methods for implicit emotion classification. In Pro- Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-
ceedings of the 9th Workshop on Computational Ap- tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua
proaches to Subjectivity, Sentiment and Social Me- Bengio. 2017. A structured self-attentive sentence
dia Analysis, pages 57–64, Brussels, Belgium. embedding. arXiv preprint arXiv:1703.03130.
Mathieu Cliche. 2017. Bb_twtr at semeval-2017 task Edward Loper and Steven Bird. 2002. Nltk: The natu-
4: Twitter sentiment analysis with cnns and lstms. In ral language toolkit. In Proceedings of the ACL-02
Proceedings of the 11th International Workshop on Workshop on Effective Tools and Methodologies for
Semantic Evaluation (SemEval-2017), pages 573– Teaching Natural Language Processing and Compu-
580, Vancouver, Canada. tational Linguistics, pages 63–70.
Ronan Collobert and Jason Weston. 2008. A unified Stephanie Lukin and Marilyn Walker. 2013. Really?
architecture for natural language processing: Deep well. apparently bootstrapping improves the perfor-
neural networks with multitask learning. In Pro- mance of sarcasm and nastiness classifiers for online
ceedings of the International Conference on Ma- dialogue. In Proceedings of the Workshop on Lan-
chine learning, pages 160–167. guage Analysis in Social Media, pages 30–40, At-
lanta, Georgia.
Ronan Collobert, Jason Weston, Léon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Sebastian J. Mielke and Jason Eisner. 2018. Spell once,
2011. Natural language processing (almost) from summon anywhere: A two-level open-vocabulary
scratch. Journal of Machine Learning Research, language model. CoRR, abs/1804.08205.
pages 2493–2537.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
Andrew M Dai and Quoc V Le. 2015. Semi-supervised rado, and Jeff Dean. 2013. Distributed representa-
sequence learning. In Proceedings of the Advances tions of words and phrases and their compositional-
in Neural Information Processing Systems, pages ity. In Proceedings of the Advances in Neural Infor-
3079–3087. mation Processing Systems, pages 3111–3119.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Shereen Oraby, Vrindavan Harrison, Lena Reed,
Kristina Toutanova. 2018. Bert: Pre-training of deep Ernesto Hernandez, Ellen Riloff, and Marilyn A.
bidirectional transformers for language understand- Walker. 2016. Creating and characterizing a diverse
ing. arXiv preprint arXiv:1810.04805. corpus of sarcasm in dialogue. In Proceedings of the

2094
SIGDIAL 2016 Conference, The 17th Annual Meet- Harald G. Wallbott and Klaus R. Scherer. 1986. How
ing of the Special Interest Group on Discourse and universal and specific is emotional experience? ev-
Dialogue, pages 31–41. idence from 27 countries on five continents. In-
formation (International Social Science Council),
Adam Paszke, Sam Gross, Soumith Chintala, Gre- (4):763–795.
gory Chanan, Edward Yang, Zachary DeVito, Zem-
ing Lin, Alban Desmaison, Luca Antiga, and Adam Caiming Xiong, Victor Zhong, and Richard Socher.
Lerer. 2017. Automatic differentiation in pytorch. 2016. Dynamic coattention networks for question
answering.
Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al. 2011. Scikit-learn:
Machine learning in python. Journal of machine
learning research, pages 2825–2830.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the Conference on
Empirical Methods in Natural Language Process-
ing, pages 1532–1543, Doha, Qatar.
Matthew Peters, Waleed Ammar, Chandra Bhagavat-
ula, and Russell Power. 2017. Semi-supervised se-
quence tagging with bidirectional language models.
In Proceedings of the Annual Meeting of the ACL,
pages 1756–1765, Vancouver, Canada.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of the Conference of
the NAACL:HLT, pages 2227–2237, New Orleans,
Louisiana.
Prajit Ramachandran, Peter Liu, and Quoc Le. 2017.
Unsupervised pretraining for sequence to sequence
learning. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing,
pages 383–391, Copenhagen, Denmark.
Sara Rosenthal, Noura Farra, and Preslav Nakov.
2017. Semeval-2017 task 4: Sentiment analysis in
twitter. In Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-2017),
pages 502–518, Vancouver, Canada.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
Anders Sogaard and Yoav Goldberg. 2016. Deep
multi-task learning with low level tasks supervised
at lower layers. In Proceedings of the Annual Meet-
ing of the ACL, pages 231–235, Berlin, Germany.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net-
works. In Proceedings of the Advances in Neural
Information Processing Systems, pages 3104–3112.
Cynthia Van Hee, Els Lefever, and Véronique Hoste.
2018. Semeval-2018 task 3: Irony detection in en-
glish tweets. In Proceedings of The 12th Interna-
tional Workshop on Semantic Evaluation (SemEval-
2018), pages 39–50, New Orleans, Louisiana.

2095

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy