0% found this document useful (0 votes)

16 views

N19-1213

Uploaded by

sitiaisyahsalim1996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

N19-1213

Uploaded by

sitiaisyahsalim1996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

An Embarrassingly Simple Approach for Transfer Learning from

Pretrained Language Models

Alexandra Chronopoulou1 , Christos Baziotis1 , Alexandros Potamianos1,2,3
1
School of ECE, National Technical University of Athens, Athens, Greece
2
Signal Analysis and Interpretation Laboratory (SAIL), USC, Los Angeles, USA
3
Behavioral Signal Technologies, Los Angeles, USA
el12068@central.ntua.gr, cbaziotis@mail.ntua.gr
potam@central.ntua.gr

Abstract slanted triangular learning rate scheme to adapt the

A growing number of state-of-the-art trans- parameters of the LM to the target dataset.
fer learning methods employ language mod- We propose a simple and effective transfer
els pretrained on large generic corpora. In this learning approach, that leverages LM contextual
paper we present a conceptually simple and representations and does not require any elaborate
effective transfer learning approach that ad- scheduling schemes during training. We initially
dresses the problem of catastrophic forgetting. train a LM on a Twitter corpus and then transfer
Specifically, we combine the task-specific op-
its weights. We add a task-specific recurrent layer
timization function with an auxiliary language
model objective, which is adjusted during the and a classification layer. The transferred model
training process. This preserves language reg- is trained end-to-end using an auxiliary LM loss,
ularities captured by language models, while which allows us to explicitly control the weighting
enabling sufficient adaptation for solving the of the pretrained part of the model and ensure that
target task. Our method does not require pre- the distilled knowledge it encodes is preserved.
training or finetuning separate components of Our contributions are summarized as follows:
the network and we train our models end-to-
1) We show that transfer learning from language
end in a single step. We present results on a va-
riety of challenging affective and text classifi-
models can achieve competitive results, while also
cation tasks, surpassing well established trans- being intuitively simple and computationally ef-
fer learning methods with greater level of com- fective. 2) We address the problem of catastrophic
plexity. forgetting, by adding an auxiliary LM objective
and using an unfreezing method. 3) Our results
1 Introduction
show that our approach is competitive with more
Pretrained word representations captured by Lan- sophisticated transfer learning methods. We make
guage Models (LMs) have recently become pop- our code widely available. 1
ular in Natural Language Processing (NLP). Pre-
trained LMs encode contextual information and 2 Related Work
high-level features of language, modeling syntax
Unsupervised pretraining has played a key role in
and semantics, producing state-of-the-art results
deep neural networks, building on the premise that
across a wide range of tasks, such as named entity
representations learned for one task can be use-
recognition (Peters et al., 2017), machine transla-
ful for another task. In NLP, pretrained word vec-
tion (Ramachandran et al., 2017) and text classifi-
tors (Mikolov et al., 2013; Pennington et al., 2014)
cation (Howard and Ruder, 2018).
are widely used, improving performance in vari-
However, in cases where contextual embed-
ous downstream tasks, such as part-of-speech tag-
dings from language models are used as additional
ging (Collobert et al., 2011) and question answer-
features (e.g. ELMo (Peters et al., 2018)), results
ing (Xiong et al., 2016). These pretrained word
come at a high computational cost and require
vectors serve as initialization of the embedding
task-specific architectures. At the same time, ap-
layer and remain frozen during training, while our
proaches that rely on fine-tuning a LM to the task
pretrained language model also initializes the hid-
at hand (e.g. ULMFiT (Howard and Ruder, 2018))
den layers of the model and is fine-tuned to each
depend on pretraining the model on an exten-
1
sive vocabulary and on employing a sophisticated /github.com/alexandra-chron/siatl

2089
Proceedings of NAACL-HLT 2019, pages 2089–2095
Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics
classification task. the LM to the target dataset, the metric (e.g. ac-
Aiming to learn from unlabeled data, Dai and curacy) that we intend to optimize cannot be ob-
Le (2015) use unsupervised objectives such as se- served. We propose adopting a multi-task learning
quence autoencoding and language modeling for perspective, via the addition of an auxiliary LM
as pretraining methods. The pretrained model is loss to the transferred model, to control the loss
then fine-tuned to the target task. However, the of the pretrained and the new task simultaneously.
fine-tuning procedure of the language model to the The intuition is that we should avoid catastrophic
target task does not include an auxiliary objective. forgetting, but at the same time allow the LM to
Ramachandran et al. (2017) also pretrain encoder- distill the knowledge of the prior data distribution
decoder pairs using language models and fine-tune and keep the most useful features.
them to a specific task, using an auxiliary lan- Multi-Task Learning (MTL) via hard parame-
guage modeling objective to prevent catastrophic ter sharing (Caruana, 1993) in neural networks
forgetting. This approach, nevertheless, is only has proven to be effective in many NLP prob-
evaluated on machine translation tasks; moreover, lems (Collobert and Weston, 2008). More re-
the seq2seq (Sutskever et al., 2014) and language cently, alternative approaches have been suggested
modeling losses are weighted equally throughout that only share parameters across lower layers (So-
training. By contrast, we propose a weighted sum gaard and Goldberg, 2016). By introducing part-
of losses, where the language modeling contribu- of-speech tags at the lower levels of the network,
tion gradually decreases. ELMo embeddings (Pe- the proposed model achieves competitive results
ters et al., 2018) are obtained from language mod- on chunking and CCG super tagging. Our auxil-
els and improve the results in a variety of tasks iary language model objective follows this line of
as additional contextual representations. However, thought and intends to boost the performance of
ELMo embeddings rely on character-level models, the higher classification layer.
whereas our approach uses a word-level LM. They
are, furthermore, concatenated to pretrained word
3 Our Model
vectors and remain fixed during training. We in- We introduce SiATL, which stands for Single-step
stead propose a fine-tuning procedure, aiming to Auxiliary loss Transfer Learning. In our proposed
adjust a generic architecture to different end tasks. approach, we first train a LM. We then transfer its
Moreover, BERT (Devlin et al., 2018) pretrains weights and add a task-specific recurrent layer to
language models and fine-tunes them on the tar- the final classifier. We also employ an auxiliary
get task. An auxiliary task (next sentence predic- LM loss to avoid catastrophic forgetting.
tion) is used to enhance the representations of the LM Pretraining. We train a word-level language
LM. BERT fine-tunes masked bi-directional LMs. model, which consists of an embedding LSTM
Nevertheless, we are limited to a uni-directional layer (Hochreiter and Schmidhuber, 1997), 2 hid-
model. Training BERT requires vast computa- den LSTM layers and a linear layer. We want to
tional resources, while our model only requires 1 minimize the negative log-likelihood of the LM:
GPU. We note that our approach is not orthogo-
n
nal to BERT and could be used to improve it, by N T
1 XX
adding an auxiliary LM objective and weighing its L(p̂) = − logp̂(xnt |xn1 , ..., xnt−1 ) (1)
N
contribution. n=1 t=1

Towards the same direction, ULMFiT (Howard where p̂(xnt |xn1 , ..., xnt−1 ) is the distribution of the
and Ruder, 2018) shows impressive results on a tth word in the nth sentence given the t − 1 words
variety of tasks by employing pretrained LMs. preceding it and N is total number of sentences.
The proposed pipeline requires three distinct steps,
that include (1) pretraining the LM, (2) fine-tuning Transfer & auxiliary loss. We transfer the
it on a target dataset with an elaborate schedul- weights of the pretrained model and add one
ing procedure and (3) transferring it to a classifica- LSTM with a self-attention mechanism (Lin et al.,
tion model. Our proposed model is closely related 2017; Bahdanau et al., 2015).
to ULMFiT. However, ULMFiT trains a LM and In order to adapt the contribution of the pretrained
fine-tunes it to the target dataset, before transfer- model to the task at hand, we introduce an auxil-
ring it to a classification model. While fine-tuning iary LM loss during training. The joint loss is the

2090
them sequentially, according to Howard and Ruder
(2018); Chronopoulou et al. (2018). We first fine-
tune only the extra, randomly initialized LSTM
and the output layer for n − 1 epochs. At the nth
epoch, we unfreeze the pretrained hidden layers.
We let the model fine-tune, until epoch k − 1. Fi-
nally, at epoch k, we also unfreeze the embedding
layer and let the network train until convergence.
The values of n and k are obtained through grid
search. We find the sequential unfreezing scheme
important, as it minimizes the risk of overfitting to
small datasets.
Optimizers. While pretraining the LM, we use
Stochastic Gradient Descent (SGD). When we
transfer the LM and fine-tune on each classifica-
tion task, we use 2 different optimizers: SGD for
the pretrained LM (embedding and hidden layer)
with a small learning rate, in order to preserve its
Figure 1: High-level overview of our proposed TL ar- contextual information. As for the new, randomly
chitecture. We transfer the pretrained LM add an extra initialized LSTM and classification layers, we em-
recurrent layer and an auxiliary LM loss. ploy Adam (Kingma and Ba, 2015), in order to al-
low them to train fast and adapt to the target task.
weighted sum of the task-specific loss Ltask and Dataset Domain # classes # examples
the auxiliary LM loss LLM , where γ is a weight- Irony18 Tweets 4 4618
ing parameter to enable adaptation to the target Sent17 Tweets 3 61854
SCv2 Debate Forums 2 3260
task but at the same time keep the useful knowl- SCv1 Debate Forums 2 1995
edge from the source task. Specifically: PsychExp Experiences 7 7480

L = Ltask + γLLM (2) Table 1: Datasets used for the downstream tasks.

Exponential decay of γ. An advantage of the pro- 4 Experiments and Results

posed TL method is that the contribution of the 4.1 Datasets
LM can be explicitly controlled in each training
To pretrain the language model, we collect a
epoch. In the first few epochs, the LM should con-
dataset of 20 million English Twitter messages,
tribute more to the joint loss of SiATL so that the
including approximately 2M unique tokens. We
task-specific layers adapt to the new data distribu-
use the 70K most frequent tokens as vocabu-
tion. After the knowledge of the pretrained LM
lary. We evaluate our model on five datasets:
is transferred to the new domain, the task-specific
Sent17 for sentiment analysis (Rosenthal et al.,
component of the loss function is more important
2017), PsychExp for emotion recognition (Wall-
and γ should become smaller. This is also crucial
bott and Scherer, 1986), Irony18 for irony detec-
due to the fact that the new, task-specific LSTM
tion (Van Hee et al., 2018), SCv1 and SCv2 for
layer is randomly initialized. Therefore, by back-
sarcasm detection (Oraby et al., 2016; Lukin and
propagating the gradients of this layer to the pre-
Walker, 2013). More details about the datasets can
trained LM in the first few epochs, we would add
be found in Table 1.
noise to the pretrained representation. To avoid
this issue, we choose to initially pay attention to 4.2 Experimental Setup
the LM objective and gradually focus on the clas- To preprocess the tweets, we use Ekphra-
sification task. In this paper, we use an exponential sis (Baziotis et al., 2017). For the generic datasets,
decay for γ over the training epochs. we use NLTK (Loper and Bird, 2002). For the
Sequential Unfreezing. Instead of fine-tuning all NBoW baseline, we use word2vec (Mikolov et al.,
the layers simultaneously, we propose unfreezing 2013) 300-dimensional embeddings as features.

2091
Irony18 Sent17 SCv2 SCv1 PsychExp
BoW 43.7 61.0 65.1 60.9 25.8
NBoW 45.2 63.0 61.1 51.9 20.3
P-LM 42.7 ± 0.6 61.2 ± 0.7 69.4 ± 0.4 48.5 ± 1.5 38.3 ± 0.3
P-LM + su 41.8 ± 1.2 62.1 ± 0.8 69.9 ± 1.0 48.4 ± 1.7 38.7 ± 1.0
P-LM + aux 45.5 ± 0.9 65.1 ± 0.6 72.6 ± 0.7 55.8 ± 1.0 40.9 ± 0.5
SiATL (P-LM + aux + su) 47.0 ± 1.1 66.5 ± 0.2 75.0 ± 0.7 56.8 ± 2.0 45.8 ± 1.6
ULMFiT (Wiki-103) 23.6 ± 1.6 60.5 ± 0.5 68.7 ± 0.6 56.6 ± 0.5 21.8 ± 0.3
ULMFiT (Twitter) 41.6 ± 0.7 65.6 ± 0.4 67.2 ± 0.9 44.0 ± 0.7 40.2 ± 1.1
53.6 68.5 76.0 69.0 57.0
State of the art
(Baziotis et al., 2018) (Cliche, 2017) (Ilic et al., 2018) (Felbo et al., 2017)

Table 2: Ablation study on various downstream datasets. Average over five runs with standard deviation. BoW
stands for Bag of Words, NBoW for Neural Bag of Words. P-LM stands for a classifier initialized with our
pretrained LM, su for sequential unfreezing and aux for the auxiliary LM loss. In all cases, F1 is employed.

For the neural models, we use an LM with an em- training in Twitter, we obtain higher results in two
bedding size of 400, 2 hidden layers, 1000 neurons Twitter datasets and three generic.
per layer, embedding dropout 0.1, hidden dropout Auxiliary LM objective. The effect of the auxil-
0.3 and batch size 32. We add Gaussian noise of iary objective is highlighted in very small datasets,
size 0.01 to the embedding layer. A clip norm of such as SCv1, where it results in an impressive
5 is applied, as an extra safety measure against ex- boost in performance (7%). We hypothesize that
ploding gradients. For each text classification neu- when the classifier is simply initialized with the
ral network, we add on top of the transferred LM pretrained LM, it overfits quickly, as the target vo-
an LSTM layer of size 100 with self-attention and cabulary is very limited. The auxiliary LM loss,
a softmax classification layer. In the pretraining however, permits refined adjustments to the model
step, SGD with a learning rate of 0.0001 is em- and fine-grained adaptation to the target task.
ployed. In the transferred model, SGD with the Exponential decay of γ. For the optimal γ in-
same learning rate is used for the pretrained layers. terval, we empirically find that exponentially de-
However, we use Adam (Kingma and Ba, 2015) caying γ from 0.2 to 0.1 over the number of train-
with a learning rate of 0.0005 for the newly added ing epochs provides best results for our classifica-
LSTM and classification layers. For developing tion tasks. A heatmap of γ is depicted in Figure 3.
our models, we use PyTorch (Paszke et al., 2017) We observe that small values of γ should be em-
and Scikit-learn (Pedregosa et al., 2011). ployed, in order to scale the LM loss in the same
order of magnitude as the classification loss over
5 Results & Discussion the training period. Nevertheless, the use of ex-
Baselines and Comparison. Table 2 summarizes ponential decay instead of linear decay does not
our results. The top two rows detail the baseline provide a significant improvement, as our model
performance of the BoW and NBoW models. We is not sensitive to the way of decaying hyperpa-
observe that when enough data is available (e.g. rameter γ.
Sent17), baselines provide decent results. Next, Sequential Unfreezing. Results show that se-
the results for the generic classifier initialized from quential unfreezing is crucial to the proposed
a pretrained LM (P-LM) are shown with and with- method, as it allows the pretrained LM to adapt
out sequential unfreezing, followed by the results to the target word distribution. The performance
of the proposed model SiATL. SiATL is also di- improvement is more pronounced when there is
rectly compared with its close relative ULMFiT a mismatch between the LM and task domains,
(trained on Wiki-103 or Twitter) and the state-of- i.e., the non-Twitter domain tasks. Specifically
the-art for each task; ULMFiT also fine-tunes a for the PsychExp and SCv2 datasets, sequentially
LM for classification tasks. The proposed SiATL unfreezing yields significant improvement in F1
method consistently outperforms the baselines, the building upon our intuition.
P-LM method and ULMFiT in all datasets. Even Number of training examples. Transfer learning
though we do not perform any elaborate learn- is particularly useful when limited training data
ing rate scheduling and we limit ourselves to pre- are available. We notice that for our largest dataset

2092
Figure 3: Heatmap of the effect of γ to F1 -score, eval-
uated on SCv2. The horizontal axis depicts the initial
value of γ and the vertical axis the final value of γ.
Figure 2: Results of SiATL, our proposed approach
(continuous lines) and ULMFiT (dashed lines) for dif-
ferent datasets (indicated by different markers) as a transferring its weights to a classifier with a task-
function of the number of training examples. specific layer. The model is trained using a task-
specific functional with an auxiliary LM loss.
Sent17, SiATL outperforms ULMFiT only by a SiATL avoids catastrophic forgetting of the lan-
small margin when trained on all the training ex- guage distribution learned by the pretrained LM.
amples available (see Table 2), while for the small Experiments on various text classification tasks
SCv2 dataset, SiATL outperforms ULMFiT by a yield competitive results, demonstrating the effi-
large margin and ranks very close to the state-of- cacy of our approach. Furthermore, our method
the-art model (Ilic et al., 2018). Moreover, the outperforms more sophisticated transfer learning
performance of SiATL vs ULMFiT as a function approaches, such as ULMFiT in all tasks.
of the training dataset size is shown in Figure 2. In future work, we plan to move from Twitter to
Note that the proposed model achieves competi- more generic domains and evaluate our approach
tive results on less than 1000 training examples for to more tasks. Additionally, we aim at exploring
the Irony18, SCv2, SCv1 and PsychExp datasets, ways for scaling our approach to larger vocabu-
demonstrating the robustness of SiATL even when lary sizes (Kumar and Tsvetkov, 2019) and for
trained on a handful of training examples. better handling of out-of-vocabulary words (OOV)
Catastrophic forgetting. We observe that SiATL (Mielke and Eisner, 2018; Sennrich et al., 2015) in
indeed provides a way of mitigating catastrophic order to be applicable to diverse datasets.
forgetting. Empirical results that are shown in Ta- Finally, we want to explore approaches for im-
ble 2 indicate that by only adding the auxiliary lan- proving the adaptive layer unfreezing process and
guage modeling objective, we obtain better results the contribution of the language model objective
on all downstream tasks. Specifically, a compari- (value of γ) to the target task.
son of the P-LM + aux model and the P-LM model
shows that the performance of SiATL on classifi- Acknowledgments
cation tasks is improved by the auxiliary objective.
We hypothesize that the language model objective We would like to thank Katerina Margatina and
acts as a regularizer that prevents the loss of the Georgios Paraskevopoulos for their helpful sug-
most generalizable features. gestions and comments. This work has been par-
tially supported by computational time granted
6 Conclusions and Future Work from the Greek Research & Technology Network
(GR-NET) in the National HPC facility - ARIS.
We introduce SiATL, a simple and efficient trans- Also, the authors would like to thank NVIDIA for
fer learning method for text classification tasks. supporting this work by donating a TitanX GPU.
Our approach is based on pretraining a LM and

2093
References Bjarke Felbo, Alan Mislove, Anders Sogaard, Iyad
Rahwan, and Sune Lehmann. 2017. Using millions
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- of emoji occurrences to learn any-domain represen-
gio. 2015. Neural machine translation by jointly tations for detecting sentiment, emotion and sar-
learning to align and translate. In Proceedings of casm. In Proceedings of the Conference on Empiri-
the International Conference on Learning Represen- cal Methods in Natural Language Processing, pages
tations, San Diego, California. 1615–1625, Copenhagen, Denmark.
Christos Baziotis, Athanasiou Nikolaos, Pinelopi Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Papalampidi, Athanasia Kolovou, Georgios short-term memory. Neural computation, (8):1735–
Paraskevopoulos, Nikolaos Ellinas, and Alexandros 1780.
Potamianos. 2018. Ntua-slp at semeval-2018 task
3: Tracking ironic tweets using ensembles of word Jeremy Howard and Sebastian Ruder. 2018. Univer-
and character level attentive rnns. In Proceedings sal language model fine-tuning for text classifica-
of the 12th International Workshop on Semantic tion. In Proceedings of the Annual Meeting of the
Evaluation (SemEval-2018), pages 613–621, New ACL, pages 328–339, Melbourne, Australia.
Orleans, Louisiana.
Suzana Ilic, Edison Marrese-Taylor, Jorge A. Bal-
Christos Baziotis, Nikos Pelekis, and Christos Doulk- azs, and Yutaka Matsuo. 2018. Deep contextu-
eridis. 2017. Datastories at semeval-2017 task alized word representations for detecting sarcasm
4: Deep lstm with attention for message-level and and irony. In Proceedings of the 9th Workshop
topic-based sentiment analysis. In Proceedings of on Computational Approaches to Subjectivity, Sen-
the 11th International Workshop on Semantic Eval- timent and Social Media Analysis, pages 2–7.
uation (SemEval-2017), pages 747–754, Vancouver,
Canada. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In Proceedings
Rich Caruana. 1993. Multitask learning: A of the International Conference on Learning Repre-
knowledge-based source of inductive bias. In Ma- sentations.
chine Learning: Proceedings of the Tenth Interna-
tional Conference, pages 41–48. Sachin Kumar and Yulia Tsvetkov. 2019. Von mises-
fisher loss for training sequence to sequence models
Alexandra Chronopoulou, Aikaterini Margatina, Chris- with continuous outputs. In International Confer-
tos Baziotis, and Alexandros Potamianos. 2018. ence on Learning Representations.
Ntua-slp at iest 2018: Ensemble of neural transfer
methods for implicit emotion classification. In Pro- Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-
ceedings of the 9th Workshop on Computational Ap- tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua
proaches to Subjectivity, Sentiment and Social Me- Bengio. 2017. A structured self-attentive sentence
dia Analysis, pages 57–64, Brussels, Belgium. embedding. arXiv preprint arXiv:1703.03130.
Mathieu Cliche. 2017. Bb_twtr at semeval-2017 task Edward Loper and Steven Bird. 2002. Nltk: The natu-
4: Twitter sentiment analysis with cnns and lstms. In ral language toolkit. In Proceedings of the ACL-02
Proceedings of the 11th International Workshop on Workshop on Effective Tools and Methodologies for
Semantic Evaluation (SemEval-2017), pages 573– Teaching Natural Language Processing and Compu-
580, Vancouver, Canada. tational Linguistics, pages 63–70.
Ronan Collobert and Jason Weston. 2008. A unified Stephanie Lukin and Marilyn Walker. 2013. Really?
architecture for natural language processing: Deep well. apparently bootstrapping improves the perfor-
neural networks with multitask learning. In Pro- mance of sarcasm and nastiness classifiers for online
ceedings of the International Conference on Ma- dialogue. In Proceedings of the Workshop on Lan-
chine learning, pages 160–167. guage Analysis in Social Media, pages 30–40, At-
lanta, Georgia.
Ronan Collobert, Jason Weston, Léon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Sebastian J. Mielke and Jason Eisner. 2018. Spell once,
2011. Natural language processing (almost) from summon anywhere: A two-level open-vocabulary
scratch. Journal of Machine Learning Research, language model. CoRR, abs/1804.08205.
pages 2493–2537.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
Andrew M Dai and Quoc V Le. 2015. Semi-supervised rado, and Jeff Dean. 2013. Distributed representa-
sequence learning. In Proceedings of the Advances tions of words and phrases and their compositional-
in Neural Information Processing Systems, pages ity. In Proceedings of the Advances in Neural Infor-
3079–3087. mation Processing Systems, pages 3111–3119.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Shereen Oraby, Vrindavan Harrison, Lena Reed,
Kristina Toutanova. 2018. Bert: Pre-training of deep Ernesto Hernandez, Ellen Riloff, and Marilyn A.
bidirectional transformers for language understand- Walker. 2016. Creating and characterizing a diverse
ing. arXiv preprint arXiv:1810.04805. corpus of sarcasm in dialogue. In Proceedings of the

2094
SIGDIAL 2016 Conference, The 17th Annual Meet- Harald G. Wallbott and Klaus R. Scherer. 1986. How
ing of the Special Interest Group on Discourse and universal and specific is emotional experience? ev-
Dialogue, pages 31–41. idence from 27 countries on five continents. In-
formation (International Social Science Council),
Adam Paszke, Sam Gross, Soumith Chintala, Gre- (4):763–795.
gory Chanan, Edward Yang, Zachary DeVito, Zem-
ing Lin, Alban Desmaison, Luca Antiga, and Adam Caiming Xiong, Victor Zhong, and Richard Socher.
Lerer. 2017. Automatic differentiation in pytorch. 2016. Dynamic coattention networks for question
answering.
Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al. 2011. Scikit-learn:
Machine learning in python. Journal of machine
learning research, pages 2825–2830.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the Conference on
Empirical Methods in Natural Language Process-
ing, pages 1532–1543, Doha, Qatar.
Matthew Peters, Waleed Ammar, Chandra Bhagavat-
ula, and Russell Power. 2017. Semi-supervised se-
quence tagging with bidirectional language models.
In Proceedings of the Annual Meeting of the ACL,
pages 1756–1765, Vancouver, Canada.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of the Conference of
the NAACL:HLT, pages 2227–2237, New Orleans,
Louisiana.
Prajit Ramachandran, Peter Liu, and Quoc Le. 2017.
Unsupervised pretraining for sequence to sequence
learning. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing,
pages 383–391, Copenhagen, Denmark.
Sara Rosenthal, Noura Farra, and Preslav Nakov.
2017. Semeval-2017 task 4: Sentiment analysis in
twitter. In Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-2017),
pages 502–518, Vancouver, Canada.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
Anders Sogaard and Yoav Goldberg. 2016. Deep
multi-task learning with low level tasks supervised
at lower layers. In Proceedings of the Annual Meet-
ing of the ACL, pages 231–235, Berlin, Germany.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net-
works. In Proceedings of the Advances in Neural
Information Processing Systems, pages 3104–3112.
Cynthia Van Hee, Els Lefever, and Véronique Hoste.
2018. Semeval-2018 task 3: Irony detection in en-
glish tweets. In Proceedings of The 12th Interna-
tional Workshop on Semantic Evaluation (SemEval-
2018), pages 39–50, New Orleans, Louisiana.

2095

(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
100% (5)
(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
81 pages
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
No ratings yet
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
24 pages
ulm_fit
No ratings yet
ulm_fit
12 pages
GPT1
No ratings yet
GPT1
12 pages
Universal Language Model Fine-Tuning For Text Classification
No ratings yet
Universal Language Model Fine-Tuning For Text Classification
12 pages
ULMfit Universal Language Model Fine-Tuning for Text Classification
No ratings yet
ULMfit Universal Language Model Fine-Tuning for Text Classification
9 pages
Google T5
No ratings yet
Google T5
67 pages
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
No ratings yet
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
67 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Arxiv - 20191023 - Colin Raffel - Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
No ratings yet
Arxiv - 20191023 - Colin Raffel - Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
53 pages
Truncated_Doc_1
No ratings yet
Truncated_Doc_1
3 pages
Improving Language Understanding by Generative Pre-Training
No ratings yet
Improving Language Understanding by Generative Pre-Training
12 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Bert
No ratings yet
Bert
10 pages
Trend
No ratings yet
Trend
47 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Approaches For Neural-Network Language Model Adaptation: August 2017
No ratings yet
Approaches For Neural-Network Language Model Adaptation: August 2017
6 pages
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
No ratings yet
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
17 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
Presentation RaviShankar
No ratings yet
Presentation RaviShankar
28 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
1708 03446
No ratings yet
1708 03446
10 pages
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
No ratings yet
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
42 pages
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
No ratings yet
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
31 pages
Meta-Learning the Difference Preparing
No ratings yet
Meta-Learning the Difference Preparing
17 pages
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
No ratings yet
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
21 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
A Little Pretraining Goes A Long Way: A Case Study On Dependency Parsing Task For Low-Resource Morphologically Rich Languages
No ratings yet
A Little Pretraining Goes A Long Way: A Case Study On Dependency Parsing Task For Low-Resource Morphologically Rich Languages
10 pages
2108.05542
No ratings yet
2108.05542
42 pages
Conneau, A., Et Al. (2017) - Supervised Learning of Universal Sentence Representations From Natural Language Inference Data. EMNLP
No ratings yet
Conneau, A., Et Al. (2017) - Supervised Learning of Universal Sentence Representations From Natural Language Inference Data. EMNLP
12 pages
Improving Neural Machine Translation Models With Monolingual Data
No ratings yet
Improving Neural Machine Translation Models With Monolingual Data
11 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
thesis_amended
No ratings yet
thesis_amended
157 pages
Mark
No ratings yet
Mark
3 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
No ratings yet
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
16 pages
Chapter 12
No ratings yet
Chapter 12
16 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
UER: An Open-Source Toolkit For Pre-Training Models
No ratings yet
UER: An Open-Source Toolkit For Pre-Training Models
6 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
Multi-Task Deep Neural Networks For Natural Language Understanding
No ratings yet
Multi-Task Deep Neural Networks For Natural Language Understanding
10 pages
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
No ratings yet
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
28 pages
Learning Natural Language Inference With LSTM
No ratings yet
Learning Natural Language Inference With LSTM
10 pages
Multi-Task Deep Neural Networks for Natural Language Understanding
No ratings yet
Multi-Task Deep Neural Networks for Natural Language Understanding
10 pages
Pre-Trained Models For Natural Language Processing: A Survey
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
31 pages
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
No ratings yet
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
10 pages
Character-Aware Neural Language Models
No ratings yet
Character-Aware Neural Language Models
9 pages
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Demystifying Large Language Models: Unraveling the Mysteries of Language Transformer Models, Build from Ground up, Pre-train, Fine-tune and Deployment
From Everand
Demystifying Large Language Models: Unraveling the Mysteries of Language Transformer Models, Build from Ground up, Pre-train, Fine-tune and Deployment
James Chen
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
From Everand
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
Nelson Ambrose
No ratings yet
ML Unit-2
No ratings yet
ML Unit-2
17 pages
Research Report
No ratings yet
Research Report
57 pages
Text Recognition
No ratings yet
Text Recognition
6 pages
Reinforcement Learning in Autonomous Driving
No ratings yet
Reinforcement Learning in Autonomous Driving
7 pages
May, 2023 B.Tech. (CE/IT/CSE/CSE (AIML) - VI Semester Soft Computing (PEC-CSD-602)
No ratings yet
May, 2023 B.Tech. (CE/IT/CSE/CSE (AIML) - VI Semester Soft Computing (PEC-CSD-602)
9 pages
N7 Brains Machines and Buildings Arbib
No ratings yet
N7 Brains Machines and Buildings Arbib
24 pages
Book AI Driven Software Development 13 August
No ratings yet
Book AI Driven Software Development 13 August
219 pages
Challenges of Deep Learning in Medical Image Analysis - Improving Explainability and Trust
No ratings yet
Challenges of Deep Learning in Medical Image Analysis - Improving Explainability and Trust
17 pages
Syllabus
No ratings yet
Syllabus
3 pages
P-QRS-T Localization in ECG Using Deep Learning
No ratings yet
P-QRS-T Localization in ECG Using Deep Learning
4 pages
Xerberus
No ratings yet
Xerberus
26 pages
Using Predicate Logic Abd Representing Simple Facts Using Predicate Logic
No ratings yet
Using Predicate Logic Abd Representing Simple Facts Using Predicate Logic
16 pages
Advanced Driver Assistance Systems and Autonomous Vehicles Bibis - Ir
100% (1)
Advanced Driver Assistance Systems and Autonomous Vehicles Bibis - Ir
628 pages
FRM Course Syllabus IPDownload
No ratings yet
FRM Course Syllabus IPDownload
2 pages
Lab - Manual - Machine Learning Lab - VII Semester - A
No ratings yet
Lab - Manual - Machine Learning Lab - VII Semester - A
56 pages
Implementasi Data Mining Clustering Tingkat Kepuasan Konsumen Terhadap Pelayanan Go-Jek
No ratings yet
Implementasi Data Mining Clustering Tingkat Kepuasan Konsumen Terhadap Pelayanan Go-Jek
7 pages
DL TLP
No ratings yet
DL TLP
3 pages
Big Data JPM
No ratings yet
Big Data JPM
31 pages
Final PPT Ann
No ratings yet
Final PPT Ann
13 pages
EE Updated Placement Brochure 2023-24
No ratings yet
EE Updated Placement Brochure 2023-24
7 pages
Kriti Final Report
No ratings yet
Kriti Final Report
60 pages
Chapter8-Basic Cluster Analysis2016
No ratings yet
Chapter8-Basic Cluster Analysis2016
143 pages
Plant Disease Final Report
No ratings yet
Plant Disease Final Report
35 pages
Hands On Machine Learning with Scikit Learn and TensorFlow Concepts Tools and Techniques to Build Intelligent Systems 1st Edition by Aurelien Geron ISBN 1491962291 9781491962299 - The ebook is available for instant download, read anywhere
100% (10)
Hands On Machine Learning with Scikit Learn and TensorFlow Concepts Tools and Techniques to Build Intelligent Systems 1st Edition by Aurelien Geron ISBN 1491962291 9781491962299 - The ebook is available for instant download, read anywhere
89 pages
Everything There Is To Know About Sentiment Analysis
No ratings yet
Everything There Is To Know About Sentiment Analysis
32 pages
Hernán 2019 A Second Chance To Get Causal Inference Right, A Classification of Data Science Tasks
No ratings yet
Hernán 2019 A Second Chance To Get Causal Inference Right, A Classification of Data Science Tasks
9 pages
20ad41e8 - Reinforcement Learning
No ratings yet
20ad41e8 - Reinforcement Learning
2 pages
Isb Lai - b15 Brochure
No ratings yet
Isb Lai - b15 Brochure
27 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

N19-1213

Uploaded by

N19-1213

Uploaded by

An Embarrassingly Simple Approach for Transfer Learning from

Pretrained Language Models

Abstract slanted triangular learning rate scheme to adapt the

Exponential decay of γ. An advantage of the pro- 4 Experiments and Results

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.