EELBERT Tiny Models Through Dynamic Embeddings 1705151354
EELBERT Tiny Models Through Dynamic Embeddings 1705151354
of downstream tasks. This is achieved by re- however, there has been a parallel push to create
placing the input embedding layer of the model much smaller models, which could be deployed in
with dynamic, i.e. on-the-fly, embedding com- resource-constrained environments such as smart
putations. Since the input embedding layer phones or watches.
accounts for a significant fraction of the model
Some key questions that arise when considering
size, especially for the smaller BERT variants,
replacing this layer with an embedding com- such environments: How does one leverage the
putation function helps us reduce the model power of such large language models on these low-
size significantly. Empirical evaluation on power devices? Is it possible to get the benefits of
the GLUE benchmark shows that our BERT large language models without the massive disk,
variants (EELBERT) suffer minimal regres- memory and compute requirements? Much recent
sion compared to the traditional BERT models. work in the areas of model pruning (Gordon et al.,
Through this approach, we are able to develop
2020), quantization (Zafrir et al., 2019), distillation
our smallest model UNO-EELBERT, which
achieves a GLUE score within 4% of fully (Jiao et al., 2020; Sanh et al., 2020) and more tar-
trained BERT-tiny, while being 15x smaller geted approaches like the lottery ticket hypothesis
(1.2 MB) in size. (Chen et al., 2020) aim to produce smaller yet effec-
tive models. Our work takes a different approach
1 Introduction by reclaiming resources required for representing
It has been standard practice for the past several the model’s large vocabulary.
years for natural language understanding systems to The inspiration for our work comes from Ravi
be built upon powerful pre-trained language mod- and Kozareva (2018a), who introduced dynamic
els, such as BERT (Devlin et al., 2019), T5 (Raffel embeddings, i.e. embeddings computed on-the-fly
et al., 2020), mT5 (Xue et al., 2021), and RoBERTa via hash functions. We extend the usage of dynamic
(Liu et al., 2019). These language models are com- embeddings to transformer-based language models.
prised of a series of transformer-based layers, each We observe that 21% of the trainable parameters in
transforming the representation at its input into a BERT-base (Turc et al., 2019) are in the embedding
new representation at its output. Such transformers lookup layer. By replacing this input embedding
act as the “backbone” for solving several natural layer with embeddings computed at run-time, we
language tasks, like text classification, sequence can reduce model size by the same percentage.
labeling, and text generation, and are primarily In this paper, we introduce an “embeddingless”
used to map (or encode) natural language text into model – EELBERT – that uses a dynamic embed-
a multidimensional vector space representing the ding computation strategy to achieve a smaller size.
semantics of that language. We conduct a set of experiments to empirically
Experiments in prior work (Kaplan et al., 2020) assess the quality of these “embeddingless” mod-
have demonstrated that the size of the language els along with the relative size reduction. A size
model (i.e., the number of parameters) has a di- reduction of up to 88% is observed in our experi-
rect impact on task performance, and that increas- ments, with minimal regression in model quality,
ing a language model’s size improves its language and this approach is entirely complementary to
other model compression techniques. Since EEL-
BERT calculates embeddings at run-time, we do
incur additional latency, which we measure in our
experiments. We find that EELBERT’s latency in-
creases relative to BERT’s as model size decreases,
but could be mitigated through careful architec-
tural and engineering optimizations. Considering
the gains in model compression that EELBERT
provides, this is not an unreasonable trade-off.
2 Related Work
There is a large body of work describing strategies
for optimizing memory and performance of the
BERT models (Ganesh et al., 2021). In this section,
we highlight the studies most revelant to our work,
which focus on reducing the size of the token em-
Figure 1: Embedding table in BERT
beddings used to map input tokens to a real valued
vector representation. We also look at past research
on hash embeddings or randomized embeddings jection and hash embedding methods to achieve
used in language applications (e.g., Tito Svenstrup compression in transformer models like BERT.
et al. (2017)).
Much prior work has been done to reduce the 3 Modeling EELBERT
size of pre-trained static embeddings like GloVe EELBERT is designed with the goal of reducing the
and Word2Vec. Lebret and Collobert (2014) apply size (and thus the memory requirement) of the input
Principal Component Analysis (PCA) to reduce embedding layers of BERT and other transformer-
the dimensionality of word embedding. For com- based models. In this section, we first describe
pressing GloVe embeddings, Arora et al. (2018) our observations about BERT which inform our
proposed LASPE, which leverages matrix factor- architecture choices in EELBERT, and then present
ization to represent the original embeddings as the EELBERT model in detail.
a combination of basis embeddings and linear
transformations. Lam (2018) proposed a method 3.1 Observations about BERT
called Word2Bits that uses quantization to com- BERT-like language models take a sequence of
press Word2Vec embeddings. Similarly, Kim et al. tokens as input, encoding them into a semantic
(2020) proposed using variable size code-blocks to vector space representation. The input tokens
represent each word, where the codes are learned are generated by a tokenizer, which segments a
via a feedforward network with binary constraint. natural language sentence into discrete sub-string
However, the most relevant works to this paper units w1 , w2 , . . . , wn . In BERT, each token in the
are by Ravi and Kozareva (2018b) and Ravi (2017). model’s vocabulary is mapped to an index, cor-
The key idea in the approach by Ravi and Kozareva responding to a row in the input embedding ta-
(2018b) is the use of projection networks as a deter- ble (also referred to as the input embedding layer).
ministic function to generate an embedding vector This row represents the token’s d-size embedding
from a string of text, where this generator function vector ewi ∈ Rd , for a given token wi .
replaces the embedding layer. The table-lookup-like process of mapping tokens
That idea has been extended to word-level em- in the vocabulary to numerical vector representa-
beddings by Sankar et al. (2021) and Ravi and tions using the input embedding layer is a “non-
Kozareva (2021), using an LSH-based technique trainable” operation, and is therefore unaffected
for the projection function. These papers demon- by standard model compression techniques, which
strate the effectiveness of projection embeddings, typically target the model’s trainable parameters.
combined with a stacked layer of CNN, BiLSTM This results in a compression bottleneck, since a
and CRF, on a small text classification task. In profiling of BERT-like models reveals that the in-
our work, we investigate the potential of these pro- put embedding layer occupies a large portion of the
model’s parameters.
We consider three publicly available BERT mod-
els of different sizes, all pre-trained for English
(Turc et al., 2019) – BERT-base, BERT-mini and
BERT-tiny. BERT-base has 12 layers with a hidden
layer size of 768, resulting in about 110M trainable
parameters. BERT-mini has 4 layers and a hidden
layer size of 256, with around 11M parameters, and
BERT-tiny has 2 layers and a hidden layer size of
128, totaling about 4.4M parameters.
Figure 1 shows the proportion of model size oc-
cupied by the input embedding layer (blue shaded Figure 2: Computing dynamic hash embeddings
portion of the bars) versus the encoder layers (un-
shaded portion of the bars). Note that in the small-
1. Initialize random hash seeds h ∈ Zd . There
est of these BERT variants, BERT-tiny, the in-
are d hash seeds in total, where d is the size of
put embedding layer occupies almost 90% of the
the embedding we wish to obtain, e.g. 768 for
model. By taking a different approach to model
BERT-base. The d hash seeds are generated via a
compression, focusing not on reducing the train-
fixed random state, so we only need to save a single
able parameters but instead on eliminating the input
integer specifying the random state.
embedding layer, one could potentially deliver up
2. Hash i-grams to get i-gram signatures si .
to 9x model size reduction.
There are ki = l − i + 1 number of i-grams, where
l is the length of the token. Using a rolling hash
3.2 EELBERT Architecture
function (Wikipedia contributors, 2023), we com-
EELBERT differs from BERT only in the process pute the i-gram signature vectors, si ∈ Zki .
of going from input token to input embedding. 3. Compute projection matrix for i-grams. For
Rather than looking up each input token in the each i, we compute a projection matrix Pi using a
input embedding layer as our first step, we dynam- subset of the hash seeds. The hash seed vector h
ically compute an embedding for a token wi by is partitioned into N vectors, boxed in pink in the
using an n-gram pooling hash function. The output diagram. Each partition hi is of length di , where
is a d-size vector representation, ewi ∈ Rd , just as
PN
i=1 di = d, with larger values of i corresponding
we would get from the embedding layer in standard to a larger di . Given the hash seed vector hi and the
BERT. Keep in mind that EELBERT only impacts i-gram signature vector si , the projection matrix
token embeddings, not the segment or position em- Pi ∈ Zki ×di is the outer product si × hi . To ensure
beddings, and that all mentions of “embeddings” that the matrix values are bounded between [−1, 1],
hereafter refer to token embeddings. we perform a sequence of transformations on Pi :
The key aspect of this method is that it does not
rely on an input embedding table stored in memory, Pi = Pi % B
instead using the hash function to map input tokens B
to embedding vectors at runtime. This technique is Pi = Pi − (Pi > )∗B
2
not intended to produce embeddings that approx- B
imate BERT embeddings. Unlike BERT’s input Pi = Pi /
2
embeddings, dynamic embeddings do not update
during training. where B is our bucket size (scalar).
Our n-gram pooling hash function methodol- 4. Compute embedding, ei , for each i-grams.
ogy is shown in Figure 2, with operations in black We obtain ei ∈ Rdi by averaging Pi across its ki
boxes, and black lines going from the input to the rows to produce a single di -dimensional vector.
output of those operations. Input and output val- 5. Concatenate ei to get token embedding e.
ues are boxed in blue. For ease of notation, we We concatenate the N vectors {ei }N i=1 , to get
refer to the n-grams of length i as i-grams, where the token’s final embedding vector, e ∈ Rd .
i = 1, ..., N , and N is the maximum n-gram size. For a fixed embedding size d, the tunable hyper-
The steps of the algorithm are as follows: parameters of this algorithm are: N , B, and the
BERT-base EELBERT-base
choice of the hashing function. We used N = 3,
Trainable Parameters 109,514,298 86,073,402
B = 109 + 7 and rolling hash function. Exported Model Size 438 MB 344 MB
Since EELBERT replaces the input embedding SST-2 (Acc.) 0.899 0.900
layer with dynamic embeddings, the exported QNLI (Acc.) 0.866 0.864
RTE (Acc.) 0.625 0.563
model size is reduced by the size of the input em- WNLI* (Acc.) 0.521 0.563
bedding layer: O(d×V ) where V is the vocabulary MRPC (Acc., F1) 0.833, 0.882 0.838, 0.887
QQP* (Acc., F1) 0.898, 0.864 0.895, 0.861
size, and d is the embedding size. MNLI (M, MM Acc.) 0.799, 0.802 0.790, 0.795
We specifically refer to the exported size here, STSB (P, S Corr.) 0.870, 0.867 0.851, 0.849
because during pre-training, the model also uses CoLA (M Corr.) 0.410 0.373
GLUE Score 0.775 0.760
an output embedding layer which maps embed-
ding vectors back into tokens. In typical BERT Table 1: GLUE benchmark for BERT vs. EELBERT
pre-training, weights are shared between the input
and output embedding layer, so the output embed-
ding layer does not contribute to model size. For by Hugging Face Transformers (Wolf et al., 2019).
EELBERT, however, there is no input embedding Each of our models is pre-trained for 900,000 steps
layer to share weights with, so the output embed- with a maximum token length of 128 using the bert-
ding layer does contribute to model size. Even if base-uncased tokenizer. We follow the pre-training
we pre-compute and store the dynamic token em- procedure described in Devlin et al. (2019), with a
beddings as an embedding lookup table, using the few differences. Specifically, (a) we use the Open-
transposed dynamic embeddings as a frozen output Web Corpus for pre-training, while the original
layer would defeat the purpose of learning contex- work used the combined dataset of Wikipedia and
tualized representations. In short, using coupled BookCorpus, and (b) we only use the masked lan-
input and output embedding layers in EELBERT is guage model pre-training objective, while the orig-
infeasible, so BERT and EELBERT are the same inal work employed both masked language model
size during pre-training. When pre-training is com- and next sentence prediction objectives.
pleted, the output embedding layer in both models For BERT, the input and output embedding lay-
is discarded, and the exported models are used for ers are coupled and trainable. Since EELBERT has
downstream tasks, which is when we see the size no input embedding layer, its output embedding
advantages of EELBERT. layer is decoupled and trainable.
4.2 Fine-tuning
4 Experimental Setup
For downstream fine-tuning and evaluation, we
In this section, we assess the effectiveness of EEL- choose the GLUE benchmark (Wang et al., 2018)
BERT. The key questions that interest us are: how to assess the quality of our models. GLUE is a
much model compression can we achieve and what collection of nine language understanding tasks,
is the impact of such compression on model quality including single sentence tasks (sentiment analy-
for language understanding? We conduct experi- sis, linguistic acceptability), similarity/paraphrase
ments on a set of benchmark NLP tasks to empiri- tasks, and natural language inference tasks. Using
cally answer these questions. each of our models as a backbone, we fine-tune
In each of our experiments, we compare EEL- individually for each of the GLUE tasks under a
BERT to the corresponding standard BERT model setting similar to that described in Devlin et al.
– i.e., a model with the same configuration but with (2019). The metrics on these tasks serve as a proxy
the standard trainable input embedding layer in- for the quality of the embedding models. Since
stead of our dynamic embeddings. This standard GLUE metrics are known to have high variance,
model serves as the baseline for comparison, to we run each experiment 5 times using 5 different
observe the impact of our approach. seeds, and report the median of the metrics on all
the runs, as done in Lan et al. (2020).
4.1 Pre-training
We calculate an overall GLUE score for each
For our experiments, we pre-train both BERT model. For BERT-base and EELBERT-base we use
and EELBERT from scratch on the OpenWebText the following equation:
dataset (Radford et al., 2019; Gokaslan and Co- AVERAGE ( CoLA Matthews corr , SST -2
hen, 2019), using the pre-training pipeline released accuracy , MRPC accuracy , STSB
BERT-mini EELBERT-mini BERT-tiny EELBERT-tiny UNO-EELBERT
Trainable Parameters 11,171,074 3,357,442 4,386,178 479,362 312,506
Exported Model Size 44.8 MB 13.4 MB 17.7 MB 2.04 MB 1.24 MB
SST-2 (Acc.) 0.851 0.835 0.821 0.749 0.701
QNLI (Acc.) 0.827 0.821 0.616 0.705 0.609
RTE (Acc.) 0.552 0.560 0.545 0.516 0.527
WNLI* (Acc.) 0.563 0.549 0.521 0.535 0.479
MRPC (Acc., F1) 0.701, 0.814 0.721, 0.814 0.684, 0.812 0.684, 0.812 0.684,0.812
QQP* (Acc., F1) 0.864, 0.815 0.850, 0.803 0.780, 0.661 0.752, 0.712 0.728, 0.628
MNLI (M, M Acc.) 0.719, 0.730 0.688, 0.697 0.577, 0.581 0.582, 0.598 0.539, 0.552
CoLA (M Corr.) 0.103 0 0 0 0
GLUE score 0.753 0.746 0.671 0.666 0.632
Pearson corr , QQP accuracy , AVERAGE ( layer with dynamic embeddings does have a rela-
MNLI match accuracy , MNLI mismatch tively small impact on the GLUE score. EELBERT-
accuracy ) , QNLI accuracy , RTE
accuracy ) base achieves ∼21% reduction in parameter count
while regressing by just 1.5% on the GLUE score.
Like Devlin et al. (2019), we do not include As a followup to this, we investigate the impact
the WNLI task in our calculations. For all the of dynamic embeddings on significantly smaller
smaller BERT variants, i.e. BERT-mini, BERT- sized models. Table 2 shows the results for BERT-
tiny, EELBERT-mini, EELBERT-tiny, and UNO- mini and BERT-tiny, which have 11 million and
EELBERT, we use: 4.4 million trainable parameters, respectively. The
AVERAGE ( SST -2 accuracy , MRPC accuracy , corresponding EELBERT-mini and EELBERT-tiny
QQP accuracy , AVERAGE ( MNLI match models have 3.4 million and 0.5 million trainable
accuracy , MNLI mismatch accuracy ) ,
QNLI accuracy , RTE accuracy ) parameters, respectively. EELBERT-mini has just
0.7% absolute regression compared to BERT-mini,
Note that we exclude CoLA and STSB from the while being ∼3x smaller. Similarly, EELBERT-
smaller models’ score, because the models (both tiny is almost on-par with BERT-tiny, with 0.5%
baseline and EELBERT) appear to be unstable on absolute regression, while being ∼9x smaller.
these tasks. We see a similar exclusion of these Additionally, when we compare EELBERT-mini
tasks in Sun et al. (2019). and BERT-tiny models, which have roughly the
Also note that in the tables we abbreviate MNLI same number of trainable parameters, we notice
match and mismatch accuracy as MNLI (M, MM that EELBERT-mini has a substantially higher
Acc.), CoLA Matthews correlation as CoLA (M GLUE score than BERT-tiny. This leads us to con-
Corr.), and STSB Pearson and Spearman correla- clude that under space-limited conditions, it would
tion as STSB (P, S Corr.). be better to train a model with dynamic embed-
dings and a larger number of hidden layers rather
5 Results than a shallower model with trainable embedding
We present results of experiments assessing various layer and fewer hidden layers.
aspects of the model with a view towards deploy-
ment and production use. 5.2 Pushing the Limits: UNO-EELBERT
The results discussed in the previous section sug-
5.1 Model Size vs. Quality
gest that our dynamic embeddings have the most
Our first experiment directly assesses our dynamic utility for extremely small models, where they per-
embeddings by comparing the EELBERT models form comparably to standard BERT while provid-
to their corresponding standard BERT baselines on ing drastic compression. Following this line of
GLUE benchmark tasks. We start by pre-training thought, we try to push the boundaries of model
the models as described in Section 4.1 and fine- compression. We train UNO-EELBERT, a model
tune the models on downstream GLUE tasks, as with a similar configuration as EELBERT-tiny, but
described in Section 4.2. a reduced intermediate size of 128. We note that
Table 1 summarizes the results of this experi- this model is almost 15 times smaller than BERT-
ment. Note that replacing the trainable embedding tiny, with an absolute GLUE score regression of
BERT-base BERT-mini
Initialization Method n-gram pooling random n-gram pooling random
Trainable Parameters 86,073,402 86,073,402 3,387,962 3,387,962
Exported Model Size 344 MB 344 MB 13.4 MB 13.4 MB
SST-2 (Acc.) 0.900 0.897 0.835 0.823
QNLI (Acc.) 0.864 0.862 0.821 0.639
RTE (Acc.) 0.563 0.574 0.560 0.569
WNLI* (Acc.) 0.563 0.507 0.549 0.507
MRPC (Acc., F1) 0.838, 0.887 0.806, 0.868 0.721, 0.814 0.690, 0.805
QQP* (Acc., F1) 0.895, 0.861 0.893, 0.858 0.850, 0.803 0.800, 0.759
MNLI (M, MM Acc.) 0.791, 0.795 0.786, 0.794 0.688, 0.697 0.647, 0.660
STSB (P, S Corr.) 0.851, 0.849 0.849, 0.847 -,- -,-
CoLA (M Corr.) 0.373 0.389 0 0
GLUE score 0.760 0.757 0.746 0.696
randomly-initialized model. We also perform this gram hash values and O(d) dimensional hash seeds,
comparison for BERT-mini (not shown in the table), resulting in a matrix of size O(l × d), is the com-
and observe a similar result. In fact, for BERT-mini, putational bottle-neck in the dynamic embedding
the hash-initialized model had an absolute increase computation. A sparse mask with a fixed number
of 1.6% in overall GLUE score, suggesting that the of 1’s in every row could reduce the complexity
advantage of n-gram pooling hash-initialization of this step to O(l × s), where s is the number
may be even greater for smaller models. of ones in each row, and s ≪ d. This means ev-
ery n-gram will only attend to some of the hash
5.5 Memory vs. Latency Trade-off seeds. This mask can be learned during training,
One consequence of using dynamic embeddings and saved with the model parameters without much
is that we are essentially trading off computation memory overhead, as it would be of size O(k × s),
time for memory. The embedding lookup time for k being the max number of n-grams expected from
a token is O(1) in BERT models. In EELBERT, a token. Future work could explore the effect of
token embedding depends on the number of char- this approach on model quality. The hash embed-
acter n-grams in the token, as well as the size of ding of tokens could also be computed in parallel,
the hash seed partitions. Due to the outer product since they are independent of each other. Addition-
between the n-gram signatures and the partitioned ally, we observe that the 1, 2 and 3-grams follow
hash seeds, the overall time complexity is domi- a Zipf-ian distribution. By using a small cache of
nated by l × d, where l is the length of a token, and the embeddings for the most common n-grams, we
d is the embedding size, leading to O(l × d) time could speed up the computation at the cost of a
complexity to compute the dynamic hash embed- small increase in memory footprint.
ding for a token. For English, the average number
6 Conclusions
of letters in a word follows a somewhat Poisson
distribution, with the mean being ∼4.79 (Norvig, In this work we explored the application of dy-
2012), and the embedding size d for BERT models namic embeddings to the BERT model architec-
typically ranging between 128 to 768. ture, as an alternative to the standard, trainable
The inference time for BERT-base vs EELBERT- input embedding layer. Our experiments show that
base is practically unchanged, as the bulk of the replacing the input embedding layer with dynami-
computation time goes in the encoder blocks for cally computed embeddings is an effective method
big models with multiple encoder blocks. However, of model compression, with minimal regression on
our experiments in Table 5 indicate that EELBERT- downstream tasks. Dynamic embeddings appear to
tiny has ∼2.3x the inference time of BERT-tiny, be particularly effective for the smaller BERT vari-
as the computation time in the encoder blocks de- ants, where the input embedding layer comprises a
creases for smaller models, and embedding com- larger percentage of trainable parameters.
putation starts constituting a sizeable portion of We also find that for smaller BERT models, a
the overall latency. These latency measurements deeper model with dynamic embeddings yields bet-
were done on a standard M1 MacBook Pro with ter results than a shallower model of comparable
32GB RAM. We performed inference on a set of size with a trainable embedding layer. Since the
10 sentences (with average word length of 4.8) for dynamic embeddings technique used in EELBERT
each of the models, reporting the average latency is complementary to existing model compression
of obtaining the embeddings for a sentence (tok- techniques, we can apply it in combination with
enization latency is same for all the models, and is other compression methods to produce extremely
excluded from the measurements). tiny models. Notably, our smallest model, UNO-
To improve the inference latency, we suggest EELBERT, is just 1.2 MB in size, but achieves a
some architectural and engineering optimizations. GLUE score within 4% of that of a standard fully
The outer product between the O(l) dimensional n- trained model almost 15 times its size.
References Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, 2020. ALBERT: A Lite BERT for Self-supervised
and Andrej Risteski. 2018. Linear Algebraic Struc- Learning of Language Representations. In Proceed-
ture of Word Senses, with Applications to Polysemy. ings of the Eighth International Conference on Learn-
Transactions of the Association for Computational ing Representations.
Linguistics, 6:483–495.
Rémi Lebret and Ronan Collobert. 2014. Word Em-
Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia beddings through Hellinger PCA. In Proceedings
Liu, Yang Zhang, Zhangyang Wang, and Michael of the 14th Conference of the European Chapter of
Carbin. 2020. The Lottery Ticket Hypothesis for the Association for Computational Linguistics, pages
Pre-trained BERT Networks. Advances in Neural 482–490, Gothenburg, Sweden. Association for Com-
Information Processing Systems, 33:15834–15846. putational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Kristina Toutanova. 2019. BERT: Pre-training of dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Deep Bidirectional Transformers for Language Un- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
derstanding. In Proceedings of the 2019 Conference RoBERTa: A Robustly Optimized BERT Pretrain-
of the North American Chapter of the Association for ing Approach. arXiv, abs/1907.11692.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Peter Norvig. 2012. English Letter Frequency Counts:
4171–4186, Minneapolis, Minnesota. Association for Mayzner Revisited or ETAOIN SRHLDCU. http:
Computational Linguistics. //norvig.com/mayzner.html. [Online; accessed
23-October-2023].
Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali
Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Dem- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
ing Chen, and Marianne Winslett. 2021. Compress- Dario Amodei, Ilya Sutskever, et al. 2019. Language
ing Large-Scale Transformer-Based Models: A Case Models are Unsupervised Multitask Learners. Ope-
Study on BERT . Transactions of the Association for nAI Blog, 1(8):9.
Computational Linguistics, 9:1061–1080.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Aaron Gokaslan and Vanya Cohen. 2019. OpenWeb- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Text Corpus. http://Skylion007.github.io/ Wei Li, and Peter J Liu. 2020. Exploring the Lim-
OpenWebTextCorpus. its of Transfer Learning with a Unified Text-to-Text
Transformer. The Journal of Machine Learning Re-
Mitchell Gordon, Kevin Duh, and Nicholas Andrews. search, 21(1):5485–5551.
2020. Compressing BERT: Studying the Effects of
Sujith Ravi. 2017. ProjectionNet: Learning Efficient
Weight Pruning on Transfer Learning. In Proceed-
On-Device Deep Networks Using Neural Projections.
ings of the 5th Workshop on Representation Learning
arXiv, abs/1708.00630.
for NLP, pages 143–155.
Sujith Ravi and Zornitsa Kozareva. 2018a. Self-
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao governing neural networks for on-device short text
Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. classification. In Proceedings of the 2018 Confer-
TinyBERT: Distilling BERT for Natural Language ence on Empirical Methods in Natural Language
Understanding. In Findings of the Association for Processing, pages 887–893.
Computational Linguistics: EMNLP 2020, pages
4163–4174, Online. Association for Computational Sujith Ravi and Zornitsa Kozareva. 2018b. Self-
Linguistics. Governing Neural Networks for On-Device Short
Text Classification. In Proceedings of the 2018 Con-
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. ference on Empirical Methods in Natural Language
Brown, Benjamin Chess, Rewon Child, Scott Gray, Processing, pages 804–810, Brussels, Belgium. As-
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. sociation for Computational Linguistics.
Scaling Laws for Neural Language Models. arXiv,
abs/2001.08361. Sujith Ravi and Zornitsa Kozareva. 2021. SoDA: On-
device Conversational Slot Extraction. In Proceed-
Yeachan Kim, Kang-Min Kim, and SangKeun Lee. ings of the 22nd Annual Meeting of the Special In-
2020. Adaptive Compression of Word Embeddings. terest Group on Discourse and Dialogue, pages 56–
In Proceedings of the 58th Annual Meeting of the As- 65, Singapore and Online. Association for Computa-
sociation for Computational Linguistics, pages 3950– tional Linguistics.
3959, Online. Association for Computational Lin-
guistics. Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2020. DistilBERT, a distilled version
Maximilian Lam. 2018. Word2Bits - Quantized Word of BERT: smaller, faster, cheaper and lighter. arXiv,
Vectors. arXiv, abs/1803.05651. abs/1910.01108.
Chinnadhurai Sankar, Sujith Ravi, and Zornitsa
Kozareva. 2021. ProFormer: Towards On-Device
LSH Projection Based Transformers. In Proceedings
of the 16th Conference of the European Chapter of
the Association for Computational Linguistics: Main
Volume, pages 2823–2828.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.
Patient Knowledge Distillation for BERT Model
Compression. arXiv, abs/1908.09355.
Dan Tito Svenstrup, Jonas Hansen, and Ole Winther.
2017. Hash Embeddings for Efficient Word Repre-
sentations. Advances in Neural Information Process-
ing Systems, 30:4935–4943.
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. 2019. Well-Read Students Learn Better:
On the Importance of Pre-training Compact Models.
arXiv, abs/1908.08962.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel Bowman. 2018. GLUE:
A Multi-Task Benchmark and Analysis Platform for
Natural Language Understanding. In Proceedings of
the 2018 EMNLP Workshop BlackboxNLP: Analyz-
ing and Interpreting Neural Networks for NLP, pages
353–355.
Wikipedia contributors. 2023. Rolling hash
— Wikipedia, The Free Encyclopedia.
https://en.wikipedia.org/w/index.php?
title=Rolling_hash&oldid=1168768744. [On-
line; accessed 23-October-2023].