0% found this document useful (0 votes)

47 views9 pages

EELBERT Tiny Models Through Dynamic Embeddings 1705151354

The document introduces EELBERT, a method for compressing transformer-based models like BERT with minimal impact on accuracy. EELBERT replaces the input embedding layer, which accounts for a significant portion of model size, with dynamic embeddings computed during runtime. This reduces model size substantially, by up to 88% in experiments, while maintaining performance on downstream tasks. The latency incurred by dynamic embeddings is a tradeoff for this large compression rate.

Uploaded by

Rabih Darwish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views9 pages

EELBERT Tiny Models Through Dynamic Embeddings 1705151354

Uploaded by

Rabih Darwish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

EELBERT: Tiny Models through Dynamic Embeddings

Gabrielle Cohn, Rishika Agarwal, Deepanshu Gupta, and Siddharth Patwardhan

Apple
Cupertino, CA 95014
{gcohn,rishika_agarwal,dkg,patwardhan.s}@apple.com

Abstract understanding capabilities. Most of the recent

state-of-art results in NLP tasks have been ob-
We introduce EELBERT, an approach for com-
pression of transformer-based models (e.g.,
tained with very large models. At the same time
BERT), with minimal impact on the accuracy as massive language models are gaining popularity,
arXiv:2310.20144v1 [cs.CL] 31 Oct 2023

of downstream tasks. This is achieved by re- however, there has been a parallel push to create
placing the input embedding layer of the model much smaller models, which could be deployed in
with dynamic, i.e. on-the-fly, embedding com- resource-constrained environments such as smart
putations. Since the input embedding layer phones or watches.
accounts for a significant fraction of the model
Some key questions that arise when considering
size, especially for the smaller BERT variants,
replacing this layer with an embedding com- such environments: How does one leverage the
putation function helps us reduce the model power of such large language models on these low-
size significantly. Empirical evaluation on power devices? Is it possible to get the benefits of
the GLUE benchmark shows that our BERT large language models without the massive disk,
variants (EELBERT) suffer minimal regres- memory and compute requirements? Much recent
sion compared to the traditional BERT models. work in the areas of model pruning (Gordon et al.,
Through this approach, we are able to develop
2020), quantization (Zafrir et al., 2019), distillation
our smallest model UNO-EELBERT, which
achieves a GLUE score within 4% of fully (Jiao et al., 2020; Sanh et al., 2020) and more tar-
trained BERT-tiny, while being 15x smaller geted approaches like the lottery ticket hypothesis
(1.2 MB) in size. (Chen et al., 2020) aim to produce smaller yet effec-
tive models. Our work takes a different approach
1 Introduction by reclaiming resources required for representing
It has been standard practice for the past several the model’s large vocabulary.
years for natural language understanding systems to The inspiration for our work comes from Ravi
be built upon powerful pre-trained language mod- and Kozareva (2018a), who introduced dynamic
els, such as BERT (Devlin et al., 2019), T5 (Raffel embeddings, i.e. embeddings computed on-the-fly
et al., 2020), mT5 (Xue et al., 2021), and RoBERTa via hash functions. We extend the usage of dynamic
(Liu et al., 2019). These language models are com- embeddings to transformer-based language models.
prised of a series of transformer-based layers, each We observe that 21% of the trainable parameters in
transforming the representation at its input into a BERT-base (Turc et al., 2019) are in the embedding
new representation at its output. Such transformers lookup layer. By replacing this input embedding
act as the “backbone” for solving several natural layer with embeddings computed at run-time, we
language tasks, like text classification, sequence can reduce model size by the same percentage.
labeling, and text generation, and are primarily In this paper, we introduce an “embeddingless”
used to map (or encode) natural language text into model – EELBERT – that uses a dynamic embed-
a multidimensional vector space representing the ding computation strategy to achieve a smaller size.
semantics of that language. We conduct a set of experiments to empirically
Experiments in prior work (Kaplan et al., 2020) assess the quality of these “embeddingless” mod-
have demonstrated that the size of the language els along with the relative size reduction. A size
model (i.e., the number of parameters) has a di- reduction of up to 88% is observed in our experi-
rect impact on task performance, and that increas- ments, with minimal regression in model quality,
ing a language model’s size improves its language and this approach is entirely complementary to
other model compression techniques. Since EEL-
BERT calculates embeddings at run-time, we do
incur additional latency, which we measure in our
experiments. We find that EELBERT’s latency in-
creases relative to BERT’s as model size decreases,
but could be mitigated through careful architec-
tural and engineering optimizations. Considering
the gains in model compression that EELBERT
provides, this is not an unreasonable trade-off.

2 Related Work
There is a large body of work describing strategies
for optimizing memory and performance of the
BERT models (Ganesh et al., 2021). In this section,
we highlight the studies most revelant to our work,
which focus on reducing the size of the token em-
Figure 1: Embedding table in BERT
beddings used to map input tokens to a real valued
vector representation. We also look at past research
on hash embeddings or randomized embeddings jection and hash embedding methods to achieve
used in language applications (e.g., Tito Svenstrup compression in transformer models like BERT.
et al. (2017)).
Much prior work has been done to reduce the 3 Modeling EELBERT
size of pre-trained static embeddings like GloVe EELBERT is designed with the goal of reducing the
and Word2Vec. Lebret and Collobert (2014) apply size (and thus the memory requirement) of the input
Principal Component Analysis (PCA) to reduce embedding layers of BERT and other transformer-
the dimensionality of word embedding. For com- based models. In this section, we first describe
pressing GloVe embeddings, Arora et al. (2018) our observations about BERT which inform our
proposed LASPE, which leverages matrix factor- architecture choices in EELBERT, and then present
ization to represent the original embeddings as the EELBERT model in detail.
a combination of basis embeddings and linear
transformations. Lam (2018) proposed a method 3.1 Observations about BERT
called Word2Bits that uses quantization to com- BERT-like language models take a sequence of
press Word2Vec embeddings. Similarly, Kim et al. tokens as input, encoding them into a semantic
(2020) proposed using variable size code-blocks to vector space representation. The input tokens
represent each word, where the codes are learned are generated by a tokenizer, which segments a
via a feedforward network with binary constraint. natural language sentence into discrete sub-string
However, the most relevant works to this paper units w1 , w2 , . . . , wn . In BERT, each token in the
are by Ravi and Kozareva (2018b) and Ravi (2017). model’s vocabulary is mapped to an index, cor-
The key idea in the approach by Ravi and Kozareva responding to a row in the input embedding ta-
(2018b) is the use of projection networks as a deter- ble (also referred to as the input embedding layer).
ministic function to generate an embedding vector This row represents the token’s d-size embedding
from a string of text, where this generator function vector ewi ∈ Rd , for a given token wi .
replaces the embedding layer. The table-lookup-like process of mapping tokens
That idea has been extended to word-level em- in the vocabulary to numerical vector representa-
beddings by Sankar et al. (2021) and Ravi and tions using the input embedding layer is a “non-
Kozareva (2021), using an LSH-based technique trainable” operation, and is therefore unaffected
for the projection function. These papers demon- by standard model compression techniques, which
strate the effectiveness of projection embeddings, typically target the model’s trainable parameters.
combined with a stacked layer of CNN, BiLSTM This results in a compression bottleneck, since a
and CRF, on a small text classification task. In profiling of BERT-like models reveals that the in-
our work, we investigate the potential of these pro- put embedding layer occupies a large portion of the
model’s parameters.
We consider three publicly available BERT mod-
els of different sizes, all pre-trained for English
(Turc et al., 2019) – BERT-base, BERT-mini and
BERT-tiny. BERT-base has 12 layers with a hidden
layer size of 768, resulting in about 110M trainable
parameters. BERT-mini has 4 layers and a hidden
layer size of 256, with around 11M parameters, and
BERT-tiny has 2 layers and a hidden layer size of
128, totaling about 4.4M parameters.
Figure 1 shows the proportion of model size oc-
cupied by the input embedding layer (blue shaded Figure 2: Computing dynamic hash embeddings
portion of the bars) versus the encoder layers (un-
shaded portion of the bars). Note that in the small-
1. Initialize random hash seeds h ∈ Zd . There
est of these BERT variants, BERT-tiny, the in-
are d hash seeds in total, where d is the size of
put embedding layer occupies almost 90% of the
the embedding we wish to obtain, e.g. 768 for
model. By taking a different approach to model
BERT-base. The d hash seeds are generated via a
compression, focusing not on reducing the train-
fixed random state, so we only need to save a single
able parameters but instead on eliminating the input
integer specifying the random state.
embedding layer, one could potentially deliver up
2. Hash i-grams to get i-gram signatures si .
to 9x model size reduction.
There are ki = l − i + 1 number of i-grams, where
l is the length of the token. Using a rolling hash
3.2 EELBERT Architecture
function (Wikipedia contributors, 2023), we com-
EELBERT differs from BERT only in the process pute the i-gram signature vectors, si ∈ Zki .
of going from input token to input embedding. 3. Compute projection matrix for i-grams. For
Rather than looking up each input token in the each i, we compute a projection matrix Pi using a
input embedding layer as our first step, we dynam- subset of the hash seeds. The hash seed vector h
ically compute an embedding for a token wi by is partitioned into N vectors, boxed in pink in the
using an n-gram pooling hash function. The output diagram. Each partition hi is of length di , where
is a d-size vector representation, ewi ∈ Rd , just as
PN
i=1 di = d, with larger values of i corresponding
we would get from the embedding layer in standard to a larger di . Given the hash seed vector hi and the
BERT. Keep in mind that EELBERT only impacts i-gram signature vector si , the projection matrix
token embeddings, not the segment or position em- Pi ∈ Zki ×di is the outer product si × hi . To ensure
beddings, and that all mentions of “embeddings” that the matrix values are bounded between [−1, 1],
hereafter refer to token embeddings. we perform a sequence of transformations on Pi :
The key aspect of this method is that it does not
rely on an input embedding table stored in memory, Pi = Pi % B
instead using the hash function to map input tokens B
to embedding vectors at runtime. This technique is Pi = Pi − (Pi > )∗B
2
not intended to produce embeddings that approx- B
imate BERT embeddings. Unlike BERT’s input Pi = Pi /
2
embeddings, dynamic embeddings do not update
during training. where B is our bucket size (scalar).
Our n-gram pooling hash function methodol- 4. Compute embedding, ei , for each i-grams.
ogy is shown in Figure 2, with operations in black We obtain ei ∈ Rdi by averaging Pi across its ki
boxes, and black lines going from the input to the rows to produce a single di -dimensional vector.
output of those operations. Input and output val- 5. Concatenate ei to get token embedding e.
ues are boxed in blue. For ease of notation, we We concatenate the N vectors {ei }N i=1 , to get
refer to the n-grams of length i as i-grams, where the token’s final embedding vector, e ∈ Rd .
i = 1, ..., N , and N is the maximum n-gram size. For a fixed embedding size d, the tunable hyper-
The steps of the algorithm are as follows: parameters of this algorithm are: N , B, and the
BERT-base EELBERT-base
choice of the hashing function. We used N = 3,
Trainable Parameters 109,514,298 86,073,402
B = 109 + 7 and rolling hash function. Exported Model Size 438 MB 344 MB
Since EELBERT replaces the input embedding SST-2 (Acc.) 0.899 0.900
layer with dynamic embeddings, the exported QNLI (Acc.) 0.866 0.864
RTE (Acc.) 0.625 0.563
model size is reduced by the size of the input em- WNLI* (Acc.) 0.521 0.563
bedding layer: O(d×V ) where V is the vocabulary MRPC (Acc., F1) 0.833, 0.882 0.838, 0.887
QQP* (Acc., F1) 0.898, 0.864 0.895, 0.861
size, and d is the embedding size. MNLI (M, MM Acc.) 0.799, 0.802 0.790, 0.795
We specifically refer to the exported size here, STSB (P, S Corr.) 0.870, 0.867 0.851, 0.849
because during pre-training, the model also uses CoLA (M Corr.) 0.410 0.373
GLUE Score 0.775 0.760
an output embedding layer which maps embed-
ding vectors back into tokens. In typical BERT Table 1: GLUE benchmark for BERT vs. EELBERT
pre-training, weights are shared between the input
and output embedding layer, so the output embed-
ding layer does not contribute to model size. For by Hugging Face Transformers (Wolf et al., 2019).
EELBERT, however, there is no input embedding Each of our models is pre-trained for 900,000 steps
layer to share weights with, so the output embed- with a maximum token length of 128 using the bert-
ding layer does contribute to model size. Even if base-uncased tokenizer. We follow the pre-training
we pre-compute and store the dynamic token em- procedure described in Devlin et al. (2019), with a
beddings as an embedding lookup table, using the few differences. Specifically, (a) we use the Open-
transposed dynamic embeddings as a frozen output Web Corpus for pre-training, while the original
layer would defeat the purpose of learning contex- work used the combined dataset of Wikipedia and
tualized representations. In short, using coupled BookCorpus, and (b) we only use the masked lan-
input and output embedding layers in EELBERT is guage model pre-training objective, while the orig-
infeasible, so BERT and EELBERT are the same inal work employed both masked language model
size during pre-training. When pre-training is com- and next sentence prediction objectives.
pleted, the output embedding layer in both models For BERT, the input and output embedding lay-
is discarded, and the exported models are used for ers are coupled and trainable. Since EELBERT has
downstream tasks, which is when we see the size no input embedding layer, its output embedding
advantages of EELBERT. layer is decoupled and trainable.

4.2 Fine-tuning
4 Experimental Setup
For downstream fine-tuning and evaluation, we
In this section, we assess the effectiveness of EEL- choose the GLUE benchmark (Wang et al., 2018)
BERT. The key questions that interest us are: how to assess the quality of our models. GLUE is a
much model compression can we achieve and what collection of nine language understanding tasks,
is the impact of such compression on model quality including single sentence tasks (sentiment analy-
for language understanding? We conduct experi- sis, linguistic acceptability), similarity/paraphrase
ments on a set of benchmark NLP tasks to empiri- tasks, and natural language inference tasks. Using
cally answer these questions. each of our models as a backbone, we fine-tune
In each of our experiments, we compare EEL- individually for each of the GLUE tasks under a
BERT to the corresponding standard BERT model setting similar to that described in Devlin et al.
– i.e., a model with the same configuration but with (2019). The metrics on these tasks serve as a proxy
the standard trainable input embedding layer in- for the quality of the embedding models. Since
stead of our dynamic embeddings. This standard GLUE metrics are known to have high variance,
model serves as the baseline for comparison, to we run each experiment 5 times using 5 different
observe the impact of our approach. seeds, and report the median of the metrics on all
the runs, as done in Lan et al. (2020).
4.1 Pre-training
We calculate an overall GLUE score for each
For our experiments, we pre-train both BERT model. For BERT-base and EELBERT-base we use
and EELBERT from scratch on the OpenWebText the following equation:
dataset (Radford et al., 2019; Gokaslan and Co- AVERAGE ( CoLA Matthews corr , SST -2
hen, 2019), using the pre-training pipeline released accuracy , MRPC accuracy , STSB
BERT-mini EELBERT-mini BERT-tiny EELBERT-tiny UNO-EELBERT
Trainable Parameters 11,171,074 3,357,442 4,386,178 479,362 312,506
Exported Model Size 44.8 MB 13.4 MB 17.7 MB 2.04 MB 1.24 MB
SST-2 (Acc.) 0.851 0.835 0.821 0.749 0.701
QNLI (Acc.) 0.827 0.821 0.616 0.705 0.609
RTE (Acc.) 0.552 0.560 0.545 0.516 0.527
WNLI* (Acc.) 0.563 0.549 0.521 0.535 0.479
MRPC (Acc., F1) 0.701, 0.814 0.721, 0.814 0.684, 0.812 0.684, 0.812 0.684,0.812
QQP* (Acc., F1) 0.864, 0.815 0.850, 0.803 0.780, 0.661 0.752, 0.712 0.728, 0.628
MNLI (M, M Acc.) 0.719, 0.730 0.688, 0.697 0.577, 0.581 0.582, 0.598 0.539, 0.552
CoLA (M Corr.) 0.103 0 0 0 0
GLUE score 0.753 0.746 0.671 0.666 0.632

Table 2: EELBERT with smaller models

Pearson corr , QQP accuracy , AVERAGE ( layer with dynamic embeddings does have a rela-
MNLI match accuracy , MNLI mismatch tively small impact on the GLUE score. EELBERT-
accuracy ) , QNLI accuracy , RTE
accuracy ) base achieves ∼21% reduction in parameter count
while regressing by just 1.5% on the GLUE score.
Like Devlin et al. (2019), we do not include As a followup to this, we investigate the impact
the WNLI task in our calculations. For all the of dynamic embeddings on significantly smaller
smaller BERT variants, i.e. BERT-mini, BERT- sized models. Table 2 shows the results for BERT-
tiny, EELBERT-mini, EELBERT-tiny, and UNO- mini and BERT-tiny, which have 11 million and
EELBERT, we use: 4.4 million trainable parameters, respectively. The
AVERAGE ( SST -2 accuracy , MRPC accuracy , corresponding EELBERT-mini and EELBERT-tiny
QQP accuracy , AVERAGE ( MNLI match models have 3.4 million and 0.5 million trainable
accuracy , MNLI mismatch accuracy ) ,
QNLI accuracy , RTE accuracy ) parameters, respectively. EELBERT-mini has just
0.7% absolute regression compared to BERT-mini,
Note that we exclude CoLA and STSB from the while being ∼3x smaller. Similarly, EELBERT-
smaller models’ score, because the models (both tiny is almost on-par with BERT-tiny, with 0.5%
baseline and EELBERT) appear to be unstable on absolute regression, while being ∼9x smaller.
these tasks. We see a similar exclusion of these Additionally, when we compare EELBERT-mini
tasks in Sun et al. (2019). and BERT-tiny models, which have roughly the
Also note that in the tables we abbreviate MNLI same number of trainable parameters, we notice
match and mismatch accuracy as MNLI (M, MM that EELBERT-mini has a substantially higher
Acc.), CoLA Matthews correlation as CoLA (M GLUE score than BERT-tiny. This leads us to con-
Corr.), and STSB Pearson and Spearman correla- clude that under space-limited conditions, it would
tion as STSB (P, S Corr.). be better to train a model with dynamic embed-
dings and a larger number of hidden layers rather
5 Results than a shallower model with trainable embedding
We present results of experiments assessing various layer and fewer hidden layers.
aspects of the model with a view towards deploy-
ment and production use. 5.2 Pushing the Limits: UNO-EELBERT
The results discussed in the previous section sug-
5.1 Model Size vs. Quality
gest that our dynamic embeddings have the most
Our first experiment directly assesses our dynamic utility for extremely small models, where they per-
embeddings by comparing the EELBERT models form comparably to standard BERT while provid-
to their corresponding standard BERT baselines on ing drastic compression. Following this line of
GLUE benchmark tasks. We start by pre-training thought, we try to push the boundaries of model
the models as described in Section 4.1 and fine- compression. We train UNO-EELBERT, a model
tune the models on downstream GLUE tasks, as with a similar configuration as EELBERT-tiny, but
described in Section 4.2. a reduced intermediate size of 128. We note that
Table 1 summarizes the results of this experi- this model is almost 15 times smaller than BERT-
ment. Note that replacing the trainable embedding tiny, with an absolute GLUE score regression of
BERT-base BERT-mini
Initialization Method n-gram pooling random n-gram pooling random
Trainable Parameters 86,073,402 86,073,402 3,387,962 3,387,962
Exported Model Size 344 MB 344 MB 13.4 MB 13.4 MB
SST-2 (Acc.) 0.900 0.897 0.835 0.823
QNLI (Acc.) 0.864 0.862 0.821 0.639
RTE (Acc.) 0.563 0.574 0.560 0.569
WNLI* (Acc.) 0.563 0.507 0.549 0.507
MRPC (Acc., F1) 0.838, 0.887 0.806, 0.868 0.721, 0.814 0.690, 0.805
QQP* (Acc., F1) 0.895, 0.861 0.893, 0.858 0.850, 0.803 0.800, 0.759
MNLI (M, MM Acc.) 0.791, 0.795 0.786, 0.794 0.688, 0.697 0.647, 0.660
STSB (P, S Corr.) 0.851, 0.849 0.849, 0.847 -,- -,-
CoLA (M Corr.) 0.373 0.389 0 0
GLUE score 0.760 0.757 0.746 0.696

Table 3: Impact of varying hash functions

BERT-base To simulate a random hash function, we initial-

Initialization Method random hash
Trainable Parameters 109,514,298 109,514,298 ize the embedding layer of BERT with a random
Exported Model Size 438 MB 438 MB normal distribution (BERT’s default initialization
SST-2 (Acc.) 0.899 0.904 scheme), and then freeze the embedding layer, so
QNLI (Acc.) 0.866 0.876
RTE (Acc.) 0.625 0.614 each word in the vocabulary is mapped to a ran-
WNLI* (Acc.) 0.521 0.563 dom embedding. The results presented in Table 3
MRPC (Acc., F1) 0.833, 0.882 0.850, 0.896 indicate that for larger models like BERT-base, the
QQP* (Acc., F1) 0.898, 0.864 0.901, 0.867
MNLI (M, MM Acc.) 0.799, 0.802 0.807, 0.809 hashing function doesn’t have much significance,
STSB (P, S Corr.) 0.870, 0.867 0.869, 0.867 as the models trained with random vs n-gram pool-
CoLA (M Corr.) 0.410 0.417 ing hash functions perform similarly on the GLUE
GLUE score 0.775 0.780
tasks. However, for the smaller BERT-mini model,
Table 4: Initialization of trainable embeddings our n-gram pooling hash function results in a better
score. These results suggest that the importance of
the n-gram pooling hash function, as compared to
less than 4%. It is also 350 times smaller than a completely random hash function, increases as
BERT-base, with an absolute regression of less the model size decreases. This is a useful finding,
than 20%. Note that for these regression calcula- since the primary benefit of dynamic hashing is to
tions, all GLUE scores were calculated using the develop small models that can be run on device.
small-model GLUE score equation, which excludes
CoLA and STSB, so that the scores would be com- 5.4 Hash Function as Initializer
parable. We believe that with a model size of 1.2
Based on the results of the previous experiment, we
MB, UNO-EELBERT could be a powerful candi-
consider a potential alternative role for the embed-
date for low-memory edge devices like IoT, and
dings generated by our hash function. We inves-
other memory critical applications.
tigate whether our n-gram pooling hash function
could be a better initializer for a trainable embed-
5.3 Impact of Hash Function
ding layer, compared to the commonly used ran-
Our results thus far suggest that the trainable em- dom normal distribution initializer. To answer this
bedding layer can be replaced by a deterministic question, we conduct an experiment with BERT-
hash function with minimal impact on downstream base, by intializing one model with the default ran-
quality. The hash function we used pools the n- dom normal initialization and the other model with
gram features of a word to generate its embedding, the embeddings generated using our n-gram pool-
so words with similar morphology, like "running" ing hash function (hash column in Table 4). Note
and "runner", will result in similar embeddings. In that in this experiment the input and output embed-
this experiment, we investigate whether our par- ding layers are coupled, and embedding layers are
ticular choice of hash function plays an important trainable for both initialization schemes.
role in the model quality, or whether a completely The results of this experiment are shown in Ta-
random hash function which preserves no morpho- ble 4. The hash-initialized model shows a 0.5%
logical information would yield similar results. absolute increase in GLUE score compared to the
BERT-base EELBERT-base BERT-mini EELBERT-mini BERT-tiny EELBERT-tiny
Model Size (MB) 428.00 344.00 44.80 13.40 17.40 2.04
Latency (ms) 162.0 165.0 7.0 9.9 1.7 3.9

Table 5: Latency, on MacBookPro M1 32GB RAM

randomly-initialized model. We also perform this gram hash values and O(d) dimensional hash seeds,
comparison for BERT-mini (not shown in the table), resulting in a matrix of size O(l × d), is the com-
and observe a similar result. In fact, for BERT-mini, putational bottle-neck in the dynamic embedding
the hash-initialized model had an absolute increase computation. A sparse mask with a fixed number
of 1.6% in overall GLUE score, suggesting that the of 1’s in every row could reduce the complexity
advantage of n-gram pooling hash-initialization of this step to O(l × s), where s is the number
may be even greater for smaller models. of ones in each row, and s ≪ d. This means ev-
ery n-gram will only attend to some of the hash
5.5 Memory vs. Latency Trade-off seeds. This mask can be learned during training,
One consequence of using dynamic embeddings and saved with the model parameters without much
is that we are essentially trading off computation memory overhead, as it would be of size O(k × s),
time for memory. The embedding lookup time for k being the max number of n-grams expected from
a token is O(1) in BERT models. In EELBERT, a token. Future work could explore the effect of
token embedding depends on the number of char- this approach on model quality. The hash embed-
acter n-grams in the token, as well as the size of ding of tokens could also be computed in parallel,
the hash seed partitions. Due to the outer product since they are independent of each other. Addition-
between the n-gram signatures and the partitioned ally, we observe that the 1, 2 and 3-grams follow
hash seeds, the overall time complexity is domi- a Zipf-ian distribution. By using a small cache of
nated by l × d, where l is the length of a token, and the embeddings for the most common n-grams, we
d is the embedding size, leading to O(l × d) time could speed up the computation at the cost of a
complexity to compute the dynamic hash embed- small increase in memory footprint.
ding for a token. For English, the average number
6 Conclusions
of letters in a word follows a somewhat Poisson
distribution, with the mean being ∼4.79 (Norvig, In this work we explored the application of dy-
2012), and the embedding size d for BERT models namic embeddings to the BERT model architec-
typically ranging between 128 to 768. ture, as an alternative to the standard, trainable
The inference time for BERT-base vs EELBERT- input embedding layer. Our experiments show that
base is practically unchanged, as the bulk of the replacing the input embedding layer with dynami-
computation time goes in the encoder blocks for cally computed embeddings is an effective method
big models with multiple encoder blocks. However, of model compression, with minimal regression on
our experiments in Table 5 indicate that EELBERT- downstream tasks. Dynamic embeddings appear to
tiny has ∼2.3x the inference time of BERT-tiny, be particularly effective for the smaller BERT vari-
as the computation time in the encoder blocks de- ants, where the input embedding layer comprises a
creases for smaller models, and embedding com- larger percentage of trainable parameters.
putation starts constituting a sizeable portion of We also find that for smaller BERT models, a
the overall latency. These latency measurements deeper model with dynamic embeddings yields bet-
were done on a standard M1 MacBook Pro with ter results than a shallower model of comparable
32GB RAM. We performed inference on a set of size with a trainable embedding layer. Since the
10 sentences (with average word length of 4.8) for dynamic embeddings technique used in EELBERT
each of the models, reporting the average latency is complementary to existing model compression
of obtaining the embeddings for a sentence (tok- techniques, we can apply it in combination with
enization latency is same for all the models, and is other compression methods to produce extremely
excluded from the measurements). tiny models. Notably, our smallest model, UNO-
To improve the inference latency, we suggest EELBERT, is just 1.2 MB in size, but achieves a
some architectural and engineering optimizations. GLUE score within 4% of that of a standard fully
The outer product between the O(l) dimensional n- trained model almost 15 times its size.
References Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, 2020. ALBERT: A Lite BERT for Self-supervised
and Andrej Risteski. 2018. Linear Algebraic Struc- Learning of Language Representations. In Proceed-
ture of Word Senses, with Applications to Polysemy. ings of the Eighth International Conference on Learn-
Transactions of the Association for Computational ing Representations.
Linguistics, 6:483–495.
Rémi Lebret and Ronan Collobert. 2014. Word Em-
Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia beddings through Hellinger PCA. In Proceedings
Liu, Yang Zhang, Zhangyang Wang, and Michael of the 14th Conference of the European Chapter of
Carbin. 2020. The Lottery Ticket Hypothesis for the Association for Computational Linguistics, pages
Pre-trained BERT Networks. Advances in Neural 482–490, Gothenburg, Sweden. Association for Com-
Information Processing Systems, 33:15834–15846. putational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Kristina Toutanova. 2019. BERT: Pre-training of dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Deep Bidirectional Transformers for Language Un- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
derstanding. In Proceedings of the 2019 Conference RoBERTa: A Robustly Optimized BERT Pretrain-
of the North American Chapter of the Association for ing Approach. arXiv, abs/1907.11692.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Peter Norvig. 2012. English Letter Frequency Counts:
4171–4186, Minneapolis, Minnesota. Association for Mayzner Revisited or ETAOIN SRHLDCU. http:
Computational Linguistics. //norvig.com/mayzner.html. [Online; accessed
23-October-2023].
Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali
Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Dem- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
ing Chen, and Marianne Winslett. 2021. Compress- Dario Amodei, Ilya Sutskever, et al. 2019. Language
ing Large-Scale Transformer-Based Models: A Case Models are Unsupervised Multitask Learners. Ope-
Study on BERT . Transactions of the Association for nAI Blog, 1(8):9.
Computational Linguistics, 9:1061–1080.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Aaron Gokaslan and Vanya Cohen. 2019. OpenWeb- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Text Corpus. http://Skylion007.github.io/ Wei Li, and Peter J Liu. 2020. Exploring the Lim-
OpenWebTextCorpus. its of Transfer Learning with a Unified Text-to-Text
Transformer. The Journal of Machine Learning Re-
Mitchell Gordon, Kevin Duh, and Nicholas Andrews. search, 21(1):5485–5551.
2020. Compressing BERT: Studying the Effects of
Sujith Ravi. 2017. ProjectionNet: Learning Efficient
Weight Pruning on Transfer Learning. In Proceed-
On-Device Deep Networks Using Neural Projections.
ings of the 5th Workshop on Representation Learning
arXiv, abs/1708.00630.
for NLP, pages 143–155.
Sujith Ravi and Zornitsa Kozareva. 2018a. Self-
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao governing neural networks for on-device short text
Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. classification. In Proceedings of the 2018 Confer-
TinyBERT: Distilling BERT for Natural Language ence on Empirical Methods in Natural Language
Understanding. In Findings of the Association for Processing, pages 887–893.
Computational Linguistics: EMNLP 2020, pages
4163–4174, Online. Association for Computational Sujith Ravi and Zornitsa Kozareva. 2018b. Self-
Linguistics. Governing Neural Networks for On-Device Short
Text Classification. In Proceedings of the 2018 Con-
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. ference on Empirical Methods in Natural Language
Brown, Benjamin Chess, Rewon Child, Scott Gray, Processing, pages 804–810, Brussels, Belgium. As-
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. sociation for Computational Linguistics.
Scaling Laws for Neural Language Models. arXiv,
abs/2001.08361. Sujith Ravi and Zornitsa Kozareva. 2021. SoDA: On-
device Conversational Slot Extraction. In Proceed-
Yeachan Kim, Kang-Min Kim, and SangKeun Lee. ings of the 22nd Annual Meeting of the Special In-
2020. Adaptive Compression of Word Embeddings. terest Group on Discourse and Dialogue, pages 56–
In Proceedings of the 58th Annual Meeting of the As- 65, Singapore and Online. Association for Computa-
sociation for Computational Linguistics, pages 3950– tional Linguistics.
3959, Online. Association for Computational Lin-
guistics. Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2020. DistilBERT, a distilled version
Maximilian Lam. 2018. Word2Bits - Quantized Word of BERT: smaller, faster, cheaper and lighter. arXiv,
Vectors. arXiv, abs/1803.05651. abs/1910.01108.
Chinnadhurai Sankar, Sujith Ravi, and Zornitsa
Kozareva. 2021. ProFormer: Towards On-Device
LSH Projection Based Transformers. In Proceedings
of the 16th Conference of the European Chapter of
the Association for Computational Linguistics: Main
Volume, pages 2823–2828.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.
Patient Knowledge Distillation for BERT Model
Compression. arXiv, abs/1908.09355.
Dan Tito Svenstrup, Jonas Hansen, and Ole Winther.
2017. Hash Embeddings for Efficient Word Repre-
sentations. Advances in Neural Information Process-
ing Systems, 30:4935–4943.
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. 2019. Well-Read Students Learn Better:
On the Importance of Pre-training Compact Models.
arXiv, abs/1908.08962.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel Bowman. 2018. GLUE:
A Multi-Task Benchmark and Analysis Platform for
Natural Language Understanding. In Proceedings of
the 2018 EMNLP Workshop BlackboxNLP: Analyz-
ing and Interpreting Neural Networks for NLP, pages
353–355.
Wikipedia contributors. 2023. Rolling hash
— Wikipedia, The Free Encyclopedia.
https://en.wikipedia.org/w/index.php?
title=Rolling_hash&oldid=1168768744. [On-
line; accessed 23-October-2023].

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien

Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
et al. 2019. Huggingface’s Transformers: State-of-
the-Art Natural Language Processing. arXiv preprint
arXiv:1910.03771.

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,

Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mT5: A Massively Multilingual
Pre-trained Text-to-Text Transformer. In Proceed-
ings of the 2021 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, pages 483–
498.
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe
Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT.
In 2019 Fifth Workshop on Energy Efficient Machine
Learning and Cognitive Computing - NeurIPS Edi-
tion (EMC2-NIPS), pages 36–39, Vancouver, Canada.
IEEE.

Mmteb: M M T E - B: Assive Ultilingual EXT Mbed Ding Enchmark
No ratings yet
Mmteb: M M T E - B: Assive Ultilingual EXT Mbed Ding Enchmark
57 pages
AComparative Study of Machine Learning and Deep Learning Techniques For Fake News Detection
No ratings yet
AComparative Study of Machine Learning and Deep Learning Techniques For Fake News Detection
28 pages
Newwhitepaper - Embeddings & Vector Stores
No ratings yet
Newwhitepaper - Embeddings & Vector Stores
51 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
Vector Database
No ratings yet
Vector Database
8 pages
Glove: Global Vectors For Word Representation
No ratings yet
Glove: Global Vectors For Word Representation
12 pages
AI Tools
No ratings yet
AI Tools
16 pages
Amharic Idiom Paper
No ratings yet
Amharic Idiom Paper
9 pages
Research Trends For The Interplay Between Large Language Models KG
No ratings yet
Research Trends For The Interplay Between Large Language Models KG
10 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Data Protection Risk in LLM
No ratings yet
Data Protection Risk in LLM
34 pages
Whitepaper - Embeddings & Vector Stores
No ratings yet
Whitepaper - Embeddings & Vector Stores
52 pages
Interview Questions
No ratings yet
Interview Questions
23 pages
Embeddings
No ratings yet
Embeddings
13 pages
BERT
No ratings yet
BERT
98 pages
NLP Week9 Fine Tuning - and - IR
No ratings yet
NLP Week9 Fine Tuning - and - IR
64 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
No ratings yet
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
87 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
NLP Techniques For Sanskrit MSR Thesis
No ratings yet
NLP Techniques For Sanskrit MSR Thesis
91 pages
Nlput-Unit2 Notes
No ratings yet
Nlput-Unit2 Notes
28 pages
Ultimate Guide To Embedding Models
No ratings yet
Ultimate Guide To Embedding Models
50 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Train 400x Faster Static Embedding Models With Sentence Transformers
No ratings yet
Train 400x Faster Static Embedding Models With Sentence Transformers
47 pages
Model5 Partial
No ratings yet
Model5 Partial
52 pages
Generative Representational Instruction Tuning
No ratings yet
Generative Representational Instruction Tuning
66 pages
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
Mẫu Trình Bày (Tham Khảo)
No ratings yet
Mẫu Trình Bày (Tham Khảo)
82 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
No ratings yet
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
12 pages
Disentangling The Past: Automatizing Information Extraction From Unstructured Historical Digitized Printed Sources
No ratings yet
Disentangling The Past: Automatizing Information Extraction From Unstructured Historical Digitized Printed Sources
67 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
For The First Paper
No ratings yet
For The First Paper
49 pages
Embeddings in Deep Learning An Introduction
No ratings yet
Embeddings in Deep Learning An Introduction
8 pages
BERT and Its Variation
No ratings yet
BERT and Its Variation
29 pages
Sequence Models
No ratings yet
Sequence Models
85 pages
Tokenisation and Embedding
No ratings yet
Tokenisation and Embedding
11 pages
Towards General Text Embeddings With Multi-Stage Contrastive Learning
No ratings yet
Towards General Text Embeddings With Multi-Stage Contrastive Learning
18 pages
Goldstein Et Al NC 2024
No ratings yet
Goldstein Et Al NC 2024
12 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
One Embedder, Any Task: Instruction-Finetuned Text Embeddings
No ratings yet
One Embedder, Any Task: Instruction-Finetuned Text Embeddings
18 pages
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
No ratings yet
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
28 pages
Mathematics of LLMs Part 1
No ratings yet
Mathematics of LLMs Part 1
8 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Trend
No ratings yet
Trend
47 pages
Ali 等 - 2024 - Prompt-SAW Leveraging Relation-Aware Graphs for Textual Prompt Compression
No ratings yet
Ali 等 - 2024 - Prompt-SAW Leveraging Relation-Aware Graphs for Textual Prompt Compression
16 pages
Recent Advances in Universal Text Embeddings: A Comprehensive Review of Top-Performing Methods On The MTEB Benchmark
No ratings yet
Recent Advances in Universal Text Embeddings: A Comprehensive Review of Top-Performing Methods On The MTEB Benchmark
21 pages
Day 5 Tokenisation and Embedding
No ratings yet
Day 5 Tokenisation and Embedding
12 pages
2019 ICLR CuBERT Pre Trained Contextual Embedding of Source Code
No ratings yet
2019 ICLR CuBERT Pre Trained Contextual Embedding of Source Code
22 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
Towards Efficiently Diversifying Dialogue Generation Via Embedding Augmentation
No ratings yet
Towards Efficiently Diversifying Dialogue Generation Via Embedding Augmentation
5 pages
20ICLR - Compressive Transformers For Long-Range Sequence Modelling
No ratings yet
20ICLR - Compressive Transformers For Long-Range Sequence Modelling
19 pages
Thesis
No ratings yet
Thesis
16 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
NLP Detailed QA
No ratings yet
NLP Detailed QA
3 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
Text and Code Embeddings by Contrastive Pre-Training
No ratings yet
Text and Code Embeddings by Contrastive Pre-Training
13 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
Flair 2
No ratings yet
Flair 2
6 pages
LLMLingua Compressing Prompts LLM Jiangetal
No ratings yet
LLMLingua Compressing Prompts LLM Jiangetal
19 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Compressing Large Scale Transformer Based Models - A Case Study On BERT
No ratings yet
Compressing Large Scale Transformer Based Models - A Case Study On BERT
7 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
Unit No 4
No ratings yet
Unit No 4
9 pages
One Embedder, Any Tasks
No ratings yet
One Embedder, Any Tasks
18 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
C T L - R S M: Ompressive Ransformers For ONG Ange Equence Odelling
No ratings yet
C T L - R S M: Ompressive Ransformers For ONG Ange Equence Odelling
19 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
A Survey On Contextual Embeddings
No ratings yet
A Survey On Contextual Embeddings
13 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
Word Embedding Generation For Telugu Corpus
No ratings yet
Word Embedding Generation For Telugu Corpus
28 pages
Akshay DBpedia GSoC 2017 Proposal
No ratings yet
Akshay DBpedia GSoC 2017 Proposal
12 pages
Pagnol: An Extra-Large French Generative Model: Lair - Lighton.Ai/Pagnol
No ratings yet
Pagnol: An Extra-Large French Generative Model: Lair - Lighton.Ai/Pagnol
14 pages
Text Classificatio N: - by TV Harshawardhan (COE17B 005)
No ratings yet
Text Classificatio N: - by TV Harshawardhan (COE17B 005)
19 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
A Text Classification Model Based On GCN and BiGRU Fusion
No ratings yet
A Text Classification Model Based On GCN and BiGRU Fusion
5 pages
Albert: A L Bert S - L L R: ITE FOR ELF Supervised Earning of Anguage Epresentations
No ratings yet
Albert: A L Bert S - L L R: ITE FOR ELF Supervised Earning of Anguage Epresentations
16 pages
Early Depression Detection From Social Network Using Deep Learning Techniques
No ratings yet
Early Depression Detection From Social Network Using Deep Learning Techniques
4 pages
Online Embedding Compression For Text Classification Using Low Rank Matrix Factorization
No ratings yet
Online Embedding Compression For Text Classification Using Low Rank Matrix Factorization
9 pages
Hyperparameter Tuning For Deep Learning in Natural Language Processing
No ratings yet
Hyperparameter Tuning For Deep Learning in Natural Language Processing
7 pages
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
No ratings yet
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
11 pages
Recurrent Convolutional Neural Networks For Text Classification
No ratings yet
Recurrent Convolutional Neural Networks For Text Classification
7 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
11 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

EELBERT Tiny Models Through Dynamic Embeddings 1705151354

Uploaded by

EELBERT Tiny Models Through Dynamic Embeddings 1705151354

Uploaded by

EELBERT: Tiny Models through Dynamic Embeddings

Gabrielle Cohn, Rishika Agarwal, Deepanshu Gupta, and Siddharth Patwardhan

Abstract understanding capabilities. Most of the recent

Table 2: EELBERT with smaller models

Table 3: Impact of varying hash functions

BERT-base To simulate a random hash function, we initial-

Table 5: Latency, on MacBookPro M1 32GB RAM

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.