0% found this document useful (0 votes)

19 views

Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Adapting Language Models via Token Translation

Zhili Feng Tanya Marwah Lester Mackey

Carnegie Mellon University Carnegie Mellon University Microsoft Research

David Alvarez-Melis Nicolò Fusi

Microsoft Research Microsoft Research
arXiv:2411.00593v1 [cs.CL] 1 Nov 2024

Abstract
Modern large language models use a fixed tokenizer to effectively compress text
drawn from a source domain. However, applying the same tokenizer to a new target
domain often leads to inferior compression, more costly inference, and reduced
semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn
Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain
and learns to translate between target and source tokens, enabling more effective
reuse of the pre-trained next-source-token predictor. In our experiments with
finetuned English language models, S2T2 improves both the perplexity and the
compression of out-of-domain protein sequences, outperforming direct finetuning
with either the source or target tokenizer. In addition, we find that token translations
learned for smaller, less expensive models can be directly transferred to larger,
more powerful models to reap the benefits of S2T2 at lower cost.

1 Introduction
Modern large language models (LLMs) are typically trained in two stages. First a tokenizer is trained
to map commonly occurring character sequences in the training data into vocabulary units known as
tokens. Next, all training text is tokenized, i.e., translated into this token vocabulary, and a model is
trained to predict the next token given a context of preceding tokens. The tokenizer can be viewed
as an initial compressor of input bytes [Gage, 1994] that significantly shortens text drawn from the
training domain and arguably improves the training dynamics [Rajaraman et al., 2024]. Despite
its widespread adoption, this two-stage procedure suffers from a key failing: When faced with text
from a new target domain, compression quality drops, context length and inference costs increase,
and learned semantic alignment deteriorates. This effect is especially evident when modern LLMs
(trained predominantly on English and code) are used to reason about molecular sequences like
proteins. Such sequences are commonly represented using the Latin-script alphabet, but the meaning
and frequency of each substring differ significantly their natural language counterparts, resulting in
semantic misalignment.
To tackle the analogous alignment problem for low-resource languages, Remy et al. [2024] proposed
to use fast_align [Dyer et al., 2013], an expectation-maximization algorithm that requires parallel
data from the training and target domains.
This approach shows promising results, but for many target domains, parallel training data is difficult
or impossible to gather. For example, there is no agreed-upon parallel translation between protein
sequences and natural language.
In this work, we propose a Sparse Sinkhorn Token Translation (S2T2) algorithm that does not require
parallel data. Instead, S2T2 learns a translation between training domain tokens and new target
domain tokens just using a sample data from the target domain and the pretrained LLM weights. After
training a tokenizer on the target domain, S2T2 translates each target-domain token into a (sparse)

Preprint. Under review.

Distribution Adaptation Dense Cost Matrix Sparse Sinkhorn
Iteration

LLM is all you ... WKK PI SDG GT ...

Column SparseMax

Row SparseMax
Transformer Blocks
Sparse OT Matrix

Token Embedding Adaptation

Sparse OT Original Token
Matrix Embedding

Figure 1: Overview of S2T2. Left: S2T2 injects a weight-tied sparse optimal transport (OT) layer in
both the token embedding and language model head. The input tokens will be encoded based on a
sparse convex combination of the original token embeddings and decoded by a sparse combination of
the original language model head. Right: The sparse OT matrix is obtained by iteratively projecting a
dense cost matrix along its rows and columns. The dense cost matrix is updated by backpropogation.

distribution over training-domain tokens, uses the pretrained LLM to predict the next training-domain
token, and translates that training-domain token back into a (sparse) distribution over target-domain
tokens. In our experiments with English LLMs, we find that

1. S2T2 provides an effective initialization for continual finetuning on protein sequences, yielding
both better compression and better perplexity than direct finetuning of the pretrained model, and
2. S2T2 enables weak-to-strong model transferability: Translations learned for smaller, less expensive
models can be transferred to larger, more powerful models to reap the benefits at lower cost.

2 Translating Tokens with Sparse Sinkhorn

Consider a pretrained LLM M with vocabulary size v, embedding matrix E ∈ Rv×d , and language
model head L ∈ Rv×d . For a given input sequence encoded as a matrix X ∈ {0, 1}s×v in which each
row is a one-hot vector representing a training-domain token, XE ∈ Rs×d represents the sequence
of (soft) embeddings, and the predicted next token is given by
M(X) = arg max softmax(Lh(XE))i ∈ {0, 1}v (1)
i∈[v]

where h : Rs×d → Rd maps an embedding sequence into a single vector, the internal representation
of the next token.
Consider also a dataset D drawn from a new target domain, and let u be the vocabulary size of a new
tokenizer trained on D. For given marginal distributions over training and target tokens µ ∈ ∆v−1
and ν ∈ ∆u−1 , we define the constraint set C(µ, ν) = {P ∈ [0, 1]v×u : P1 = µ, P⊤ 1 = ν}.
S2T2 finds a joint probability matrix P ∈ C(µ, ν) and defines a new target-domain LLM M′ with
⊤
embedding matrix E′ = P⊤ ⊙ (1/µ) E ∈ Ru×d and language head L′ = P ⊙ (1/ν) L ∈

Ru×d substituted for (E, L) in (1). Here, A ⊙ v represents a Hadamard product broadcasted along
the last dimension. It is crucial to perform such a Hadamard product, since we want the new token
embedding and old token embedding to be on the same scale. More generally, one could use different
P matrices to translate E and L, but we focus on a single P here for simplicity. An overview of S2T2
can be in Fig. 1.

2.1 Finding P via Sparse Sinkhorn

Since it is difficult to directly parameterize a joint probability matrix P ∈ C(µ, ν), we instead
maintain a dense weight matrix C ∈ Rv×u and recover P as the solution to the following two
equivalent optimization problems.

2
1 ′ 1
min ∥P − C∥2F min ⟨−C, P′ ⟩ + ∥P′ ∥2F
P ′ 2 (2) P ′ 2 (3)
s.t. P′ ∈ C(µ, ν) ′
s.t. P ∈ C(µ, ν)
Notice that (3) is the ℓ2 constrained optimal transport problem, which is known to generate sparse
solutions [Essid and Solomon, 2018, Peyré et al., 2019]. Moreover, since C = C1 ∩ C2 for the
convex sets C1 = {P ∈ Rv×u v×u ⊤
+ , P1 = µ} and C2 = {P ∈ R+ , P 1 = ν}, these problems can be
solved using iterative Dykstra’s projections [Boyle and Dykstra, 1986], a Sinkhorn-like algorithm via
with guaranteed convergence (see Algorithm 1).
In every Sinkhorn iteration, we solve a set of ℓ2 projections onto a probability simplex. This
optimization problem enjoys an efficient backpropogation computation [Martins and Astudillo, 2016].
A small caveat is that we are not always projecting onto the unit simplex but rather onto a scaled
simplex, so the optimization is modified accordingly in Algorithm 2.

Algorithm 1 S PARSE S INKHORN I TERATION

Require: Weight matrix C ∈ Rv×u
1: P ← 0v×u , Q ← 0v×u , X0 ← C
2: for k = 0, . . . , n do
3: Yk ← PC1 (Xk + Pk ), where PC1 applies S PARSEMAX with scale µi to each row i.
4: Pk+1 ← Xk + Pk − Yk
5: Xk+1 ← PC2 (Yk + Qk ), where PC2 applies S PARSEMAX with scale νj to each column j.
6: Qk+1 ← Yk + Qk − Xk+1
7: end for
8: return Xn+1

Algorithm 2 S PARSEMAX
Require: z ∈ RK , scale α
1: Sort z as z(1) ≥ · · · ≥ z(K)
n P o
2: Find k(z) = max k ∈ [K] : α + kz(k) > j≤k z(j)
P
j≤k z(j) −α
3: Let τ (z) = k(z)
4: return p where pi = max{zi − τ (z), 0}

To learn our token translation, we initialize the weight matrix C by setting each entry to be 1/v,
obtain the joint probability matrix P by applying Algorithm 1 to C, and perform a normal forward
pass using P. During the backward pass, we differentiate through the Sinkhorn iteration and update
C directly. In practice, we find that iterating 3 times is enough to generate an effective sparse P.

3 Experiment
We conduct experiments on the UniRef50 [Suzek et al., 2015] protein sequence dataset using the
OLMo-1B English LLM [Groeneveld et al., 2024] with batch size 16 and context length of 512.
The training domain tokens in our experiment are bytes (single characters), and the target domain
tokenizer is a new Byte-Pair Encoding (BPE) tokenizer [Gage, 1994] trained on UniRef50 with
vocabulary size 512. The new tokenizer reduces the length our protein sequences by a factor of 1.82×
on average. This will in turn have sizable impact on the standard measure of model compression,
bits-per-byte (BpB) [see Biderman et al., 2024, for details on calculating BpB]. To control the sparsity
level of P, we add an entropy regularizer αH(P) to the next token prediction loss with larger α
encouraging smaller entropy and hence sparser P. Unless otherwise specified, α = 0.
We compare with four baseline methods: 1. Training an unconstrained translator P followed by
whole-model finetuning. 2. Training a dense probabilistic translator P (using S OFT M AX in place
of S PARSE M AX) followed by whole-model finetuning. 3. Finetuning the model directly using the
original OLMo tokenizer. 4. Finetuning the model with the new tokenizer, resizing the embedding
matrix E and language model head L by truncation.

3
6.6
Random Guess Loss 6.46
6.44

Evaluation Cross Entropy Loss

6.4
6.24
6.2
6.09
6.0 5.94 5.94

5.8

5.6
S2T2-0.01 S2T2-0.1 S2T2-1.0 New Tok. Orig. Tok.
Model
Figure 2: Evaluation loss after initializing OLMo-7B with token translator P learned from OLMo-1B.
Along the x-axis, S2T2-α represent S2T2 with the α-entropy regularizer that controls the sparsity of
P. New Tok. is OLMo-7B with the new tokenizer and truncated E, L; Orig Tok. is OLMo-7B with
the original tokenizer. The red dashed line is the loss when you randomly guess the next token.

Training details. We always train with AdamW [Loshchilov and Hutter, 2019]. When training P,
we use a learning rate of 10−3 (except for our model transfer experiments, which use 2 × 10−5 ) and
no weight decay; when finetuning the whole model, we always use learning rate of 2 × 10−5 with
0.01 weight decay. We follow the convention of training with BFloat16, β1 = 0.9, β2 = 0.95, and
ε = 10−5 . We always use the cosine annealing scheduler with 20% linear warm-up steps and decay
to 10% of the learning rate. We train P and finetune the whole model with 2000 steps.
Remarkably, Table 1 shows that simply initializing with S2T2 produces better language model
quality (as measured by perplexity) and compression (as measured by BpB) than whole-model
finetuning with the original tokenizer (baseline 3). Note that baseline 3 has much worse BpB due to
its longer sequence length, further motivating the usage of a tailored tokenizer. In addition, S2T2
initialization outperforms both dense Sinkhorn and unconstrained token translation in both metrics.
Moreover, after finetuning, S2T2 also improves upon the perplexity and BpB of baseline 4, direct
finetuning with a new tokenizer. Fig. 2 shows that the translator P learned using OLMo-1B can
also be directly transferred to the more expensive model, OLMo-7B, yielding significantly better
performance than random guessing or OLMo-7B with its original tokenizer or the new tokenizer with
truncated embedding matrix and language model head.

Table 1: Performance on UniRef50 evaluation set, measured by perplexity (perp.) and bits-per-byte
(BpB). Plain P: Unconstrained P. CFT: Continual finetuning, initialized from the learned P. FT
orig. tok.: Finetuning with the original tokenizer. FT new tok.: Finetuning with the new tokenizer.
Plain P + CFT Sinkhorn P + CFT S2T2 + CFT FT orig. tok. FT new tok.
Perp. 174.20 130.44 167.74 136.12 144.03 118.78 151.05 130.56
BpB 4.09 3.86 4.06 3.89 3.94 3.78 7.24 3.86

4 Conclusion

We proposed S2T2 as a token translation technique for continual finetuning of LLMs on out-of-
distribution data and demonstrate its effectiveness on protein sequence modeling. As a next step,
we plan to expand this framework to adapt to other modalities such as code and images. Another
natural extension is to combine the training and target token vocabularies to produce an effective
“multidomain” LLM.

4
5 Acknowledgement
This work was done during Zhili Feng’s and Tanya Marwah’s internship at Microsoft Research New
England.

References
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi,
Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from
the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782,
2024.
James P Boyle and Richard L Dykstra. A method for finding projections onto the intersection of
convex sets in hilbert spaces. In Advances in Order Restricted Statistical Inference: Proceedings
of the Symposium on Order Restricted Statistical Inference held in Iowa City, Iowa, September
11–13, 1985, pages 28–47. Springer, 1986.
Chris Dyer, Victor Chahuneau, and Noah A Smith. A simple, fast, and effective reparameterization
of ibm model 2. In Proceedings of the 2013 conference of the North American chapter of the
association for computational linguistics: human language technologies, pages 644–648, 2013.
Montacer Essid and Justin Solomon. Quadratically regularized optimal transport on graphs. SIAM
Journal on Scientific Computing, 40(4):A1961–A1986, 2018.
Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994.
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord,
Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson,
Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack
Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik,
Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk,
Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep
Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Sol-
daini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of language
models. Preprint, 2024.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International
Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=
Bkg6RiCqY7.
Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and
multi-label classification. In International conference on machine learning, pages 1614–1623.
PMLR, 2016.
Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data
science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
Nived Rajaraman, Jiantao Jiao, and Kannan Ramchandran. Toward a theory of tokenization in llms.
arXiv preprint arXiv:2404.08335, 2024.
François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux,
and Thomas Demeester. Trans-tokenization and cross-lingual vocabulary transfers: Language
adaptation of LLMs for low-resource NLP. In First Conference on Language Modeling, 2024.
URL https://openreview.net/forum?id=sBxvoDhvao.
Baris E Suzek, Yuqi Wang, Hongzhan Huang, Peter B McGarvey, Cathy H Wu, and UniProt
Consortium. Uniref clusters: a comprehensive and scalable alternative for improving sequence
similarity searches. Bioinformatics, 31(6):926–932, 2015.

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Detection of Parkinson's Disease Using Machine Learning
75% (4)
Detection of Parkinson's Disease Using Machine Learning
91 pages
CLLMs- Consistency Large Language Models
No ratings yet
CLLMs- Consistency Large Language Models
15 pages
CLLMS: Consistency Large Language Models
No ratings yet
CLLMS: Consistency Large Language Models
13 pages
Aipaper2
No ratings yet
Aipaper2
14 pages
LLMLingua Compressing Prompts LLM Jiangetal
No ratings yet
LLMLingua Compressing Prompts LLM Jiangetal
19 pages
Tokenization
No ratings yet
Tokenization
7 pages
Blockwise Parallel Decoding for Deep Autoregressive Models
No ratings yet
Blockwise Parallel Decoding for Deep Autoregressive Models
10 pages
Token Izer
No ratings yet
Token Izer
17 pages
20 Paper
No ratings yet
20 Paper
12 pages
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
No ratings yet
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
21 pages
Controllable Sentence Simplification With A Unified Text-to-Text Transfer Transformer
No ratings yet
Controllable Sentence Simplification With A Unified Text-to-Text Transfer Transformer
12 pages
13-TextGen-2024
No ratings yet
13-TextGen-2024
106 pages
Accelerating Large Language Model Decoding With Speculative Sampling
No ratings yet
Accelerating Large Language Model Decoding With Speculative Sampling
11 pages
Trans-Tokenization
No ratings yet
Trans-Tokenization
28 pages
Accelerating Transformer Inference for Translation via Parallel Decoding
No ratings yet
Accelerating Transformer Inference for Translation via Parallel Decoding
18 pages
Accelerating Sinkhorn Algorithm with Sparse Newton Iterations
No ratings yet
Accelerating Sinkhorn Algorithm with Sparse Newton Iterations
20 pages
A Comprehensive Survey of Compression Algorithms For Language Models - 2024 - Park Et Al
No ratings yet
A Comprehensive Survey of Compression Algorithms For Language Models - 2024 - Park Et Al
35 pages
Compiler Design
From Everand
Compiler Design
Knowledge Flow
No ratings yet
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
The Power of Scale For Parameter-Efficient Prompt Tuning: Brian Lester Rami Al-Rfou Noah Constant Google Research
No ratings yet
The Power of Scale For Parameter-Efficient Prompt Tuning: Brian Lester Rami Al-Rfou Noah Constant Google Research
13 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Domain Adap Asr 5
No ratings yet
Domain Adap Asr 5
6 pages
2024.findings-acl.837
No ratings yet
2024.findings-acl.837
14 pages
RNN Approaches To Text Normalization - A Challenge
No ratings yet
RNN Approaches To Text Normalization - A Challenge
17 pages
DeepNorm Deep Learning Approach
No ratings yet
DeepNorm Deep Learning Approach
7 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
2407.03169v1
No ratings yet
2407.03169v1
5 pages
19 20-gpt-3 Prompts
No ratings yet
19 20-gpt-3 Prompts
68 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
SGPT: GPT Sentence Embeddings For Semantic Search: Preprint. Under Review
No ratings yet
SGPT: GPT Sentence Embeddings For Semantic Search: Preprint. Under Review
19 pages
10 Encdec Attention Notes
No ratings yet
10 Encdec Attention Notes
29 pages
Jaszczur et al - Sparse is Enough in Scaling Transformers
No ratings yet
Jaszczur et al - Sparse is Enough in Scaling Transformers
22 pages
AcceleratingTrainingOfTransformerBasedLanguageModelsWithProgressiveLayerDropping
No ratings yet
AcceleratingTrainingOfTransformerBasedLanguageModelsWithProgressiveLayerDropping
16 pages
NeurIPS-2021-sparse-is-enough-in-scaling-transformers-Paper
No ratings yet
NeurIPS-2021-sparse-is-enough-in-scaling-transformers-Paper
13 pages
The Power of Scale For Parameter-Efficient Prompt Tuning
No ratings yet
The Power of Scale For Parameter-Efficient Prompt Tuning
15 pages
2022 Acl-Long 331
No ratings yet
2022 Acl-Long 331
16 pages
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
8
No ratings yet
8
13 pages
Eecs 2015 235 PDF
No ratings yet
Eecs 2015 235 PDF
89 pages
Bouchard, Stenetorp, Riedel - Unknown - Learning To Generate Textual Data
No ratings yet
Bouchard, Stenetorp, Riedel - Unknown - Learning To Generate Textual Data
9 pages
Generalizing Backpropagation To Include Sparse Coding: David M. Bradley and Drew Bagnell
No ratings yet
Generalizing Backpropagation To Include Sparse Coding: David M. Bradley and Drew Bagnell
27 pages
29920-Article Text-33974-1-2-20240324
No ratings yet
29920-Article Text-33974-1-2-20240324
9 pages
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
No ratings yet
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
16 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Zero Shot Text To Image Generation (DALL E)
No ratings yet
Zero Shot Text To Image Generation (DALL E)
20 pages
kuang22_interspeech
No ratings yet
kuang22_interspeech
5 pages
Metric-Based In-Context Learning: A Case Study in Text Simplification
No ratings yet
Metric-Based In-Context Learning: A Case Study in Text Simplification
16 pages
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
From Everand
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
Lorenzo Bettini
4/5 (1)
Prefix-Tuning: Optimizing Continuous Prompts For Generation
No ratings yet
Prefix-Tuning: Optimizing Continuous Prompts For Generation
15 pages
Neural Machine Translation: Shusen Wang
No ratings yet
Neural Machine Translation: Shusen Wang
57 pages
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
BERT-NAR-BERT A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints
No ratings yet
BERT-NAR-BERT A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints
11 pages
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
No ratings yet
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
42 pages
Neural Math Word Problem Solver With Reinforcement Learning
No ratings yet
Neural Math Word Problem Solver With Reinforcement Learning
11 pages
Week 11 Exercises Solutions
No ratings yet
Week 11 Exercises Solutions
6 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
2305.11720v4
No ratings yet
2305.11720v4
29 pages
2308.01185v2
No ratings yet
2308.01185v2
8 pages
2308.00223v1
No ratings yet
2308.00223v1
35 pages
2305.17473v4
No ratings yet
2305.17473v4
62 pages
2406.14716v1
No ratings yet
2406.14716v1
5 pages
2403.15846v1
No ratings yet
2403.15846v1
22 pages
2406.02095v1
No ratings yet
2406.02095v1
13 pages
2403.14507v1
No ratings yet
2403.14507v1
11 pages
2405.17087v1
No ratings yet
2405.17087v1
59 pages
2406.01520v2
No ratings yet
2406.01520v2
55 pages
2405.20597v1
No ratings yet
2405.20597v1
24 pages
2406.03767v1
No ratings yet
2406.03767v1
26 pages
2406.02448v1
No ratings yet
2406.02448v1
21 pages
2406.16192v2
No ratings yet
2406.16192v2
36 pages
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
No ratings yet
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
16 pages
2408.01819v1
No ratings yet
2408.01819v1
12 pages
2405.16632v1
No ratings yet
2405.16632v1
24 pages
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
No ratings yet
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
32 pages
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
No ratings yet
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
18 pages
2409.20471v2
No ratings yet
2409.20471v2
14 pages
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
No ratings yet
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
20 pages
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
No ratings yet
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
27 pages
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
No ratings yet
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
17 pages
2410.14306v1
No ratings yet
2410.14306v1
10 pages
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
No ratings yet
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
13 pages
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
No ratings yet
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
21 pages
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
No ratings yet
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
50 pages
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
No ratings yet
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
21 pages
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
No ratings yet
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
11 pages
Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira
No ratings yet
Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira
48 pages
ECE 219: Signal, System & Control: Tutorial Sheet 3 - Fourier Series & Convolution Submission Date: March. 10, 2015
No ratings yet
ECE 219: Signal, System & Control: Tutorial Sheet 3 - Fourier Series & Convolution Submission Date: March. 10, 2015
4 pages
Dcs 7302
No ratings yet
Dcs 7302
17 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-04 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-04 Reference-Material-I
69 pages
Senior Maths Worksheet Day 05
No ratings yet
Senior Maths Worksheet Day 05
2 pages
CSE373 Algorithms Course Outline
No ratings yet
CSE373 Algorithms Course Outline
7 pages
DWDM UNIT-IV Classification and Prediction
100% (1)
DWDM UNIT-IV Classification and Prediction
70 pages
Lec3 Gradient Based Method Part I
No ratings yet
Lec3 Gradient Based Method Part I
30 pages
PERT Sample Question
No ratings yet
PERT Sample Question
4 pages
DIP Chap 4 (Filtering in The Frequency Domain) Lect 11
No ratings yet
DIP Chap 4 (Filtering in The Frequency Domain) Lect 11
45 pages
Bolia - Ennuma1l - Lab 4
No ratings yet
Bolia - Ennuma1l - Lab 4
5 pages
A Fast Fractal Image Compression Using Huffman Coding: D. Venkatasekhar, P. Aruna
No ratings yet
A Fast Fractal Image Compression Using Huffman Coding: D. Venkatasekhar, P. Aruna
4 pages
Immediate Download Test Bank For Precalculus With Limits A Graphing Approach Texas Edition 6th Edition Larson 1285867718 9781285867717 All Chapters
100% (11)
Immediate Download Test Bank For Precalculus With Limits A Graphing Approach Texas Edition 6th Edition Larson 1285867718 9781285867717 All Chapters
49 pages
(Ebook) Essential Algorithms : a Practical Approach to Computer Algorithms by Rod Stephens ISBN 9781118612101, 9781118612767, 9781118797297, 9781299759992, 1118612108, 1118612760, 1118797299, 1299759998 instant download
No ratings yet
(Ebook) Essential Algorithms : a Practical Approach to Computer Algorithms by Rod Stephens ISBN 9781118612101, 9781118612767, 9781118797297, 9781299759992, 1118612108, 1118612760, 1118797299, 1299759998 instant download
57 pages
Wiley - Interscience.introduction - To.digital - Signal.processing - And.filter - Design.oct.2005.ebook LinG
88% (8)
Wiley - Interscience.introduction - To.digital - Signal.processing - And.filter - Design.oct.2005.ebook LinG
440 pages
DAA_Assignment-_1[1] (1) (1)
No ratings yet
DAA_Assignment-_1[1] (1) (1)
2 pages
CS 331: Artificial Intelligence: Informed Search
No ratings yet
CS 331: Artificial Intelligence: Informed Search
64 pages
Complexity of algorithms
No ratings yet
Complexity of algorithms
3 pages
vogn
No ratings yet
vogn
27 pages
OS - Home Assignment Questions
No ratings yet
OS - Home Assignment Questions
13 pages
Unit 3 Chapter1 Computer Arithmetic
No ratings yet
Unit 3 Chapter1 Computer Arithmetic
28 pages
Floyed Warshall-Algorithm Data Structure & Algorithm Nawsher Ali 1864114' Abrar Khan 1864106 Ashfaq Ahmad 1864108
No ratings yet
Floyed Warshall-Algorithm Data Structure & Algorithm Nawsher Ali 1864114' Abrar Khan 1864106 Ashfaq Ahmad 1864108
10 pages
CS8451 Design and Analysis of Algorithms MCQ
No ratings yet
CS8451 Design and Analysis of Algorithms MCQ
206 pages
Acoustic Deep Learning PDF
No ratings yet
Acoustic Deep Learning PDF
16 pages
DCT Model Question Paper
No ratings yet
DCT Model Question Paper
2 pages
Minimum and Maximum
No ratings yet
Minimum and Maximum
28 pages
Filtered Back Projection
No ratings yet
Filtered Back Projection
3 pages
An Adaptive Particle Swarm Optimization Algorithm Based On Cat Map
No ratings yet
An Adaptive Particle Swarm Optimization Algorithm Based On Cat Map
8 pages
Properties of The Z-Transform
No ratings yet
Properties of The Z-Transform
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey

Uploaded by

Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey

Uploaded by

Adapting Language Models via Token Translation

Zhili Feng Tanya Marwah Lester Mackey

David Alvarez-Melis Nicolò Fusi

Preprint. Under review.

LLM is all you ... WKK PI SDG GT ...

Token Embedding Adaptation

2 Translating Tokens with Sparse Sinkhorn

2.1 Finding P via Sparse Sinkhorn

Algorithm 1 S PARSE S INKHORN I TERATION

Evaluation Cross Entropy Loss

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.