0% found this document useful (0 votes)

26 views

L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration

The document introduces a new video tokenizer called MAGVIT-v2 that outperforms previous methods. It maps videos and images to discrete tokens that can be used to train language models for visual generation, achieving better results than diffusion models on standard benchmarks. The new tokenizer also provides improved video compression quality and learned video representations for action recognition tasks compared to prior work.

Uploaded by

xomepip356

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration

Uploaded by

xomepip356

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Work in progress

L ANGUAGE M ODEL B EATS D IFFUSION

— T OKENIZER IS K EY TO V ISUAL G ENERATION
Lijun Yu;:˚ José Lezama: Nitesh B. Gundavarapu: Luca Versari: Kihyuk Sohn:
David Minnen: Yong Cheng: Agrim Gupta: Xiuye Gu: Alexander G. Hauptmann;
Boqing Gong: Ming-Hsuan Yang: Irfan Essa: David A. Ross: Lu Jiang:;
:
Google, ; Carnegie Mellon University

A BSTRACT
arXiv:2310.05737v1 [cs.CV] 9 Oct 2023

While Large Language Models (LLMs) are the dominant models for generative
tasks in language, they do not perform as well as diffusion models on image and
video generation. To effectively use LLMs for visual generation, one crucial com-
ponent is the visual tokenizer that maps pixel-space inputs to discrete tokens ap-
propriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video
tokenizer designed to generate concise and expressive tokens for both videos and
images using a common token vocabulary. Equipped with this new tokenizer, we
show that LLMs outperform diffusion models on standard image and video gener-
ation benchmarks including ImageNet and Kinetics. In addition, we demonstrate
that our tokenizer surpasses the previously top-performing video tokenizer on two
more tasks: (1) video compression comparable to the next-generation video codec
(VCC) according to human evaluations, and (2) learning effective representations
for action recognition tasks.

1 I NTRODUCTION
Large transformer-based language models, commonly referred to as LMs or LLMs, are the de facto
models for natural language generation (OpenAI, 2023; Google, 2023). Over time, LMs have ex-
panded their capabilities to generate content in various modalities, asserting their dominance in other
domains like audio (Agostinelli et al., 2023), speech (Rubenstein et al., 2023), code generation (Li
et al., 2023), medical applications (Singhal et al., 2023) and robotics (Zitkovich et al., 2023).
LMs are capable of generating images and videos. To do so, the image pixels are mapped into a
sequence of discrete tokens by a visual tokenizer (c.f . Section 2). These tokens are then fed into
the LM transformer, as if they were lexical words, for generative modeling. Despite notable ad-
vancements in employing LMs for visual generation (Esser et al., 2021; Chang et al., 2022), LMs
still do not perform as well as diffusion models (Rombach et al., 2022). For instance, when evalu-
ating on the ImageNet dataset, a gold standard benchmark for image generation, the best language
model (Lee et al., 2022) underperforms the diffusion model (Gao et al., 2023) by a substantial 48%
margin (FID 3.41 vs. 1.79 when generating images at the 256ˆ256 resolution).
Why do language models lag behind diffusion models in visual generation? This paper suggests that
a primary reason is the lack of a good visual representation, resembling our natural language system,
for effectively modeling the visual world. To substantiate this hypothesis, this paper shows that,
when utilizing a good visual tokenizer, the masked language model (Devlin et al., 2019; Chang et al.,
2022; Yu et al., 2023a) surpasses the state-of-the-art diffusion models in terms of both generation
fidelity and efficiency across image and video benchmarks, given the same training data, comparable
model size, and training budget. To the best of our knowledge, this provides the first evidence that
language models beat diffusion models on the hallmark ImageNet benchmark.
It is worth emphasizing that our intention is not to assert whether the language model is superior
to others, but to promote the exploration of visual tokenization methods for LLMs. A fundamental
difference of LLMs from other models, such as diffusion models, is that LLMs utilize a discrete
latent format: tokens obtained from a visual tokenizer. We show that the values of these discrete
visual tokens should not be overlooked considering their distinct advantages as follows. (1) Com-
patibility with LLMs. The main advantage of a token representation is that it shares the same form
˚
Work done during a research internship at Google Research.

1
Work in progress

as language tokens, making it straightforward to leverage the optimizations our community has de-
veloped over many years for LLMs. This includes faster training and inference speeds (Shazeer,
2019; Lester et al., 2021), advancements in model infrastructure (Dao et al., 2022; Du et al., 2022),
learning recipes for model scaling (Brown et al., 2020; Chowdhery et al., 2022), and GPU/TPU op-
timization, among other innovations. Unifying vision and language by the same token space could
set the stage for a true multimodal LLM that can understand, generate, and reason within our visual
environment. (2) Compressed representation. The discrete token may offer a fresh perspective
on video compression. The visual tokens can serve as a new video compression format to reduce
disk storage and bandwidth during internet transfers. Unlike compressed RGB pixels, these tokens
can be fed directly into generative models, bypassing the conventional decompression and latent
encoding steps. This allows for faster processing in generative video applications, especially bene-
ficial in edge computing cases. (3) Visual understanding benefits. Prior research has shown that
the discrete tokens are valuable as a pre-training target in self-supervised representation learning, as
discussed in BEiT (Bao et al., 2021) and BEVT (Wang et al., 2022). Additionally, research finds
that using tokens as the model inputs improves the robustness and generalization (Mao et al., 2021).
In this paper, we introduce MAGVIT-v2, a video tokenizer designed to map videos (and images) into
compact discrete tokens. Our model is built on the state-of-the-art video tokenizer, MAGVIT (Yu
et al., 2023a), within the VQ-VAE framework (Van Den Oord et al., 2017). We propose two new
techniques. First, a novel lookup-free quantization method enables the learning of a large vocabulary
that is able to improve generation quality of the language model. Second, through extensive empir-
ical analyses, we have identified modifications to the tokenizer that not only enhance generation
quality but also enable the tokenization of both images and videos using a shared vocabulary.
We empirically demonstrate that our model outperforms the previously top-performing video tok-
enizer, MAGVIT, in three key areas. First, our model significantly improves the generation quality
of MAGVIT, establishing the state of the art on the common image and video benchmarks. Second,
user studies indicate that its compression quality exceeds that of MAGVIT and the current video
compression standard, HEVC (Sullivan et al., 2012). Moreover, it is on par with the next-generation
video codec, VVC (Bross et al., 2021). Finally, we show that, compared to MAGVIT, our new
tokens are stronger for video understanding tasks across two setups and three datasets. The main
contributions of this work are:
• A new video tokenizer that outperforms the previously best-performing video tokenizer in three
areas: visual generation, video compression, and action recognition.
• A novel lookup-free quantization approach that enables improving the visual generation quality
of language models by learning a large vocabulary.
• To the best of our knowledge, the first evidence suggesting that a language model can outperform
diffusion models on ImageNet when provided with the same training data, an equivalent model
size, and a similar training budget.
• A video compressor with better quality than HEVC and VVC, at similar bit rates, according to
user studies. To our knowledge, this is the first successful attempt of a visual tokenizer designed
for video generation to achieve comparable results to standard codecs.
2 BACKGROUND
Language Model (LM) for visual generation. LMs have been extended to generate images and
videos. A visual tokenizer f is used to first map visual inputs into a sequence of discrete tokens.
A video V P RT ˆHˆW ˆ3 (or image when T “ 1) is tokenized into a discrete representation
1 1 1
X “ f pVq P t1, 2, ¨ ¨ ¨ , KuT ˆH ˆW , where K is the codebook (vocabulary) size of the visual
tokenizer. X is flattened into a 1D token sequence obtained using raster scan ordering and then fed
into an LM transformer for generative modeling.
Two types of LMs are commonly used for visual generation. The Autoregressive LM (AR-LM)
includes ImageGPT (Chen et al., 2020), DALL-E (Ramesh et al., 2021), Parti (Yu et al., 2022b), etc.
An AR-LM predicts the next token given the previous tokens along with additional conditioning
information c using a categorical distribution for pθ pxi | xăi ; cq. During inference, AR-LMs use
the standard autoregressive decoding over the tokens. Finally, the tokens are converted back to pixels
by a decoder associated with the visual tokenizer.
The Masked LM (MLM) is another type of language model for visual generation, such as:
MaskGIT (Chang et al., 2022), MAGVIT (Yu et al., 2023a), Phenaki (Villegas et al., 2022), and
MUSE (Chang et al., 2023), among others. An MLM is trained using a masked token objective (De-

2
Work in progress

vlin et al., 2019), where some tokens in the sequence are randomly masked and need to be predicted
given the observed tokens. Let m P t0, 1un be a random binary sequence where mJ 1 P r0, n ´ 1s.
The MLM learns pθ pxi | txj : mj “ 1, @ju; cq for all i where mi “ 0. To generate a video or
image during inference, the MLM uses the non-autoregressive decoding algorithms for images and
videos (Chang et al., 2022; Yu et al., 2023a). The decoding starts with a fully masked sequence,
which is iteratively filled by repeating two steps: (1) sample the whole sequence x̂ptq from pθ given
the non-masked tokens from the previous step, (2) re-mask the tλptq ¨ nu tokens in x̂ptq with the
lowest probability, following a decreasing masking ratio schedule λptq, according to timestamp t.
Denoising Diffusion Models (DDM). DDMs (Sohl-Dickstein et al., 2015; Song & Ermon, 2019)
are regarded as the state-of-the-art in visual generation due to their high-quality image (Dhariwal &
Nichol, 2021; Ho et al., 2022a) and video generation (Ho et al., 2022c). For instance, DDPM (Ho
et al., 2020) learns a denoising process parameterized as conditional Gaussian distributions over
image pixels. Recently, diffusion models and language models have displayed a significant overlap.
Recent DDMs diffuse over latents rather than raw pixels. These latents are obtained using models
similar to the visual tokenizer used by LMs. In fact, the very first latent in diffusion, proposed
by Rombach et al. (2022), is derived from a visual tokenizer. Additionally, the diffusion model’s
architecture has been shifting from the U-Net to the transformer architecture (Peebles & Xie, 2022).
Consequently, the boundaries between diffusion and language models in visual generation have
become less distinct. Yet, a fundamental difference between DDMs and LMs lies in the latent
format, i.e., continuous vs. discrete. We have discussed the benefits of having discrete tokens in
Section 1 and will show that the proposed tokenizer improves in these aspects.
Visual tokenization. Visual tokenization plays an essential role in mapping pixels into a discrete
representation suitable for generative modeling. VQ-VAE (Van Den Oord et al., 2017) is a corner-
stone work in image tokenization. A VQ-VAE model consists of a convolutional neural network
(CNN) encoder, a vector-quantization (VQ) bottleneck, and a CNN decoder. Given a video V P
1 1 1
RT ˆHˆW ˆ3 , the VQ-VAE’s encoder E produces latent embeddings Z “ EpVq P RT ˆH ˆW ˆd .
d
Each embedding vector z P R in Z is then passed through the vector quantizer q, which assigns it
to the closest entry c P Rd in the learned codebook embedding C P RKˆd :
qpzq “ ci , where i “ arg min }z ´ cj }2 . (1)
jPt1,2,¨¨¨ ,Ku
To get discrete tokens, we drop the embedding dimension and represent Z by its indices X P
1 1 1
t1, 2, ¨ ¨ ¨ , KuT ˆH ˆW . For decoding, embeddings of all image tokens are given as input to the
decoder D to reconstruct the input V̂ “ DpZq. Following VQ-VAE, VQGAN (Esser et al., 2021)
introduces an adversarial loss and feature-level perceptual losses to enhance the image quality.
Video tokenization is more challenging and VQGAN has been adapted to meet this purpose (Ge
et al., 2022; Villegas et al., 2022; Yu et al., 2023a). The state of the art in video tokenization
is MAGVIT (Yu et al., 2023a), which introduces a better 3D architecture, an inflation technique
for initialization using image pre-training, and robust training losses. With MAGVIT, the LMs
achieve leading generation quality across multiple video benchmarks. However, MAGVIT struggles
to tokenize images and often results in noticeable flickering in longer videos.
3 M ETHOD
We introduce a new video tokenizer designed to map the spatial-temporal dynamics from a visual
scene into compact discrete tokens suitable for language models. Our approach builds upon the
state-of-the-art video tokenizer, MAGVIT, as detailed in Yu et al. (2023a). This section highlights
two new designs: a lookup-free quantizer and a collection of enhancements to the tokenizer model.
3.1 L OOKUP -F REE Q UANTIZER

Although the community has made great progress in developing VQ-VAEs, the relationship be-
tween improvements in the reconstruction quality and subsequent generation quality is still not well
understood. A common misconception is that improving reconstruction equates to improving the
generation of the language model. For example, enlarging the vocabulary can improve reconstruc-
tion quality. However, such improvement only extends to generation when the vocabulary size is
small, and a very large vocabulary can actually hurt the performance of the language model.
As illustrated by the dashed curves in Fig. 1, the reconstruction FID, indicated by the right y-axis
(where a lower value is better), improves as the vocabulary size (the x-axis) increases. The orange

3
Work in progress

solid curve in Fig. 1 represents the LM’s generation quality (the left y-axis). The generation FID
initially improves but deteriorates for larger vocabulary. This may shed light on why the vocabulary
size of most language models for visual generation is around 1-8k (Esser et al., 2021; Villegas et al.,
2022), which is significantly smaller than the size of natural language vocabulary, i.e. over 200k.
A simple trick for training a larger codebook in- VQ Reconstruction VQ Generation
volves decreasing the code embedding dimen-
LFQ Reconstruction LFQ Generation
sion when increasing the vocabulary size (Yu

Reconstruction FID↓
3.4
et al., 2022a). This trick captures the intuition 19.0

Generation FID↓
3.0
of limiting the representational capacity of indi- 18.2
vidual tokens, which in turn facilitates learning 17.4 2.6

over the distribution of a large vocabulary. 16.6 2.2

Lookup-Free Quantization (LFQ). Moti- 15.8 1.8
vated by the above observation, we reduce the 1.4
15.0
VQ-VAE codebook’s embedding dimension to 10 12 14 16
zero. Formally, the codebook C P RKˆd is re- Vocabulary (2^k)
placed with an integer set C where |C| “ K. Figure 1: Reconstruction and generation qual-
Recall that in VQ-VAE models, the quantizer ity curves in FID on ImageNet when scaling the
must look up all K d-dimensional embeddings tokenizer’s vocabulary size with Vector Quantiza-
in the codebook, where d is typically 256, when tion (VQ) and Lookup-Free Quantization (LFQ).
computing the closest codebook entry to the en- Comparison is done at 128ˆ128 resolution using
coder output. This new design eliminates the an MLM with 306-372M parameters.
need for such embedding lookup entirely hence
we call it lookup-free quantization (LFQ). We found that LFQ can grow the vocabulary size in a way
benefiting the generation quality of language models. As shown by the blue curves in Fig. 1, both
reconstruction and generation consistently improves as the vocabulary size increases – a property
not observed in current VQ-VAE methods.
While various LFQ methods are available, this paper discusses a straightforward variant that as-
sumes independent codebook dimensions and binary latents. Specifically, the latent space of LFQ is
Ślog K
decomposed as the Cartesian product of single-dimensional variables, as C “ i“12 Ci . Given a
feature vector z P Rlog2 K , each dimension of the quantized representation qpzq is obtained from:
qpzi q “ Ci,j , where j “ arg mink }zi ´ Ci,k }, (2)
where Ci,j is the j-th value in Ci . With Ci “ t´1, 1u, the arg min can be computed by the sign
function as
qpzi q “ signpzi q “ ´1tzi ď 0u ` 1tzi ą 0u. (3)
With LFQ, the token index for qpzq is given by:
log 2K i´1 log 2K

2i´1 1tzi ą 0u,

ÿ ź ÿ
Index pzq “ arg mink }zi ´ Ci,k } |Cb | “ (4)
i“1 b“0 i“1
where |C0 | “ 1 sets the virtual basis.
We add an entropy penalty during training to encourage codebook utilization:
Lentropy “ ErHpqpzqqs ´ HrEpqpzqqs. (5)
This penalty is inspired by a similar loss used in image VQGAN model (Chang et al., 2022), which is
also found in entropy-based clustering (Jansen et al., 2020). In LFQ, given the independence among
řlog K
dimensions, we rewrite Hpqpzqq “ i“12 Hpqpzi qq . The HrEpqpzqqs term can be approximated
with sub-groups of dimensions for K ą 218 where direct estimation is memory bound.
We note that there are various other variants of LFQ, e.g., opting for the multivariant over the binary
codebook Ci or employing other quantization techniques such as Agustsson et al. (2019). As the first
paper to introduce this concept, we focus on the simplest form with independent binary dimensions,
which shows promising improvements. Other LFQ methods merit further research.

3.2 V ISUAL T OKENIZER M ODEL I MPROVEMENT

Joint image-video tokenization. A desirable feature of visual tokenization is the capability to
tokenize images and videos using a shared codebook. However, the MAGVIT tokenizer, which
utilizes the 3D CNN, faces challenges in tokenizing images due to the temporal receptive field.

4
Work in progress

v0 v1:2 vN-2:N-1 v0

… replicate
Patch Patch Patch
Emb Emb Emb v '0 v1:4 vN-4:N-1 v0:N-1

… Causal conv.
in time
3D 3D 3D Causal
CNN CNN CNN 3D CNN
Spatial Transformer
…
Causal Transformer Causal Transformer Quantizer

Quantizer Quantizer
…
x0 x1:4 xN-4:N-1

… xN-2:N-1
…
x0 x1:2
x0 x1:4 xN-4:N-1

(a) C-ViViT (b) C-ViViT + MAGVIT (c) Causal 3D CNN

Figure 2: Causal tokenizer architecture comparison. The decoders, which are omitted from the
figure, employ an architecture that is symmetric to the encoder.
To build a joint image-video tokenizer, a new design is needed. We begin our discussion by revis-
iting an existing method C-ViViT (Villegas et al., 2022). As vdepicted in Fig. 2a, v
C-ViViT employs
v 0
N-4:N-1
full spatial transformer blocks combined with causal temporal transformer blocks. This approach 1:4

performs reasonably well but has two drawbacks. First, unlike CNNs, …
the positional embeddings
2D 3D 3D
makes it difficult to tokenize spatial resolutions that were notCNN
seen during
CNN training.
CNN Second, empiri-

cally we found that 3D CNNs perform better than spatial transformer and produce tokens with better
spatial causality of the corresponding patch.
Causal Transformer
To tackle these drawbacks, we explore two plausible designs. Fig. 2b combines C-ViViT and
MAGVIT. Assuming a temporal compression ratio of 4, a 3D CNNQuantizer processes blocks of 4 frames
followed by a causal transformer. In Fig. 2c, we use the temporally causal 3D convolution to replace
the regular 3D CNN. Specifically, the temporal padding scheme for a regular … 3D convolution layer
z
with kernel size pkt , kh , kw q includes t kt2´1 u frames before and
z
t k2t zu frames after the input frames.
0 1:4
1
N-4:N-

In contrast, a causal 3D convolution layer pads with kt ´ 1 frames before the input and nothing after,
so that the output for each frame only depends on the previous frames. In consequence, the first
frame is always independent of other frames, allowing the model to tokenize single images.
Temporal convolutional subsampling with stride s is sufficient for sˆ down-sampling by mapping
1 ` s ˆ t frames into 1 ` t. After a regular sˆ up-sampling, we drop the first s ´ 1 resulting frames,
which maps 1 ` t frames into 1 ` s ˆ t and allows for the tokenization of a single image. Tab. 5a
empirically compares the designs in Fig. 2, and we find that the causal 3D CNN performs the best.
Architecture modifications. In addition to using causal 3D CNN layers, we made several other
architectural modifications to improve upon the MAGVIT model. First, we change the encoder
downsamplers from average pooling into strided convolutions to leverage learned kernels, and re-
place the decoder upsamplers from nearest resizing followed by convolution with a depth-to-space
operator. Second, we defer the temporal downsampling from the first few encoder blocks to the last
ones. In addition, the downsampling layer in the discriminator now utilizes 3D blur pooling (Zhang,
2019) to encourage shift invariance. Finally, we add one adaptive group normalization layer before
the residual blocks at each resolution in the decoder to pass in the quantized latents as the control
signal following StyleGAN (Karras et al., 2019). Tabs. 5b and 5c empirically verify these designs.
Token factorization for efficient prediction. The output tokens can be fed into language models
to generate videos. To assist smaller transformers predicting in a large vocabulary, we can factorize
the LFQ token’s latent space into equal subspaces. For instance, rather than predicting using a
codebook of size 218 , we can predict in two concatenated codebooks, each of size 29 . We embed
each subspace token separately and use their embedding summation as the token embedding for
the transformer input. For the output layer with weight tying (Press & Wolf, 2017), we use the
embedding matrix for each subspace to obtain logits with seperate prediction heads.

4 E XPERIMENTS

This section empirically verifies the proposed tokenizer across three distinct tasks: video and
image generation, video compression, and action recognition. Fig. 3 visually compares the re-
construction quality of our tokenizer with prior works. More qualitative samples are shown at
https://magvit.cs.cmu.edu/v2.

5
Work in progress

1024x1328 LPIPS↓ = 0.1665 0.1167 0.1082

512x768 LPIPS↓ = 0.1349 0.0788 0.0726

Original VQGAN (ImageNet) Ours (ImageNet) Ours (Web images)

Figure 3: Image reconstruction samples with different tokenizers. We compare the VQGAN
used in MaskGIT (Chang et al., 2022) with two of our models trained on ImageNet and web im-
ages (Chen et al., 2022). Original images are by Eric TERRADE and Barth Bailey on Unsplash.

4.1 E XPERIMENTAL S ETUPS

Datasets. We use Kinetics-600 (K600) (Carreira et al., 2018) and UCF-101 (Soomro et al., 2012)
for video generation experiments, along with ImageNet (Deng et al., 2009) for image generaton. In
addition, MCL-JCV (Wang et al., 2016) is used as the testbed for video compression, with Kinetics-
400 (K400) (Kay et al., 2017) and SSv2 (Goyal et al., 2017) for video understanding.
Implementation details We follow the tokenizer training setting and hyperparameters in (Yu
et al., 2023a), unless stated otherwise. LFQ is used, which eliminates the codebook embedding,
to increase the default codebook size to K “ 218 . The weight of Lentropy follows an annealing
schedule with a 3ˆ higher starting point and linearly decays to a fixed value of 0.1 within 2k steps.
We defer details regarding the evaluation setup of each subsection to the Appendix.

4.2 V ISUAL G ENERATION

The masked language model (MLM) (Devlin et al., 2019) is used in image and video generation.
To verify the tokenizer, we employ the same MLM transformers in MAGVIT (Yu et al., 2023a).
As we use a smaller MLM („300M parameters) with a large codebook (218 «262K), the token
factorization as discussed in Section 3.2 is applied using two heads with each predicting from a
codebook of size 29 .
Video generation. We consider two standard video benchmarks, UCF-101 for class-conditional
generation and K600 for frame prediction with 5-frame condition. FVD (Unterthiner et al., 2018) is
used as our primary evaluation metric. Tab. 1 shows that our model surpasses all prior arts in both
benchmarks. Specifically, it outperforms the previous best model MAGVIT by a large margin, while
using the same MLM transformer backbone. These results demonstrate the essential role of a good
visual tokenizer in enabling LMs to generate high-quality videos. Fig. 4 shows qualitative samples
from the model.
Image generation on ImageNet. We evaluate MAGVIT-v2 on image generation under the stan-
dard ImageNet class-conditional setting. We present results for resolution 512ˆ512 in Tab. 2 and

6
Work in progress

Table 1: Video generation results: frame prediction on Kinetics-600 and class-conditional genera-
tion on UCF-101. We adopt the evaluation protocol of MAGVIT.
Type Method K600 FVDÓ UCF FVDÓ #Params #Steps
GAN TrIVD-GAN-FP (Luc et al., 2020) 25.7˘0.7 1
Diffusion Video Diffusion (Ho et al., 2022c) 16.2˘0.3 1.1B 256
Diffusion RIN (Jabri et al., 2023) 10.8 411M 1000
AR-LM + VQ TATS (Ge et al., 2022) 332˘18 321M 1024
MLM + VQ Phenaki (Villegas et al., 2022) 36.4˘0.2 227M 48
MLM + VQ MAGVIT (Yu et al., 2023a) 9.9˘0.3 76˘2 306M 12
5.2˘0.2 12
MLM + LFQ MAGVIT-v2 (this paper) 307M
4.3˘0.1 58˘3 24

Table 2: Image generation results: class-conditional generation on ImageNet 512ˆ512. Guidance

indicates the classifier-free diffusion guidance (Ho & Salimans, 2021). ˚ indicates usage of extra
training data. We adopt the evaluation protocol and implementation of ADM.
w/o guidance w/ guidance
Type Method #Params #Steps
FIDÓ ISÒ FIDÓ ISÒ
GAN StyleGAN-XL (Sauer et al., 2022) 2.41 267.8 168M 1
Diff. + VAE˚ DiT-XL/2 (Peebles & Xie, 2022) 12.03 105.3 3.04 240.8 675M 250
Diffusion ADM+Upsample (Dhariwal & Nichol, 2021) 9.96 121.8 3.85 221.7 731M 2000
Diffusion RIN (Jabri et al., 2023) 3.95 216.0 320M 1000
Diffusion simple diffusion (Hoogeboom et al., 2023) 3.54 205.3 3.02 248.7 2B 512
Diffusion VDM++ (Kingma & Gao, 2023) 2.99 232.2 2.65 278.1 2B 512
MLM + VQ MaskGIT (Chang et al., 2022) 7.32 156.0 227M 12
MLM + VQ DPC+Upsample (Lezama et al., 2023) 3.62 249.4 619M 72
4.61 192.4 12
MLM + LFQ MAGVIT-v2 (this paper) 307M
3.07 213.1 1.91 324.3 64

refer to the Appendix for 256ˆ256 results. FID (Heusel et al., 2017) and Inception Score (IS) (Sali-
mans et al., 2016) are used as evaluation metrics. Our model surpasses the best performing diffusion
models both in sampling quality (w.r.t. FID and IS), and inference-time efficiency (w.r.t. sampling
steps).
It is worth noting that all the models compared are trained using the same ImageNet training data,
with a comparable model size and training budget. Therefore, the performance primarily evaluates
the model’s capabilities. The masked language model, equipped with our tokenizer, exhibits a no-
table improvement in FID over the best diffusion model baseline at 512ˆ512 (FID=1.91 vs. 2.65,
28%Ó). While this margin narrows at 256ˆ256 resolution, the MLM uses a 50% reduced model
size and needs much fewer decoding steps (e.g., 64 vs. 250) to get the image generation quality.
Qualitative samples in comparison with other models are shown in Fig. 5.

4.3 V IDEO C OMPRESSION

We conduct a subjective rater study to assess

the compression quality of MAGVIT-v2. The Ours MAGVIT VVC HEVC
study is conducted on the 30 videos of the 2200
MCL-JCV dataset, resized to a resolution of
640ˆ360. Sixteen raters are engaged, each pro-
Elo score ↑

2000
viding responses to an average of roughly 800
1800
pairwise-preference questions.
We calculate Elo scores (Elo & Sloan, 2008) 1600
based on pairwise preferences to quantify the 1400
relative visual quality between the models. The 0.02 0.04 0.06 0.08 0.10 0.12
study compares our model with MAGVIT as bits per pixel (bpp)
well as the current video compression standard
HEVC (H.265) video codec (Sullivan et al., Figure 6: Video compression rater study.

7
Work in progress

Condition → Generation
Figure 4: Frame prediction samples on Kinetics-600.

DPC ADM+Upsample simple diffusion VDM++ StyleGAN-XL

Ours
Figure 5: Class-conditional generation samples on ImageNet 512ˆ512. We compare with each
of the previous works with a random sample from the same image class.

2012) and the next-generation codec VVC (H.266) (Bross et al., 2021). As shown in Fig. 6, raters
prefer our model to the compared methods at multiple bit rates.
We also compare the compression quality us-
ing common distortion metrics (LPIPS, PSNR, Table 3: Video compression metrics.
and MS-SSIM) at 0.0384 bpp, the bit rate of
MAGVIT. The results in Tab. 3 show that our Method LPIPSÓ PSNRÒ MS-SSIMÒ
model outperforms MAGVIT on all metrics, HEVC (Sullivan et al., 2012) 0.199 30.10 0.943
and it outperforms all methods on LPIPS, a VVC (Bross et al., 2021) 0.153 32.65 0.966
metric which correlates more closely with sub- MAGVIT (Yu et al., 2023a) 0.144 23.70 0.846
MAGVIT-v2 (this paper) 0.104 26.18 0.894
jective quality assessments than PSNR or MS-
SSIM.

4.4 V IDEO U NDERSTANDING

In this subsection, we assess the tokenizer’s ca- Table 4: Video action recognition performance
pability to learn a video understanding model (classification accuracyÒ ˆ100).
for action recognition. Two setups are exam-
ined: (1) using tokens as prediction targets for Token as transformer’s: Output Input
the transformer’s output, and (2) using tokens Tokenizer SSv2 SSv2 K400 K600
as the input to the transformer. For the former 3D VQ-VAE 64.13 41.27 44.44 45.67
setup, we use a similar architecture following MAGVIT (Yu et al., 2023a) 67.22 57.34 72.29 74.65
67.38 62.40 75.34 77.93
the BEVT (Wang et al., 2022) pre-training. For MAGVIT-v2
Raw pixel
(this paper)
n/a 63.08 76.13 78.92
the tokens as inputs, to work with the ViViT
backbone (Arnab et al., 2021), we detokenize the tokens to pixels before feeding them to the ViViT
transformers.
Tab. 4 shows that MAGVIT-v2 outperforms the previous best MAGVIT in these evaluations. Specif-
ically, when using the decoded tokens as input, the performance approaches that of the model trained
with ground-truth pixels using the same ViViT backbone. While these numbers are still worse than
the state-of-the-art in action recognition, they represent solid improvements credited to the new
tokenizer.

8
Work in progress

Table 5: Ablation study verifying key design choices.

(a) Causal architectures on UCF-101. (b) Image tokenization on (c) Video tokenization on UCF-101.
FID is calculated on the first frame. ImageNet 128ˆ128.
FVDÓ LPIPSÓ
#Params FIDÓ FVDÓ FIDÓ LPIPSÓ MAGVIT 24.55 0.0988
MAGVIT 39M n/a 107.15 MAGVIT 2.65 0.1292 + LFQ & large vocabulary 16.12 0.0694
C-ViViT 90M 28.02 437.54 + LFQ 2.48 0.1182 + up/downsampler 15.37 0.0678
C-ViViT + MAGVIT 67M 13.52 316.70 + large vocabulary 1.34 0.0821 + late temporal downsample 11.11 0.0653
MAGVIT-v2: + up/downsampler 1.21 0.0790 + deeper model 8.90 0.0542
58M 7.06 96.33 + 3D blur pooling 8.62 0.0537
Causal 3D CNN + deeper model 1.20 0.0686
+ adaptive normalization 1.15 0.0685

4.5 A BLATION S TUDY

In Fig. 1, we have ablated LFQ vs. VQ and the vocabulary size. In Tab. 5, we validate the key designs
proposed in Section 3.2. Specifically, Tab. 5a compares the architecture illustrated in Fig. 2; Tab. 5b
and Tab. 5c verify the LFQ and other improvements on ImageNet and UCF-101, respectively.

5 R ELATED W ORK
Visual tokenization. Beyond the VQ-VAE models discussed in Section 2, additional models have
been proposed. ViT-VQGAN (Yu et al., 2022a) introduces transformer blocks as a substitute for
CNNs for image tokenization. C-ViViT (Villegas et al., 2022) further extends this idea for video to-
kenization. Early studies on video tokenization treat frames as independent images with no temporal
compression (Wu et al., 2022; Gupta et al., 2022). Later research (Yan et al., 2021; Ge et al., 2022;
Yu et al., 2023a) integrates 3D CNNs to tokenize spatial-temporal volumes. Despite these advances
in vector quantization (VQ), the codebook learned by previous VQ models is relatively small (e.g.,
8k) due to the difficulty in improving the generation quality with larger vocabularies. In contrast,
our tokenizer can induce a large vocabulary (e.g., 262k) that can be effectively modeled by an LM,
leading to enhanced image and video generation quality.
Text-to-{image, video}. Text-to-image and text-to-video generation has garnered significant
rapid advancements using both language models (Yu et al., 2023b; Chang et al., 2023) and dif-
fusion models (Ho et al., 2022a; Blattmann et al., 2023; Singer et al., 2022; Ge et al., 2023; Ramesh
et al., 2022). Although diffusion models, such as Midjourney, are considered the top performers in
these tasks, it is unclear whether their advantage stems from the model, data, or some other uniden-
tified factors. Indeed, it is challenging to scientifically compare these text-to-image models as they
are trained on varied datasets, with some even being proprietary data, under inconsistent training
conditions. To facilitate a fairer comparison, this paper prioritizes using the ImageNet and Kinetics
benchmarks.
Diffusion models. Exhibiting high quality sampling, pixel-space diffusion models (Sohl-
Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) raised to the top of the generative
modeling space for both image (Ho et al., 2020; Dhariwal & Nichol, 2021; Saharia et al., 2022) and
video (Ho et al., 2022c;a; Singer et al., 2022) synthesis. The pixel-space denoising diffusion models
(DDMs) are later refined by the latent-space DDM (Rombach et al., 2022), which conducts diffusion
over the continuous latent embeddings derived from a pre-trained variational autoencoder (VAE).
Binary latents for image modeling were used in Wang et al. (2023), where the diffusion process is
parameterized with Bernoulli distributions. Recent studies have identified advantages in substituting
the U-Net (Ronneberger et al., 2015) denoising backbone with a Transformer (Peebles & Xie, 2022;
Jabri et al., 2023) or a hybrid of both (Hoogeboom et al., 2023), making the distinctions between
diffusion and language models in visual generation more blurred, with a key distinction being their
latent format — continuous for diffusion and discrete for language models.

6 C ONCLUSION AND F UTURE W ORK

We introduce MAGVIT-v2, a novel video tokenizer that exploits lookup-free quantization along
with architectural advancements to tokenize images and video with a shared vocabulary. The ex-
periments show that our tokenizer outperforms the previously leading video tokenizer across three
areas: visual generation, video compression, and action recognition in videos. Our results suggest
that a good visual tokenizer is key for enabling language models to excel in image and video gener-
ation. These results demonstrate the great capabilities of LMs in visual generation, and advocate for
further exploration of advanced visual tokenization methods designed for LLMs.

9
Work in progress

R EFERENCES
Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon,
Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. MusicLM: Generating
music from text. arXiv:2301.11325, 2023. 1

Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Gener-
ative adversarial networks for extreme learned image compression. In ICCV, 2019. 4

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid.
ViViT: A video vision transformer. In ICCV, 2021. 8, 16

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transform-
ers. In ICLR, 2021. 2, 16

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler,
and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion
models. In CVPR, 2023. 9

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity
natural image synthesis. In ICLR, 2018. 17

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer
Ohm. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans-
actions on Circuits and Systems for Video Technology, 31(10):3736–3764, 2021. 2, 8

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. In NeurIPS, 2020. 2

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short
note about Kinetics-600. arXiv:1808.01340, 2018. 6

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked genera-
tive image transformer. In CVPR, 2022. 1, 2, 3, 4, 6, 7, 17

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang, Ming-Hsuan
Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image gen-
eration via masked generative transformers. In ICML, 2023. 2, 9

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever.
Generative pretraining from pixels. In ICML, 2020. 2

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian
Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLI: A jointly-scaled multilingual
language-image model. In ICLR, 2022. 6

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM:
Scaling language modeling with pathways. arXiv:2204.02311, 2022. 2

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-
efficient exact attention with io-awareness. In NeurIPS, 2022. 2

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale
hierarchical image database. In CVPR, 2009. 6

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In NAACL, 2019. 1, 2, 6

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In
NeurIPS, 2021. 3, 7, 9, 17

10
Work in progress

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim
Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaMs: Efficient scaling of language
models with mixture-of-experts. In ICML, 2022. 2
Arpad E. Elo and Sam Sloan. The rating of chessplayers : past and present. Ishi Press International,
2008. 7, 15
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image
synthesis. In CVPR, 2021. 1, 3, 4, 17
Gustav Theodor Fechner. Elemente der psychophysik, volume 2. Breitkopf u. Härtel, 1860. 15
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is
a strong image synthesizer. arXiv:2303.14389, 2023. 1, 15, 17
Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and
Devi Parikh. Long video generation with time-agnostic VQGAN and time-sensitive transformer.
In ECCV, 2022. 3, 7, 9
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs,
Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior
for video diffusion models. arXiv preprint arXiv:2305.10474, 2023. 9
Google. PaLM 2 technical report. arXiv:2305.10403, 2023. 1
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne West-
phal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al.
The “something something” video database for learning and evaluating visual common sense. In
ICCV, 2017. 6
Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martı́n-Martı́n, and Li Fei-Fei.
MaskViT: Masked visual pre-training for video prediction. In ICLR, 2022. 9
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS,
2017. 7
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021. 7,
15, 17
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS,
2020. 3, 9
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P
Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition
video generation with diffusion models. arXiv:2210.02303, 2022a. 3, 9
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Sali-
mans. Cascaded diffusion models for high fidelity image generation. JMLR, 23(1):2249–2281,
2022b. 17
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J
Fleet. Video diffusion models. In ICLR Workshops, 2022c. 3, 7, 9
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for
high resolution images. In ICML, 2023. 7, 9, 17
Allan Jabri, David J Fleet, and Ting Chen. Scalable adaptive computation for iterative generation.
In ICML, 2023. 7, 9, 17
Aren Jansen, Daniel PW Ellis, Shawn Hershey, R Channing Moore, Manoj Plakal, Ashok C Popat,
and Rif A Saurous. Coincidence, categorization, and consolidation: Learning to recognize sounds
with minimal supervision. In ICASSP, 2020. 4

11
Work in progress

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In CVPR, 2019. 5
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-
narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action
video dataset. arXiv:1705.06950, 2017. 6
Diederik P Kingma and Ruiqi Gao. Understanding the diffusion objective as a weighted integral of
elbos. arXiv:2303.00848, 2023. 7, 17
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and WOOK SHIN HAN. Draft-and-revise:
Effective image generation with contextual rq-transformer. In NeurIPS, 2022. 1, 17
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In EMNLP, 2021. 2
José Lezama, Huiwen Chang, Lu Jiang, and Irfan Essa. Improved masked image generation with
Token-Critic. In ECCV, 2022. 17
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, and Irfan Essa. Discrete
predictor-corrector diffusion models for image synthesis. In ICLR, 2023. 7, 15, 17
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou,
Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. StarCoder: may the source be with
you! arXiv:2305.06161, 2023. 1
Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer,
and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data.
arXiv:2003.04035, 2020. 7
Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, and Irfan Essa.
Discrete representations strengthen vision transformer robustness. In ICLR, 2021. 2
OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023. 1
William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv:2212.09748,
2022. 3, 7, 9, 17
Ofir Press and Lior Wolf. Using the output embedding to improve language models. In EACL, 2017.
5
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen,
and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021. 2
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-
conditional image generation with clip latents. arXiv:2204.06125, 2022. 9
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In CVPR, 2022. 1, 3, 9, 17
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomed-
ical image segmentation. In MICCAI, 2015. 9
Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos,
Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al.
AudioPaLM: A large language model that can speak and listen. arXiv:2306.12925, 2023. 1
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar
Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic
text-to-image diffusion models with deep language understanding. In NeurIPS, 2022. 9
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training GANs. In NeurIPS, 2016. 7

12
Work in progress

Axel Sauer, Katja Schwarz, and Andreas Geiger. StyleGAN-XL: Scaling stylegan to large diverse
datasets. In SIGGRAPH, 2022. 7, 17

Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv:1911.02150,
2019. 2

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry
Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video
data. arXiv:2209.14792, 2022. 9

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen
Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering
with large language models. arXiv:2305.09617, 2023. 1

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. In ICML, 2015. 3, 9

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
In NeurIPS, 2019. 3, 9

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human
actions classes from videos in the wild. arXiv:1212.0402, 2012. 6

Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high
efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video
Technology, 22(12):1649–1668, 2012. 2, 7, 8

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski,
and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.
arXiv:1812.01717, 2018. 6

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS,
2017. 2, 3

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang,
Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable
length video generation from open domain textual description. arXiv:2210.02399, 2022. 2, 3, 4,
5, 7, 9

Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang,
Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. MCL-JCV: a JND-based H. 264/AVC
video quality assessment dataset. In ICIP, 2016. 6, 15

Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang
Jiang, Luowei Zhou, and Lu Yuan. BEVT: BERT pretraining of video transformers. In CVPR,
2022. 2, 8, 16

Ze Wang, Jiang Wang, Zicheng Liu, and Qiang Qiu. Binary latent diffusion. In CVPR, 2023. 9, 17

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. NÜWA:
Visual synthesis pre-training for neural visual world creation. In ECCV, 2022. 9

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video generation using
VQ-VAE and transformers. arXiv:2104.10157, 2021. 9

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong
Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQ-
GAN. In ICLR, 2022a. 4, 9

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan,
Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-
rich text-to-image generation. arXiv:2206.10789, 2022b. 2

13
Work in progress

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G
Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. MAGVIT: Masked generative video
transformer. In CVPR, 2023a. 1, 2, 3, 6, 7, 8, 9, 15
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun
Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models:
Pretraining and instruction tuning. arXiv:2309.02591, 2023b. 9
Richard Zhang. Making convolutional networks shift-invariant again. In ICML, 2019. 5
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models
with masked transformers. arXiv:2306.09305, 2023. 17
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart,
Stefan Welker, Ayzaan Wahid, et al. RT-2: Vision-language-action models transfer web knowl-
edge to robotic control. In CoRL, 2023. 1

14
Work in progress

A I MPLEMENTATION D ETAILS

A.1 I MAGE AND V IDEO G ENERATION

We set up two image tokenizers to downsample by 16ˆ and 32ˆ, where they are used for generation
at 256ˆ256 and 512ˆ512, respectively. In both cases, an image is represented as 16ˆ16 tokens.
We train them on the ImageNet training set for 270 epochs using a batch size of 256, both with
256ˆ256 images.
With this tokenizer we train a Masked Language Model following Yu et al. (2023a), using the token
factorization described in Section 3.2. We train for 1080 epochs in accordance with the prior best
model MDT (Gao et al., 2023), with batch size 1024 for better efficiency. For preprocessing and
data augmentation, we randomly crop 80-100% of an image while keeping the aspect ratio, followed
by random horizontal flipping. The class label is dropped for 10% of the training batches to enable
classifier-free guidance (Ho & Salimans, 2021). For unguided generation, we use temperature 30
for 512ˆ512 and 15 for 256ˆ256 in the non-autoregressive decoding. For guided generation, we
adopt the guidance schedule from Gao et al. (2023) with temperature scaling (Lezama et al., 2023),
where we use guidance scale 25 with temperature 15.
We inflate an image tokenizer trained at 128ˆ128 for video modeling. Different from the inflation
in Yu et al. (2023a), we fill in the temporally last slice to correspond to the causal padding scheme.
In addition, we disable the inflation for the discriminator and train it from scratch for better stability.
We train the causal video tokenizer on Kinetics-600 training set for 190 epochs with batch size 256.
This tokenizer is also used in subsequent evaluations of video compression and action recognition.
With the causal tokenizer producing 5ˆ16ˆ16 tokens for a 17ˆ128ˆ128 clip, the first 2ˆ16ˆ16
tokens are provided as the condition of the first 5 frames, per the standard setup of Kinetics-600
frame prediction benchmark. We train the MLM transformer following Yu et al. (2023a) with token
factorization for 360 epochs with batch size 256. The model is sampled with a cosine schedule using
temperature 32.

A.2 V IDEO C OMPRESSION E VALUATION

Figure 7: Rating interface for subjective compression evaluation.

To rate the quality of the different methods, we use a two-alternative forced choice rating method-
ology (Fechner, 1860). As this methodology produces a sequence of binary decisions, we calculate
Elo scores (Elo & Sloan, 2008) based on pairwise preferences to quantify the relative visual quality
between the models. The study was conducted on the 30 videos of the MCL-JCV dataset (Wang
et al., 2016), scaled down to a resolution of 640ˆ360 pixels. Sixteen raters are engaged, each pro-
viding responses to an average of roughly 800 pairwise-preference questions. The questions are
presented with an interface that parallels the one used for the Challenge on Learned Image Com-

15
Work in progress

Table 6: Experimental configurations with tokens as targets.

Config SSv2 Pre-Training SSv2 Fine-tuning

inputs pixels pixels
input size 16 ˆ 224 ˆ 224 ˆ 3 16 ˆ 224 ˆ 224 ˆ 3
targets tokens classes
encoder ViT-B ViT-B
decoder linear linear
masking block-tube (Wang et al., 2022) none
masking ratio 0.75 0.0
mask temporal length 16 0
batch size 1024 512
training epochs 800 50
ViT sequence length 8 ˆ 16 ˆ 16 8 ˆ 16 ˆ 16
optimization
optimizer AdamW AdamW
optimizer momentum 0.9 0.9
layer decay 0.75 0.75
weight decay 0.05 0.05
learning rate schedule cosine decay cosine decay
warmup epochs 40 5
data augmentations
random horizontal flip true false
label smoothing 0.1 0.1
mixup none 0.8
cutmix none 1.0
droppath 0.0 0.1
dropout 0.1 0.0
random color augmentation false false

pression (http://compression.cc/), extended to comparing videos, as shown in Fig. 7. Raters

are instructed to compare the two videos and are not allowed to pause the videos.

A.3 V IDEO U NDERSTANDING E XPERIMENTS

Tokens as prediction targets. BEiT (Bao et al., 2021) and BEVT (Wang et al., 2022) class of
models pretrain visual encoders on pixel inputs by predicting tokens as targets in a masked-modeling
framework, and demonstrate state-of-the-art downstream results. We use a simplified BEVT pre-
training setup to test the effectiveness of our video tokens as targets for masked modeling. The main
difference is that we drop the image-stream from pre-training and only use the video stream and for
this reason, we also drop the multiple decoders completely and adopt an encoder-only architecture
similar to BEiT. Detailed pre-training and fine-tuning setup is presented in Tab. 6. In Tab. 4 of the
main paper, we show that our video tokens are effective targets for masked modeling based video
understanding.

Tokens as inputs. In Tab. 4, we show that we can re-use video understanding models trained on
pixels using our video tokens as input, with very minimal performance drop. For this experiment,
we train a factorized variant of the ViViT model (Arnab et al., 2021) on pixels, and evaluate it on de-
tokenized pixels from our model. We use the same hyper-parameters as used in Arnab et al. (2021)
with a Base sized model operating on 32 frames of inputs at 224p resolution. For the Kinetics-600
experiment, we use the same hyper-parameters as the Kinetics-400 experiments.

B A DDITIONAL R ESULTS

For better visualization, the generated video samples can be viewed at https://magvit.cs.cmu.
edu/v2.

16
Work in progress

Table 7: Class-conditional image generation on ImageNet 256ˆ256. Guidance indicates the

classifier-free diffusion guidance (Ho & Salimans, 2021). ˚ indicates usage of extra training data.
We adopt the evaluation protocol and implementation of ADM.

w/o guidance w/ guidance

Type Method # Params Steps
FIDÓ ISÒ FIDÓ ISÒ
GAN BigGAN-deep (Brock et al., 2018) 6.95 171.4 160M 1
GAN StyleGAN-XL (Sauer et al., 2022) 2.30 265.1 166M 1
Diff. + VAE˚ LDM-4 (Rombach et al., 2022) 10.56 103.5 3.60 247.7 400M 250
Diff. + VAE˚ DiT-XL/2 (Peebles & Xie, 2022) 9.62 121.5 2.27 278.2 675M 250
Diff. + BAE Binary latent diffusion (Wang et al., 2023) 8.21 162.3 172M 64
Diffusion ADM+Upsample (Dhariwal & Nichol, 2021) 7.49 127.5 3.94 215.8 608M 2000
Diff. + VAE˚ MDT (Gao et al., 2023) 6.23 143.0 1.79 283.0 676M 250
Diff. + VAE˚ MaskDiT (Zheng et al., 2023) 5.69 178.0 2.28 276.6 736M 40
Diffusion CDM (Ho et al., 2022b) 4.88 158.7 8100
Diffusion RIN (Jabri et al., 2023) 3.42 182.0 410M 1000
Diffusion simple diffusion (Hoogeboom et al., 2023) 2.77 211.8 2.44 256.3 2B 512
Diffusion VDM++ (Kingma & Gao, 2023) 2.40 225.3 2.12 267.7 2B 512
AR-LM + VQ VQGAN (Esser et al., 2021) 15.78 78.3 1.4B 256
MLM + VQ MaskGIT (Chang et al., 2022) 6.18 182.1 227M 8
MLM + VQ Token-Critic (Lezama et al., 2022) 4.69 174.5 368M 36
MLM + VQ Contextual RQ-Transformer (Lee et al., 2022) 3.41 224.6 1.4B 72
MLM + VQ DPC (Lezama et al., 2023) 4.45 244.8 454M 180
MLM + LFQ MAGVIT-v2 (this paper) 3.65 200.5 1.78 319.4 307M 64

Where are the text-to-image results? We want to emphasize that our goal is to develop a video
tokenizer, and many of the proposed techniques are designed specifically for videos. Text-to-image
may be out of the scope of our paper. We are currently training text-to-video models that require
considerable computational resources. Due to time constraints, these results are not available at the
moment. We intend to add the generated videos in the next revision. However, it is important to
note that comparing these text-to-image or text-to-video models scientifically is challenging. These
models were trained on different datasets, and some were even based on proprietary or non-public
data, all under varying training conditions.

2024 - Language Model Beats Diffusion - Tokenizer Is Key To Visual Generation - Yu Et Al
No ratings yet
2024 - Language Model Beats Diffusion - Tokenizer Is Key To Visual Generation - Yu Et Al
19 pages
Autoregressive Model Beats Diffusion: Llama For Scalable Image Generation
No ratings yet
Autoregressive Model Beats Diffusion: Llama For Scalable Image Generation
26 pages
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
No ratings yet
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
16 pages
Multi-Modal Generative AI Survey
No ratings yet
Multi-Modal Generative AI Survey
23 pages
2412.04332v2
No ratings yet
2412.04332v2
16 pages
2504.17789v1
No ratings yet
2504.17789v1
24 pages
LLaMA-VID
No ratings yet
LLaMA-VID
18 pages
Vila-U Foundation Model
No ratings yet
Vila-U Foundation Model
15 pages
AUTOREGRESSIVE VIDEO GENERATION
No ratings yet
AUTOREGRESSIVE VIDEO GENERATION
22 pages
Paper Ieee Tai
No ratings yet
Paper Ieee Tai
10 pages
2305.13782v1
No ratings yet
2305.13782v1
13 pages
Simmim: A Simple Framework For Masked Image Modeling
No ratings yet
Simmim: A Simple Framework For Masked Image Modeling
11 pages
2501.02765v1
No ratings yet
2501.02765v1
29 pages
Exploring
No ratings yet
Exploring
16 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
SVLM_Survey_for_ACL_2025
No ratings yet
SVLM_Survey_for_ACL_2025
20 pages
I O VLM N O O V T L M: Nference Ptimal S EED NLY NE Isual Oken But Arger Odels
No ratings yet
I O VLM N O O V T L M: Nference Ptimal S EED NLY NE Isual Oken But Arger Odels
17 pages
2412.01762v1
No ratings yet
2412.01762v1
12 pages
2503.21782v1
No ratings yet
2503.21782v1
13 pages
Magma
No ratings yet
Magma
13 pages
MAGVIT Masked Generative Video Transformer
No ratings yet
MAGVIT Masked Generative Video Transformer
30 pages
Token Izer
No ratings yet
Token Izer
17 pages
paper1
No ratings yet
paper1
17 pages
LVLang
No ratings yet
LVLang
21 pages
27999-Article Text-32053-1-2-20240324
No ratings yet
27999-Article Text-32053-1-2-20240324
9 pages
2501.02189v3 -2025
No ratings yet
2501.02189v3 -2025
35 pages
From Show To Tell: A Survey On Image Captioning
No ratings yet
From Show To Tell: A Survey On Image Captioning
22 pages
2504.05299v1
No ratings yet
2504.05299v1
20 pages
FlexCap: Generating Rich, Localized, and Flexible Captions in Images
No ratings yet
FlexCap: Generating Rich, Localized, and Flexible Captions in Images
27 pages
Photorealistic Video Generation With Diffusion Models
No ratings yet
Photorealistic Video Generation With Diffusion Models
13 pages
visionllama
No ratings yet
visionllama
17 pages
BLIVA_ A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
No ratings yet
BLIVA_ A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
12 pages
Write and Paint
No ratings yet
Write and Paint
25 pages
2501.05453v1
No ratings yet
2501.05453v1
19 pages
Whitepaper_Foundational Large Language Models & Text Generation_v2
100% (1)
Whitepaper_Foundational Large Language Models & Text Generation_v2
86 pages
Deep Learning Book PDF
No ratings yet
Deep Learning Book PDF
272 pages
Deepseek-Vl: Towards Real-World Vision-Language Understanding
No ratings yet
Deepseek-Vl: Towards Real-World Vision-Language Understanding
33 pages
Visual GPT
No ratings yet
Visual GPT
17 pages
ModelScope Text-to-Video Technical Report
No ratings yet
ModelScope Text-to-Video Technical Report
14 pages
GIT Generative Image-To-Text Transformer
No ratings yet
GIT Generative Image-To-Text Transformer
49 pages
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
No ratings yet
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
26 pages
2406.09399v1
No ratings yet
2406.09399v1
13 pages
Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
Bao Et Al. - 2022 - VL-BEiT Generative Vision-Language Pretraining
No ratings yet
Bao Et Al. - 2022 - VL-BEiT Generative Vision-Language Pretraining
12 pages
2412.03069v1
No ratings yet
2412.03069v1
19 pages
ee9220f3-9bed-497f-b576-d52adf28dfca
No ratings yet
ee9220f3-9bed-497f-b576-d52adf28dfca
32 pages
2503.01159v1
No ratings yet
2503.01159v1
55 pages
PaLI-3 Vision Language Models - Smaller, Faster, Stronger - 2310.09199
No ratings yet
PaLI-3 Vision Language Models - Smaller, Faster, Stronger - 2310.09199
16 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
2403.14520v4
No ratings yet
2403.14520v4
12 pages
2404.17136v1
No ratings yet
2404.17136v1
28 pages
Video Chat GPT
No ratings yet
Video Chat GPT
17 pages
Multimodal Autoregressive Pre-Training of Large Vision Encoders
No ratings yet
Multimodal Autoregressive Pre-Training of Large Vision Encoders
18 pages
2412.13303v2
No ratings yet
2412.13303v2
20 pages
Zhao Learning Video Representations From Large Language Models CVPR 2023 Paper
No ratings yet
Zhao Learning Video Representations From Large Language Models CVPR 2023 Paper
12 pages
Salmon: S - A P - F R M: ELF Lignment With Rinciple Ollowing Eward Odels
No ratings yet
Salmon: S - A P - F R M: ELF Lignment With Rinciple Ollowing Eward Odels
32 pages
Steerlm: Attribute Conditioned SFT As An (User-Steerable) Alternative To RLHF
No ratings yet
Steerlm: Attribute Conditioned SFT As An (User-Steerable) Alternative To RLHF
14 pages
This Is A Cool Paper
No ratings yet
This Is A Cool Paper
32 pages
Lemma An Open Language Model For Mathematics
No ratings yet
Lemma An Open Language Model For Mathematics
28 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration

Uploaded by

L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration

Uploaded by

Work in progress

L ANGUAGE M ODEL B EATS D IFFUSION

over the distribution of a large vocabulary. 16.6 2.2

2i´1 1tzi ą 0u,

3.2 V ISUAL T OKENIZER M ODEL I MPROVEMENT

(a) C-ViViT (b) C-ViViT + MAGVIT (c) Causal 3D CNN

1024x1328 LPIPS↓ = 0.1665 0.1167 0.1082

512x768 LPIPS↓ = 0.1349 0.0788 0.0726

4.1 E XPERIMENTAL S ETUPS

4.2 V ISUAL G ENERATION

Table 2: Image generation results: class-conditional generation on ImageNet 512ˆ512. Guidance

4.3 V IDEO C OMPRESSION

We conduct a subjective rater study to assess

DPC ADM+Upsample simple diffusion VDM++ StyleGAN-XL

4.4 V IDEO U NDERSTANDING

Table 5: Ablation study verifying key design choices.

4.5 A BLATION S TUDY

6 C ONCLUSION AND F UTURE W ORK

A.1 I MAGE AND V IDEO G ENERATION

A.2 V IDEO C OMPRESSION E VALUATION

Figure 7: Rating interface for subjective compression evaluation.

Table 6: Experimental configurations with tokens as targets.

Config SSv2 Pre-Training SSv2 Fine-tuning

pression (http://compression.cc/), extended to comparing videos, as shown in Fig. 7. Raters

A.3 V IDEO U NDERSTANDING E XPERIMENTS

Table 7: Class-conditional image generation on ImageNet 256ˆ256. Guidance indicates the

w/o guidance w/ guidance

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.