L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration
L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration
A BSTRACT
arXiv:2310.05737v1 [cs.CV] 9 Oct 2023
While Large Language Models (LLMs) are the dominant models for generative
tasks in language, they do not perform as well as diffusion models on image and
video generation. To effectively use LLMs for visual generation, one crucial com-
ponent is the visual tokenizer that maps pixel-space inputs to discrete tokens ap-
propriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video
tokenizer designed to generate concise and expressive tokens for both videos and
images using a common token vocabulary. Equipped with this new tokenizer, we
show that LLMs outperform diffusion models on standard image and video gener-
ation benchmarks including ImageNet and Kinetics. In addition, we demonstrate
that our tokenizer surpasses the previously top-performing video tokenizer on two
more tasks: (1) video compression comparable to the next-generation video codec
(VCC) according to human evaluations, and (2) learning effective representations
for action recognition tasks.
1 I NTRODUCTION
Large transformer-based language models, commonly referred to as LMs or LLMs, are the de facto
models for natural language generation (OpenAI, 2023; Google, 2023). Over time, LMs have ex-
panded their capabilities to generate content in various modalities, asserting their dominance in other
domains like audio (Agostinelli et al., 2023), speech (Rubenstein et al., 2023), code generation (Li
et al., 2023), medical applications (Singhal et al., 2023) and robotics (Zitkovich et al., 2023).
LMs are capable of generating images and videos. To do so, the image pixels are mapped into a
sequence of discrete tokens by a visual tokenizer (c.f . Section 2). These tokens are then fed into
the LM transformer, as if they were lexical words, for generative modeling. Despite notable ad-
vancements in employing LMs for visual generation (Esser et al., 2021; Chang et al., 2022), LMs
still do not perform as well as diffusion models (Rombach et al., 2022). For instance, when evalu-
ating on the ImageNet dataset, a gold standard benchmark for image generation, the best language
model (Lee et al., 2022) underperforms the diffusion model (Gao et al., 2023) by a substantial 48%
margin (FID 3.41 vs. 1.79 when generating images at the 256ˆ256 resolution).
Why do language models lag behind diffusion models in visual generation? This paper suggests that
a primary reason is the lack of a good visual representation, resembling our natural language system,
for effectively modeling the visual world. To substantiate this hypothesis, this paper shows that,
when utilizing a good visual tokenizer, the masked language model (Devlin et al., 2019; Chang et al.,
2022; Yu et al., 2023a) surpasses the state-of-the-art diffusion models in terms of both generation
fidelity and efficiency across image and video benchmarks, given the same training data, comparable
model size, and training budget. To the best of our knowledge, this provides the first evidence that
language models beat diffusion models on the hallmark ImageNet benchmark.
It is worth emphasizing that our intention is not to assert whether the language model is superior
to others, but to promote the exploration of visual tokenization methods for LLMs. A fundamental
difference of LLMs from other models, such as diffusion models, is that LLMs utilize a discrete
latent format: tokens obtained from a visual tokenizer. We show that the values of these discrete
visual tokens should not be overlooked considering their distinct advantages as follows. (1) Com-
patibility with LLMs. The main advantage of a token representation is that it shares the same form
˚
Work done during a research internship at Google Research.
1
Work in progress
as language tokens, making it straightforward to leverage the optimizations our community has de-
veloped over many years for LLMs. This includes faster training and inference speeds (Shazeer,
2019; Lester et al., 2021), advancements in model infrastructure (Dao et al., 2022; Du et al., 2022),
learning recipes for model scaling (Brown et al., 2020; Chowdhery et al., 2022), and GPU/TPU op-
timization, among other innovations. Unifying vision and language by the same token space could
set the stage for a true multimodal LLM that can understand, generate, and reason within our visual
environment. (2) Compressed representation. The discrete token may offer a fresh perspective
on video compression. The visual tokens can serve as a new video compression format to reduce
disk storage and bandwidth during internet transfers. Unlike compressed RGB pixels, these tokens
can be fed directly into generative models, bypassing the conventional decompression and latent
encoding steps. This allows for faster processing in generative video applications, especially bene-
ficial in edge computing cases. (3) Visual understanding benefits. Prior research has shown that
the discrete tokens are valuable as a pre-training target in self-supervised representation learning, as
discussed in BEiT (Bao et al., 2021) and BEVT (Wang et al., 2022). Additionally, research finds
that using tokens as the model inputs improves the robustness and generalization (Mao et al., 2021).
In this paper, we introduce MAGVIT-v2, a video tokenizer designed to map videos (and images) into
compact discrete tokens. Our model is built on the state-of-the-art video tokenizer, MAGVIT (Yu
et al., 2023a), within the VQ-VAE framework (Van Den Oord et al., 2017). We propose two new
techniques. First, a novel lookup-free quantization method enables the learning of a large vocabulary
that is able to improve generation quality of the language model. Second, through extensive empir-
ical analyses, we have identified modifications to the tokenizer that not only enhance generation
quality but also enable the tokenization of both images and videos using a shared vocabulary.
We empirically demonstrate that our model outperforms the previously top-performing video tok-
enizer, MAGVIT, in three key areas. First, our model significantly improves the generation quality
of MAGVIT, establishing the state of the art on the common image and video benchmarks. Second,
user studies indicate that its compression quality exceeds that of MAGVIT and the current video
compression standard, HEVC (Sullivan et al., 2012). Moreover, it is on par with the next-generation
video codec, VVC (Bross et al., 2021). Finally, we show that, compared to MAGVIT, our new
tokens are stronger for video understanding tasks across two setups and three datasets. The main
contributions of this work are:
• A new video tokenizer that outperforms the previously best-performing video tokenizer in three
areas: visual generation, video compression, and action recognition.
• A novel lookup-free quantization approach that enables improving the visual generation quality
of language models by learning a large vocabulary.
• To the best of our knowledge, the first evidence suggesting that a language model can outperform
diffusion models on ImageNet when provided with the same training data, an equivalent model
size, and a similar training budget.
• A video compressor with better quality than HEVC and VVC, at similar bit rates, according to
user studies. To our knowledge, this is the first successful attempt of a visual tokenizer designed
for video generation to achieve comparable results to standard codecs.
2 BACKGROUND
Language Model (LM) for visual generation. LMs have been extended to generate images and
videos. A visual tokenizer f is used to first map visual inputs into a sequence of discrete tokens.
A video V P RT ˆHˆW ˆ3 (or image when T “ 1) is tokenized into a discrete representation
1 1 1
X “ f pVq P t1, 2, ¨ ¨ ¨ , KuT ˆH ˆW , where K is the codebook (vocabulary) size of the visual
tokenizer. X is flattened into a 1D token sequence obtained using raster scan ordering and then fed
into an LM transformer for generative modeling.
Two types of LMs are commonly used for visual generation. The Autoregressive LM (AR-LM)
includes ImageGPT (Chen et al., 2020), DALL-E (Ramesh et al., 2021), Parti (Yu et al., 2022b), etc.
An AR-LM predicts the next token given the previous tokens along with additional conditioning
information c using a categorical distribution for pθ pxi | xăi ; cq. During inference, AR-LMs use
the standard autoregressive decoding over the tokens. Finally, the tokens are converted back to pixels
by a decoder associated with the visual tokenizer.
The Masked LM (MLM) is another type of language model for visual generation, such as:
MaskGIT (Chang et al., 2022), MAGVIT (Yu et al., 2023a), Phenaki (Villegas et al., 2022), and
MUSE (Chang et al., 2023), among others. An MLM is trained using a masked token objective (De-
2
Work in progress
vlin et al., 2019), where some tokens in the sequence are randomly masked and need to be predicted
given the observed tokens. Let m P t0, 1un be a random binary sequence where mJ 1 P r0, n ´ 1s.
The MLM learns pθ pxi | txj : mj “ 1, @ju; cq for all i where mi “ 0. To generate a video or
image during inference, the MLM uses the non-autoregressive decoding algorithms for images and
videos (Chang et al., 2022; Yu et al., 2023a). The decoding starts with a fully masked sequence,
which is iteratively filled by repeating two steps: (1) sample the whole sequence x̂ptq from pθ given
the non-masked tokens from the previous step, (2) re-mask the tλptq ¨ nu tokens in x̂ptq with the
lowest probability, following a decreasing masking ratio schedule λptq, according to timestamp t.
Denoising Diffusion Models (DDM). DDMs (Sohl-Dickstein et al., 2015; Song & Ermon, 2019)
are regarded as the state-of-the-art in visual generation due to their high-quality image (Dhariwal &
Nichol, 2021; Ho et al., 2022a) and video generation (Ho et al., 2022c). For instance, DDPM (Ho
et al., 2020) learns a denoising process parameterized as conditional Gaussian distributions over
image pixels. Recently, diffusion models and language models have displayed a significant overlap.
Recent DDMs diffuse over latents rather than raw pixels. These latents are obtained using models
similar to the visual tokenizer used by LMs. In fact, the very first latent in diffusion, proposed
by Rombach et al. (2022), is derived from a visual tokenizer. Additionally, the diffusion model’s
architecture has been shifting from the U-Net to the transformer architecture (Peebles & Xie, 2022).
Consequently, the boundaries between diffusion and language models in visual generation have
become less distinct. Yet, a fundamental difference between DDMs and LMs lies in the latent
format, i.e., continuous vs. discrete. We have discussed the benefits of having discrete tokens in
Section 1 and will show that the proposed tokenizer improves in these aspects.
Visual tokenization. Visual tokenization plays an essential role in mapping pixels into a discrete
representation suitable for generative modeling. VQ-VAE (Van Den Oord et al., 2017) is a corner-
stone work in image tokenization. A VQ-VAE model consists of a convolutional neural network
(CNN) encoder, a vector-quantization (VQ) bottleneck, and a CNN decoder. Given a video V P
1 1 1
RT ˆHˆW ˆ3 , the VQ-VAE’s encoder E produces latent embeddings Z “ EpVq P RT ˆH ˆW ˆd .
d
Each embedding vector z P R in Z is then passed through the vector quantizer q, which assigns it
to the closest entry c P Rd in the learned codebook embedding C P RKˆd :
qpzq “ ci , where i “ arg min }z ´ cj }2 . (1)
jPt1,2,¨¨¨ ,Ku
To get discrete tokens, we drop the embedding dimension and represent Z by its indices X P
1 1 1
t1, 2, ¨ ¨ ¨ , KuT ˆH ˆW . For decoding, embeddings of all image tokens are given as input to the
decoder D to reconstruct the input V̂ “ DpZq. Following VQ-VAE, VQGAN (Esser et al., 2021)
introduces an adversarial loss and feature-level perceptual losses to enhance the image quality.
Video tokenization is more challenging and VQGAN has been adapted to meet this purpose (Ge
et al., 2022; Villegas et al., 2022; Yu et al., 2023a). The state of the art in video tokenization
is MAGVIT (Yu et al., 2023a), which introduces a better 3D architecture, an inflation technique
for initialization using image pre-training, and robust training losses. With MAGVIT, the LMs
achieve leading generation quality across multiple video benchmarks. However, MAGVIT struggles
to tokenize images and often results in noticeable flickering in longer videos.
3 M ETHOD
We introduce a new video tokenizer designed to map the spatial-temporal dynamics from a visual
scene into compact discrete tokens suitable for language models. Our approach builds upon the
state-of-the-art video tokenizer, MAGVIT, as detailed in Yu et al. (2023a). This section highlights
two new designs: a lookup-free quantizer and a collection of enhancements to the tokenizer model.
3.1 L OOKUP -F REE Q UANTIZER
Although the community has made great progress in developing VQ-VAEs, the relationship be-
tween improvements in the reconstruction quality and subsequent generation quality is still not well
understood. A common misconception is that improving reconstruction equates to improving the
generation of the language model. For example, enlarging the vocabulary can improve reconstruc-
tion quality. However, such improvement only extends to generation when the vocabulary size is
small, and a very large vocabulary can actually hurt the performance of the language model.
As illustrated by the dashed curves in Fig. 1, the reconstruction FID, indicated by the right y-axis
(where a lower value is better), improves as the vocabulary size (the x-axis) increases. The orange
3
Work in progress
solid curve in Fig. 1 represents the LM’s generation quality (the left y-axis). The generation FID
initially improves but deteriorates for larger vocabulary. This may shed light on why the vocabulary
size of most language models for visual generation is around 1-8k (Esser et al., 2021; Villegas et al.,
2022), which is significantly smaller than the size of natural language vocabulary, i.e. over 200k.
A simple trick for training a larger codebook in- VQ Reconstruction VQ Generation
volves decreasing the code embedding dimen-
LFQ Reconstruction LFQ Generation
sion when increasing the vocabulary size (Yu
Reconstruction FID↓
3.4
et al., 2022a). This trick captures the intuition 19.0
Generation FID↓
3.0
of limiting the representational capacity of indi- 18.2
vidual tokens, which in turn facilitates learning 17.4 2.6
4
Work in progress
v0 v1:2 vN-2:N-1 v0
… replicate
Patch Patch Patch
Emb Emb Emb v '0 v1:4 vN-4:N-1 v0:N-1
… Causal conv.
in time
3D 3D 3D Causal
CNN CNN CNN 3D CNN
Spatial Transformer
…
Causal Transformer Causal Transformer Quantizer
Quantizer Quantizer
…
x0 x1:4 xN-4:N-1
… xN-2:N-1
…
x0 x1:2
x0 x1:4 xN-4:N-1
performs reasonably well but has two drawbacks. First, unlike CNNs, …
the positional embeddings
2D 3D 3D
makes it difficult to tokenize spatial resolutions that were notCNN
seen during
CNN training.
CNN Second, empiri-
cally we found that 3D CNNs perform better than spatial transformer and produce tokens with better
spatial causality of the corresponding patch.
Causal Transformer
To tackle these drawbacks, we explore two plausible designs. Fig. 2b combines C-ViViT and
MAGVIT. Assuming a temporal compression ratio of 4, a 3D CNNQuantizer processes blocks of 4 frames
followed by a causal transformer. In Fig. 2c, we use the temporally causal 3D convolution to replace
the regular 3D CNN. Specifically, the temporal padding scheme for a regular … 3D convolution layer
z
with kernel size pkt , kh , kw q includes t kt2´1 u frames before and
z
t k2t zu frames after the input frames.
0 1:4
1
N-4:N-
In contrast, a causal 3D convolution layer pads with kt ´ 1 frames before the input and nothing after,
so that the output for each frame only depends on the previous frames. In consequence, the first
frame is always independent of other frames, allowing the model to tokenize single images.
Temporal convolutional subsampling with stride s is sufficient for sˆ down-sampling by mapping
1 ` s ˆ t frames into 1 ` t. After a regular sˆ up-sampling, we drop the first s ´ 1 resulting frames,
which maps 1 ` t frames into 1 ` s ˆ t and allows for the tokenization of a single image. Tab. 5a
empirically compares the designs in Fig. 2, and we find that the causal 3D CNN performs the best.
Architecture modifications. In addition to using causal 3D CNN layers, we made several other
architectural modifications to improve upon the MAGVIT model. First, we change the encoder
downsamplers from average pooling into strided convolutions to leverage learned kernels, and re-
place the decoder upsamplers from nearest resizing followed by convolution with a depth-to-space
operator. Second, we defer the temporal downsampling from the first few encoder blocks to the last
ones. In addition, the downsampling layer in the discriminator now utilizes 3D blur pooling (Zhang,
2019) to encourage shift invariance. Finally, we add one adaptive group normalization layer before
the residual blocks at each resolution in the decoder to pass in the quantized latents as the control
signal following StyleGAN (Karras et al., 2019). Tabs. 5b and 5c empirically verify these designs.
Token factorization for efficient prediction. The output tokens can be fed into language models
to generate videos. To assist smaller transformers predicting in a large vocabulary, we can factorize
the LFQ token’s latent space into equal subspaces. For instance, rather than predicting using a
codebook of size 218 , we can predict in two concatenated codebooks, each of size 29 . We embed
each subspace token separately and use their embedding summation as the token embedding for
the transformer input. For the output layer with weight tying (Press & Wolf, 2017), we use the
embedding matrix for each subspace to obtain logits with seperate prediction heads.
4 E XPERIMENTS
This section empirically verifies the proposed tokenizer across three distinct tasks: video and
image generation, video compression, and action recognition. Fig. 3 visually compares the re-
construction quality of our tokenizer with prior works. More qualitative samples are shown at
https://magvit.cs.cmu.edu/v2.
5
Work in progress
Figure 3: Image reconstruction samples with different tokenizers. We compare the VQGAN
used in MaskGIT (Chang et al., 2022) with two of our models trained on ImageNet and web im-
ages (Chen et al., 2022). Original images are by Eric TERRADE and Barth Bailey on Unsplash.
The masked language model (MLM) (Devlin et al., 2019) is used in image and video generation.
To verify the tokenizer, we employ the same MLM transformers in MAGVIT (Yu et al., 2023a).
As we use a smaller MLM („300M parameters) with a large codebook (218 «262K), the token
factorization as discussed in Section 3.2 is applied using two heads with each predicting from a
codebook of size 29 .
Video generation. We consider two standard video benchmarks, UCF-101 for class-conditional
generation and K600 for frame prediction with 5-frame condition. FVD (Unterthiner et al., 2018) is
used as our primary evaluation metric. Tab. 1 shows that our model surpasses all prior arts in both
benchmarks. Specifically, it outperforms the previous best model MAGVIT by a large margin, while
using the same MLM transformer backbone. These results demonstrate the essential role of a good
visual tokenizer in enabling LMs to generate high-quality videos. Fig. 4 shows qualitative samples
from the model.
Image generation on ImageNet. We evaluate MAGVIT-v2 on image generation under the stan-
dard ImageNet class-conditional setting. We present results for resolution 512ˆ512 in Tab. 2 and
6
Work in progress
Table 1: Video generation results: frame prediction on Kinetics-600 and class-conditional genera-
tion on UCF-101. We adopt the evaluation protocol of MAGVIT.
Type Method K600 FVDÓ UCF FVDÓ #Params #Steps
GAN TrIVD-GAN-FP (Luc et al., 2020) 25.7˘0.7 1
Diffusion Video Diffusion (Ho et al., 2022c) 16.2˘0.3 1.1B 256
Diffusion RIN (Jabri et al., 2023) 10.8 411M 1000
AR-LM + VQ TATS (Ge et al., 2022) 332˘18 321M 1024
MLM + VQ Phenaki (Villegas et al., 2022) 36.4˘0.2 227M 48
MLM + VQ MAGVIT (Yu et al., 2023a) 9.9˘0.3 76˘2 306M 12
5.2˘0.2 12
MLM + LFQ MAGVIT-v2 (this paper) 307M
4.3˘0.1 58˘3 24
refer to the Appendix for 256ˆ256 results. FID (Heusel et al., 2017) and Inception Score (IS) (Sali-
mans et al., 2016) are used as evaluation metrics. Our model surpasses the best performing diffusion
models both in sampling quality (w.r.t. FID and IS), and inference-time efficiency (w.r.t. sampling
steps).
It is worth noting that all the models compared are trained using the same ImageNet training data,
with a comparable model size and training budget. Therefore, the performance primarily evaluates
the model’s capabilities. The masked language model, equipped with our tokenizer, exhibits a no-
table improvement in FID over the best diffusion model baseline at 512ˆ512 (FID=1.91 vs. 2.65,
28%Ó). While this margin narrows at 256ˆ256 resolution, the MLM uses a 50% reduced model
size and needs much fewer decoding steps (e.g., 64 vs. 250) to get the image generation quality.
Qualitative samples in comparison with other models are shown in Fig. 5.
2000
viding responses to an average of roughly 800
1800
pairwise-preference questions.
We calculate Elo scores (Elo & Sloan, 2008) 1600
based on pairwise preferences to quantify the 1400
relative visual quality between the models. The 0.02 0.04 0.06 0.08 0.10 0.12
study compares our model with MAGVIT as bits per pixel (bpp)
well as the current video compression standard
HEVC (H.265) video codec (Sullivan et al., Figure 6: Video compression rater study.
7
Work in progress
Condition → Generation
Figure 4: Frame prediction samples on Kinetics-600.
Ours
Figure 5: Class-conditional generation samples on ImageNet 512ˆ512. We compare with each
of the previous works with a random sample from the same image class.
2012) and the next-generation codec VVC (H.266) (Bross et al., 2021). As shown in Fig. 6, raters
prefer our model to the compared methods at multiple bit rates.
We also compare the compression quality us-
ing common distortion metrics (LPIPS, PSNR, Table 3: Video compression metrics.
and MS-SSIM) at 0.0384 bpp, the bit rate of
MAGVIT. The results in Tab. 3 show that our Method LPIPSÓ PSNRÒ MS-SSIMÒ
model outperforms MAGVIT on all metrics, HEVC (Sullivan et al., 2012) 0.199 30.10 0.943
and it outperforms all methods on LPIPS, a VVC (Bross et al., 2021) 0.153 32.65 0.966
metric which correlates more closely with sub- MAGVIT (Yu et al., 2023a) 0.144 23.70 0.846
MAGVIT-v2 (this paper) 0.104 26.18 0.894
jective quality assessments than PSNR or MS-
SSIM.
In this subsection, we assess the tokenizer’s ca- Table 4: Video action recognition performance
pability to learn a video understanding model (classification accuracyÒ ˆ100).
for action recognition. Two setups are exam-
ined: (1) using tokens as prediction targets for Token as transformer’s: Output Input
the transformer’s output, and (2) using tokens Tokenizer SSv2 SSv2 K400 K600
as the input to the transformer. For the former 3D VQ-VAE 64.13 41.27 44.44 45.67
setup, we use a similar architecture following MAGVIT (Yu et al., 2023a) 67.22 57.34 72.29 74.65
67.38 62.40 75.34 77.93
the BEVT (Wang et al., 2022) pre-training. For MAGVIT-v2
Raw pixel
(this paper)
n/a 63.08 76.13 78.92
the tokens as inputs, to work with the ViViT
backbone (Arnab et al., 2021), we detokenize the tokens to pixels before feeding them to the ViViT
transformers.
Tab. 4 shows that MAGVIT-v2 outperforms the previous best MAGVIT in these evaluations. Specif-
ically, when using the decoded tokens as input, the performance approaches that of the model trained
with ground-truth pixels using the same ViViT backbone. While these numbers are still worse than
the state-of-the-art in action recognition, they represent solid improvements credited to the new
tokenizer.
8
Work in progress
In Fig. 1, we have ablated LFQ vs. VQ and the vocabulary size. In Tab. 5, we validate the key designs
proposed in Section 3.2. Specifically, Tab. 5a compares the architecture illustrated in Fig. 2; Tab. 5b
and Tab. 5c verify the LFQ and other improvements on ImageNet and UCF-101, respectively.
5 R ELATED W ORK
Visual tokenization. Beyond the VQ-VAE models discussed in Section 2, additional models have
been proposed. ViT-VQGAN (Yu et al., 2022a) introduces transformer blocks as a substitute for
CNNs for image tokenization. C-ViViT (Villegas et al., 2022) further extends this idea for video to-
kenization. Early studies on video tokenization treat frames as independent images with no temporal
compression (Wu et al., 2022; Gupta et al., 2022). Later research (Yan et al., 2021; Ge et al., 2022;
Yu et al., 2023a) integrates 3D CNNs to tokenize spatial-temporal volumes. Despite these advances
in vector quantization (VQ), the codebook learned by previous VQ models is relatively small (e.g.,
8k) due to the difficulty in improving the generation quality with larger vocabularies. In contrast,
our tokenizer can induce a large vocabulary (e.g., 262k) that can be effectively modeled by an LM,
leading to enhanced image and video generation quality.
Text-to-{image, video}. Text-to-image and text-to-video generation has garnered significant
rapid advancements using both language models (Yu et al., 2023b; Chang et al., 2023) and dif-
fusion models (Ho et al., 2022a; Blattmann et al., 2023; Singer et al., 2022; Ge et al., 2023; Ramesh
et al., 2022). Although diffusion models, such as Midjourney, are considered the top performers in
these tasks, it is unclear whether their advantage stems from the model, data, or some other uniden-
tified factors. Indeed, it is challenging to scientifically compare these text-to-image models as they
are trained on varied datasets, with some even being proprietary data, under inconsistent training
conditions. To facilitate a fairer comparison, this paper prioritizes using the ImageNet and Kinetics
benchmarks.
Diffusion models. Exhibiting high quality sampling, pixel-space diffusion models (Sohl-
Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) raised to the top of the generative
modeling space for both image (Ho et al., 2020; Dhariwal & Nichol, 2021; Saharia et al., 2022) and
video (Ho et al., 2022c;a; Singer et al., 2022) synthesis. The pixel-space denoising diffusion models
(DDMs) are later refined by the latent-space DDM (Rombach et al., 2022), which conducts diffusion
over the continuous latent embeddings derived from a pre-trained variational autoencoder (VAE).
Binary latents for image modeling were used in Wang et al. (2023), where the diffusion process is
parameterized with Bernoulli distributions. Recent studies have identified advantages in substituting
the U-Net (Ronneberger et al., 2015) denoising backbone with a Transformer (Peebles & Xie, 2022;
Jabri et al., 2023) or a hybrid of both (Hoogeboom et al., 2023), making the distinctions between
diffusion and language models in visual generation more blurred, with a key distinction being their
latent format — continuous for diffusion and discrete for language models.
We introduce MAGVIT-v2, a novel video tokenizer that exploits lookup-free quantization along
with architectural advancements to tokenize images and video with a shared vocabulary. The ex-
periments show that our tokenizer outperforms the previously leading video tokenizer across three
areas: visual generation, video compression, and action recognition in videos. Our results suggest
that a good visual tokenizer is key for enabling language models to excel in image and video gener-
ation. These results demonstrate the great capabilities of LMs in visual generation, and advocate for
further exploration of advanced visual tokenization methods designed for LLMs.
9
Work in progress
R EFERENCES
Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon,
Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. MusicLM: Generating
music from text. arXiv:2301.11325, 2023. 1
Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Gener-
ative adversarial networks for extreme learned image compression. In ICCV, 2019. 4
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid.
ViViT: A video vision transformer. In ICCV, 2021. 8, 16
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transform-
ers. In ICLR, 2021. 2, 16
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler,
and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion
models. In CVPR, 2023. 9
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity
natural image synthesis. In ICLR, 2018. 17
Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer
Ohm. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans-
actions on Circuits and Systems for Video Technology, 31(10):3736–3764, 2021. 2, 8
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. In NeurIPS, 2020. 2
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short
note about Kinetics-600. arXiv:1808.01340, 2018. 6
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked genera-
tive image transformer. In CVPR, 2022. 1, 2, 3, 4, 6, 7, 17
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang, Ming-Hsuan
Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image gen-
eration via masked generative transformers. In ICML, 2023. 2, 9
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever.
Generative pretraining from pixels. In ICML, 2020. 2
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian
Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLI: A jointly-scaled multilingual
language-image model. In ICLR, 2022. 6
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM:
Scaling language modeling with pathways. arXiv:2204.02311, 2022. 2
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-
efficient exact attention with io-awareness. In NeurIPS, 2022. 2
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale
hierarchical image database. In CVPR, 2009. 6
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In NAACL, 2019. 1, 2, 6
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In
NeurIPS, 2021. 3, 7, 9, 17
10
Work in progress
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim
Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaMs: Efficient scaling of language
models with mixture-of-experts. In ICML, 2022. 2
Arpad E. Elo and Sam Sloan. The rating of chessplayers : past and present. Ishi Press International,
2008. 7, 15
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image
synthesis. In CVPR, 2021. 1, 3, 4, 17
Gustav Theodor Fechner. Elemente der psychophysik, volume 2. Breitkopf u. Härtel, 1860. 15
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is
a strong image synthesizer. arXiv:2303.14389, 2023. 1, 15, 17
Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and
Devi Parikh. Long video generation with time-agnostic VQGAN and time-sensitive transformer.
In ECCV, 2022. 3, 7, 9
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs,
Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior
for video diffusion models. arXiv preprint arXiv:2305.10474, 2023. 9
Google. PaLM 2 technical report. arXiv:2305.10403, 2023. 1
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne West-
phal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al.
The “something something” video database for learning and evaluating visual common sense. In
ICCV, 2017. 6
Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martı́n-Martı́n, and Li Fei-Fei.
MaskViT: Masked visual pre-training for video prediction. In ICLR, 2022. 9
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS,
2017. 7
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021. 7,
15, 17
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS,
2020. 3, 9
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P
Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition
video generation with diffusion models. arXiv:2210.02303, 2022a. 3, 9
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Sali-
mans. Cascaded diffusion models for high fidelity image generation. JMLR, 23(1):2249–2281,
2022b. 17
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J
Fleet. Video diffusion models. In ICLR Workshops, 2022c. 3, 7, 9
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for
high resolution images. In ICML, 2023. 7, 9, 17
Allan Jabri, David J Fleet, and Ting Chen. Scalable adaptive computation for iterative generation.
In ICML, 2023. 7, 9, 17
Aren Jansen, Daniel PW Ellis, Shawn Hershey, R Channing Moore, Manoj Plakal, Ashok C Popat,
and Rif A Saurous. Coincidence, categorization, and consolidation: Learning to recognize sounds
with minimal supervision. In ICASSP, 2020. 4
11
Work in progress
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In CVPR, 2019. 5
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-
narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action
video dataset. arXiv:1705.06950, 2017. 6
Diederik P Kingma and Ruiqi Gao. Understanding the diffusion objective as a weighted integral of
elbos. arXiv:2303.00848, 2023. 7, 17
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and WOOK SHIN HAN. Draft-and-revise:
Effective image generation with contextual rq-transformer. In NeurIPS, 2022. 1, 17
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In EMNLP, 2021. 2
José Lezama, Huiwen Chang, Lu Jiang, and Irfan Essa. Improved masked image generation with
Token-Critic. In ECCV, 2022. 17
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, and Irfan Essa. Discrete
predictor-corrector diffusion models for image synthesis. In ICLR, 2023. 7, 15, 17
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou,
Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. StarCoder: may the source be with
you! arXiv:2305.06161, 2023. 1
Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer,
and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data.
arXiv:2003.04035, 2020. 7
Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, and Irfan Essa.
Discrete representations strengthen vision transformer robustness. In ICLR, 2021. 2
OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023. 1
William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv:2212.09748,
2022. 3, 7, 9, 17
Ofir Press and Lior Wolf. Using the output embedding to improve language models. In EACL, 2017.
5
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen,
and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021. 2
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-
conditional image generation with clip latents. arXiv:2204.06125, 2022. 9
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In CVPR, 2022. 1, 3, 9, 17
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomed-
ical image segmentation. In MICCAI, 2015. 9
Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos,
Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al.
AudioPaLM: A large language model that can speak and listen. arXiv:2306.12925, 2023. 1
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar
Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic
text-to-image diffusion models with deep language understanding. In NeurIPS, 2022. 9
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training GANs. In NeurIPS, 2016. 7
12
Work in progress
Axel Sauer, Katja Schwarz, and Andreas Geiger. StyleGAN-XL: Scaling stylegan to large diverse
datasets. In SIGGRAPH, 2022. 7, 17
Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv:1911.02150,
2019. 2
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry
Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video
data. arXiv:2209.14792, 2022. 9
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen
Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering
with large language models. arXiv:2305.09617, 2023. 1
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. In ICML, 2015. 3, 9
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
In NeurIPS, 2019. 3, 9
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human
actions classes from videos in the wild. arXiv:1212.0402, 2012. 6
Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high
efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video
Technology, 22(12):1649–1668, 2012. 2, 7, 8
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski,
and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.
arXiv:1812.01717, 2018. 6
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS,
2017. 2, 3
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang,
Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable
length video generation from open domain textual description. arXiv:2210.02399, 2022. 2, 3, 4,
5, 7, 9
Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang,
Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. MCL-JCV: a JND-based H. 264/AVC
video quality assessment dataset. In ICIP, 2016. 6, 15
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang
Jiang, Luowei Zhou, and Lu Yuan. BEVT: BERT pretraining of video transformers. In CVPR,
2022. 2, 8, 16
Ze Wang, Jiang Wang, Zicheng Liu, and Qiang Qiu. Binary latent diffusion. In CVPR, 2023. 9, 17
Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. NÜWA:
Visual synthesis pre-training for neural visual world creation. In ECCV, 2022. 9
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video generation using
VQ-VAE and transformers. arXiv:2104.10157, 2021. 9
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong
Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQ-
GAN. In ICLR, 2022a. 4, 9
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan,
Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-
rich text-to-image generation. arXiv:2206.10789, 2022b. 2
13
Work in progress
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G
Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. MAGVIT: Masked generative video
transformer. In CVPR, 2023a. 1, 2, 3, 6, 7, 8, 9, 15
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun
Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models:
Pretraining and instruction tuning. arXiv:2309.02591, 2023b. 9
Richard Zhang. Making convolutional networks shift-invariant again. In ICML, 2019. 5
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models
with masked transformers. arXiv:2306.09305, 2023. 17
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart,
Stefan Welker, Ayzaan Wahid, et al. RT-2: Vision-language-action models transfer web knowl-
edge to robotic control. In CoRL, 2023. 1
14
Work in progress
A I MPLEMENTATION D ETAILS
We set up two image tokenizers to downsample by 16ˆ and 32ˆ, where they are used for generation
at 256ˆ256 and 512ˆ512, respectively. In both cases, an image is represented as 16ˆ16 tokens.
We train them on the ImageNet training set for 270 epochs using a batch size of 256, both with
256ˆ256 images.
With this tokenizer we train a Masked Language Model following Yu et al. (2023a), using the token
factorization described in Section 3.2. We train for 1080 epochs in accordance with the prior best
model MDT (Gao et al., 2023), with batch size 1024 for better efficiency. For preprocessing and
data augmentation, we randomly crop 80-100% of an image while keeping the aspect ratio, followed
by random horizontal flipping. The class label is dropped for 10% of the training batches to enable
classifier-free guidance (Ho & Salimans, 2021). For unguided generation, we use temperature 30
for 512ˆ512 and 15 for 256ˆ256 in the non-autoregressive decoding. For guided generation, we
adopt the guidance schedule from Gao et al. (2023) with temperature scaling (Lezama et al., 2023),
where we use guidance scale 25 with temperature 15.
We inflate an image tokenizer trained at 128ˆ128 for video modeling. Different from the inflation
in Yu et al. (2023a), we fill in the temporally last slice to correspond to the causal padding scheme.
In addition, we disable the inflation for the discriminator and train it from scratch for better stability.
We train the causal video tokenizer on Kinetics-600 training set for 190 epochs with batch size 256.
This tokenizer is also used in subsequent evaluations of video compression and action recognition.
With the causal tokenizer producing 5ˆ16ˆ16 tokens for a 17ˆ128ˆ128 clip, the first 2ˆ16ˆ16
tokens are provided as the condition of the first 5 frames, per the standard setup of Kinetics-600
frame prediction benchmark. We train the MLM transformer following Yu et al. (2023a) with token
factorization for 360 epochs with batch size 256. The model is sampled with a cosine schedule using
temperature 32.
To rate the quality of the different methods, we use a two-alternative forced choice rating method-
ology (Fechner, 1860). As this methodology produces a sequence of binary decisions, we calculate
Elo scores (Elo & Sloan, 2008) based on pairwise preferences to quantify the relative visual quality
between the models. The study was conducted on the 30 videos of the MCL-JCV dataset (Wang
et al., 2016), scaled down to a resolution of 640ˆ360 pixels. Sixteen raters are engaged, each pro-
viding responses to an average of roughly 800 pairwise-preference questions. The questions are
presented with an interface that parallels the one used for the Challenge on Learned Image Com-
15
Work in progress
Tokens as prediction targets. BEiT (Bao et al., 2021) and BEVT (Wang et al., 2022) class of
models pretrain visual encoders on pixel inputs by predicting tokens as targets in a masked-modeling
framework, and demonstrate state-of-the-art downstream results. We use a simplified BEVT pre-
training setup to test the effectiveness of our video tokens as targets for masked modeling. The main
difference is that we drop the image-stream from pre-training and only use the video stream and for
this reason, we also drop the multiple decoders completely and adopt an encoder-only architecture
similar to BEiT. Detailed pre-training and fine-tuning setup is presented in Tab. 6. In Tab. 4 of the
main paper, we show that our video tokens are effective targets for masked modeling based video
understanding.
Tokens as inputs. In Tab. 4, we show that we can re-use video understanding models trained on
pixels using our video tokens as input, with very minimal performance drop. For this experiment,
we train a factorized variant of the ViViT model (Arnab et al., 2021) on pixels, and evaluate it on de-
tokenized pixels from our model. We use the same hyper-parameters as used in Arnab et al. (2021)
with a Base sized model operating on 32 frames of inputs at 224p resolution. For the Kinetics-600
experiment, we use the same hyper-parameters as the Kinetics-400 experiments.
B A DDITIONAL R ESULTS
For better visualization, the generated video samples can be viewed at https://magvit.cs.cmu.
edu/v2.
16
Work in progress
Where are the text-to-image results? We want to emphasize that our goal is to develop a video
tokenizer, and many of the proposed techniques are designed specifically for videos. Text-to-image
may be out of the scope of our paper. We are currently training text-to-video models that require
considerable computational resources. Due to time constraints, these results are not available at the
moment. We intend to add the generated videos in the next revision. However, it is important to
note that comparing these text-to-image or text-to-video models scientifically is challenging. These
models were trained on different datasets, and some were even based on proprietary or non-public
data, all under varying training conditions.
17