0% found this document useful (0 votes)
32 views

Meta

The document proposes a novel text-to-image generation method called Make-A-Scene that addresses gaps in existing methods. It introduces (1) scene control in addition to text input, (2) losses focused on human-salient regions like faces to improve quality, and (3) classifier-free guidance to generate higher resolution 512x512 images. Experiments show it achieves state-of-the-art results and enables new capabilities like scene editing and out-of-distribution generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Meta

The document proposes a novel text-to-image generation method called Make-A-Scene that addresses gaps in existing methods. It introduces (1) scene control in addition to text input, (2) losses focused on human-salient regions like faces to improve quality, and (3) classifier-free guidance to generate higher resolution 512x512 images. Experiments show it achieves state-of-the-art results and enables new capabilities like scene editing and out-of-distribution generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Oran Gafni Adam Polyak Oron Ashual Shelly Sheynin Devi Parikh Yaniv Taigman
Meta AI Research
{oran,adampolyak,oron,shellysheynin,dparikh,yaniv}@fb.com

(a) (b)
Figure 1. Make-A-Scene: Samples of generated images from text inputs (a), and a text and scene input (b). Our method is able to both
generate the scene (a, bottom left) and image, or generate the image from text and a simple sketch input (b, center).

Abstract 1. Introduction
“A poet would be overcome by sleep and hunger
Recent text-to-image generation methods provide a sim- before being able to describe with words what a
ple yet exciting conversion capability between text and im- painter is able to depict in an instant.”
age domains. While these methods have incrementally im-
proved the generated image fidelity and text relevancy, sev- Similar to this quote by Leonardo da Vinci [27], equiv-
eral pivotal gaps remain unanswered, limiting applicabil- alents of the expression “A picture is worth a thou-
ity and quality. We propose a novel text-to-image method sand words” have been iterated in different languages and
that addresses these gaps by (i) enabling a simple control eras [14, 1, 25], alluding to the heightened expressiveness
mechanism complementary to text in the form of a scene, of images over text, from the human perspective. There is
(ii) introducing elements that substantially improve the tok- no surprise then, that the task of text-to-image generation
enization process by employing domain-specific knowledge has been gaining increased attention with the recent suc-
over key image regions (faces and salient objects), and (iii) cess of text-to-image modeling via large-scale models and
adapting classifier-free guidance for the transformer use datasets. This new capability of effortlessly bridging be-
case. Our model achieves state-of-the-art FID and human tween the text and image domains enables new forms of
evaluation results, unlocking the ability to generate high fi- creativity to be accessible to the general public.
delity images in a resolution of 512 × 512 pixels, signifi- While current methods provide a simple yet exciting
cantly improving visual quality. Through scene controlla- conversion between the text and image domains, they still
bility, we introduce several new capabilities: (i) Scene edit- lack several pivotal aspects:
ing, (ii) text editing with anchor scenes, (iii) overcoming (i) Controllability. The sole input accepted by the ma-
out-of-distribution text prompts, and (iv) story illustration jority of models is text, confining any output to be con-
generation, as demonstrated in the story we wrote. trolled by a text description only. While certain perspectives

1
can be controlled with text, such as style or color, others We demonstrate the new capabilities this method pro-
such as structure, form, or arrangement can only be loosely vides in addition to controllability, such as (i) complex
described at best [46]. This lack of control conveys a no- scene generation (Fig. 1), (ii) out-of-distribution generation
tion of randomness and weak user-influence on the image (Fig. 3), (iii) scene editing (Fig. 4), and (iv) text editing with
content and context [34]. Controlling elements additional anchored scenes (Fig. 5). We additionally provide an exam-
to text have been suggested by [69], yet their use is con- ple of harnessing controllability to assist with the creative
fined to restricted datasets such as fashion items or faces. process of storytelling in this video.
An earlier work by [23] suggests coarse control in the form While most approaches rely on losses agnostic to human
of bounding boxes resulting in low resolution images. perception, this approach differs in that respect. We use two
(ii) Human perception. While images are generated to modified Vector-Quantized Variational Autoencoders (VQ-
match human perception and attention, the generation pro- VAE) to encode and decode the image and scene tokens
cess does not include any relevant prior knowledge, result- with explicit losses targeted at specific image regions cor-
ing in little correlation between generation and human at- related with human perception and attention, such as faces
tention. A clear example of this gap can be observed in and salient objects. The losses contribute to the genera-
person and face generation, where a dissonance is present tion process by emphasizing the specific regions of inter-
between the importance of face pixels from the human per- est and integrating domain-specific perceptual knowledge
spective and the loss applied over the whole image [28, 66]. in the form of network feature-matching.
This gap is relevant to animals and salient objects as well. While some methods rely on image re-ranking for post-
(iii) Quality and resolution. Although quality has grad- generation image filtering (utilizing CLIP [44] for instance),
ually improved between consecutive methods, the previous we extend the use of classifier-free guidance suggested for
state-of-the-art methods are still limited to an output im- diffusion models [53, 20] by [22, 41] to transformers, elimi-
age resolution of 256 × 256 pixels [45, 41]. Alternative ap- nating the need for post-generation filtering, thus producing
proaches propose a super-resolution network which results faster and higher quality generation results, better adhering
in less favorable visual and quantitative results [12]. Quality to input text prompts.
and resolution are strongly linked, as scaling up to a reso- An extensive set of experiments is provided to establish
lution of 512 × 512 requires a substantially higher quality the visual and numerical validity of our contributions.
with fewer artifacts than 256 × 256.
In this work, we introduce a novel method that success- 2. Related Work
fully tackles these pivotal gaps, while attaining state-of-
the-art results in the task of text-to-image generation. Our 2.1. Image generation
method provides a new type of control complementary to Recent advancements in deep generative models
text, enabling new-generation capabilities while improving have enabled algorithms to generate high-quality and
structural consistency and quality. Furthermore, we propose natural-looking images. Generative Adversarial Net-
explicit losses correlated with human preferences, signifi- works (GANs) [17] facilitate the generation of high fidelity
cantly improving image quality, breaking the common res- images [29, 3, 30, 56] in multiple domains by simultane-
olution barrier, and thus producing results in a resolution of ously training a generator network G and a discriminator
512 × 512 pixels. network D, where G is trained to fool D, while D is trained
Our method is comprised of an autoregressive trans- to judge if a given image is real or fake. Concurrently to
former, where in addition to the conventional use of text GANs, Variational Autoencoders (VAEs) [32, 57] have in-
and image tokens, we introduce implicit conditioning over troduced a likelihood-based approach to image generation.
optionally controlled scene tokens, derived from segmenta- Other likelihood-based models include autoregressive mod-
tion maps. During inference, the segmentation tokens are els [58, 43, 13, 8] and diffusion models [11, 21, 20]. While
either generated independently by the transformer or ex- the former model image pixels as a sequence with autore-
tracted from an input image, providing freedom to impel gressive dependency between each pixel, the latter synthe-
additional constraints over the generated image. Contrary sizes images via a gradual denoising process. Specifically,
to the common use of segmentation for explicit condition- sampling starts with a noisy image which is iteratively de-
ing as employed in many GAN-based methods [24, 62, 42], noised until all denoising steps are performed. Applying
our segmentation tokens provide implicit conditioning in both methods directly to the image pixel-space can be chal-
the sense that the generated image and image tokens are not lenging. Consequently, recent approaches either compress
constrained to use the segmentation information, as there is the image to a discrete representation [13, 59] via Vector
no loss tying them together. In practice, this contributes to Quantized (VQ) VAEs [59], or down-sample the image res-
the variety of samples generated by the model, producing olution [11, 21]. Our method is based on autoregressive
diverse results constrained to the input segmentations. modeling of discrete image representation.

2
XMC-GAN [67]
DALL-E [45]
CogView [12]
GLIDE [41]
Ours

“a green train “a group of skiers “a small kitchen “a group of “a living area


is coming down are preparing to ski with a elephants walking with a television
the tracks” down a mountain” low ceiling” in muddy water” and a table”

Figure 2. Qualitative comparison with previous work. The text and generated images for [67, 45, 41] were taken from [41]. For
CogView [12] we use the released 512 × 512 model weights, applying self-reranking of 60 for post-generation selection.

2.2. Image tokenization age representation. In the second stage, a generative model
generates the image in the discrete latent space. Inspired by
Image generation models based on discrete representa- Vector Quantization (VQ) techniques, VQ-VAE [59] learns
tion [59, 45, 47, 12, 13] follow a two-stage training scheme. to extract a discrete latent representation by performing on-
First, an image tokenizer is trained to extract a discrete im- line clustering. VQ-VAE-2 [47] presented a hierarchical ar-

3
chitecture composed of VQ-VAE models operating at mul- and without a classifier network to generate high-fidelity
tiple scales, enabling faster generation compared with pixel images. LAFITE [70] employed a pre-trained CLIP [44]
space generation. The DALL-E [45] text-to-image model model to project text and images to the same latent space,
used dVAE, which uses gumbel-softmax [26, 39], relaxing training text-to-image models without text data. Similarly
the VQ-VAE’s online clustering. Recently, VQGAN [13] to DALL-E and CogView, we train an autoregressive trans-
added adversarial and perceptual losses [68] on top of the former model on text and image tokens. Our main contri-
VQ-VAE reconstruction task, producing reconstructed im- butions are introducing additional controlling elements in
ages with higher quality. In our work, we modify the VQ- the form of a scene, improve the tokenization process, and
GAN framework by adding perceptual losses to specific im- adapt classifier-free guidance to transformers.
age regions, such as faces and salient objects, which further
improve the fidelity of the generated images. 3. Method
2.3. Image-to-image generation Our model generates an image given a text input and
an optional scene layout (segmentation map). As demon-
Generating images from segmentation maps or scenes strated in our experiments, by conditioning over the scene
can be viewed as a conditional image synthesis task [71, layout, our method provides a new form of implicit con-
38, 24, 61, 62, 42]. Specifically, this form of image syn- trollability, improves structural consistency and quality, and
thesis permits more controllability over the desired output. adheres to human preference (as assessed by our human
CycleGAN [71] trained a mapping function from one do- evaluation study). In addition to our scene-based approach,
main to the other. UNIT [38] projected two different do- we extended our aspiration of improving the general and
mains into a shared latent space and used a per-domain perceived quality with a better representation of the token
decoder to re-synthesize images in the desired domain. space. We introduce several modifications to the tokeniza-
Both methods do not require supervision between domains. tion process, emphasizing awareness of aspects with in-
pix2pix [24] utilized conditional GANs together with a su- creased importance in the human perspective, such as faces
pervised reconstruction loss. pix2pixHD [62] improved the and salient objects. To refrain from post-generation filtering
latter by increasing output image resolution thanks to im- and further improve the generation quality and text align-
proved network architecture. SPADE [42] introduced a ment, we employ classifier-free guidance.
spatially-adaptive normalization layer which elevated infor- We follow next with a detailed overview of the proposed
mation lost in normalization layers. [15] introduced face- method, comprised of (i) scene representation and tokeniza-
refinement to SPADE through a pre-trained face-embedding tion, (ii) attending human preference in the token space with
network inspired by face-generation methods [16]. Unlike explicit losses, (iii) the scene-based transformer, and (iv)
the aforementioned, our work conditions jointly on text and transformer classifier-free guidance. Aspects commonly
segmentation, enabling bi-domain controllability. used prior to this method are not extensively detailed be-
low, whereas specific settings for all elements can be found
2.4. Text-to-image generation
in the appendix.
Text-to-image generation [64, 72, 54, 65, 67, 45, 12,
3.1. Scene representation and tokenization
41, 70] focuses on generating images from standalone
text descriptions. Preliminary text-to-image methods con- The scene is composed of a union of three complemen-
ditioned RNN-based DRAW [18] on text [40]. Text- tary semantic segmentation groups - panoptic, human, and
conditioned GANs provided additional improvement [48]. face. By combining the three extracted semantic segmen-
AttnGAN [64] introduced an attention component, allow- tation groups, the network learns to both generate the se-
ing the generator network to attend to relevant words in the mantic layout and condition on it while generating the final
text. DM-GAN [72] introduced a dynamic memory compo- image. The semantic layout provides additional global con-
nent, while DF-GAN [54] employed a fusion block, fusing text in an implicit form that correlates with human prefer-
text information into image features. Contrastive learning ence, as the choice of categories within the scene groups,
further improved the results of DM-GAN [65], while XMC- and the choice of the groups themselves are a prior to hu-
GAN [67] used contrastive learning to maximize the mutual man preference and awareness. We consider this form of
information between image and text. conditioning to be implicit, as the network may disregard
DALL-E [45] and CogView [12] trained an autoregres- any scene information, and generate the image conditioned
sive transformer [60] on text and image tokens, demon- solely on text. Our experiments indicate that both the text
strating convincing zero-shot capabilities on the MS-COCO and scene firmly control the image.
dataset. GLIDE [41] used diffusion models conditioned on In order to create the scene token space, we employ
images. Inspired by the high-quality unconditional images VQ-SEG: a modified VQ-VAE for semantic segmentation,
generation model, GLIDE employed guided inference with building on the VQ-VAE suggested for semantic segmen-

4
GLIDE [41]
Ours

Figure 3. Overcoming out-of-distribution text prompts with scene control. By introducing simple scene sketches (bottom right) as additional
inputs, our method is able to overcome unusual objects and scenarios presented as failure cases in previous methods.

(a) (b) (c) (d) (e)


Figure 4. Generating images through edited scenes. For an input text (a) and the segmentations extracted from an input image (b), we can
re-generate the image (c) or edit the segmentations (d) by replacing classes (top) or adding classes (bottom), generating images with new
context or content (e).

tation in [13]. In our implementation the inputs and out- nel provides both separations for adjacent instances of the
puts of VQ-SEG are m channels, representing the num- same class, and emphasis on scarce classes with high im-
ber of classes for all semantic segmentation groups m = portance, as edges (perimeter) are less biased towards larger
mp + mh + mf + 1, where mp , mh , mf are the number of categories than pixels (area).
categories for the panoptic segmentation [63], human seg-
mentation [35], and face segmentation extracted with [5] re- 3.2. Adhering to human emphasis in the token space
spectively. The additional channel is a map of the edges sep- We observe an inherent upper-bound on image qual-
arating the different classes and instances. The edge chan- ity when generating images with the transformer, stem-

5
(a) (b) (c) (d)
Figure 5. Generating new image interpretations through text editing and anchor scenes. For an input text (a) and image (b), we first extract
the semantic segmentation (c), we can then re-generate new images (d) given the input segmentation and edited text. Purple denotes text
added or replacing the original text.

ming from the tokenization reconstruction method. In other FE [6], while the summation runs over the last layers of
words, quality limitations of the VQ image reconstruction each block of size 112 × 112, 56 × 56, 28 × 28, 7 × 7, 1 × 1
method inherently transfer to quality limitations on images (1 × 1 being the size of the top most block), ĉkf and ckf are
generated by the transformer. To that end, we introduce sev- respectively the reconstructed and ground-truth face crops
eral modifications to both the segmentation and image re- k out of kf faces in an image, αfl is a per-layer normaliz-
construction methods. These modifications are losses in the ing hyperparameter, and LFace is the face loss added to the
form of emphasis (specific region awareness) and percep- VQGAN losses defined by [13].
tual knowledge (feature-matching over task-specific pre-
trained networks). 3.4. Face emphasis in the scene space
3.3. Face-aware vector quantization While training the VQ-SEG network, we observe a fre-
quent reduction of the semantic segmentations representing
While using a scene as an additional form of condition- the face parts (such as the eyes, nose, lips, eyebrows) in
ing provides an implicit prior for human preference, we in- the reconstructed scene. This effect is not surprising due
stitute explicit emphasis in the form of additional losses, to the relatively small number of pixels that each face part
explicitly targeted at specific image regions. accounts for in the scene space. A straightforward solution
We employ a feature-matching loss over the activa- would be to employ a loss more suitable for class imbal-
tions of a pre-trained face-embedding network, introducing ance, such as focal loss [36]. However, we do not aspire
“awareness” of face regions and additional perceptual infor- to increase the importance of classes that are both scarce,
mation, motivating high-quality face reconstruction. and of less importance, such as fruit or a tooth-brush. In-
Before training the face-aware VQ (denoted as stead, we (1) employ a weighted binary cross-entropy face
VQ-IMG), faces are located using the semantic segmenta- loss over the segmentation face parts classes, emphasizing
tion information extracted for VQ-SEG. The face locations higher importance for face parts, and (2) include the face
are then used during the face-aware VQ training stage, run- parts edges as part of the semantic segmentation edge map
ning up to kf faces per image from the ground-truth and mentioned above. The weighted binary cross-entropy loss
reconstructed images through the face-embedding network. can then be formulated as following:
The face loss can then be formulated as following:
XX LWBCE = αcat BCE(s, ŝ), (2)
LFace = αfl ∥ FEl (ĉkf ) − FEl (ckf )∥, (1)
k l where s and ŝ are the input and reconstructed segmenta-
where the index l is used to denote the size of the spatial tion maps respectively, αcat is a per-category weight func-
activation at specific layers of the face embedding network tion, BCE is a binary cross-entropy loss, and LWBCE is the

6
weighted binary cross-entropy loss added to the conditional
VQ-VAE losses defined by [13].
3.5. Object-aware vector quantization
We generalized and extend the face-aware VQ method
to increase awareness and perceptual knowledge of ob-
jects defined as “things” in the panoptic segmentation cate-
gories. Rather than a specialized face-embedding network,
we employ a pre-trained VGG [52] network trained on Im-
ageNet [33], and introduce a feature-matching loss repre-
senting the perceptual differences between the object crops
of the reconstructed and ground-truth images. By running
the feature-matching over image crops, we are able to in-
crease the output image resolution from 256 × 256 by sim-
ply adding to VQ-IMG an additional down-sample and up- Figure 6. The scene-based method high-level architecture. Given
an input text and optional scene layout, a corresponding image is
sample layer to the encoder and decoder respectively. Sim-
generated. The transformer generates the relevant tokens, encoded
ilarly to Eq. 1, the loss can be formulated as:
and decoded by the corresponding networks.

XX
LObj = αol ∥ VGGl (ĉko ) − VGGl (cko )∥, (3)
44]. Classifier-free guidance is the process of guiding an
k l
unconditional sample in the direction of a conditional sam-
where ĉko and cko are the reconstructed and input object crops ple. To support unconditional sampling we fine-tune the
respectively, VGGl are the activations of the l − th layer transformer while randomly replacing the text prompt with
from the pre-trained VGG network, αol is a per-layer nor- padding tokens with a probability of pCF . During infer-
malizing hyperparameter, and LObj is the object-aware loss ence, we generate two parallel token streams: a conditional
added to the VQ-IMG losses defined in Eq. 1. token stream conditioned on text, and an unconditional to-
ken stream conditioned on an empty text stream initialized
3.6. Scene-based transformer with padding tokens. For transformers, we apply classifier-
The method relies on an autoregressive transformer with free guidance on logit scores:
three independent consecutive token spaces: text, scene,
and image, as depicted in Fig 6. The token sequence is logitscond = T (ty , tz |tx ),
comprised of nx text tokens encoded by a BPE [50] en- logitsuncond = T (ty , tz |∅),
coder, followed by ny scene tokens encoded by VQ-SEG, logitscf = logitsuncond + αc · (logitscond − logitsuncond ),
and nz image tokens encoded or decoded by VQ-IMG.
Prior to training the scene-based transformer, each en- where ∅ is the empty text stream, logitscond are logit scores
coded token sequence corresponding to a [text, scene, im- outputted by the conditioned token stream, logitsuncond are
age] triplet is extracted using the corresponding encoder, logit scores outputted by the unconditioned token stream,
producing a sequence that consists of: αc is the guidance scale, logitscf is the guided logit scores
used to sample the next scene or image token, T is an au-
tx , ty , tz = BPE(ix ), VQ-SEG(iy ), VQ-IMG(iz ), toregressive transformer based the GPT-3 [4] architecture.
t = [tx , ty , tz ], Note that since we use an autoregressive transformer, we
use logitscf to sample once and feed the same token (im-
where ix , iy , iz are the input text, scene and image respec- age or scene) to the conditional and unconditional stream.
tively, ix ∈ Ndx , dx is the length of the input text sequence,
iy ∈ Rhy ×wy ×m , iz ∈ Rhz ×wz ×3 , hy , wy , hz , wz are the
4. Experiments
height and width dimensions of the scene and image in-
puts respectively, BPE is the Byte Pair Encoding encoder, Our model achieves state-of-the-art results in human-
tx , ty , tz are the text, scene and image input tokens respec- based and numerical metric comparisons. Samples support-
tively, and t is the complete token sequence. ing the qualitative advantage are provided in Fig. 2. Ad-
ditionally, we demonstrate new creative capabilities possi-
3.7. Transformer classifier-free guidance
ble with this method’s new form of controllability. Finally,
Inspired by the high-fidelity of unconditional image gen- to better assess the effect of each contribution, an ablation
eration models, we employ classifier-free guidance [9, 22, study is provided.

7
Experiments were performed with a 4 billion parameter to choose between two images generated by the two mod-
transformer, generating a sequence of 256 text tokens, 256 els being compared. The two models are compared in three
scene tokens, and 1024 image tokens, that are then decoded aspects: (i) image quality, (ii) photorealism (which image
into an image with a resolution of 256 × 256 or 512 × 512 appears more real), and (iii) text alignment (which image
pixels (depending on the model of choice). best matches the text). Each question is surveyed using
500 image pairs, where 5 different evaluators answer each
4.1. Datasets question, amounting to 2500 instances per question for a
The scene-based transformer is trained on a union of given comparison. We compare our 256 × 256 model with
CC12m [7], CC [51], and subsets of YFCC100m [55] and our re-implementation of DALL-E [45] and CogView’s [12]
Redcaps [10], amounting to 35m text-image pairs. MS- 256 × 256 model. CogView’s 512 × 512 model is compared
COCO [37] is used unless otherwise specified. VQ-SEG with our corresponding model. Results are presented as a
and VQ-IMG are trained on CC12m, CC, and MS-COCO. percentage of majority votes in favor of our method when
comparing between a certain model and ours. Compared
4.2. Metrics with the three methods, ours achieves significantly higher
favorability in all aspects.
The goal of text-to-image generation is to generate high-
quality and text-aligned images from a human perspective. 4.6. FID comparison
Different metrics have been suggested to mimic the human
perspective, where some are considered more reliable than FID is calculated over a subset of 30k images generated
others. We consider human evaluation the highest authority from the MS-COCO validation set text prompts with no re-
when evaluating image quality and text-alignment, and rely ranking, and provided in Tab. 4.6. The evaluated models
on FID [19] to increase evaluation confidence and handle are divided into two groups: trained with and without (de-
cases where human evaluation is not applicable. We do not noted as filtered) the MS-COCO training set. In both sce-
use IS [49] as it has been noted to be insufficient for model narios our model achieves the lowest FID. In addition, we
evaluation [2]. provide a loose practical lower-bound (denoted as ground-
truth), calculated between the training and validation sub-
4.3. Comparison with previous work sets of MS-COCO. As FID results are approaching small
numbers, it is interesting to get an idea of a possible practi-
The task of text-to-image generation does not contain
cal lower-bound.
absolute ground-truths, as a specific text description could
apply to multiple images and vice versa. This constrains
4.7. Generating out of distribution
evaluation metrics to evaluate distributions of images, rather
than specific images, thus we employ FID [19] as our sec- Methods that rely on text inputs only are more confined
ondary metric. to generate within the training distribution, as demonstrated
by [41]. Unusual objects and scenarios can be challenging
4.4. Baselines to generate, as certain objects are strongly correlated with
We compare our results with several state-of-the-art specific structures, such as cats with four legs, or cars with
methods using the FID metric and human evaluators (AMT) round wheels. The same is true for scenarios. “A mouse
when possible. DALL-E [45] provides strong zero- hunting a lion” is most likely not a scenario easily found
shot capabilities, similarly employing an autoregressive within the dataset. By conditioning on scenes in the form of
transformer with VQ-VAE tokenization. We train a re- simple sketches, we are able to attend to these uncommon
implementation of DALL-E with 4B parameters to enable objects and scenarios, as demonstrated in Fig. 3, despite the
human evaluation and fairly compare both methods em- fact that some objects do not exist as categories in our scene
ploying an identical VQ method (VQGAN). GLIDE [41] (mouse, lion). We solve the category gap by using cate-
demonstrates vastly improved results over DALL-E, adopt- gories that may be close in certain aspects (elephant instead
ing a diffusion-based [53] approach with classifier-free of mouse, cat instead of lion). In practice, for non-existent
guidance [22]. We additionally provide an FID compari- categories, several categories could be used instead.
son with CogView [12], LAFITE [70], XMC-GAN [67],
DM-GAN(+CL) [65], DF-GAN [54], DM-GAN [72], DF- 4.8. Scene controllability
GAN [54] and, AttnGAN [64]. Samples are provided in Fig. 1, 3, 4, 5 and in the ap-
pendix with both our 256 × 256 and 512 × 512 models. In
4.5. Human evaluation results
addition to generating high fidelity images from text only,
Human evaluation with previous methods is provided in we demonstrate the applicability of scene-wise image con-
Tab. 4.6. In each instance, human evaluators are required trol and maintaining consistency between generations.

8
Model FID↓ FID↓ Image Photo- Text Model FID↓ Image Photo- Text
(filt.) quality realism alignment quality realism alignment
AttnGAN [64] 35.49 - - - - Base 18.01 - - -
DM-GAN [72] 32.64 - - - - +Scene tokens 19.16 57.3% 65.3% 58.3%
DF-GAN [54] 21.42 - - - - +Face-aware 14.45 63.6% 59.8% 57.4%
DM-GAN+CL [65] 20.79 - - - - +CF 7.55 76.8% 66.8% 66.8%
XMC-GAN [67] 9.33 - - - - +Obj-aware512 8.70 62.0% 53.5% 52.2%
DALL-E [45] - 34.60 81.8% 81.0% 65.9% +CF with scene input 4.69 - - -
CogView256 [12] - 32.20 92.2% 94.2% 92.2%
CogView512 [12] - 36.53 91.1% 88.2% 87.8% Table 2. Ablation study (FID and human preference). FID is calcu-
LAFITE [70] 8.12 26.94 - - - lated over a subset of 30k images generated from the MS-COCO
GLIDE [41] - 12.24 - - - validation set text prompts. Human evaluation is shown as a per-
centage of majority votes in favor of the added element compared
Ours256 7.55 11.84
to the previous model.
Ground-truth 2.47 - - - -

Table 1. Comparison with previous work (FID and human prefer-


ence). FID is calculated over a subset of 30k images generated
from the MS-COCO validation set text prompts. When possible, of text-to-image generation, and (ii) improved consistency
we include models trained with and without (filtered) the MS- between generation. We provide a short video of the story
COCO training set. In both scenarios our model achieves state and process.
of the art results, correlating with visual samples and human eval-
uation. We add a loose practical lower-bound (denoted as ground-
4.11. Ablation study
truth), calculated between the training and validation subsets of
MS-COCO. Human evaluation is shown as a percentage of major- An ablation study of human preference and FID is pro-
ity votes in favor of our method when comparing between a certain vided in Tab. 4.11 to assess the effectiveness of our differ-
model and ours.
ent contributions. Settings in both studies are similar to the
comparison made with previous work (Sec. 4.3). Each row
corresponds to a model trained with the additional element,
4.9. Scene editing and anchoring compared with the model without that specific addition for
Rather than editing certain regions of images as demon- human preference. We note that while the lowest FID is
strated by [45], we introduce new capabilities of generat- attained by the 256 × 256 model, human preference favors
ing images from existing or edited scenes. In Fig. 4, two the 512 × 512 model with object-aware training, particu-
scenarios are considered. In both scenarios the semantic larly in quality. Furthermore, we re-examine the FID of the
segmentation is extracted from an input image, and used to best model, where the scene is given as an additional in-
re-generate an image conditioned on the input text. In the put, to gain a better notion of the gap from the lower-bound
top row, the scene is edited, replacing the ‘sky’ and ‘tree’ (Tab. 4.6).
categories with ‘sea’, and the ‘grass’ category with ‘sand’,
resulting in a generated image adhering to the new scene. 5. Conclusion
A simple sketch of a giant dog is added to the scene in the
bottom row, resulting in a generated image corresponding The text-to-image domain has witnessed a plethora of
to the new scene without any change in text. novel methods aimed at improving the general quality and
Fig. 5 demonstrates the ability to generate new interpre- adherence to text of generated images. While some meth-
tations of existing images and scenes. After extracting the ods propose image editing techniques, progress is not often
semantic segmentation from a given image, we re-generate directed towards enabling new forms of human creativity
the image conditioned on the input scene and edited text. and experiences. We attempt to progress text-to-image gen-
eration towards a more interactive experience, where people
4.10. Storytelling through controllability
can perceive more control over the generated outputs, thus
To demonstrate the applicability of harnessing scene enable real-world applications such as storytelling. In ad-
control for story illustrations, we wrote a children story, and dition to improving the general image quality, we focus on
illustrated it using our method. The main advantages of us- improving key image aspects we deem significant in human
ing simple sketches as additional inputs in this case, are (i) perception, such as faces and salient objects, resulting in
that authors can translate their ideas into paintings or realis- higher favorability of our method in human evaluations and
tic images, while being less susceptible to the “randomness” objective metrics.

9
References IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 882–891, 2021. 4
[1] Speakers Give Sound Advice. Syracuse post standard. [16] Oran Gafni, Lior Wolf, and Yaniv Taigman. Live face de-
March, 28:18, 1911. 1 identification in video. In Proceedings of the IEEE/CVF
[2] Shane Barratt and Rishi Sharma. A note on the inception International Conference on Computer Vision, pages 9378–
score. arXiv preprint arXiv:1801.01973, 2018. 8 9387, 2019. 4, 13
[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
scale gan training for high fidelity natural image synthesis. Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
arXiv preprint arXiv:1809.11096, 2018. 2 Yoshua Bengio. Generative adversarial nets. Advances in
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- neural information processing systems, 27, 2014. 2
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- [18] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende,
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- and Daan Wierstra. Draw: A recurrent neural network for
guage models are few-shot learners. Advances in neural in- image generation. In International Conference on Machine
formation processing systems, 33:1877–1901, 2020. 7 Learning, pages 1462–1471. PMLR, 2015. 4
[5] Adrian Bulat and Georgios Tzimiropoulos. How far are we [19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
from solving the 2d & 3d face alignment problem? (and a Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
dataset of 230,000 3d facial landmarks). In International two time-scale update rule converge to a local nash equilib-
Conference on Computer Vision, 2017. 5 rium. Advances in neural information processing systems,
[6] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and An- 30, 2017. 8
drew Zisserman. VGGFace2: A dataset for recognising faces [20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
across pose and age. arXiv preprint arXiv:1710.08092, 2017. sion probabilistic models. Advances in Neural Information
6 Processing Systems, 33:6840–6851, 2020. 2
[7] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu [21] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
Soricut. Conceptual 12m: Pushing web-scale image-text pre- Mohammad Norouzi, and Tim Salimans. Cascaded diffu-
training to recognize long-tail visual concepts. In Proceed- sion models for high fidelity image generation. Journal of
ings of the IEEE/CVF Conference on Computer Vision and Machine Learning Research, 23(47):1–33, 2022. 2
Pattern Recognition, pages 3558–3568, 2021. 8 [22] Jonathan Ho and Tim Salimans. Classifier-free diffusion
[8] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- guidance. In NeurIPS 2021 Workshop on Deep Generative
woo Jun, David Luan, and Ilya Sutskever. Generative pre- Models and Downstream Applications, 2021. 2, 7, 8
training from pixels. In Hal Daumé III and Aarti Singh, [23] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and
editors, Proceedings of the 37th International Conference Honglak Lee. Inferring semantic layout for hierarchical text-
on Machine Learning, volume 119 of Proceedings of Ma- to-image synthesis. In Proceedings of the IEEE conference
chine Learning Research, pages 1691–1703. PMLR, 13–18 on computer vision and pattern recognition, pages 7986–
Jul 2020. 2 7994, 2018. 2
[24] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
[9] Katherine Crowson. Classifier Free Guidance for Auto-
Efros. Image-to-image translation with conditional adver-
regressive Transformers, 2021b. 7
sarial networks. In Proceedings of the IEEE conference on
[10] Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin John-
computer vision and pattern recognition, pages 1125–1134,
son. Redcaps: Web-curated image-text data created by the
2017. 2, 4
people, for the people. arXiv preprint arXiv:2111.11431,
[25] Turgenev Ivan. Fathers and Sons. Pandora’s Box, 2017. 1
2021. 8
[26] Eric Jang, Shixiang Gu, and Ben Poole. Categorical
[11] Prafulla Dhariwal and Alexander Nichol. Diffusion models reparameterization with gumbel-softmax. arXiv preprint
beat gans on image synthesis. Advances in Neural Informa- arXiv:1611.01144, 2016. 4
tion Processing Systems, 34, 2021. 2
[27] Horst Woldemar Janson, Anthony F Janson, and Max Mar-
[12] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, mor. History of art. Thames and Hudson London, 1991.
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, 1
Hongxia Yang, et al. Cogview: Mastering text-to-image gen- [28] Tilke Judd, Frédo Durand, and Antonio Torralba. A bench-
eration via transformers. Advances in Neural Information mark of computational models of saliency to predict human
Processing Systems, 34, 2021. 2, 3, 4, 8, 9 fixations, 2012. 2
[13] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming [29] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,
transformers for high-resolution image synthesis. In Pro- Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free
ceedings of the IEEE/CVF Conference on Computer Vision generative adversarial networks. Advances in Neural Infor-
and Pattern Recognition, pages 12873–12883, 2021. 2, 3, 4, mation Processing Systems, 34, 2021. 2
5, 6, 7, 13 [30] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
[14] Marie-Madeleine Fourcade. L’Arche de Noé: réseau Al- Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
liance, 1940-1945. Plon, 1968. 1 ing the image quality of stylegan. In Proceedings of
[15] Oran Gafni, Oron Ashual, and Lior Wolf. Single- the IEEE/CVF conference on computer vision and pattern
shot freestyle dance reenactment. In Proceedings of the recognition, pages 8110–8119, 2020. 2

10
[31] Diederik P Kingma and Jimmy Ba. Adam: A method for Zero-shot text-to-image generation. In International Confer-
stochastic optimization. arXiv preprint arXiv:1412.6980, ence on Machine Learning, pages 8821–8831. PMLR, 2021.
2014. 13 2, 3, 4, 8, 9
[32] Diederik P Kingma and Max Welling. Auto-encoding varia- [46] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2 Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Zero-shot text-to-image generation (ICML spotlight), 2021.
Imagenet classification with deep convolutional neural net- 2
works. Advances in neural information processing systems, [47] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener-
25, 2012. 7 ating diverse high-fidelity images with vq-vae-2. Advances
in neural information processing systems, 32, 2019. 3
[34] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip
[48] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
Torr. Controllable text-to-image generation. Advances in
geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
Neural Information Processing Systems, 32, 2019. 2
versarial text to image synthesis. In International conference
[35] Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self- on machine learning, pages 1060–1069. PMLR, 2016. 4
correction for human parsing. IEEE Transactions on Pattern
[49] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
Analysis and Machine Intelligence, 2020. 5
Cheung, Alec Radford, and Xi Chen. Improved techniques
[36] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and for training gans. Advances in neural information processing
Piotr Dollár. Focal loss for dense object detection. In Pro- systems, 29, 2016. 8
ceedings of the IEEE international conference on computer [50] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural
vision, pages 2980–2988, 2017. 6 machine translation of rare words with subword units. arXiv
[37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, preprint arXiv:1508.07909, 2015. 7
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [51] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Zitnick. Microsoft coco: Common objects in context. In Soricut. Conceptual captions: A cleaned, hypernymed, im-
European conference on computer vision, pages 740–755. age alt-text dataset for automatic image captioning. In Pro-
Springer, 2014. 8 ceedings of the 56th Annual Meeting of the Association for
[38] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised Computational Linguistics (Volume 1: Long Papers), pages
image-to-image translation networks. Advances in neural in- 2556–2565, 2018. 8
formation processing systems, 30, 2017. 4 [52] Karen Simonyan and Andrew Zisserman. Very deep convo-
[39] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The lutional networks for large-scale image recognition. arXiv
concrete distribution: A continuous relaxation of discrete preprint arXiv:1409.1556, 2014. 7
random variables. arXiv preprint arXiv:1611.00712, 2016. [53] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
4 and Surya Ganguli. Deep unsupervised learning using
[40] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Rus- nonequilibrium thermodynamics. In International Confer-
lan Salakhutdinov. Generating images from captions with ence on Machine Learning, pages 2256–2265. PMLR, 2015.
attention. arXiv preprint arXiv:1511.02793, 2015. 4 2, 8
[54] Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan
[41] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion gener-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
ative adversarial networks for text-to-image synthesis. arXiv
Mark Chen. Glide: Towards photorealistic image generation
preprint arXiv:2008.05865, 2020. 4, 8, 9
and editing with text-guided diffusion models. arXiv preprint
[55] Bart Thomee, David A Shamma, Gerald Friedland, Ben-
arXiv:2112.10741, 2021. 2, 3, 4, 5, 8, 9
jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
[42] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Li-Jia Li. Yfcc100m: The new data in multimedia research.
Zhu. Semantic image synthesis with spatially-adaptive nor- Communications of the ACM, 59(2):64–73, 2016. 8
malization. In Proceedings of the IEEE/CVF conference on
[56] Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and
computer vision and pattern recognition, pages 2337–2346,
Weilong Yang. Regularizing generative adversarial networks
2019. 2, 4
under limited data. In Proceedings of the IEEE/CVF Con-
[43] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz ference on Computer Vision and Pattern Recognition, pages
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- 7921–7931, 2021. 2
age transformer. In International Conference on Machine [57] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical vari-
Learning, pages 4055–4064. PMLR, 2018. 2 ational autoencoder. Advances in Neural Information Pro-
[44] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya cessing Systems, 33:19667–19679, 2020. 2
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [58] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- Oriol Vinyals, Alex Graves, et al. Conditional image genera-
ing transferable visual models from natural language super- tion with pixelcnn decoders. Advances in neural information
vision. In International Conference on Machine Learning, processing systems, 29, 2016. 2
pages 8748–8763. PMLR, 2021. 2, 4, 7 [59] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
[45] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, representation learning. Advances in neural information pro-
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. cessing systems, 30, 2017. 2, 3

11
[60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- ference on Computer Vision and Pattern Recognition, pages
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia 5802–5810, 2019. 4, 8, 9
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 4
[61] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-
video synthesis. arXiv preprint arXiv:1808.06601, 2018. 4
[62] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 8798–8807, 2018. 2, 4
[63] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019. 5
[64] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
grained text to image generation with attentional generative
adversarial networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1316–
1324, 2018. 4, 8, 9
[65] Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderra-
man, and Shihao Ji. Improving text-to-image synthesis us-
ing contrastive learning. arXiv preprint arXiv:2107.02423,
2021. 4, 8, 9
[66] Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J Zelin-
sky, and Tamara L Berg. Studying relationships between
human gaze, description, and computer vision. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 739–746, 2013. 2
[67] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and
Yinfei Yang. Cross-modal contrastive learning for text-to-
image generation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
833–842, 2021. 3, 4, 8, 9
[68] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
man, and Oliver Wang. The unreasonable effectiveness of
deep features as a perceptual metric. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 586–595, 2018. 4, 13
[69] Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li,
Ming Ding, Jie Tang, Jingren Zhou, and Hongxia Yang. M6-
ufc: Unifying multi-modal controls for conditional image
synthesis. arXiv preprint arXiv:2105.14211, 2021. 2
[70] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li,
Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and
Tong Sun. Lafite: Towards language-free training for text-to-
image generation. arXiv preprint arXiv:2111.13792, 2021.
4, 8, 9
[71] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In Proceedings of the IEEE
international conference on computer vision, pages 2223–
2232, 2017. 4
[72] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan:
Dynamic memory generative adversarial networks for text-
to-image synthesis. In Proceedings of the IEEE/CVF Con-

12
A. Additional implementation details (iii) sampling a single logit from a multinomial probability
distribution.
A.1. VQ-SEG
VQ-SEG is trained for 600k iterations, with a batch size B. Additional samples
of 48, dictionary size of 1024. The number of segmentation
Additional samples generated from challenging text in-
categories per-group are mp = 133 for the panoptic seg-
puts are provided in Figs. 7-8, while samples generated
mentation, mh = 20 for the human parsing, and mf = 5
from text and scene inputs are provided in Figs. 9-12. The
for the face parsing. The per-category weight function fol-
different text colors emphasize the large number of differ-
lows the notation:
ent objects/scenarios being attended. As there are no ‘octo-
( pus’ or ‘dinosaur’ categories, we use instead the ‘cat’ and
20, if cat ∈ [154, ..., 158] ‘giraffe’ categories respectively. We did not attempt to use
αcat = (4)
1, otherwise, other classes in this case. However, we found that generally
there are no “one-to-one” mappings between absent and ex-
where cat ∈ [154, ..., 158] are the face-parts categories eye-
isting categories, hence several categories may work for an
brows, eyes, nose, outer-mouth, and inner-mouth.
absent category.
A.2. VQ-IMG
VQ-IMG256 and VQ-IMG512 are trained for 800k and
940k iterations respectively, with a batch size of 192 and
128, a channel multiplier of [1, 1, 2, 4] and [1, 1, 2, 4, 4],
while both are trained with a dictionary size of 8192.
The per-layer normalizing hyperparameter for the face-
aware loss is αfl = [αf 1 , αf 2 × 0.01, αf 2 × 0.1, αf 2 ×
0.2, αf 2 ×0.02] corresponding to the last layer of each block
of size 1×1, 7×7, 28×28, 56×56, 128×128, where αf 1 =
0.1 and αf 2 = 0.25. We experimented with two settings,
the first where αf 1 = αf 2 = 1.0, and the second, which
was used to train the final models, where αf 1 = 0.1, αf 2 =
0.25. The remaining face-loss values were taken from the
work of [16]. The per-layer normalizing hyperparameter for
the object-aware loss, αol were taken from the work of [13],
based on LPIPS [68].

A.3. Scene-based transformer


The 512 × 512 and 256 × 256 models both share all im-
plementation details, excluding the VQ-IMG used for to-
ken encoding and decoding, and the object-aware loss that
was applied to the 512 × 512 model only. Both transform-
ers share the architecture of 48 layers, 48 attention heads,
and an embedding dimension of 2560. The models were
trained for a total of 170k iterations, with a batch size of
1024, Adam [31] optimizer, with a starting learning-rate
of 4.5 × 10−4 for the first 40k iterations, transitioning to
1.5 × 10−4 for the remainder, β1 = 0.9,β2 = 0.96, weight-
decay of 4.5 × 10−4 , and a loss ratio of 7/1 between the
image and text tokens. For classifier-free guidance, we fine-
tune the transformer, while replacing the text tokens with
padding tokens in the last 30k iterations, with a probability
of pCF = 0.2. At inference-time we set the guidance scale
to αc = 5, though we found that αc = 3 works as well.
At each inference step, the next token is sampled by (i)
selecting half the logits with the highest probabilities, (ii)
applying a softmax operation over the selected logits, and

13
Figure 7. Additional samples generated from challenging text inputs.

14
Figure 8. Additional samples generated from challenging text inputs.

15
(a) (b)
Figure 9. Additional samples generated (b) from text and segmentation inputs (a).

(a) (b)
Figure 10. Additional samples generated (b) from text and segmentation inputs (a).

16
(a) (b)
Figure 11. Additional samples generated (b) from text and segmentation inputs (a).

(a) (b)
Figure 12. Additional samples generated (b) from text and segmentation inputs (a).

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy