2103 - Generative Adversarial Transformers

Generative Adversarial Transformers
Drew A. Hudson§ C. Lawrence Zitnick

Department of Computer Science Facebook AI Research
Stanford University Facebook, Inc.
dorarad@cs.stanford.edu zitnick@fb.com
Abstract
arXiv:2103.01209v2 [cs.CV] 2 Mar 2021
We introduce the GANsformer, a novel and effi-

cient type of transformer, and explore it for the
task of visual generative modeling. The network
employs a bipartite structure that enables long-
range interactions across the image, while main-
taining computation of linearly efficiency, that can
readily scale to high-resolution synthesis. It itera-
tively propagates information from a set of latent
variables to the evolving visual features and vice
versa, to support the refinement of each in light of
the other and encourage the emergence of compo-
sitional representations of objects and scenes. In
contrast to the classic transformer architecture, it
utilizes multiplicative integration that allows flexi- Figure 1. Sample images generated by the GANsformer, along
ble region-based modulation, and can thus be seen with a visualization of the model attention maps.
as a generalization of the successful StyleGAN together to form the whole [27, 28], and the top-down pro-
network. We demonstrate the model’s strength cessing, where surrounding context, selective attention and
and robustness through a careful evaluation over prior knowledge inform the interpretation of the particular
a range of datasets, from simulated multi-object [32, 68]. While their respective roles and dynamics are be-
environments to rich real-world indoor and out- ing actively studied, researchers agree that it is the interplay
door scenes, showing it achieves state-of-the- between these two complementary processes that enables
art results in terms of image quality and diver- the formation of our rich internal representations, allowing
sity, while enjoying fast learning and better data- us to perceive the world around in its fullest and create vivid
efficiency. Further qualitative and quantitative imageries in our mind’s eye [13, 17, 40, 54].
experiments offer us an insight into the model’s
inner workings, revealing improved interpretabil- Nevertheless, the very mainstay and foundation of computer
ity and stronger disentanglement, and illustrating vision over the last decade – the Convolutional Neural Net-
the benefits and efficacy of our approach. An im- work, surprisingly does not reflect this bidirectional nature
plementation of the model is available at https: that so characterizes the human visual system, and rather
//github.com/dorarad/gansformer. displays a one-way feed-forward progression from raw sen-
sory signals to higher representations. Their local receptive
field and rigid computation reduce their ability to model
1. Introduction
long-range dependencies or develop holistic understanding
The cognitive science literature speaks of two reciprocal of global shapes and structures that goes beyond the brittle
mechanisms that underlie human perception: the bottom- reliance on texture [26], and in the generative domain es-
up processing, proceeding from the retina up to the cortex, pecially, they are linked to considerable optimization and
as local elements and salient stimuli hierarchically group stability issues [35, 72] due to their fundamental difficulty in
§
coordinating between fine details across a generated scene.
I wish to thank Christopher D. Manning for the fruitful These concerns, along with the inevitable comparison to
discussions and constructive feedback in developing the bipartite
transformer, especially when explored within the language repre- cognitive visual processes, beg the question of whether con-
sentation area, as well as for the kind financial support that allowed volution alone provides a complete solution, or some key
this work to happen. ingredients are still missing.
Figure 2. We introduce the GANsformer network, that leverages a bipartite structure to allow long-range interactions, while evading the
quadratic complexity standard transformers suffer from. We present two novel attention operations over the bipartite graph: simplex and
duplex, the former permits communication in one direction, in the generative context – from the latents to the image features, while the
latter enables both top-down and bottom up connections between these two variable groups.
Meanwhile, the NLP community has witnessed a major rev- provides robust evidence for the network’s enhanced trans-
olution with the advent of the Transformer network [66], parency and compositionality, while ablation studies empir-
a highly-adaptive architecture centered around relational ically validate the value and effectiveness of our approach.
attention and dynamic interaction. In response, several at- We then present visualizations of the model’s produced
tempts have been made to integrate the transformer into attention maps to shed more light upon its internal represen-
computer vision models, but so far they have met only lim- tations and visual generation process. All in all, as we will
ited success due to scalabillity limitations stemming from see through the rest of the paper, by bringing the renowned
their quadratic mode of operation. GANs and Transformer architectures together under one
roof, we can integrate their complementary strengths, to cre-
Motivated to address these shortcomings and unlock the
ate a strong, compositional and efficient network for visual
full potential of this promising network for the field of com-
generative modeling.
puter vision, we introduce the Generative Adversarial Trans-
former, or GANsformer for short, a simple yet effective
generalization of the vanilla transformer, explored here for 2. Related Work
the task of visual synthesis. The model features a biparatite
Generative Adversarial Networks (GANs), originally intro-
construction for computing soft attention, that iteratively
duced in 2014 [29], have made remarkable progress over
aggregates and disseminates information between the gener-
the past few years, with significant advances in training
ated image features and a compact set of latent variables to
stability and dramatic improvements in image quality and
enable bidirectional interaction between these dual represen-
diversity, turning them to be nowadays a leading paradigm in
tations. This proposed design achieves a favorable balance,
visual synthesis [5, 45, 60]. In turn, GANs have been widely
being capable of flexibly modeling global phenomena and
adopted for a rich variety of tasks, including image-to-image
long-range interactions on the one hand, while featuring an
translation[41, 74], super-resolution [49], style transfer [12],
efficient setup that still scales linearly with the input size
and representation learning [18], to name a few. But while
on the other. As such, the GANsformer can sidestep the
automatically produced images for faces, single objects or
computational costs and applicability constraints incurred
natural scenery have reached astonishing fidelity, becoming
by prior work, due to their dense and potentially excessive
nearly indistinguishable from real samples, the uncondi-
pairwise connectivity of the standard transformer [5, 72],
tional synthesis of more structured or compositional scenes
and successfully advance generative modeling of images
is still lagging behind, suffering from inferior coherence,
and scenes.
reduced geometric consistency and, at times, lack of global
We study the model quantitative and qualitative behavior coordination [1, 9, 44, 72]. As of now, faithful generation
through a series of experiments, where it achieves state-of- of structured scenes is thus yet to be reached.
the-art performance over a wide selection of datasets, of
Concurrently, the last couple of years saw impressive
both simulated as well as real-world kinds, obtaining partic-
progress in the field of NLP, driven by the innovative ar-
ularly impressive gains in generating highly-compositional
chitecture called Transformer [66], which has attained sub-
multi-object scenes. The analysis we conduct indicates that
stantial gains within the language domain and consequently
the GANsformer requires less training steps and fewer sam-
sparked considerable interest across the deep learning com-
ples than competing approaches to successfully synthesize
munity [16, 66]. In response, several attempts have been
images of high quality and diversity. Further evaluation
made to incorporate self-attention constructions into vision supervised segmentation [30, 52], clustering [50], image
models, most commonly for image recognition, but also in recognition [10], NLP [61] and viewpoint generalization
segmentation [25], detection [8], and synthesis [72]. From [47]. However, our work stands out incorporating new ways
structural perspective, they can be roughly divided into two to integrate information between nodes, as well as novel
streams: those that apply local attention operations, fail- forms of attention (Simplex and Duplex) that iteratively up-
ing to capture global interactions [14, 38, 58, 59, 73], and date and refine the assignments between image features and
others that borrow the original transformer structure as- latents, and is the first to explore these techniques in the
is and perform attention globally, across the entire image, context of high-resolution generative modeling.
resulting in prohibitive computation due to its quadratic
Most related to our work are certain GAN models for condi-
complexity, which fundamentally hinders its applicability
tional and unconditional visual synthesis: A few methods
to low-resolution layers only [3, 5, 19, 24, 42, 67, 72]. Few
[21, 33, 56, 65] utilize multiple replicas of a generator to pro-
other works proposed sparse, discrete or approximated vari-
duce a set of image layers, that are then combined through
ations of self-attention, either within the adversarial or au-
alpha-composition. As a result, these models make quite
toregressive contexts, but they still fall short of reducing
strong assumptions about the independence between the
memory footprint and computation costs to a sufficient de-
components depicted in each layer. In contrast, our model
gree [11, 24, 37, 39, 63].
generates one unified image through a cooperative process,
Compared to these prior works, the GANsformer stands coordinating between the different latents through the use
out as it manages to avoid the high costs ensued by self at- of soft attention. Other works, such as SPADE [57, 75],
tention, employing instead bipartite attention between the employ region-based feature modulation for the task of
image features and a small collection of latent variables. Its layout-to-image translation, but, contrary to us, use fixed
design fits naturally with the generative goal of transforming segmentation maps or static class embeddings to control
source latents into an image, facilitating long-range inter- the visual features. Of particular relevance is the prominent
action without sacrificing computational efficiency. Rather, StyleGAN model [45, 46], which utilizes a single global
the network maintains a scalable linear efficiency across all style vector to consistently modulate the features of each
layers, realizing the transformer full potential. In doing so, layer. The GANsformer generalizes this design, as mul-
we seek to take a step forward in tackling the challenging tiple style vectors impact different regions in the image
task of compositional scene generation. Intuitively, and as concurrently, allowing for a spatially finer control over the
is later corroborated by our findings, holding multiple latent generation process. Finally, while StyleGAN broadcasts
variables that interact through attention with the evolving information in one direction from the global latent to the lo-
generated image, may serve as a structural prior that pro- cal image features, our model propagates information both
motes the formation of compact and compositional scene from latents to features and vise versa, enabling top-down
representations, as the different latents may specialize for and bottom-up reasoning to occur simultaneously1 .
certain objects or regions of interest. Indeed, as demon-
strated in section 4, the Generative Adversarial Transformer 3. The Generative Adversarial Transformer
achieves state-of-the-art performance in synthesizing both
controlled and real-world indoor and outdoor scenes, while The Generative Adversarial Transformer is a type of Gen-
showing indications for semantic compositional disposition erative Adversarial Network, which involves a generator
along the way. network (G) that maps a sample from the latent space to
the output space (e.g. an image), and a discriminator net-
In designing our model, we drew inspiration from multiple
work (D) which seeks to discern between real and fake
lines of research on generative modeling, compositionality
samples [29]. The two networks compete with each other
and scene understanding, including techniques for scene
through a minimax game until reaching an equilibrium. Typ-
decomposition, object discovery and representation learn-
ically, each of these networks consists of multiple layers
ing. Several approaches, such as [7, 22, 23, 31], perform
of convolution, but in the GANsformer case, we instead
iterative variational inference to encode scenes into multiple
construct them using a novel architecture, called Bipartite
slots, but are mostly applied in the contexts of synthetic
Transformer, formally defined below.
and oftentimes fairly rudimentary 2D settings. Works such
as Capsule networks [62] leverage ideas from psychology The section is structured as follows: we first present a for-
about Gestalt principles [34, 64], perceptual grouping [6] mulation of the Bipartite Transformer, a general-purpose
or analysis-by-synthesis [4], and like us, introduce ways to 1
Note however that our model certainly does not claim to serve
piece together visual elements to discover compound enti-
as a biologically-accurate reflection of cognitive top-down pro-
ties or, in the cases of Set Transformers [50] and A2 -Nets cessing. Rather, this analogy played as a conceptual source of
[10], group local information into global aggregators, which inspiration that aided us through the idea development.
proves useful for a broad specturm of tasks, spanning un-
Figure 3. Samples of images generated by the GANsformer for the CLEVR, Bedroom and Cityscapes datasets, and a visualization of the
produced attention maps. The different colors correspond to the latents that attend to each region.
generalization of the Transformer2 (section 3.1). Then, we ables (the latents, in the case of the generator). We can then
provide an overview of how the transformer is incorporated compute attention over the derived bipartite graph between
into the generator and discriminator framework (section 3.2). these two groups of elements. Specifically, we then define:
We conclude by discussing the merits and distinctive proper-
QK T

ties of the GANsformer, that set it apart from the traditional Attention(Q, K, V ) = softmax √ V
GAN and transformer networks (section 3.3). d
ab (X, Y ) = Attention (q(X), k(Y ), v(Y ))
3.1. The Bipartite Transformer
Where a stands for Attention, and q(·), k(·), v(·) are func-
The standard transformer network consists of alternating tions that respectively map elements into queries, keys, and
multi-head self-attention and feed-forward layers. We refer values, all maintaining the same dimensionality d. We also
to each pair of self-attention and feed-forward layers as a provide the query and key mappings with respective posi-
transformer layer, such that a Transformer is considered tional encoding inputs, to reflect the distinct position of each
to be a stack composed of several such layers. The Self- element in the set (e.g. in the image) (further details on the
Attention operator considers all pairwise relations among specifics of the positional encoding scheme in section 3.2).
the input elements, so to update each single element by
We can then combine the attended information with the
attending to all the others. The Bipartite Transformer gener-
input elements X, but whereas the standard transformer im-
alizes this formulation, featuring instead a bipartite graph
plements an additive update rule of the form: ua (X, Y ) =
between two groups of variables (in the GAN case, latents
LayerN orm(X + ab (X, Y )) (where Y = X in the stan-
and image features). In the following, we consider two
dard self-attention case) we instead use the retrieved infor-
forms of attention that could be computed over the bipartite
mation to control both the scale as well as the bias of the
graph – Simplex attention, and Duplex attention, depending
elements in X, in line with the practices promoted by the
on the direction in which information propagates3 – either
StyleGAN model [45]. As our experiments indicate, such
in one way only, or both in top-down and bottom-up ways.
multiplicative integration enables significant gains in the
While for clarity purposes we present the technique here in
model performance. Formally:
its one-head version, in practice we make use of a multi-
head variant, in accordance with [66]. us (X, Y ) = γ (ab (X, Y )) ω(X) + β (ab (X, Y ))
3.1.1. S IMPLEX ATTENTION Where γ(·), β(·) are mappings that compute multiplicative
and additive styles (gain and bias), maintaining a dimension
We begin by introducing the simplex attention, which dis-
of d, and ω(X) = X−µ(X)σ(X) normalizes each element, with
tributes information in a single direction over the Bipar-
tite Transformer. Formally, let X n×d denote an input set respect to the other features4 . This update rule fuses together
of n vectors of dimension d (where, for the image case, the normalization and information propagation from Y to
n = W ×H), and Y m×d denote a set of m aggregator vari- X, by essentially letting Y control X statistical tendencies
of X, which for instance can be useful in the case of visual
2
By transformer, we precisely mean a multi-layer bidirectional synthesis for generating particular objects or entities.
transformer encoder, as described in [16], which interleaves self-
4
attention and feed-forward layers. The statistics are computed with respect to the other elements
3 in the case of instance normalization, or among element channels
In computer networks, simplex refers to communication in
a single direction, while duplex refers to communication in both in the case of layer normalization. We have experimented with both
ways. forms and found that for our model layer normalization performs
a little better, matching reports by [48].
Figure 4. A visualization of the GANsformer attention maps for bedrooms.
3.1.2. D UPLEX ATTENTION tions among all the pairs of pixels in the images, it supports
adaptive long-range interaction between far away pixels in
We can go further and consider the variables Y to poses a
a moderated manner, passing through through a compact
key-value structure of their own [55]: Y = (K P ×d , V P ×d ),
and global bottleneck that selectively gathers information
where the values store the content of the Y variables
from the entire input and distribute it to relevant regions.
(e.g. the randomly sampled latents for the case of GANs)
Intuitively, this form can be viewed as analogous to the
while the keys track the centroids of the attention-based
top-down notions discussed in section 1, as information is
assignments from X to Y , which can be computed as:
propagated in the two directions, both from the local pixel
K = ab (Y, X). Consequently, we can define a new update
to the global high-level representation and vise versa.
rule, that is later empirically shown to work more effectively
than the simplex attention: We note that both the simplex and the duplex attention op-
d erations enjoy a bilinear efficiency of O(mn) thanks to the
u (X, Y ) = γ(a(X, K, V )) ω(X) + β(a(X, K, V )) network’s bipartite structure that considers all pairs of corre-
This update compounds two attention operations on top of sponding elements from X and Y . Since, as we see below,
each other, where we first compute soft attention assign- we maintain Y to be of a fairly small size, choosing m in
ments between X and Y , and then refine the assignments the range of 8–32, this compares favorably to the poten-
by considering their centroids, analogously to the k-means tially prohibitive O(n2 ) complexity of the self-attention,
algorithm [51, 52]. that impedes its applicability to high-resolution images.
Finally, to support bidirectional interaction between the
3.2. The Generator and Discriminator networks
elements, we can chain two reciprocal simplex attentions
from X to Y and from Y to X, obtaining the duplex at- We use the celebrated StyleGAN model as a starting point
tention, which alternates computing Y := ua (Y, X) and for our GAN design. Commonly, a generator network con-
X := ud (X, Y ), such that each representation is refined in sists of a multi-layer CNN that receives a randomly sampled
light of its interaction with the other, integrating together vector z and transforms it into an image. The StyleGAN
bottom-up and top-down interactions. approach departs from this design and, instead, introduces a
feed-forward mapping network that outputs an intermediate
3.1.3. OVERALL A RCHITECTURE S TRUCTURE vector w, which in turn interacts directly with each convolu-
tion through the synthesis network, globally controlling the
Vision-Specific adaptations. In the standard Transformer
feature maps statistics of every layer.
used for NLP, each self-attention layer is followed by a Feed-
Forward FC layer that processes each element independently Effectively, this approach attains layer-wise decomposition
(which can be deemed a 1 × 1 convolution). Since our of visual properties, allowing the model to control particular
case pertains to images, we use instead a kernel size of global aspects of the image such as pose, lighting conditions
k = 3 after each application of attention. We also apply a or color schemes, in a coherent manner over the entire im-
Leaky ReLU nonlinearity after each convolution [53] and age. But while StyleGAN successfully disentangles global
then upsample or downsmaple the features X, as part of properties, it is potentially limited in its ability to perform
the generator or discriminator respectively, following e.g. spatial decomposition, as it provides no means to control
StyleGAN2 [46]. To account for the features location within the style of a localized regions within the generated image.
the image, we use a sinusoidal positional encoding along the
Luckily, the Bipartite Transformer offers a solution to meet
horizontal and vertical dimensions for the visual features X
this goal. Instead of controlling the style of all features
[66], and a trained embedding for the set of latent variables
globally, we use instead our new attention layer to perform
Y.
adaptive and local region-wise modulation. We split the
Overall, the Bipartite Transformer is thus composed of a latent vector z into k components, z = [z1 , ...., zk ] and, as
stack that alternates attention (simplex or duplex) and con- in StyleGAN, pass each of them through a shared mapping
volution layers, starting from a 4 × 4 grid up to the desirable network, obtaining a corresponding set of intermediate la-
resolution. Conceptually, this structure fosters an interesting tent variables Y = [y1 , ..., yk ]. Then, during synthesis, after
communication flow: rather than densely modeling interac- each CNN layer in the generator, we let the feature map X
Figure 5. Sampled Images and Attention maps. Samples of images generated by the GANsformer for the CLEVR, LSUN-Bedroom
and Cityscapes datasets, and a visualization of the produced attention maps. The different colors correspond to the latent variables that
attend to each region. For the CLEVR dataset we should multiple attention maps in different layers of the model, revealing how the latent
variables roles change over the different layers – while they correspond to different objects as the layout of the scene is being formed in
early layers, they behave similarly to a surface normal in the final layers of the generator.
and latents Y to play the roles of the two element groups,

mediate their interaction through our new attention layer Attention First Layer Attention Last Layer
190 70
(either simplex or duplex). This setting thus allows for a 160 60
flexible and dynamic style modulation at the region level. 130 50
FID Score
FID Score
Since soft attention tends to group elements based on their 100 40
proximity and content similarity [66], we see how the trans- 70 30
40 20
former architecture naturally fits into the generative task and 3
6
4
7
5
8 4 5 6 7 8
10 10
proves useful in the visual domain, allowing the network to Step
1250k
Step
1250k
exercise finer control in modulating semantic regions. As

we see in section 4, this capability turns to be especially Figure 6. Performance of the GANsformer and competing ap-
useful in modeling compositional scenes. proaches for the CLEVR and Bedroom datasets.
For the discriminator, we similarly apply attention after ev- • Employing a multiplicative update rule to affect fea-
ery convolution, in this case using trained embeddings to ture styles, akin to StyleGAN but in contrast to the
initialize the aggregator variables Y , which may intuitively transformer architecture.
represent background knowledge the model learns about
the scenes. At the last layer, we concatenate these vari- As we see in the following section, the combination of these
ables to the final feature map to make a prediction about design choices yields a strong architecture that demonstrates
the identity of the image source. We note that this con- high efficiency, improved latent space disentanglement, and
struction holds some resemblance to the PatchGAN discrim- enhanced transparency of its generation process.
inator introduced by [41], but whereas PatchGAN pools
features according to a fixed predetermined scheme, the
4. Experiments
GANsformer can gather the information in a more adaptive
and selective manner. Overall, using this structure endows We investigate the GANsformer through a suite of exper-
the discriminator with the capacity to likewise model long- iments to study its quantitative performance and quali-
range dependencies, which can aid the discriminator in its tative behavior. As detailed in the sections below, the
assessment of the image fidelity, allowing it to acquire a GANsformer achieves state-of-the-art results, successfully
more holistic understanding of the visual modality. producing high-quality images for a varied assortment of
In terms of the loss function, optimization and training datasets: FFHQ for human faces [45], the CLEVR dataset
configuration, we adopt the settings and techniques used in for multi-object scenes [43], and the LSUN-Bedroom [71]
the StyleGAN and StyleGAN2 models [45, 46], including and Cityscapes [15] datasets for challenging indoor and out-
in particular style mixing, stochastic variation, exponential door scenes. The use of these datasets and their reproduced
moving average for weights, and a non-saturating logistic images are only for the purpose of scientific communication.
loss with lazy R1 regularization. Further analysis we then conduct in section 4.1, 4.2 and
4.3 provide evidence for several favorable properties that
the GANsformer posses, including better data-efficiency,
3.3. Summary
enhanced transparency, and stronger disentanglement com-
To recapitulate the discussion above, the GANsformer suc- pared to prior approaches. Section 4.4 then quantitatively
cessfully unifies the GANs and Transformer for the task of assesses the network semantic coverage of the natural image
scene generation. Compared to the traditional GANs and distribution for the CLEVR dataset, while ablation studies
transformers, it introduces multiple key innovations: 4.5 empirically validate the relative importance of each of
the model’s design choices. Taken altogether, our evaluation
• Featuring a bipartite structure the reaching a sweet spot offers solid evidence for the GANsformer effectiveness and
that balances between expressiveness and efficiency, efficacy in modeling compsitional images and scenes.
being able to model long-range dependencies while We compare our network with multiple related approaches
maintaining linear computational costs. including both baselines as well as leading models for image
synthesis: (1) A baseline GAN [29]: a standard GAN that
• Introducing a compositional structure with multiple
follows the typical convolutional architecture.5 (2) Style-
latent variables that coordinate through attention to pro-
GAN2 [46], where a single global latent interacts with the
duce the image cooperatively, in a manner that matches
evolving image by modulating its style in each layer. (3)
the inherent compositionality of natural scenes.
SAGAN [72], a model that performs self-attention across all
• Supporting bidirectional interaction between the latents 5
We specifically use a default configuration from StyleGAN2
and the visual features which allows the refinement and codebase, but with the noise being inputted through the network
interpretation of each in light of the other. stem instead of through weight demodulation.
Table 1. Comparison between the GANsformer and competing methods for image synthesis. We evaluate the models along commonly
used metrics such as FID, Inception, and Precision & Recall scores. FID is considered to be the most well-received as a reliable indication
of images fidelity and diversity. We compute each metric 10 times over 50k samples with different random seeds and report their average.
CLEVR LSUN-Bedroom
Model FID ↓ IS ↑ Precision ↑ Recall ↑ FID ↓ IS ↑ Precision ↑ Recall ↑
GAN 25.0244 2.1719 21.77 16.76 12.1567 2.6613 52.17 13.63
k-GAN 28.2900 2.2097 22.93 18.43 69.9014 2.4114 28.71 3.45
SAGAN 26.0433 2.1742 30.09 15.16 14.0595 2.6946 54.82 7.26
StyleGAN2 16.0534 2.1472 28.41 23.22 11.5255 2.7933 51.69 19.42
VQGAN 32.6031 2.0324 46.55 63.33 59.6333 1.9319 55.24 28.00
GANsformers 10.2585 2.4555 38.47 37.76 8.5551 2.6896 55.52 22.89
GANsformerd 9.1679 2.3654 47.55 66.63 6.5085 2.6694 57.41 29.71
FFHQ Cityscapes
Model FID ↓ IS ↑ Precision ↑ Recall ↑ FID ↓ IS ↑ Precision ↑ Recall ↑
GAN 13.1844 4.2966 67.15 17.64 11.5652 1.6295 61.09 15.30
k-GAN 61.1426 3.9990 50.51 0.49 51.0804 1.6599 18.80 1.73
SAGAN 16.2069 4.2570 64.84 12.26 12.8077 1.6837 43.48 7.97
StyleGAN2 10.8309 4.3294 68.61 25.45 8.3500 1.6960 59.35 27.82
VQGAN 63.1165 2.2306 67.01 29.67 173.7971 2.8155 30.74 43.00
GANsformers 13.2861 4.4591 68.94 10.14 14.2315 1.6720 64.12 2.03
GANsformerd 12.8478 4.4079 68.77 5.7589 5.7589 1.6927 48.06 33.65
pixel pairs in the low-resolution layers of the generator and dataset, where naturally there is relatively lower diversity
discriminator. (4) k-GAN [65] that produces k separated in images layout. On the flip side, most notable are the sig-
images, which are then blended through alpha-composition. nificant improvements in the performance for the CLEVR
(5) VQGAN [24] that has been proposed recently and uti- case, where our approach successfully lowers FID scores
lizes transformers for discrete autoregessive auto-encoding. from 16.05 to 9.16, as well as the Bedroom dataset where
the GANsformer nearly halves the FID score from 11.32 to
To evaluate all models under comparable conditions of train-
6.5, being trained for equal number of steps. These findings
ing scheme, model size, and optimization details, we im-
suggest that the GANsformer is particularly adept in mod-
plement them all within the codebase introduced by the
eling scenes of high compositionality (CLEVR) or layout
StyleGAN authors. All models have been trained with im-
diversity (Bedroom). Comparing between the Simplex and
ages of 256 × 256 resolution and for the same number of
Duplex Attention versions further reveals the strong bene-
training steps, roughly spanning a week on 2 NVIDIA V100
fits of integrating the reciprocal bottom-up and top-down
GPUs per model (or equivalently 3-4 days using 4 GPUs).
processes together.
For the GANsformer, we select k – the number of latent vari-
ables, from the range of 8–32. Note that increasing the value
of k does not translate to increased overall latent dimension, 4.1. Data and Learning Efficiency
and we rather kept the overall latent equal across models. We examine the learning curves of our and competing mod-
See supplementary material A for further implementation els (figure 7, middle) and inspect samples of generated
details, hyperparameter settings and training configuration. image at different stages of the training (figure 12 supple-
We can see that the GANsformer matches or outperforms the mentary). These results both reveal that our model learns
performance of prior approaches, with the least benefits for significantly faster than competing approaches, in the case
the non-compositional FFHQ dataset for human faces, and of CLEVR producing high-quality images in approximately
largest gains for the highly-compositional CLEVR dataset. 3-times less training steps than the second-best approach.
To explore the GANsformer learning aptitude further, we
As shown in table 1, our model matches or outperforms have performed experiments where we reduced the size of
prior work, achieving substantial gains in terms of FID the dataset that each model (and specifically, its discrimina-
score which correlates with image quality and diversity [36], tor) is exposed to during the training (figure 7, rightmost) to
as well as other commonly used metrics such as Inception varied degrees. These results similarly validate the model su-
score (IS) and Precision and Recall (P&R).6 As could be ex- perior data-efficiency, especially when as little as 1k images
pected, we obtain the least gains for the FFHQ human faces are given to the model.
6
Note that while the StyleGAN paper [46] reports lower FID
scores in the FFHQ and Bedroom cases, they obtain them by 4.2. Transparency & Compositionality
training their model for 5-7 times longer than our experiments
(StyleGAN models are trained for up to 17.5 million steps, produc- To gain more insight into the model’s internal representa-
ing 70M samples and demanding over 90 GPU-days). To comply tion and its underlying generative process, we visualize the
with a reasonable compute budget, in our evaluation we equally attention distributions produced by the GANsformer as it
reduced the training duration for all models, maintaining the same synthesizes new images. Recall that at each layer of the
number of steps.
CLEVR LSUN-Bedroom 180

Data Efficiency
90 90
GAN GAN
80 160
80 k-GAN StyleGAN2
SAGAN 140 Simplex (Ours)
70 StyleGAN2 70
Duplex (Ours)
Simplex (Ours) 120
FID Score
60 60
FID Score
FID Score
Duplex (Ours)
GAN 100
50 50 k-GAN 80
SAGAN
40 40
StyleGAN2 60
30 30 Simplex (Ours)
Duplex (Ours) 40
20 20 20
10 10 0
1250k 2500k 100 1000 10000 100000
Step Step Dataset size (Logaritmic)
Figure 7. From left to right: (1-2) Performance as a function of start and final layers the attention is applied to. (3): data-efficiency
experiments for CLEVR.
generator, it casts attention between the k latent variables

Table 2. Chi-Square statistics of the output image distribution for
and the evolving spatial features of the generated image.
the CLEVR dataset, based on 1k samples that have been processed
From the samples in figures 3 and 4, we can see that particu- by a pre-trained object detector to identify the objects and semantic
lar latent variables tend to attend to coherent regions within properties within the sample generated images.
the image in terms of content similarity and proximity. Fig- GAN StyleGAN GANsformers GANsformerd
ure 5 shows further visualizations of the attention computed Object Area 0.038 0.035 0.045 0.068
Object Number 2.378 1.622 2.142 2.825
by the model in various layers, showing how it behaves dis- Co-occurrence 13.532 9.177 9.506 13.020
tinctively in different stages of the synthesis process. These Shape 1.334 0.643 1.856 2.815
Size 0.256 0.066 0.393 0.427
visualizations imply that the latents carry a semantic sense, Material 0.108 0.322 1.573 2.887
capturing objects, visual entities or constituent components Color 1.011 1.402 1.519 3.189
Class 6.435 4.571 5.315 16.742
of the synthesized scene. These findings can thereby attest
for an enhanced compositionality that our model acquires Table 3. Disentanglement metrics (DCI), which asses the Disen-
through its multi-latent structure. Whereas models such as tanglement, Completeness and Informativeness of latent repre-
StyleGAN use a single monolithic latent vector to account sentations, computed over 1k CLEVR images. The GANsformer
for the whole scene and modulate features only at the global achieves the strongest results compared to competing approaches.
scale, our design lets the GANsformer exercise finer control
GAN StyleGAN GANsformers GANsformerd
in impacting features at the object granularity, while leverag- Disentanglement 0.126 0.208 0.556 0.768
ing the use of attention to make its internal representations Modularity 0.631 0.703 0.891 0.952
Completeness 0.071 0.124 0.195 0.270
more explicit and transparent. Informativeness 0.583 0.685 0.899 0.971625
Informativeness’ 0.4345 0.332 0.848 0.963
To quantify the compositionality level exhibited by the
model, we use a pre-trained segmentor [69] to produce se- there is 1-to-1 correspondence between latent factors and
mantic segmentations for a sample set of generated scenes, global image attributes. To obtain the attributes, we consider
and use them to measure the correlation between the atten- the area size of each semantic class (bed, carpet, pillows)
tion cast by latent variables and various semantic classes. In obtained through a pre-trained segmentor and use them as
figure 8 in the supplementary, we present the classes that the output response features for measuring the latent space
on average have shown the highest correlation with respect disentanglement, computed over 1k images. We follow the
to latent variables in the model, indicating that the model protocol proposed by [70] and present the results in table 3.
coherently attend to semantic concepts such as windows, This analysis confirms that the GANSformer latent repre-
pillows, sidewalks and cars, as well as coherent background sentations enjoy higher disentanglement when compared to
regions like carpets, ceiling, and walls. the baseline StyleGAN approach.
4.3. Disentanglement 4.4. Image Diversity

We consider the DCI metrics commonly used in the dis- One of the advantages of compositional representations is
entanglement literature [20], to provide more evidence for that they can support combinatorial generalization – one of
the beneficial impact our architecture has on the model’s the key foundations of human intelligence [2]. Inspired by
internal representations. These metrics asses the Disen- this observation, we perform an experiment to measure that
tanglement, Completeness and Informativeness of a given property in the context of visual synthesis of multi-object
representation, essentially evaluating the degree to which scenes. We use a pre-trained detector on the generated
CLEVR scenes to extract the objects and properties within References

each sample. Considering a large number of samples, we
[1] Samaneh Azadi, Michael Tschannen, Eric Tzeng, Syl-
then compute Chi-Square statistics to determine the degree
vain Gelly, Trevor Darrell, and Mario Lucic. Se-
to which each model manages to covers the natural uniform
mantic bottleneck scene generation. arXiv preprint
distribution of CLEVR images. Table 2 summarizes the re-
arXiv:1911.11357, 2019.
sults, where we can see that our model obtains better scores
across almost all the semantic properties of the image dis- [2] Peter W Battaglia, Jessica B Hamrick, Victor Bapst,
tribution. These metrics complement the common FID and Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Ma-
IS scores as they emphasize structure over texture, focusing teusz Malinowski, Andrea Tacchetti, David Raposo,
on object existence, arrangement and local properties, and Adam Santoro, Ryan Faulkner, et al. Relational induc-
thereby substantiating further the model compositionality. tive biases, deep learning, and graph networks. arXiv
preprint arXiv:1806.01261, 2018.
4.5. Ablation and Variation Studies
[3] Irwan Bello, Barret Zoph, Quoc Le, Ashish Vaswani,
To validate the usefulness of our approach and obtain a bet- and Jonathon Shlens. Attention augmented convolu-
ter assessment of the relative contribution of each design tional networks. In 2019 IEEE/CVF International Con-
choice, we conduct multiple ablation studies, where we test ference on Computer Vision, ICCV 2019, Seoul, Korea
our model under varying conditions, specifically studying (South), October 27 - November 2, 2019, pp. 3285–
the impact of: latent dimension, attention heads, number 3294. IEEE, 2019. doi: 10.1109/ICCV.2019.00338.
of layers attention is incorporated into, simplex vs. duplex,
generator vs. discriminator attention, and multiplicative vs. [4] Irving Biederman. Recognition-by-components: a
additive integration. While most results appear in the sup- theory of human image understanding. Psychological
plementary, we wish to focus on two variations in particular, review, 94(2):115, 1987.
where we incorporate attention in different layers across
[5] Andrew Brock, Jeff Donahue, and Karen Simonyan.
the generator. As we can see in figure 6, the earlier in the
Large scale GAN training for high fidelity natural im-
stack the attention is applied (low-resolutions), the better the
age synthesis. In 7th International Conference on
model’s performance and the faster it learns. The same goes
Learning Representations, ICLR 2019, New Orleans,
for the final layer to apply attention to – as attention can
LA, USA, May 6-9, 2019. OpenReview.net, 2019.
especially contribute in high-resolutions that can benefit the
most from long-range interactions. These studies provide a [6] Joseph L Brooks. Traditional and new principles of
validation for the effectiveness of our approach in enhancing perceptual grouping. 2015.
generative scene modeling.
[7] Christopher P Burgess, Loic Matthey, Nicholas Wat-
ters, Rishabh Kabra, Irina Higgins, Matt Botvinick,
5. Conclusion and Alexander Lerchner. MONet: Unsupervised scene
We have introduced the GANsformer, a novel and efficient decomposition and representation. arXiv preprint
bipartite transformer that combines top-down and bottom- arXiv:1901.11390, 2019.
up interactions, and explored it for the task of generative [8] Nicolas Carion, Francisco Massa, Gabriel Synnaeve,
modeling, achieving strong quantitative and qualitative re- Nicolas Usunier, Alexander Kirillov, and Sergey
sults that attest for the model robustness and efficacy. The Zagoruyko. End-to-end object detection with trans-
GANsformer fits within the general philosophy that aims to formers. In European Conference on Computer Vision,
incorporate stronger inductive biases into neural networks pp. 213–229. Springer, 2020.
to encourage desirable properties such as transparency, data-
efficiency and compositionality – properties which are at [9] Arantxa Casanova, Michal Drozdzal, and Adriana
the core of human intelligence, and serve as the basis for Romero-Soriano. Generating unseen complex scenes:
our capacity to reason, plan, learn, and imagine. While our are we there yet? arXiv preprint arXiv:2012.04027,
work focuses on visual synthesis, we note that the Bipartite 2020.
Transformer is a general-purpose model, and expect it may
be found useful for other tasks in both vision and language. [10] Yunpeng Chen, Yannis Kalantidis, Jianshu Li,
Overall, we hope that our work will help taking us a little Shuicheng Yan, and Jiashi Feng. A2-nets: Double
closer in our collective search to bridge the gap between the attention networks. In Samy Bengio, Hanna M. Wal-
intelligence of humans and machines. lach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-
Bianchi, and Roman Garnett (eds.), Advances in Neu-
ral Information Processing Systems 31: Annual Con-
ference on Neural Information Processing Systems
2018, NeurIPS 2018, December 3-8, 2018, Montréal, [19] Alexey Dosovitskiy, Lucas Beyer, Alexander
Canada, pp. 350–359, 2018. Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias
[11] Rewon Child, Scott Gray, Alec Radford, and Ilya Minderer, Georg Heigold, Sylvain Gelly, et al. An
Sutskever. Generating long sequences with sparse image is worth 16x16 words: Transformers for image
transformers. arXiv preprint arXiv:1904.10509, 2019. recognition at scale. arXiv preprint arXiv:2010.11929,
2020.
[12] Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo
Ha, Sunghun Kim, and Jaegul Choo. Stargan: Uni- [20] Cian Eastwood and Christopher KI Williams. A frame-
fied generative adversarial networks for multi-domain work for the quantitative evaluation of disentangled
image-to-image translation. In 2018 IEEE Conference representations. In International Conference on Learn-
on Computer Vision and Pattern Recognition, CVPR ing Representations, 2018.
2018, Salt Lake City, UT, USA, June 18-22, 2018,
pp. 8789–8797. IEEE Computer Society, 2018. doi: [21] Sébastien Ehrhardt, Oliver Groth, Aron Monszpart,
10.1109/CVPR.2018.00916. Martin Engelcke, Ingmar Posner, Niloy J. Mitra, and
Andrea Vedaldi. RELATE: physically plausible multi-
[13] Charles E Connor, Howard E Egeth, and Steven Yantis. object scene synthesis using structured latent spaces.
Visual attention: bottom-up versus top-down. Current In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Had-
biology, 14(19):R850–R852, 2004. sell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.),
Advances in Neural Information Processing Systems
[14] Jean-Baptiste Cordonnier, Andreas Loukas, and Mar- 33: Annual Conference on Neural Information Pro-
tin Jaggi. On the relationship between self-attention cessing Systems 2020, NeurIPS 2020, December 6-12,
and convolutional layers. In 8th International Confer- 2020, virtual, 2020.
ence on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, [22] Martin Engelcke, Adam R. Kosiorek, Oiwi Parker
2020. Jones, and Ingmar Posner. GENESIS: generative
scene inference and sampling with object-centric latent
[15] Marius Cordts, Mohamed Omran, Sebastian Ramos, representations. In 8th International Conference on
Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Learning Representations, ICLR 2020, Addis Ababa,
Uwe Franke, Stefan Roth, and Bernt Schiele. The Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
cityscapes dataset for semantic urban scene under-
[23] S. M. Ali Eslami, Nicolas Heess, Theophane Weber,
standing. In 2016 IEEE Conference on Computer
Yuval Tassa, David Szepesvari, Koray Kavukcuoglu,
Vision and Pattern Recognition, CVPR 2016, Las Ve-
and Geoffrey E. Hinton. Attend, infer, repeat:
gas, NV, USA, June 27-30, 2016, pp. 3213–3223. IEEE
Fast scene understanding with generative models.
Computer Society, 2016. doi: 10.1109/CVPR.2016.
In Daniel D. Lee, Masashi Sugiyama, Ulrike von
350.
Luxburg, Isabelle Guyon, and Roman Garnett (eds.),
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Advances in Neural Information Processing Sys-
Kristina Toutanova. BERT: Pre-training of deep bidi- tems 29: Annual Conference on Neural Informa-
rectional transformers for language understanding. In tion Processing Systems 2016, December 5-10, 2016,
Proceedings of the 2019 Conference of the North Amer- Barcelona, Spain, pp. 3225–3233, 2016.
ican Chapter of the Association for Computational [24] Patrick Esser, Robin Rombach, and Björn Ommer.
Linguistics: Human Language Technologies, Volume 1 Taming transformers for high-resolution image synthe-
(Long and Short Papers), pp. 4171–4186, Minneapolis, sis. arXiv preprint arXiv:2012.09841, 2020.
Minnesota, June 2019. Association for Computational
Linguistics. doi: 10.18653/v1/N19-1423. [25] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao,
Zhiwei Fang, and Hanqing Lu. Dual attention network
[17] Nadine Dijkstra, Peter Zeidman, Sasha Ondobaka, for scene segmentation. In IEEE Conference on Com-
Marcel AJ van Gerven, and K Friston. Distinct top- puter Vision and Pattern Recognition, CVPR 2019,
down and bottom-up brain connectivity during visual Long Beach, CA, USA, June 16-20, 2019, pp. 3146–
perception and imagery. Scientific reports, 7(1):1–9, 3154. Computer Vision Foundation / IEEE, 2019. doi:
2017. 10.1109/CVPR.2019.00326.
[18] Jeff Donahue, Philipp Krähenbühl, and Trevor Dar- [26] Robert Geirhos, Patricia Rubisch, Claudio Michaelis,
rell. Adversarial feature learning. arXiv preprint Matthias Bethge, Felix A. Wichmann, and Wieland
arXiv:1605.09782, 2016. Brendel. Imagenet-trained cnns are biased towards
texture; increasing shape bias improves accuracy and June 27-30, 2016, pp. 770–778. IEEE Computer Soci-
robustness. In 7th International Conference on Learn- ety, 2016. doi: 10.1109/CVPR.2016.90.
ing Representations, ICLR 2019, New Orleans, LA,
[36] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
USA, May 6-9, 2019. OpenReview.net, 2019.
Bernhard Nessler, and Sepp Hochreiter. Gans trained
[27] James J Gibson. A theory of direct visual perception. by a two time-scale update rule converge to a local
Vision and Mind: selected readings in the philosophy nash equilibrium. In Isabelle Guyon, Ulrike von
of perception, pp. 77–90, 2002. Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fer-
gus, S. V. N. Vishwanathan, and Roman Garnett (eds.),
[28] James Jerome Gibson and Leonard Carmichael. The Advances in Neural Information Processing Systems
senses considered as perceptual systems, volume 2. 30: Annual Conference on Neural Information Pro-
Houghton Mifflin Boston, 1966. cessing Systems 2017, December 4-9, 2017, Long
Beach, CA, USA, pp. 6626–6637, 2017.
[29] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron [37] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn,
Courville, and Yoshua Bengio. Generative adversarial and Tim Salimans. Axial attention in multidimensional
networks. arXiv preprint arXiv:1406.2661, 2014. transformers. arXiv preprint arXiv:1912.12180, 2019.
[30] Klaus Greff, Sjoerd van Steenkiste, and Jürgen [38] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin.
Schmidhuber. Neural expectation maximization. In Local relation networks for image recognition. In 2019
Isabelle Guyon, Ulrike von Luxburg, Samy Ben- IEEE/CVF International Conference on Computer Vi-
gio, Hanna M. Wallach, Rob Fergus, S. V. N. Vish- sion, ICCV 2019, Seoul, Korea (South), October 27 -
wanathan, and Roman Garnett (eds.), Advances in November 2, 2019, pp. 3463–3472. IEEE, 2019. doi:
Neural Information Processing Systems 30: Annual 10.1109/ICCV.2019.00356.
Conference on Neural Information Processing Systems [39] Zilong Huang, Xinggang Wang, Lichao Huang, Chang
2017, December 4-9, 2017, Long Beach, CA, USA, pp. Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-
6691–6701, 2017. cross attention for semantic segmentation. In 2019
[31] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, IEEE/CVF International Conference on Computer Vi-
Nick Watters, Christopher Burgess, Daniel Zoran, Loic sion, ICCV 2019, Seoul, Korea (South), October 27
Matthey, Matthew Botvinick, and Alexander Lerchner. - November 2, 2019, pp. 603–612. IEEE, 2019. doi:
Multi-object representation learning with iterative vari- 10.1109/ICCV.2019.00069.
ational inference. In Kamalika Chaudhuri and Ruslan [40] Monika Intaitė, Valdas Noreika, Alvydas Šoliūnas,
Salakhutdinov (eds.), Proceedings of the 36th Interna- and Christine M Falter. Interaction of bottom-up and
tional Conference on Machine Learning, ICML 2019, top-down processes in the perception of ambiguous
9-15 June 2019, Long Beach, California, USA, vol- figures. Vision Research, 89:24–31, 2013.
ume 97 of Proceedings of Machine Learning Research,
pp. 2424–2433. PMLR, 2019. [41] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and
Alexei A. Efros. Image-to-image translation with con-
[32] Richard Langton Gregory. The intelligent eye. 1970. ditional adversarial networks. In 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition,
[33] Jinjin Gu, Yujun Shen, and Bolei Zhou. Image CVPR 2017, Honolulu, HI, USA, July 21-26, 2017,
processing using multi-code GAN prior. In 2020 pp. 5967–5976. IEEE Computer Society, 2017. doi:
IEEE/CVF Conference on Computer Vision and Pat- 10.1109/CVPR.2017.632.
tern Recognition, CVPR 2020, Seattle, WA, USA,
June 13-19, 2020, pp. 3009–3018. IEEE, 2020. doi: [42] Yifan Jiang, Shiyu Chang, and Zhangyang Wang.
10.1109/CVPR42600.2020.00308. Transgan: Two transformers can make one strong gan.
arXiv preprint arXiv:2102.07074, 2021.
[34] David Walter Hamlyn. The psychology of perception:
A philosophical examination of Gestalt theory and [43] Justin Johnson, Bharath Hariharan, Laurens van der
derivative theories of perception, volume 13. Rout- Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B.
ledge, 2017. Girshick. CLEVR: A diagnostic dataset for composi-
tional language and elementary visual reasoning. In
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian 2017 IEEE Conference on Computer Vision and Pat-
Sun. Deep residual learning for image recognition. tern Recognition, CVPR 2017, Honolulu, HI, USA,
In 2016 IEEE Conference on Computer Vision and July 21-26, 2017, pp. 1988–1997. IEEE Computer
Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, Society, 2017. doi: 10.1109/CVPR.2017.215.
[44] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Im- and Ruslan Salakhutdinov (eds.), Proceedings of the
age generation from scene graphs. In Proceedings of 36th International Conference on Machine Learning,
the IEEE conference on computer vision and pattern ICML 2019, 9-15 June 2019, Long Beach, California,
recognition, pp. 1219–1228, 2018. USA, volume 97 of Proceedings of Machine Learning
Research, pp. 3744–3753. PMLR, 2019.
[45] Tero Karras, Samuli Laine, and Timo Aila. A style-
based generator architecture for generative adversarial [51] Stuart Lloyd. Least squares quantization in pcm. IEEE
networks. In IEEE Conference on Computer Vision transactions on information theory, 28(2):129–137,
and Pattern Recognition, CVPR 2019, Long Beach, 1982.
CA, USA, June 16-20, 2019, pp. 4401–4410. Computer
Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR. [52] Francesco Locatello, Dirk Weissenborn, Thomas Un-
2019.00453. terthiner, Aravindh Mahendran, Georg Heigold, Jakob
Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf.
[46] Tero Karras, Samuli Laine, Miika Aittala, Janne Hell- Object-centric learning with slot attention. arXiv
sten, Jaakko Lehtinen, and Timo Aila. Analyzing preprint arXiv:2006.15055, 2020.
and improving the image quality of stylegan. In
[53] Andrew L Maas, Awni Y Hannun, and Andrew Y
2020 IEEE/CVF Conference on Computer Vision and
Ng. Rectifier nonlinearities improve neural network
Pattern Recognition, CVPR 2020, Seattle, WA, USA,
acoustic models. In Proc. icml, volume 30, pp. 3.
June 13-19, 2020, pp. 8107–8116. IEEE, 2020. doi:
Citeseer, 2013.
10.1109/CVPR42600.2020.00813.
[54] Andrea Mechelli, Cathy J Price, Karl J Friston, and
[47] Adam R. Kosiorek, Sara Sabour, Yee Whye Teh, and
Alumit Ishai. Where bottom-up meets top-down: neu-
Geoffrey E. Hinton. Stacked capsule autoencoders. In
ronal interactions during perception and imagery. Cere-
Hanna M. Wallach, Hugo Larochelle, Alina Beygelz-
bral cortex, 14(11):1256–1265, 2004.
imer, Florence d’Alché-Buc, Emily B. Fox, and Ro-
man Garnett (eds.), Advances in Neural Information [55] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-
Processing Systems 32: Annual Conference on Neural Hossein Karimi, Antoine Bordes, and Jason We-
Information Processing Systems 2019, NeurIPS 2019, ston. Key-value memory networks for directly read-
December 8-14, 2019, Vancouver, BC, Canada, pp. ing documents. In Proceedings of the 2016 Confer-
15486–15496, 2019. ence on Empirical Methods in Natural Language Pro-
cessing, pp. 1400–1409, Austin, Texas, November
[48] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine,
2016. Association for Computational Linguistics. doi:
Jaakko Lehtinen, and Timo Aila. Improved precision
10.18653/v1/D16-1147.
and recall metric for assessing generative models. In
Hanna M. Wallach, Hugo Larochelle, Alina Beygelz- [56] Thu Nguyen-Phuoc, Christian Richardt, Long Mai,
imer, Florence d’Alché-Buc, Emily B. Fox, and Ro- Yong-Liang Yang, and Niloy Mitra. Blockgan: Learn-
man Garnett (eds.), Advances in Neural Information ing 3d object-aware scene representations from un-
Processing Systems 32: Annual Conference on Neural labelled images. arXiv preprint arXiv:2002.08988,
Information Processing Systems 2019, NeurIPS 2019, 2020.
December 8-14, 2019, Vancouver, BC, Canada, pp.
3929–3938, 2019. [57] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and
Jun-Yan Zhu. Semantic image synthesis with spatially-
[49] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose adaptive normalization. In IEEE Conference on Com-
Caballero, Andrew Cunningham, Alejandro Acosta, puter Vision and Pattern Recognition, CVPR 2019,
Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Ze- Long Beach, CA, USA, June 16-20, 2019, pp. 2337–
han Wang, and Wenzhe Shi. Photo-realistic single 2346. Computer Vision Foundation / IEEE, 2019. doi:
image super-resolution using a generative adversarial 10.1109/CVPR.2019.00244.
network. In 2017 IEEE Conference on Computer Vi-
sion and Pattern Recognition, CVPR 2017, Honolulu, [58] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit,
HI, USA, July 21-26, 2017, pp. 105–114. IEEE Com- Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
puter Society, 2017. doi: 10.1109/CVPR.2017.19. Dustin Tran. Image transformer. In Jennifer G. Dy
and Andreas Krause (eds.), Proceedings of the 35th
[50] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Ko- International Conference on Machine Learning, ICML
siorek, Seungjin Choi, and Yee Whye Teh. Set trans- 2018, Stockholmsmässan, Stockholm, Sweden, July
former: A framework for attention-based permutation- 10-15, 2018, volume 80 of Proceedings of Machine
invariant neural networks. In Kamalika Chaudhuri Learning Research, pp. 4052–4061. PMLR, 2018.
[59] Niki Parmar, Prajit Ramachandran, Ashish Vaswani, [67] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta,
Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand- and Kaiming He. Non-local neural networks. In 2018
alone self-attention in vision models. In Hanna M. IEEE Conference on Computer Vision and Pattern
Wallach, Hugo Larochelle, Alina Beygelzimer, Flo- Recognition, CVPR 2018, Salt Lake City, UT, USA,
rence d’Alché-Buc, Emily B. Fox, and Roman Garnett June 18-22, 2018, pp. 7794–7803. IEEE Computer
(eds.), Advances in Neural Information Processing Sys- Society, 2018. doi: 10.1109/CVPR.2018.00813.
tems 32: Annual Conference on Neural Information
Processing Systems 2019, NeurIPS 2019, December [68] Lawrence Wheeler. Concepts and mechanisms of per-
8-14, 2019, Vancouver, BC, Canada, pp. 68–80, 2019. ception by rl gregory. Leonardo, 9(2):156–157, 1976.
[69] Yuxin Wu, Alexander Kirillov, Francisco
[60] Alec Radford, Luke Metz, and Soumith Chintala. Un-
Massa, Wan-Yen Lo, and Ross Girshick.
supervised representation learning with deep convo-
Detectron2. https://github.com/
lutional generative adversarial networks. In Yoshua
facebookresearch/detectron2, 2019.
Bengio and Yann LeCun (eds.), 4th International Con-
ference on Learning Representations, ICLR 2016, San [70] Zongze Wu, Dani Lischinski, and Eli Shecht-
Juan, Puerto Rico, May 2-4, 2016, Conference Track man. StyleSpace analysis: Disentangled controls
Proceedings, 2016. for StyleGAN image generation. arXiv preprint
arXiv:2011.12799, 2020.
[61] Anirudh Ravula, Chris Alberti, Joshua Ainslie,
Li Yang, Philip Minh Pham, Qifan Wang, Santiago [71] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song,
Ontanon, Sumit Kumar Sanghai, Vaclav Cvicek, and Thomas Funkhouser, and Jianxiong Xiao. Lsun: Con-
Zach Fisher. Etc: Encoding long and structured inputs struction of a large-scale image dataset using deep
in transformers. 2020. learning with humans in the loop. arXiv preprint
arXiv:1506.03365, 2015.
[62] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hin-
ton. Dynamic routing between capsules. In Isabelle [72] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas,
Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. and Augustus Odena. Self-attention generative adver-
Wallach, Rob Fergus, S. V. N. Vishwanathan, and Ro- sarial networks. In Kamalika Chaudhuri and Ruslan
man Garnett (eds.), Advances in Neural Information Salakhutdinov (eds.), Proceedings of the 36th Interna-
Processing Systems 30: Annual Conference on Neural tional Conference on Machine Learning, ICML 2019,
Information Processing Systems 2017, December 4-9, 9-15 June 2019, Long Beach, California, USA, vol-
2017, Long Beach, CA, USA, pp. 3856–3866, 2017. ume 97 of Proceedings of Machine Learning Research,
pp. 7354–7363. PMLR, 2019.
[63] Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, [73] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Ex-
Xuhui Jia, and Ching-Hui Chen. Global self-attention ploring self-attention for image recognition. In 2020
networks for image recognition. arXiv preprint IEEE/CVF Conference on Computer Vision and Pat-
arXiv:2010.03019, 2020. tern Recognition, CVPR 2020, Seattle, WA, USA, June
13-19, 2020, pp. 10073–10082. IEEE, 2020. doi:
[64] Barry Smith. Foundations of gestalt theory. 1988.
10.1109/CVPR42600.2020.01009.
[65] Sjoerd van Steenkiste, Karol Kurach, Jürgen Schmid- [74] Jun-Yan Zhu, Taesung Park, Phillip Isola, and
huber, and Sylvain Gelly. Investigating object compo- Alexei A. Efros. Unpaired image-to-image transla-
sitionality in generative adversarial networks. Neural tion using cycle-consistent adversarial networks. In
Networks, 130:309–325, 2020. IEEE International Conference on Computer Vision,
ICCV 2017, Venice, Italy, October 22-29, 2017, pp.
[66] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 2242–2251. IEEE Computer Society, 2017. doi:
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz 10.1109/ICCV.2017.244.
Kaiser, and Illia Polosukhin. Attention is all you need.
In Isabelle Guyon, Ulrike von Luxburg, Samy Ben- [75] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter
gio, Hanna M. Wallach, Rob Fergus, S. V. N. Vish- Wonka. SEAN: image synthesis with semantic region-
wanathan, and Roman Garnett (eds.), Advances in adaptive normalization. In 2020 IEEE/CVF Confer-
Neural Information Processing Systems 30: Annual ence on Computer Vision and Pattern Recognition,
Conference on Neural Information Processing Systems CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp.
2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5103–5112. IEEE, 2020. doi: 10.1109/CVPR42600.
5998–6008, 2017. 2020.00515.
Supplementary Material
Bedroom Correlation
In the following we provide additional quantitative experi- 1
0.9
ments and visualizations for the GANsformer model. Sec- 0.8
0.7
tion 4 discusses multiple experiments we have conducted to 0.6
IOU
0.5
measure the model’s latent space disentanglement (Section 0.4
0.3
3), image diversity (Section 4.4), and spatial composition- 0.2
ality (Section 4.2). For the last of which we describe the 0.1
experiment in the main text and provide here the numerical
ng
pa p
w
d
in
g
l
or
al
tin
be
m
do
llo
rta
flo
w
ili
la
in
ce
pi
in
cu
results: See figure 8 for results of the spatial compositional-
w
Segment Class
ity experiment that sheds light upon the roles of the different
latent variables. To complement the numerical evaluations Cityscapes Correlation
with qualitative results, we present in figures 12 and 9 a 1
0.9
comparison of sample images produced by the GANsformer 0.8
0.7
and a set of baseline models, over the course the training and 0.6
IOU
0.5
after convergence respectively, while section A specifies the 0.4
0.3
implementation details, optimization scheme and training 0.2
0.1
configuration of the model. Finally, section C details addi- 0
tional ablation studies to measure the relative contribution
bu k
g
ad
e
y
n
ve und
ca
bu
in
al
sk
nc
tio
ro
ew
ild
fe
o
ta
gr
sid
ge
of different model’s design choices.
Segment Class
A. Implementation and Training details Figure 8. Attention spatial compositionality experiments. Cor-
relation between attention heads and semantic segments, computed
To evaluate all models under comparable conditions of train- over 1k sample images. Results presented for the Bedroom and
ing scheme, model size, and optimization details, we imple- Cityscapes datasets.
ment them all within the TensorFlow codebase introduced
by the StyleGAN authors [45]. See tables 4 for particular
settings of the GANsformer and table 5 for comparison B. Spatial Compositionality
of models’ sizes. In terms of the loss function, optimiza-
tion and training configuration, we adopt the settings and To quantify the compositionality level exhibited by the
techniques used in the StyleGAN and StyleGAN2 models model, we employ a pre-trained segmentor to produce se-
[45, 46], including in particular style mixing, Xavier Initial- mantic segmentations for the synthesized scenes, and use
ization, stochastic variation, exponential moving average them to measure the correlation between the attention cast
for weights, and a non-saturating logistic loss with lazy R1 by the latent variables and the various semantic classes.
regularization. We use Adam optimizer with batch size We derive the correlation by computing the maximum
of 32 (4 times 8 using gradient accumulation), equalized intersection-over-union between a class segment and the
learning rate of 0.001, β1 = 0.9 and β1 = 0.999 as well as attention segments produced by the model in the different
leaky ReLU activations with α = 0.2, bilinear filtering in all layers. The mean of these scores is then taken over a set of
up/downsampling layers and minibatch standard deviation 1k images. Results presented in figure 8 for the Bedroom
layer at the end of the discriminator. The mapping layer of and Cityscapes datasets, showing semantic classes which
the generator consists of 8 layers, and resnet connections have high correlation with the model attention, indicating
are used throughout the model, for the mapping network it decomposes the image into semantically-meaningful seg-
synthesis network and discriminator. We train all models ments of objects and entities.
on images of 256 × 256 resolution, padded as necessary.
The CLEVR dataset consists of 100k images, the FFHQ C. Ablation and Variation Studies
has 70k images, Cityscapes has overall about 25k images
and the LSUN-Bedroom has 3M images. The images in To validate the usefulness of our approach and obtain a bet-
the Cityscapes and FFHQ datasets are mirror-augmented ter assessment of the relative contribution of each design
to increase the effective training set size. All models have choice, we conduct multiple ablation and variation studies
been trained for the same number of training steps, roughly (table will be updated soon), where we test our model un-
spanning a week on 2 NVIDIA V100 GPUs per model. der varying conditions, specifically studying the impact of:
latent dimension, attention heads, number of layers atten-
tion is incorporated into, simplex vs. duplex, generator vs.
discriminator attention, and multiplicative vs. additive inte-
gration, all performed over the diagnostic CLEVR dataset
Table 4. Hyperparameter choices for the GANsformer and base-

line models. The number of latent variables (each variable can
be multidimensional) is chosen based on performance among
{8, 16, 32, 64}. The overall latent dimension (a sum over
the dimensions of all the latents variables) is chosen among
{128, 256, 512} and is then used both for the GANsformer and the
baseline models. The R1 regularization factor γ is chosen among
{1, 10, 20, 40, 80, 100}.
FFHQ CLEVR Cityscapes Bedroom
# Latent vars 8 16 16 16
Latent var dim 16 32 32 32
Latent overall dim 128 512 512 512
R1 reg weight (γ) 10 40 20 100
Table 5. Model size for the GANsformer and competing ap-

proaches, computed given 16 latent variables and an overall latent
dimension of 512. All models have comparable size.
# G Params # D Params
GAN 34M 29M
StyleGAN2 35M 29M
k-GAN 34M 29M
SAGAN 38M 29M
GANsformers 36M 29M
GANsformerd 36M 29M
[43]. We see that having the bottom-up relations introduced

by the duplex attention help the model significantly, and
likewise conducive are the multiple distinct latent variables
(up to 8 for CLEVR) and the use of multiplicative integra-
tion. Note that additional variation results appear in the
main text, showing the GANsformer’s performance through-
out the training as a function of the number of attention
layers used, either how early or up to what layer they are
introduced, demonstrating that our biprartite structure and
both duplex attention lead to substantial contributions in the
model generative performance.
GAN
StyleGAN2
k-GAN
Figure 9. State-of-the-art Comparison. A comparison of models’ sampled images for the CLEVR, LSUN-Bedroom and Cityscapes
datasets. All models have been trained for the same number of steps, which ranges between 5k to 15k samples. Note that the original
StyleGAN2 model has been trained by its authors for up to generate 70k samples, which is expected to take over 90 GPU-days for a single
model. See next page for image samples by further models. These images show that given the same training length the GANsformer
model’s sampled images enjoy high quality and diversity compared to the prior works, demonstrating the efficacy of our approach.
SAGAN
VQGAN
GANsformers
Figure 10. A comparison of models’ sampled images for the CLEVR, LSUN-Bedroom and Cityscapes datasets. See figure 9 for further
description.
GANsformerd
Figure 11. A comparison of models’ sampled images for the CLEVR, LSUN-Bedroom and Cityscapes datasets. See figure 9 for further
description.
GAN
StyleGAN
k-GAN
Figure 12. State-of-the-art Comparison over training. A comparison of models’ sampled images for the CLEVR, LSUN-Bedroom and
Cityscapes datasets, generated at different stages throughout the training. Sampled image from different points in training of based on the
same sampled latents, thereby showing how the image evolves during the training. For CLEVR and Cityscapes, we present results after
training to generate 100k, 200k, 500k, 1m, and 2m samples. For the Bedroom case, we present results after 500k, 1m, 2m, 5m and 10m
generated samples while training. These results show how the GANsformer, and especially when using duplex attention, manages learn a
lot faster than the competing approaches, generating impressive images very early in the training.
SAGAN
VQGAN
GANsformers
Figure 13. A comparison of models’ sampled images for the CLEVR, LSUN-Bedroom and Cityscapes datasets throughout the training.
See figure 12 for further description.
GANsformerd
Figure 14. A comparison of models’ sampled images for the CLEVR, LSUN-Bedroom and Cityscapes datasets throughout the training.
See figure 12 for further description.

2103 - Generative Adversarial Transformers

Uploaded by

Copyright:

Available Formats

2103 - Generative Adversarial Transformers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2103 - Generative Adversarial Transformers

Uploaded by

Copyright:

Available Formats

Generative Adversarial Transformers

Drew A. Hudson§ C. Lawrence Zitnick

We introduce the GANsformer, a novel and effi-

Figure 4. A visualization of the GANsformer attention maps for bedrooms.

and latents Y to play the roles of the two element groups,

(either simplex or duplex). This setting thus allows for a 160 60

flexible and dynamic style modulation at the region level. 130 50

proximity and content similarity [66], we see how the trans- 70 30

exercise finer control in modulating semantic regions. As

CLEVR LSUN-Bedroom 180

generator, it casts attention between the k latent variables

4.3. Disentanglement 4.4. Image Diversity

CLEVR scenes to extract the objects and properties within References

ments and visualizations for the GANsformer model. Sec- 0.8

tion 4 discusses multiple experiments we have conducted to 0.6

measure the model’s latent space disentanglement (Section 0.4

3), image diversity (Section 4.4), and spatial composition- 0.2

experiment in the main text and provide here the numerical

comparison of sample images produced by the GANsformer 0.8

after convergence respectively, while section A specifies the 0.4

implementation details, optimization scheme and training 0.2

configuration of the model. Finally, section C details addi- 0

tional ablation studies to measure the relative contribution

Table 4. Hyperparameter choices for the GANsformer and base-

Table 5. Model size for the GANsformer and competing ap-

[43]. We see that having the bottom-up relations introduced

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.