Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks
Abstract 1. Introduction
Image-to-image translation is a class of vision and
What did Claude Monet see as he placed his easel by the
graphics problems where the goal is to learn the mapping
bank of the Seine near Argenteuil on a lovely spring day
between an input image and an output image using a train-
in 1873 (Figure 1, top-left)? A color photograph, had it
ing set of aligned image pairs. However, for many tasks,
been invented, may have documented a crisp blue sky and
paired training data will not be available. We present an
a glassy river reflecting it. Monet conveyed his impression
approach for learning to translate an image from a source
of this same scene through wispy brush strokes and a bright
domain X to a target domain Y in the absence of paired
palette.
examples. Our goal is to learn a mapping G : X → Y
such that the distribution of images from G(X) is indistin- What if Monet had happened upon the little harbor in
guishable from the distribution Y using an adversarial loss. Cassis on a cool summer evening (Figure 1, bottom-left)?
Because this mapping is highly under-constrained, we cou- A brief stroll through a gallery of Monet paintings makes it
ple it with an inverse mapping F : Y → X and introduce a possible to imagine how he would have rendered the scene:
cycle consistency loss to enforce F (G(X)) ≈ X (and vice perhaps in pastel shades, with abrupt dabs of paint, and a
versa). Qualitative results are presented on several tasks somewhat flattened dynamic range.
where paired training data does not exist, including collec- We can imagine all this despite never having seen a side
tion style transfer, object transfiguration, season transfer, by side example of a Monet painting next to a photo of the
photo enhancement, etc. Quantitative comparisons against scene he painted. Instead we have knowledge of the set of
several prior methods demonstrate the superiority of our Monet paintings and of the set of landscape photographs.
approach. We can reason about the stylistic differences between these
* indicates equal contribution
1
n xi
Paired
yi o ( )( )X
Unpaired
Y
x ∈ X, is indistinguishable from images y ∈ Y by an ad-
versary trained to classify ŷ apart from y. In theory, this ob-
jective can induce an output distribution over ŷ that matches
n
, o
the empirical distribution pdata (y) (in general, this requires
G to be stochastic) [15]. The optimal G thereby translates
the domain X to a domain Ŷ distributed identically to Y .
, , However, such a translation does not guarantee that an in-
n o dividual input x and output y are paired up in a meaningful
way – there are infinitely many mappings G that will in-
, duce the same distribution over ŷ. Moreover, in practice,
we have found it difficult to optimize the adversarial objec-
…
…
tive in isolation: standard procedures often lead to the well-
Figure 2: Paired training data (left) consists of training ex- known problem of mode collapse, where all input images
amples {xi , yi }N
i=1 , where the correspondence between xi map to the same output image and the optimization fails to
and yi exists [21]. We instead consider unpaired training make progress [14].
data (right), consisting of a source set {xi }Ni=1 (xi ∈ X) These issues call for adding more structure to our ob-
and a target set {yj }j=1 (yj ∈ Y ), with no information pro- jective. Therefore, we exploit the property that translation
vided as to which xi matches which yj . should be “cycle consistent”, in the sense that if we trans-
late, e.g., a sentence from English to French, and then trans-
two sets, and thereby imagine what a scene might look like
late it back from French to English, we should arrive back
if we were to “translate” it from one set into the other.
at the original sentence [3]. Mathematically, if we have a
In this paper, we present a method that can learn to do the
translator G : X → Y and another translator F : Y → X,
same: capturing special characteristics of one image col-
then G and F should be inverses of each other, and both
lection and figuring out how these characteristics could be
mappings should be bijections. We apply this structural as-
translated into the other image collection, all in the absence
sumption by training both the mapping G and F simultane-
of any paired training examples.
ously, and adding a cycle consistency loss [64] that encour-
This problem can be more broadly described as image- ages F (G(x)) ≈ x and G(F (y)) ≈ y. Combining this loss
to-image translation [21], converting an image from one with adversarial losses on domains X and Y yields our full
representation of a given scene, x, to another, y, e.g., objective for unpaired image-to-image translation.
grayscale to color, image to semantic labels, edge-map to
We apply our method to a wide range of applications,
photograph. Years of research in computer vision, image
including collection style transfer, object transfiguration,
processing, computational photography, and graphics have
season transfer and photo enhancement. We also compare
produced powerful translation systems in the supervised
against previous approaches that rely either on hand-defined
setting, where example image pairs {x, y} are available
factorizations of style and content, or on shared embed-
(Figure 2, left), e.g., [10, 18, 21, 22, 26, 31, 44, 55, 57, 61].
ding functions, and show that our method outperforms these
However, obtaining paired training data can be difficult and
baselines. Our code is available at https://github.
expensive. For example, only a couple of datasets exist for
com/junyanz/CycleGAN. Check out more results at
tasks like semantic segmentation (e.g., [4]), and they are
https://junyanz.github.io/CycleGAN/.
relatively small. Obtaining input-output pairs for graphics
tasks like artistic stylization can be even more difficult since
2. Related work
the desired output is highly complex, typically requiring
artistic authoring. For many tasks, like object transfigura- Generative Adversarial Networks (GANs) [15, 62]
tion (e.g., zebra↔horse, Figure 1 top-middle), the desired have achieved impressive results in image generation [5,
output is not even well-defined. 37], image editing [65], and representation learning [37, 42,
We therefore seek an algorithm that can learn to trans- 35]. Recent methods adopt the same idea for conditional
late between domains without paired input-output examples image generation applications, such as text2image [39], im-
(Figure 2, right). We assume there is some underlying rela- age inpainting [36], and future prediction [34], as well as to
tionship between the domains – for example, that they are other domains like videos [53] and 3D data [56]. The key to
two different renderings of the same underlying scene – and GANs’ success is the idea of an adversarial loss that forces
seek to learn that relationship. Although we lack supervi- the generated images to be, in principle, indistinguishable
sion in the form of paired examples, we can exploit super- from real images. This is particularly powerful for image
vision at the level of sets: we are given one set of images in generation tasks, as this is exactly the objective that much
domain X and a different set in domain Y . We may train of computer graphics aims to optimize. We adopt an ad-
a mapping G : X → Y such that the output ŷ = G(x), versarial loss to learn the mapping such that the translated
DY DX
G G
DX DY x Ŷ x̂ y X̂ ŷ
G F F
X( Y X Y
X Y ( cycle-consistency
loss
cycle-consistency
loss
F
(a) (b) (c)
Figure 3: (a) Our model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial
discriminators DY and DX . DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa
for DX and F . To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if
we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency
loss: x → G(x) → F (G(x)) ≈ x, and (c) backward cycle-consistency loss: y → F (y) → G(F (y)) ≈ y
image cannot be distinguished from images in the target do- tween the input and output, nor do we assume that the input
main. and output have to lie in the same low-dimensional embed-
Image-to-Image Translation The idea of image-to- ding space. This makes our method a general-purpose solu-
image translation goes back at least to Hertzmann et al.’s tion for many vision and graphics tasks. We directly com-
Image Analogies [18], who employ a non-parametric tex- pare against several prior and contemporary approaches in
ture model [9] on a single input-output training image pair. Section 5.1.
More recent approaches use a dataset of input-output exam- Cycle Consistency The idea of using transitivity as a
ples to learn a parametric translation function using CNNs, way to regularize structured data has a long history. In
e.g. [31]. Our approach builds on the “pix2pix” framework visual tracking, enforcing simple forward-backward con-
of Isola et al. [21], which uses a conditional generative ad- sistency has been a standard trick for decades [47]. In
versarial network [15] to learn a mapping from input to out- the language domain, verifying and improving translations
put images. Similar ideas have been applied to various tasks via “back translation and reconsiliation” is a technique
such as generating photographs from sketches [43] or from used by human translators [3] (including, humorously, by
attribute and semantic layouts [23]. However, unlike these Mark Twain [50]), as well as by machines [16]. More
prior works, we learn the mapping without paired training recently, higher-order cycle consistency has been used in
examples. structure from motion [60], 3D shape matching [20], co-
Unpaired Image-to-Image Translation Several other segmentation [54], dense semantic alignment [63, 64], and
methods also tackle the unpaired setting, where the goal is depth estimation [13]. Of these, Zhou et al. [64] and Go-
to relate two data domains, X and Y . Rosales et al. [40] dard et al. [13] are most similar to our work, as they use a
propose a Bayesian framework that includes a prior based cycle consistency loss as a way of using transitivity to su-
on a patch-based Markov random field computed from a pervise CNN training. In this work, we are introducing a
source image, and a likelihood term obtained from multi- similar loss to push G and F to be consistent with each
ple style images. More recently, CoGAN [30] and cross- other. Concurrent with our work, in these same proceed-
modal scene networks [1] use a weight-sharing strategy to ings, Yi et al. [58] independently use a similar objective
learn a common representation across domains. Concurrent for unpaired image-to-image translation, inspired by dual
to our method, Liu et al. [29] extends this framework with learning in machine translation [16].
a combination of variational autoencoders [25] and gen- Neural Style Transfer [12, 22, 51, 11] is another way
erative adversarial networks. Another line of concurrent to perform image-to-image translation, which synthesizes a
work [45, 48, 2] encourages the input and output to share novel image by combining the content of one image with
certain “content” features even though they may differ in the style of another image (typically a painting) based on
“style“. They also use adversarial networks, with additional matching the Gram matrix statistics of pre-trained deep fea-
terms to enforce the output to be close to the input in a pre- tures. Our main focus, on the other hand, is learning the
defined metric space, such as class label space [2], image mapping between two image collections, rather than be-
pixel space [45], and image feature space [48]. tween two specific images, by trying to capture correspon-
Unlike the above approaches, our formulation does not dences between higher-level appearance structures. There-
rely on any task-specific, predefined similarity function be- fore, our method can be applied to other tasks, such as
painting→ photo, object transfiguration, etc. where single Input 𝑥 Generated image 𝐺(𝑥) Reconstruction F(𝐺 𝑥 )
sample transfer methods do not perform well. We compare
these two methods in Section 5.2.
3. Formulation
Our goal is to learn mapping functions between two
domains X and Y given training samples {xi }N i=1 where
xi ∈ X and {yj }M 1
j=1 where yj ∈ Y . We denote the data
distribution as x ∼ pdata (x) and y ∼ pdata (y). As illus-
trated in Figure 3 (a), our model includes two mappings
G : X → Y and F : Y → X. In addition, we in-
troduce two adversarial discriminators DX and DY , where
DX aims to distinguish between images {x} and translated
images {F (y)}; in the same way, DY aims to discriminate
between {y} and {G(x)}. Our objective contains two types
of terms: adversarial losses [15] for matching the distribu-
tion of generated images to the data distribution in the target
domain; and cycle consistency losses to prevent the learned
mappings G and F from contradicting each other.
4. Implementation
5.1. Evaluation
Network Architecture We adapt the architecture for our
generative networks from Johnson et al. [22] who have Using the same evaluation datasets and metrics as
shown impressive results for neural style transfer and super- “pix2pix” [21], we compare our method against several
resolution. This network contains two stride-2 convolu- baselines both qualitatively and quantitatively. The tasks in-
tions, several residual blocks [17], and two fractionally- clude semantic labels↔photo on the Cityscapes dataset [4],
strided convolutions with stride 12 . We use 6 blocks for and map↔aerial photo on data scraped from Google Maps.
128 × 128 images, and 9 blocks for 256 × 256 and higher- We also perform ablation study on the full loss function.
resolution training images. Similar to Johnson et al. [22],
we use instance normalization [52]. For the discriminator 5.1.1 Metrics
networks we use 70 × 70 PatchGANs [21, 28, 27], which AMT perceptual studies On the map↔aerial photo
aim to classify whether 70 × 70 overlapping image patches task, we run “real vs fake” perceptual studies on Amazon
are real or fake. Such a patch-level discriminator architec- Mechanical Turk (AMT) to assess the realism of our out-
ture has fewer parameters than a full-image discriminator, puts. We follow the same perceptual study protocol from
and can be applied to arbitrarily-sized images in a fully con- Isola et al. [21], except we only gather data from 25 partici-
volutional fashion [21]. pants per algorithm we tested. Participants were shown a se-
quence of pairs of images, one a real photo or map and one
Training details We apply two techniques from recent fake (generated by our algorithm or a baseline), and asked
works to stabilize our model training procedure. First, for to click on the image they thought was real. The first 10
LGAN (Equation 1), we replace the negative log likeli- trials of each session were practice and feedback was given
hood objective by a least-squares loss [33]. This loss is as to whether the participant’s response was correct or in-
Figure 5: Different methods for mapping labels↔photos trained on Cityscapes images. From left to right: input, Bi-
GAN/ALI [6, 8], CoGAN [30], feature loss + GAN, SimGAN [45], CycleGAN (ours), pix2pix [21] trained on paired data,
and ground truth.
Figure 6: Different methods for mapping aerial photos↔maps on Google Maps. From left to right: input, BiGAN/ALI [6, 8],
CoGAN [30], feature loss + GAN, SimGAN [45], CycleGAN (ours), pix2pix [21] trained on paired data, and ground truth.
correct. The remaining 40 trials were used to assess the rate automatic quantitative measure that does not require human
at which each algorithm fooled participants. Each session experiments. For this we adopt the “FCN score” from [21],
only tested a single algorithm, and participants were only and use it to evaluate the Cityscapes labels→photo task.
allowed to complete a single session. Note that the numbers The FCN metric evaluates how interpretable the generated
we report here are not directly comparable to those in [21] photos are according to an off-the-shelf semantic segmen-
as our ground truth images were processed slightly differ- tation algorithm (the fully-convolutional network, FCN,
ently 2 and the participant pool we tested may be differently from [31]). The FCN predicts a label map for a generated
distributed from those tested in [21] (due to running the ex- photo. This label map can then be compared against the
periment at a different date and time). Therefore, our num- input ground truth labels using standard semantic segmen-
bers should only be used to compare our current method tation metrics described below. The intuition is that if we
against the baselines (which were run under identical con- generate a photo from a label map of “car on road”, then we
ditions), rather than against [21]. have succeeded if the FCN applied to the generated photo
FCN score Although perceptual studies may be the gold detects “car on road”.
standard for assessing graphical realism, we also seek an
Semantic segmentation metrics To evaluate the perfor-
2 We train all the models on 256 × 256 images while in pix2pix [21], mance of photo→labels, we use the standard metrics from
the model was trained on 256 × 256 patches of 512 × 512 images, and
run convolutionally on the 512 × 512 images at test time. We choose
the Cityscapes benchmark, including per-pixel accuracy,
256 × 256 in our experiments as many baselines cannot scale up to high per-class accuracy, and mean class Intersection-Over-Union
resolution images, and CoGAN cannot be tested fully convolutionally. (Class IOU) [4].
Map → Photo Photo → Map
Loss % Turkers labeled real % Turkers labeled real Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [30] 0.6% ± 0.5% 0.9% ± 0.5% CoGAN [30] 0.45 0.11 0.08
BiGAN/ALI [8, 6] 2.1% ± 1.0% 1.9% ± 0.9% BiGAN/ALI [8, 6] 0.41 0.13 0.07
SimGAN [45] 0.7% ± 0.5% 2.6% ± 1.1% SimGAN [45] 0.47 0.11 0.07
Feature loss + GAN 1.2% ± 0.6% 0.3% ± 0.2% Feature loss + GAN 0.50 0.10 0.06
CycleGAN (ours) 26.8% ± 2.8% 23.2% ± 3.4% CycleGAN (ours) 0.58 0.22 0.16
pix2pix [21] 0.85 0.40 0.32
Table 1: AMT “real vs fake” test on maps↔aerial photos at
256 × 256 resolution. Table 3: Classification performance of photo→labels for
different methods on cityscapes.
Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [30] 0.40 0.10 0.06
Loss Per-pixel acc. Per-class acc. Class IOU
BiGAN/ALI [8, 6] 0.19 0.06 0.02
Cycle alone 0.22 0.07 0.02
SimGAN [45] 0.20 0.10 0.04
GAN alone 0.51 0.11 0.08
Feature loss + GAN 0.06 0.04 0.01
GAN + forward cycle 0.55 0.18 0.12
CycleGAN (ours) 0.52 0.17 0.11
GAN + backward cycle 0.39 0.14 0.06
pix2pix [21] 0.71 0.25 0.18
CycleGAN (ours) 0.52 0.17 0.11
Table 2: FCN-scores for different methods, evaluated on Table 4: Ablation study: FCN-scores for different variants
Cityscapes labels→photo. of our method, evaluated on Cityscapes labels→photo.
5.1.2 Baselines Loss Per-pixel acc. Per-class acc. Class IOU
Cycle alone 0.10 0.05 0.02
CoGAN [30] This method learns one GAN generator for GAN alone 0.53 0.11 0.07
GAN + forward cycle 0.49 0.11 0.07
domain X and one for domain Y , with tied weights on the GAN + backward cycle 0.01 0.06 0.01
first few layers for shared latent representation. Translation CycleGAN (ours) 0.58 0.22 0.16
from X to Y can be achieved by finding a latent represen- Table 5: Ablation study: classification performance of
tation that generates image X and then rendering this latent photo→labels for different losses, evaluated on Cityscapes.
representation into style Y .
SimGAN [45] Like our method, Shrivastava et al.[45] 5.1.3 Comparison against baselines
uses an adversarial loss to train a translation from X to Y . As can be seen in Figure 5 and Figure 6, we were un-
The regularization term kX −G(X)k1 was used to penalize able to achieve compelling results with any of the baselines.
making large changes at pixel level. Our method, on the other hand, is able to produce transla-
Feature loss + GAN We also test a variant of Sim- tions that are often of similar quality to the fully supervised
GAN [45] where the L1 loss is computed over deep pix2pix.
image features using a pretrained network (VGG-16 Table 1 reports performance regarding the AMT per-
relu4 2 [46]), rather than over RGB pixel values. Com- ceptual realism task. Here, we see that our method can
puting distances in deep feature space, like this, is also fool participants on around a quarter of trials, in both the
sometimes referred to as using a “perceptual loss” [7, 22]. maps→aerial photos direction and the aerial photos→maps
BiGAN/ALI [8, 6] Unconditional GANs [15] learn a direction at 256 × 256 resolution4 .
generator G : Z → X, that maps random noise Z to im- All baselines almost never fooled participants.
ages X. The BiGAN [8] and ALI [6] propose to also learn Table 2 assesses the performance of the labels→photo
the inverse mapping function F : X → Z. Though they task on the Cityscapes and Table 3 assesses the opposite
were originally designed for mapping a latent vector z to an mapping (photos→labels). In both cases, our method again
image x, we implemented the same objective for mapping a outperforms the baselines.
source image x to a target image y.
pix2pix [21] We also compare against pix2pix [21], 5.1.4 Analysis of the loss function
which is trained on paired data, to see how close we can
get to this “upper bound” without using any paired training In Table 4 and Table 5, we compare against ablations
data. of our full loss. Removing the GAN loss substantially
For a fair comparison, we implement all the baselines degrades results, as does removing the cycle-consistency
using the same architecture and details as our method, ex- loss. We therefore conclude that both terms are critical
cept for CoGAN [30]. CoGAN builds on generators that to our results. We also evaluate our method with the cy-
produce images from a shared latent representation, which cle loss in only one direction: GAN+forward cycle loss
is incompatible with our image-to-image network. We use 4 We also train CycleGAN and pix2pix at 512 × 512 resolution, and
the public implementation of CoGAN instead 3 . observe the comparable performance: maps→aerial photos: CycleGAN:
37.5% ± 3.6% and pix2pix: 33.9% ± 3.1%; aerial photos→maps: Cy-
3 https://github.com/mingyuliutw/CoGAN cleGAN: 16.5% ± 4.1% and pix2pix: 8.5% ± 2.6%
Input Cycle alone GAN alone GAN+forward GAN+backward CycleGAN (ours) Ground truth
Figure 7: Different variants of our method for mapping labels↔photos trained on cityscapes. From left to right: input, cycle-
consistency loss alone, adversarial loss alone, GAN + forward cycle-consistency loss (F (G(x)) ≈ x), GAN + backward
cycle-consistency loss (G(F (y)) ≈ y), CycleGAN (our full method), and ground truth. Both Cycle alone and GAN +
backward fail to produce images similar to the target domain. GAN alone and GAN + forward suffer from mode collapse,
producing identical label maps regardless of the input photo.
Input Output Input Output Input Output
5.1.6 Additional results on paired datasets
5.2. Applications
edges → shoes
We demonstrate our method on several applications
where paired training data does not exist. Please refer to
the appendix (Section 7) for more details about the datasets.
edges → shoes
We observe that translations on training data are often more
appealing than those on test data, and full results of all ap-
Figure 8: Example results of CycleGAN on paired datasets plications on both training and test data can be viewed on
used in “pix2pix” [21] such as architectural labels↔photos our project website.
and edges↔shoes.
Collection style transfer (Figure 10 and Figure 11)
Ex∼pdata (x) [kF (G(x)) − xk1 ], or GAN+backward cycle loss We train the model on landscape photographs downloaded
Ey∼pdata (y) [kG(F (y))−yk1 ] (Equation 2) and find that it of- from Flickr and WikiArt. Note that unlike recent work on
ten incurs training instability and causes mode collapse, es- “neural style transfer” [12], our method learns to mimic the
pecially for the direction of the mapping that was removed. style of an entire collection of artworks, rather than trans-
Figure 7 shows several qualitative examples. ferring the style of a single selected piece of art. Therefore,
we can learn to generate photos in the style of, e.g., Van
Gogh, rather than just in the style of Starry Night. The size
5.1.5 Image reconstruction quality of the dataset for each artist/style was 526, 1073, 400, and
563 for Cezanne, Monet, Van Gogh, and Ukiyo-e.
In Figure 4, we show a few random samples of the recon- Object transfiguration (Figure 13) The model is
structed images F (G(x)). We observed that the recon- trained to translate one object class from ImageNet [41] to
structed images were very close to the original inputs x, another (each class contains around 1000 training images).
at both training and testing time, even in cases where one Turmukhambetov et al. [49] proposes a subspace model to
domain represents significantly more diverse information, translate one object into another object of the same category,
such as map↔aerial photos. while our method focuses on object transfiguration between
Input CycleGAN CycleGAN+𝐿𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦 Photo enhancement (Figure 14) We show that our
method can be used to generate photos with shallower depth
of field. We train the model on flower photos downloaded
from Flickr. The source domain consists of photos of flower
taken by smartphones, which usually have deep DoF due
to small aperture. The target contains photos captured by
DSLRs with larger aperture. Our model successfully gen-
erates photos with shallower depth of field from the photos
taken by smartphones.
Comparison with Gatys et al. [12] In Figure 15, we
compare our results with neural style transfer [12] on photo
stylization. For each row, we first use two representative
artworks as the style images for [12]. Our method, on
the other hand, is able to produce photos in the style of en-
tire collection. To compare against neural style transfer of
an entire collection, we compute the average Gram Matrix
Figure 9: The effect of the identity mapping loss on Monet’s across the target domain, and use this matrix to transfer the
painting→ photos. From left to right: input paintings, Cy- “average style” using [12].
cleGAN without identity mapping loss, CycleGAN with Figure 16 demonstrates similar comparisons for other
identity mapping loss. The identity mapping loss helps pre- translation tasks. We observe that Gatys et al. [12] requires
serve the color of the input paintings. finding target style images that closely match the desired
output, but still often fails to produce photo-realistic results,
two visually similar categories. while our method succeeds to generate natural looking re-
Season transfer (Figure 13) The model is trained on sults, similar to the target domain.
854 winter photos and 1273 summer photos of Yosemite
downloaded from Flickr. 6. Limitations and Discussion
Photo generation from paintings (Figure 12) For Although our method can achieve compelling results in
painting→photo, we find that it is helpful to introduce an many cases, the results are far from uniformly positive. Sev-
additional loss to encourage the mapping to preserve color eral typical failure cases are shown in Figure 17. On transla-
composition between the input and output. In particular, we tion tasks that involve color and texture changes, like many
adopt the technique of Taigman et al. [48] and regularize the of those reported above, the method often succeeds. We
generator to be near an identity mapping when real samples have also explored tasks that require geometric changes,
of the target domain are provided as the input to the gen- with little success. For example, on the task of dog→cat
erator: i.e. Lidentity (G, F ) = Ey∼pdata (y) [kG(y) − yk1 ] + transfiguration, the learned translation degenerates to mak-
Ex∼pdata (x) [kF (x) − xk1 ]. ing minimal changes to the input (Figure 17). This might
Without Lidentity , the generator G and F are free to be caused by our generator architecture choices which are
change the tint of input images when there is no need to. tailored for good performance on the appearance changes.
For example, when learning the mapping between Monet’s Handling more varied and extreme transformations, espe-
paintings and Flickr photographs, the generator often maps cially geometric changes, is an important problem for future
paintings of daytime to photographs taken during sunset, work.
because such a mapping may be equally valid under the ad- Some failure cases are caused by the distribution charac-
versarial loss and cycle consistency loss. The effect of this teristics of the training datasets. For example, the horse →
identity mapping loss are shown in Figure 9. zebra example (Figure 17, right) has got confused, because
In Figure 12, we show additional results translating our model was trained on the wild horse and zebra synsets
Monet’s paintings to photographs. This figure and Figure 9 of ImageNet, which does not contain images of a person
show results on paintings that were included in the train- riding a horse or zebra.
ing set, whereas for all other experiments in the paper, we We also observe a lingering gap between the results
only evaluate and show test set results. Because the training achievable with paired training data and those achieved by
set does not include paired data, coming up with a plausi- our unpaired method. In some cases, this gap may be very
ble translation for a training set painting is a nontrivial task. hard – or even impossible – to close: for example, our
Indeed, since Monet is no longer able to create new paint- method sometimes permutes the labels for tree and build-
ings, generalization to unseen, “test set”, paintings is not a ing in the output of the photos→labels task. To resolve this
pressing problem. ambiguity may require some form of weak semantic super-
vision. Integrating weak or semi-supervised data may lead
to substantially more powerful translators, still at a fraction
of the annotation cost of the fully-supervised systems.
Nonetheless, in many cases completely unpaired data is
plentifully available and should be made use of. This paper
pushes the boundaries of what is possible in this “unsuper-
vised” setting.
Acknowledgments: We thank Aaron Hertzmann, Shiry
Ginosar, Deepak Pathak, Bryan Russell, Eli Shechtman,
Richard Zhang, and Tinghui Zhou for many helpful com-
ments. This work was supported in part by NSF SMA-
1514512, NSF IIS-1633310, a Google Research Award, In-
tel Corp, and hardware donations from NVIDIA. JYZ is
supported by the Facebook Graduate Fellowship and TP is
supported by the Samsung Scholarship. The photographs
used for style transfer were taken by AE, mostly in France.
Input Monet Van Gogh Cezanne Ukiyo-e
Figure 10: Collection style transfer I: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, and
Ukiyo-e. Please see our website for additional examples.
Input Monet Van Gogh Cezanne Ukiyo-e
Figure 11: Collection style transfer II: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, Ukiyo-e.
Please see our website for additional examples.
Input Output Input Output
Figure 12: Relatively successful results on mapping Monet’s paintings to photographs. Please see our website for additional
examples.
Input Output Input Output Input Output
horse → zebra
zebra → horse
apple → orange
orange → apple
Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results
– please see our website for more comprehensive and random results. In the top two rows, we show results on object
transfiguration between horses and zebras, trained on 939 images from the wild horse class and 1177 images from the zebra
class in Imagenet [41]. Also check out the horse→zebra demo video at https://youtu.be/9reHvktowLY. The
middle two rows show results on season transfer, trained on winter and summer photos of Yosemite from Flickr. In the
bottom two rows, we train our method on 996 apple images and 1020 navel orange images from ImageNet.
Input Output Input Output Input Output Input Output
Figure 14: Photo enhancement: mapping from a set of iPhone snaps to professional DSLR photographs, the system often
learns to produce shallow focus. Here we show some of the most successful results in our test set – average performance is
considerably worse. Please see our website for more comprehensive and random examples.
Input Gatys et al. (image I) Gatys et al. (image II) Gatys et al. (collection) CycleGAN
Photo → Ukiyo-e
Photo → Cezanne
Figure 15: We compare our method with neural style transfer [12] on photo stylization. Left to right: input image, results
from [12] using two different representative artworks as style images, results from [12] using the entire collection of the
artist, and CycleGAN (ours).
Input Gatys et al. (image I) Gatys et al. (image II) Gatys et al. (collection) CycleGAN
apple → orange
horse → zebra
Monet → photo
Figure 16: We compare our method with neural style transfer [12] on various applications. From top to bottom:
apple→orange, horse→zebra, and Monet→photo. Left to right: input image, results from [12] using two different images as
style images, results from [12] using all the images from the target domain, and CycleGAN (ours).
photo → Ukiyo-e photo → Van Gogh iPhone photo → DSLR photo ImageNet “wild horse” training images
Figure 17: Typical failure cases of our method. Please see our website for more comprehensive results.
References [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-
[1] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, gio. Generative adversarial nets. In NIPS, 2014. 2, 3,
and A. Torralba. Cross-modal scene networks. arXiv 4, 7
preprint arXiv:1610.09003, 2016. 3
[16] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and
[2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and W.-Y. Ma. Dual learning for machine translation. In
D. Krishnan. Unsupervised pixel-level domain adap- NIPS, pages 820–828, 2016. 3
tation with generative adversarial networks. arXiv
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
preprint arXiv:1612.05424, 2016. 3
learning for image recognition. In CVPR, pages 770–
[3] R. W. Brislin. Back-translation for cross-cultural 778, 2016. 5
research. Journal of cross-cultural psychology, [18] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and
1(3):185–216, 1970. 2, 3 D. H. Salesin. Image analogies. In SIGGRAPH, pages
[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- 327–340. ACM, 2001. 2, 3
zweiler, R. Benenson, U. Franke, S. Roth, and [19] G. E. Hinton and R. R. Salakhutdinov. Reducing the
B. Schiele. The cityscapes dataset for semantic urban dimensionality of data with neural networks. Science,
scene understanding. In CVPR, 2016. 2, 5, 7, 20 313(5786):504–507, 2006. 5
[5] E. L. Denton, S. Chintala, R. Fergus, et al. Deep gen- [20] Q.-X. Huang and L. Guibas. Consistent shape maps
erative image models using a laplacian pyramid of ad- via semidefinite programming. In Computer Graph-
versarial networks. In NIPS, pages 1486–1494, 2015. ics Forum, volume 32, pages 177–186. Wiley Online
2 Library, 2013. 3
[6] J. Donahue, P. Krähenbühl, and T. Darrell. Adversar- [21] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
ial feature learning. arXiv preprint arXiv:1605.09782, to-image translation with conditional adversarial net-
2016. 6, 7 works. In CVPR, 2017. 2, 3, 5, 6, 7, 8, 20
[7] A. Dosovitskiy and T. Brox. Generating images with [22] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
perceptual similarity metrics based on deep networks. for real-time style transfer and super-resolution. In
In NIPS, pages 658–666, 2016. 7 ECCV, pages 694–711. Springer, 2016. 2, 3, 5, 7, 20
[8] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, [23] L. Karacan, Z. Akata, A. Erdem, and E. Erdem.
M. Arjovsky, O. Mastropietro, and A. Courville. Learning to generate images of outdoor scenes from
Adversarially learned inference. arXiv preprint attributes and semantic layouts. arXiv preprint
arXiv:1606.00704, 2016. 6, 7 arXiv:1612.00215, 2016. 3
[9] A. A. Efros and T. K. Leung. Texture synthesis by [24] D. Kingma and J. Ba. Adam: A method for stochastic
non-parametric sampling. In ICCV, volume 2, pages optimization. arXiv preprint arXiv:1412.6980, 2014.
1033–1038. IEEE, 1999. 3 5
[25] D. P. Kingma and M. Welling. Auto-encoding varia-
[10] D. Eigen and R. Fergus. Predicting depth, surface nor-
tional bayes. ICLR, 2014. 3
mals and semantic labels with a common multi-scale
convolutional architecture. In ICCV, pages 2650– [26] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays.
2658, 2015. 2 Transient attributes for high-level understanding and
editing of outdoor scenes. ACM Transactions on
[11] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shecht- Graphics (TOG), 33(4):149, 2014. 2
man. Preserving color in neural artistic style transfer.
arXiv preprint arXiv:1606.05897, 2016. 3 [27] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun-
ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
[12] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style Z. Wang, et al. Photo-realistic single image super-
transfer using convolutional neural networks. CVPR, resolution using a generative adversarial network.
2016. 3, 8, 9, 15, 16 arXiv preprint arXiv:1609.04802, 2016. 5
[13] C. Godard, O. Mac Aodha, and G. J. Brostow. Un- [28] C. Li and M. Wand. Precomputed real-time texture
supervised monocular depth estimation with left-right synthesis with markovian generative adversarial net-
consistency. In CVPR, 2017. 3 works. ECCV, 2016. 5
[14] I. Goodfellow. Nips 2016 tutorial: Generative ad- [29] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised
versarial networks. arXiv preprint arXiv:1701.00160, image-to-image translation networks. arXiv preprint
2016. 2, 4, 5 arXiv:1703.00848, 2017. 3
[30] M.-Y. Liu and O. Tuzel. Coupled generative adversar- [45] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind,
ial networks. In NIPS, pages 469–477, 2016. 3, 6, W. Wang, and R. Webb. Learning from simulated
7 and unsupervised images through adversarial training.
[31] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- arXiv preprint arXiv:1612.07828, 2016. 3, 5, 6, 7
tional networks for semantic segmentation. In CVPR, [46] K. Simonyan and A. Zisserman. Very deep convo-
pages 3431–3440, 2015. 2, 3, 6 lutional networks for large-scale image recognition.
[32] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and arXiv preprint arXiv:1409.1556, 2014. 7
B. Frey. Adversarial autoencoders. arXiv preprint [47] N. Sundaram, T. Brox, and K. Keutzer. Dense point
arXiv:1511.05644, 2015. 5 trajectories by gpu-accelerated large displacement op-
tical flow. In ECCV, pages 438–451. Springer, 2010.
[33] X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Multi-
3
class generative adversarial networks with the l2 loss
function. arXiv preprint arXiv:1611.04076, 2016. 5 [48] Y. Taigman, A. Polyak, and L. Wolf. Unsuper-
vised cross-domain image generation. arXiv preprint
[34] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-
arXiv:1611.02200, 2016. 3, 9
scale video prediction beyond mean square error.
ICLR, 2016. 2 [49] D. Turmukhambetov, N. D. Campbell, S. J. Prince,
and J. Kautz. Modeling object appearance using
[35] M. F. Mathieu, J. Zhao, A. Ramesh, P. Sprechmann, context-conditioned component analysis. In CVPR,
and Y. LeCun. Disentangling factors of variation pages 4156–4164, 2015. 8
in deep representation using adversarial training. In
NIPS, pages 5040–5048, 2016. 2 [50] M. Twain. The Jumping Frog: in English, then in
French, and then Clawed Back into a Civilized Lan-
[36] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and guage Once More by Patient, Unremunerated Toil.
A. A. Efros. Context encoders: Feature learning by 1903. 3
inpainting. CVPR, 2016. 2
[51] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempit-
[37] A. Radford, L. Metz, and S. Chintala. Unsu- sky. Texture networks: Feed-forward synthesis of tex-
pervised representation learning with deep convolu- tures and stylized images. In Int. Conf. on Machine
tional generative adversarial networks. arXiv preprint Learning (ICML), 2016. 3
arXiv:1511.06434, 2015. 2
[52] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance
[38] R. Š. Radim Tyleček. Spatial pattern templates for normalization: The missing ingredient for fast styliza-
recognition of objects with regular structure. In Proc. tion. arXiv preprint arXiv:1607.08022, 2016. 5
GCPR, Saarbrucken, Germany, 2013. 8, 20
[53] C. Vondrick, H. Pirsiavash, and A. Torralba. Generat-
[39] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, ing videos with scene dynamics. In NIPS, pages 613–
and H. Lee. Generative adversarial text to image syn- 621, 2016. 2
thesis. arXiv preprint arXiv:1605.05396, 2016. 2 [54] F. Wang, Q. Huang, and L. J. Guibas. Image co-
[40] R. Rosales, K. Achan, and B. J. Frey. Unsupervised segmentation via consistent functional maps. In ICCV,
image translation. In iccv, pages 472–478, 2003. 3 pages 849–856, 2013. 3
[41] O. Russakovsky, J. Deng, H. Su, J. Krause, [55] X. Wang and A. Gupta. Generative image modeling
S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, using style and structure adversarial networks. ECCV,
M. Bernstein, et al. Imagenet large scale visual recog- 2016. 2
nition challenge. IJCV, 115(3):211–252, 2015. 8, 14 [56] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenen-
[42] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, baum. Learning a probabilistic latent space of ob-
A. Radford, and X. Chen. Improved techniques for ject shapes via 3d generative-adversarial modeling. In
training gans. arXiv preprint arXiv:1606.03498, 2016. NIPS, pages 82–90, 2016. 2
2 [57] S. Xie and Z. Tu. Holistically-nested edge detection.
[43] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scrib- In ICCV, 2015. 2
bler: Controlling deep image synthesis with sketch [58] Z. Yi, H. Zhang, T. Gong, Tan, and M. Gong. Dual-
and color. In CVPR, 2017. 3 gan: Unsupervised dual learning for image-to-image
[44] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data- translation. In ICCV, 2017. 3
driven hallucination of different times of day from a [59] A. Yu and K. Grauman. Fine-grained visual compar-
single outdoor photo. ACM Transactions on Graphics isons with local learning. In CVPR, pages 192–199,
(TOG), 32(6):200, 2013. 2 2014. 8, 20
[60] C. Zach, M. Klopschitz, and M. Pollefeys. Disam-
biguating visual relations using loop constraints. In
CVPR, pages 1426–1433. IEEE, 2010. 3
[61] R. Zhang, P. Isola, and A. A. Efros. Colorful image
colorization. In ECCV, 2016. 2
[62] J. Zhao, M. Mathieu, and Y. LeCun. Energy-
based generative adversarial network. arXiv preprint
arXiv:1609.03126, 2016. 2
[63] T. Zhou, Y. Jae Lee, S. X. Yu, and A. A. Efros.
Flowweb: Joint image set alignment by weaving con-
sistent, pixel-wise correspondences. In CVPR, pages
1191–1200, 2015. 3
[64] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and
A. A. Efros. Learning dense correspondence via 3d-
guided cycle consistency. In CVPR, pages 117–126,
2016. 2, 3
[65] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A.
Efros. Generative visual manipulation on the natural
image manifold. In ECCV, 2016. 2
7. Appendix weight for the identity mapping loss was 0.5λ where λ was
the weight for cycle consistency loss, and we set λ = 10.
7.1. Training details Flower photo enhancement Flower images taken on
All the networks (except edges↔shoes) were trained iPhones were downloaded from Flickr by searching for the
from scratch, with a learning rate of 0.0002 for 200 epochs. photos taken by Apple iPhone 5, 5s, or 6, with search text
In practice, we divide the objective by 2 while optimizing flower. DSLR images with shallow DoF were also down-
D, which slows down the rate at which D learns relative to loaded from Flickr by search tag flower, dof. The images
G. We keep the same learning rate for the first 100 epochs were scaled to width 360 pixels. The identity mapping loss
and linearly decay the rate to zero over the next 100 epochs. of weight 0.5λ was used. The training set size of the smart-
Weights were initialized from a Gaussian distribution with phone and DSLR dataset were 1813 and 3326, respectively.
mean 0 and standard deviation 0.02.
7.2. Network architectures
Cityscapes label↔Photo 2975 training images from the
Cityscapes training set [4] with image size 128 × 128. We Our code and models are available at https://
used the Cityscapes val set for testing. github.com/junyanz/CycleGAN. We also provide
Maps↔aerial photograph 1096 training images a pytorch implementation at https://github.com/
scraped from Google Maps [21] with image size 256 × 256. junyanz/pytorch-CycleGAN-and-pix2pix
Images were sampled from in and around New York City. Generator architectures We adapt our architectures
Data was then split into train and test about the median from Johnson et al. [22]. We use 6 blocks for 128 × 128
latitude of the sampling region (with a buffer region added training images, and 9 blocks for 256 × 256 or higher-
to ensure that no training pixel appeared in the test set). resolution training images. Below, we follow the naming
Architectural facades labels↔photo 400 training im- convention used in the Johnson el al.’s Github repository5
ages from [38]. Let c7s1-k denote a 7 × 7 Convolution-InstanceNorm-
Edges→shoes around 50, 000 training images from UT ReLU layer with k filters and stride 1. dk denotes a 3 × 3
Zappos50K dataset [59]. The model was trained for 5 Convolution-InstanceNorm-ReLU layer with k filters, and
epochs with learning rate of 0.0002. stride 2. Reflection padding was used to reduce artifacts.
Horse↔Zebra and Apple↔Orange The images for Rk denotes a residual block that contains two 3 × 3 con-
each class were downloaded from ImageNet using key- volutional layers with the same number of filters on both
words wild horse, zebra, apple, and navel orange. The im- layer. uk denotes a 3 × 3 fractional-strided-Convolution-
ages were scaled to 256×256 pixels. The training set size of InstanceNorm-ReLU layer with k filters, and stride 21 .
each class was horse: 939, zebra: 1177, apple: 996, orange: The network with 6 blocks consists of:
1020. c7s1-32,d64,d128,R128,R128,R128,
Summer↔Winter Yosemite The images were down- R128,R128,R128,u64,u32,c7s1-3
loaded using Flickr API using the tag yosemite and the date- The network with 9 blocks consists of:
taken field. Black-and-white photos were pruned. The im- c7s1-32,d64,d128,R128,R128,R128,
ages were scaled to 256 × 256 pixels. The training size of R128,R128,R128,R128,R128,R128,u64,u32,c7s1-3
each class was summer: 1273, winter: 854. Discriminator architectures For discriminator net-
Photo↔Art for style transfer The art images were works, we use 70 × 70 PatchGAN [21]. Let Ck denote a
downloaded from Wikiart.org by crawling. Some artworks 4 × 4 Convolution-InstanceNorm-LeakyReLU layer with k
that were sketches or too obscene were pruned by hand. filters and stride 2. After the last layer, we apply a convo-
The photos are downloaded from Flickr using the combina- lution to produce a 1 dimensional output. We do not use
tion of tags landscape and landscapephotography. Black- InstanceNorm for the first C64 layer. We use leaky ReLUs
and-white photos were pruned. The images were scaled to with slope 0.2. The discriminator architecture is:
256 × 256 pixels. The training set size of each class was C64-C128-C256-C512
Monet: 1074, Cezanne: 584, Van Gogh: 401, Ukiyo-e:
1433, Photographs: 6853. The Monet dataset was partic-
ularly pruned to include only landscape paintings, and Van
Gogh included only his later works that represent his most
recognizable artistic style.
Monet’s paintings→photos In order to achieve high
resolution while conserving memory, random square crops
of the rectangular images were used for training. To gener-
ate results, images of width 512 pixels with correct aspect 5 https://github.com/jcjohnson/