Meta Optimal Transport

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Meta Optimal Transport

Brandon Amos1 Samuel Cohen12† Giulia Luise2 Ievgen Redko34


1
Meta AI University College London University Jean Monnet 4 Aalto University
2 3

Correspondence to: bda@meta.com


arXiv:2206.05262v1 [cs.LG] 10 Jun 2022

Abstract
We study the use of amortized optimization to predict optimal transport (OT) maps
from the input measures, which we call Meta OT. This helps repeatedly solve sim-
ilar OT problems between different measures by leveraging the knowledge and
information present from past problems to rapidly predict and solve new prob-
lems. Otherwise, standard methods ignore the knowledge of the past solutions
and suboptimally re-solve each problem from scratch. Meta OT models surpass
the standard convergence rates of log-Sinkhorn solvers in the discrete setting and
convex potentials in the continuous setting. We improve the computational time
of standard OT solvers by multiple orders of magnitude in discrete and continuous
transport settings between images, spherical data, and color palettes. Our source
code is available at http://github.com/facebookresearch/meta-ot.

1 Introduction
Optimal transportation [Villani, 2009, Ambrosio, 2003, Santambrogio, 2015, Peyré et al., 2019,
Merigot and Thibert, 2021] is thriving in domains including economics [Galichon, 2016], rein-
forcement learning [Dadashi et al., 2020, Fickinger et al., 2021], style transfer [Kolkin et al., 2019],
generative modeling [Arjovsky et al., 2017, Seguy et al., 2017, Huang et al., 2020, Rout et al., 2021],
geometry [Solomon et al., 2015, Cohen et al., 2021], domain adaptation [Courty et al., 2017b, Redko
et al., 2019], signal processing [Kolouri et al., 2017], fairness [Jiang et al., 2020], and cell repro-
gramming [Schiebinger et al., 2019]. A core component in these settings is to couple two measures
(α, β) supported on domains (X , Y) by solving a transport optimization problem such as the primal
Kantorovich problem, which is defined by:
Z
π ? (α, β, c) ∈ arg min c(x, y)dπ(x, y), (1)
π∈U (α,β) X ×Y

where the optimal coupling π ? is a joint distribution over the product space, U(α, β) is the set of
admissible couplings between α and β, and c : X × Y → R is the ground cost, that represents a
notion of distance between elements in X and elements in Y.
Challenges. Unfortunately, solving eq. (1) once is computationally expensive between general mea-
sures and computationally cheaper alternatives are an active research topic: Entropic optimal trans-
port [Cuturi, 2013] smooths the transport problem with an entropy penalty, and sliced distances
[Kolouri et al., 2016, 2018, 2019, Deshpande et al., 2019] solve OT between 1-dimensional projec-
tions of the measures, where eq. (1) can be solved easily.
Furthermore, when an optimal transport method is deployed in practice, eq. (1) is not just solved
a single time, but is repeatedly solved for new scenarios between different input measures (α, β).
For example, the measures could be representations of images we care about optimally transporting
between and in deployment we would receive a stream of new images to couple. Standard optimal

Contributed while at UCL and visiting Meta AI.

Preprint. Under review.


transport solvers deployed in this setting would re-solve the optimization problems from scratch, but
this ignores the shared structure and information present between different coupling problems. We
note this is not the out-of-sample setting of Seguy et al. [2017], Perrot et al. [2016] that seeks to
couple measures and then extrapolate the map to larger measures containing the original measures.
Overview and outline. We study the use of amortized optimization and machine learning methods
to rapidly solve multiple optimal transport problems and predict the solution from the input measures
(α, β). This setting involves learning a meta model to predict the solution to the optimal transport
problem, which we will refer to as Meta Optimal Transport. We learn Meta OT models to predict
the solutions to optimal transport problems and significantly improve the computational time and
number of iterations needed to solve eq. (1) between discrete (sect. 3.1) and continuous (sect. 3.2)
measures. The paper is organized as follows: sect. 2 recalls the main concepts needed for the rest
of the paper, in particular the formulations of the entropy regularized and unregularized optimal
transport problems and the basic notions of amortized optimization; sect. 3 presents the Meta OT
models and algorithms; and sect. 4 empirically demonstrates the effectiveness of Meta OT.

2 Preliminaries and background


2.1 Dual optimal transport solvers

We review foundations of optimal transportation, following the notation of Peyré et al. [2019] in
most places. The discrete setting often favors the entropic regularized version since it can be com-
puted efficiently and in a parallelized way using the Sinkhorn algorithm. On the other hand, the
continuous setting is often solved from samples using convex potentials. While the primal Kan-
torovich formulation in eq. (1) provides an intuitive problem description, optimal transport problems
are rarely solved directly in this form due to the high-dimensionality of the couplings π and the diffi-
culty of satisfying the coupling constraints U(α, β). Instead, most computational OT solvers use the
dual of eq. (1), which we build our Meta OT solvers on top of in discrete and continuous settings.

2.1.1 Entropic OT between discrete measures with the Sinkhorn algorithm


Let α := i=1 ai δxi and β := i=1 bi δyi be Algorithm 1 Sinkhorn(α, β, c, , f0 = 0, g0 = 0)
Pm Pn
discrete measures, where δz is a Dirac at point for iteration i = 1 to N do
z and a ∈ ∆m−1 and b ∈ ∆n−1 are in the fi ←  log a −  log (K exp{gi−1 /}) 
probability simplex defined by gi ←  log b −  log K > exp{fi−1 /}
end for
Compute PN from fN , gN using eq. (6)
X
∆k−1 := {x ∈ Rk : x ≥ 0 and xi = 1}. (2)
i
return PN ≈ P ?

Discrete OT. In the discrete setting, eq. (1) simplifies to the linear program
P ? (α, β, c) ∈ arg min hC, P i n×m
U (a, b) := {P ∈ R+ : P 1m = a, P > 1n = b} (3)
P ∈U (a,b)

where P is a coupling matrix, P ? (α, β) is the optimal coupling, and


P the cost can be discretized as a
matrix C ∈ Rm×n with entries Ci,j := c(xi , yj ), and hC, P i := i,j Ci,j Pi,j ,
Entropic OT. The linear program above can be regularized adding the entropy of the coupling to
smooth the objective as in Cominetti and Martín [1994], Cuturi [2013], resulting in:
P ? (α, β, c, ) ∈ arg min hC, P i − H(P ) (4)
P ∈U (a,b)
P
where H(P ) := − i,j Pi,j (log(Pi,j ) − 1) is the discrete entropy of a coupling matrix P .
Entropic OT dual. As presented in Peyré et al. [2019, Prop. 4.4], the dual of eq. (4) is
f ? , g ? ∈ arg max hf, ai + hg, bi −  hexp{f /}, K exp{g/}i , Ki,j := exp{−Ci,j /}, (5)
f ∈Rn ,g∈Rm

where K ∈ Rm×n is the Gibbs kernel and the dual variables or potentials f ∈ Rn and g ∈ Rm are
associated, respectively, with the marginal constraints P 1m = a and P > 1n = b. The optimal duals
depend on the problem, e.g. f ? (α, β, c, ), but we omit this dependence for notational simplicity.

2
Recovering the primal solution from the duals. Given optimal duals f ? , g ? that solve eq. (5) the
optimal coupling P ? to the primal problem in eq. (4) can be obtained by
?
Pi,j (α, β, c, ) := exp{fi? /}Ki,j exp{gj? /} (K is defined in eq. (5)) (6)

The Sinkhorn algorithm. Algorithm 1 summarizes the log-space version, which takes closed-form
block coordinate ascent updates on eq. (5) obtained from the first-order optimality conditions [Peyré
et al., 2019, Remark 4.21]. We will use it to fine-tune predictions made by our Meta OT models.
Computing the error. Standard implementations of the Sinkhorn algorithm, such as Flamary et al.
[2021], Cuturi et al. [2022], measure the error of a candidate dual solution (f, g) by computing the
deviation from the marginal constraints, which we will also use in comparing our solution quality:
err(f, g; α, β, c) := kP 1m − ak1 + kP > 1n − bk1 (compute P from eq. (6)) (7)

Mapping between the duals. The first-order optimality conditions of eq. (5) also provide an equiv-
alence between the optimal dual potentials that we will make use of:
g(f ; b, c) :=  log b −  log K > exp{f /} .

(8)

2.1.2 Wasserstein-2 OT between continuous (Euclidean) measures with dual potentials


Let α and β be continuous measures in Euclidean Algorithm 2 W2GN(α, β, ϕ0 )
space X = Y = Rd (with α absolutely contin- for iteration i = 1 to N do
uous with respect to the Lebesgue measure) and Sample from (α, β) and estimate L(ϕi−1 )
the ground cost be the squared Euclidean distance Update ϕi with approximation to ∇ϕ L(ϕi−1 )
c(x, y) := kx−yk22 . Then the minimum of eq. (1) end for
defines the square of the Wasserstein-2 distance: return TN (·) := ∇x ψϕN (·) ≈ T ? (·)
Z Z
2 2
W2 (α, β) := min kx − yk2 dπ(x, y) = min kx − T (x)k22 dα(x), (9)
π∈U (α,β) X ×Y T X

where T is a transport map pushing α to β, i.e. T# α = β with the pushforward operator defined
by T# α(B) := α(T −1 (B)) for any measurable set B.
Convex dual potentials. The primal form in eq. (9) is difficult to solve, as in the discrete setting, due
to the difficulty of representing the coupling and satisfying the constraints. Makkuva et al. [2020],
Taghvaei and Jalali [2019], Korotin et al. [2019, 2021b, 2022] propose to instead solve the dual:
Z Z
ψ ? ( · ; α, β) ∈ arg min ψ(x)dα(x) + ψ(y)dβ(y), (10)
ψ∈convex X Y

where ψ is a convex function referred to as a convex potential, and ψ(y) := maxx∈X hx, yi−ψ(x) is
the Legendre-Fenchel transform or convex conjugate of ψ [Fenchel, 1949, Rockafellar, 2015]. The
potential ψ is often approximated with an input-convex neural network (ICNN) [Amos et al., 2017].
Recovering the primal solution from the dual. Given an optimal dual ψ ? for eq. (10), Brenier
[1991] remarkably shows that an optimal map T ? for eq. (9) can be obtained with differentiation:
T ? (x) = ∇x ψ ? (x). (11)

Wasserstein-2 Generative Networks (W2GNs). Korotin et al. [2019] model ψϕ and ψϕ in eq. (10)
with two separate ICNNs parameterized by ϕ. The separate model for ψϕ is useful because the
conjugate operation in eq. (10) becomes computationally expensive. They optimize the loss:
L(ϕ) := E [ψϕ (x)] + E h∇ψϕ (y), yi − ψϕ (∇ψϕ (y)) +γ E k∇ψϕ ◦ ∇ψϕ (y) − yk22 , (12)
 
x∼α y∼β y∼β
| {z } | {z }
Cyclic monotone correlations (dual objective) Cycle-consistency regularizer

where ϕ is a detached copy of the parameters and γ is a hyper-parameter. The first term are the
cyclic monotone correlations [Chartrand et al., 2009, Taghvaei and Jalali, 2019], that optimize the
dual objective in eq. (10), and the second term provides cycle consistency [Zhu et al., 2017] to
estimate the conjugate ψ. Algorithm 2 shows how L is typically optimized using samples from the
measures, which we use to fine-tune Meta OT predictions.

3
θ θ θ

α α α

β π? β f? g? P? β ψ? T?

c c
D D D
General Discrete (Entropic) Continuous (Wasserstein-2)
Input measures and cost Dual potentials Couplings

Figure 1: Meta OT uses objective-based amortization for optimal transport. In the general formula-
tion, the parameters θ capture shared structure in the optimal couplings π ? between multiple input
measures and costs over some distribution D. In practice, we learn this shared structure over the
dual potentials which map back to the coupling: f ? in discrete settings and ψ ? in continuous ones.

2.2 Amortized optimization and learning to optimize

Our paper is an application of amortized optimization methods that predict the solutions of opti-
mization problems, as surveyed in, e.g., Chen et al. [2021], Amos [2022]. We use the basic setup
from Amos [2022], which considers unconstrained continuous optimization problems of the form
z ? (φ) ∈ arg min J(z; φ), (13)
z

where J is the objective, z ∈ Z is the domain, and φ ∈ Φ is some context or parameterization. In


other words, the context conditions the objective but is not optimized over. Given a distribution over
contexts P(φ), we learn a model ẑθ parameterized by θ to approximate eq. (13), i.e. ẑθ (φ) ≈ z ? (φ).
J will be differentiable for us, so we optimize the parameters using objective-based learning with
min E J(ẑθ (φ); φ), (14)
θ φ∼P(φ)

which does not require ground-truth solutions z ? and can be optimized with a gradient-based solver.

3 Meta Optimal Transport


Figure 1 illustrates our key contribution of connecting objective-based amortization in eq. (14) to
optimal transport. We consider solving multiple OT problems and learning shared structure and
correlations between them. We denote a joint meta-distribution over the input measures and costs
with D(α, β, c), which we call meta to distinguish it from the measures α, β.
In general, we could introduce a model that directly predicts the primal solution to eq. (1), i.e.
πθ (α, β, c) ≈ π ? (α, β, c) for (α, β, c) ∼ D. This is difficult for the same reason why most compu-
tational methods do not operate directly in the primal space: the optimal coupling is often a high-
dimensional joint distribution with non-trivial marginal constraints. We instead turn to predicting
the dual variables used by today’s solvers.

3.1 Meta OT between discrete measures

We build
Pmon standard methods Pn for entropic OT reviewed in sect. 2.1.1 between discrete measures
α := i=1 ai δxi and β := i=1 bi δxi with a ∈ ∆m−1 and b ∈ ∆n−1 coupled using a cost c. In the
Meta OT setting, the measures and cost are the contexts for amortization and sampled from a meta-
distribution, i.e. (α, β, c) ∼ D(α, β, c). For example, sects. 4.1 and 4.2 considers meta-distributions
over the weights of the atoms, i.e. (a, b) ∼ D, where D is a distribution over ∆m−1 × ∆n−1 .
Amortization objective. We will seek to predict the optimal potential. At optimality, the pair of
potentials are related to each other via eq. (8), i.e. g(f ; α, β, c) :=  log b −  log K > exp{f /}
where K ∈ Rm×n is the Gibbs kernel from eq. (5). Hence, it is sufficient to predict one of the

4
Algorithm 3 Training Meta OT Algorithm 4 Fine-tuning with Sinkhorn
Initialize amortization model with θ0 Predict duals fˆθ (α, β, c)
for iteration i do Compute ĝ from fˆθ using eq. (8)
Sample (α, β, c) ∼ D
return Sinkhorn(α, β, c, , fˆθ , ĝ)
Predict duals fˆθ or ϕ̂θ on the sample
Estimate the loss in eq. (16) or eq. (17)
Update θi+1 with a gradient step Algorithm 5 Fine-tuning with W2GN
end for Predict dual ICNN parameters ϕ̂θ (α, β, c)
return W2GN(α, β, c, T, ϕ̂θ )

potentials, e.g. f , and recover the other. We thus re-formulate eq. (5) to just optimize over f with
f ? (α, β, c, ) ∈ arg min J(f ; α, β, c), (15)
f ∈Rn

where −J(f ; α, β, c) := hf, ai + hg, bi is the dual objective over f . Even though most solvers
optimize over f and g jointly as in eq. (15), amortizing over these would likely need: 1) to have a
higher capacity than a model just predicting f , and 2) to learn how f and g are connected through
eq. (8) while in eq. (15) we explicitly provide this knowledge.
Amortization model. We predict the solution to eq. (15) with fˆθ (α, β, c) parameterized by θ,
resulting in a computationally efficient approximation fˆθ ≈ f ? . Here we use the notation fˆθ (α, β, c)
to mean that the model fˆθ depends on representations of the input measures and cost. In our settings,
we define fˆθ as a fully-connected MLP mapping from the atoms of the measures to the duals.
Amortization loss. Applying objective-based amortization from eq. (14) to the dual in eq. (15)
completes our learning setup. Our model should best-optimize the expectation of the dual objective
min E J(fˆθ (α, β, c); α, β, c), (16)
θ (α,β,c)∼D

which is appealing as it does not require ground-truth solutions f ? . Algorithm 3 shows a basic
training loop for eq. (16) using a gradient-based optimizer such as Adam [Kingma and Ba, 2014].
Sinkhorn fine-tuning. The dual prediction made by fˆθ with an associated ĝ can easily be input as
the initialization to a standard Sinkhorn solver as shown in algorithm 4. This allows us to deploy the
predicted potential with Sinkhorn to obtain the optimal potentials with only a few extra iterations.
On accelerated solvers. Here we have only considered fine-tuning the Meta OT prediction with
a log-Sinkhorn solver. Meta OT can also be combined with accelerated variants of entropic OT
solvers such as Thibault et al. [2017], Altschuler et al. [2017], Alaya et al. [2019], Lin et al. [2019]
that would otherwise solve every problem from scratch.
Convergence rates. The knowledge of the hyper-distribution D of problems being solved enables
Meta OT methods to surpass the convergence rates and computational time of standard algorithms
by restricting the set of problems rather than considering the average- or worst-case performance.
The model fˆθ distills information between the problem instances into the parameters θ.

3.2 Meta OT between continuous measures (Wasserstein-2)

We take an analogous approach to predicting the Wasserstein-2 map between continuous measures
for Wasserstein-2 as reviewed in sect. 2.1.2. Here the measures α, β are supported in continuous
space X = Y = Rd and we focus on computing Wasserstein-2 couplings from instances sampled
from a meta-distribution (α, β) ∼ D(α, β). The cost c is not included in D as it remains fixed to the
squared Euclidean cost everywhere here.
One challenge here is that the optimal dual potential ψ ? ( · ; α, β) in eq. (10) is a convex function and
not simply a finite-dimensional real vector. The dual potentials in this setting are approximated by,
e.g., an ICNN. We thus propose a Meta ICNN that predicts the parameters ϕ of an ICNN ψϕ that
approximates the optimal dual potentials, which can be seen as a hypernetwork [Stanley et al., 2009,
Ha et al., 2016]. The dual prediction made by ϕ̂θ can easily be input as the initial value to a standard

5
Sinkhorn (converged, ground-truth) Meta OT (initial prediction)

α0 α1 α2 α0 α1 α2

Figure 2: Interpolations between MNIST test digits using couplings obtained from (left) solving
the problem with Sinkhorn, and (right) Meta OT model’s initial prediction, which is ≈100 times
computationally cheaper and produces a nearly identical coupling.

z
ResNetθ
α z1
MLPθ
ϕ̂θ ψϕ̂θ T̂ (·) = ∇x ψϕ̂θ (·)
ResNetθ Parameters ICNN Transport map
β z2

Figure 3: A Meta ICNN for image-based input measures. A shared ResNet processes the input
measures α and β into latents z that are decoded with an MLP into the parameters ϕ of an ICNN
dual potential ψϕ . The derivative of the ICNN provides the transport map T̂ .

Table 1: Discrete OT runtime (in seconds) to reach Table 2: Color transfer runtimes and values.
a marginal error of 10−3 and Meta OT’s runtime. Iter Runtime (s) Dual Value

MNIST Spherical
Meta OT None 3.5 · 10−3 ±2.7 · 10−4 0.90 ±6.08 · 10−2
+ W2GN 1k 0.93 ±2.27 · 10−2 1.0 ±2.57 · 10−3
Sinkhorn 3.3 · 10−3 ±1.0 · 10−3 1.5 ±0.64 2k 1.84 ±3.78 · 10−2 1.0 ±5.30 · 10−3
Meta OT + Sinkhorn 2.2 · 10−3 ±3.8 · 10−4 0.48 ±.24
Meta OT (Initial prediction) 4.6 · 10−5 ±2.8 · 10−6 4.4 · 10−5 ±3.2 · 10−6 W2GN 1k 0.90 ±1.62 · 10−2 0.96 ±2.62 · 10−2
2k 1.81 ±3.05 · 10−2 0.99 ±1.14 · 10−2
We report the mean and (standard deviation) across 10 test instances.

W2GN solver as shown in algorithm 5. App. B discusses other modeling choices we considered:
we tried models based on MAML [Finn et al., 2017] and neural processes [Garnelo et al., 2018b,a].
Amortization objective. We build on the W2GN formulation [Korotin et al., 2019] and seek pa-
rameters ϕ? optimizing the dual ICNN potentials ψϕ and ψϕ with L(ϕ; α, β) from eq. (12). We
chose W2GN due to the stability, but could also easily use other losses optimizing ICNN potentials.
Amortization model: the Meta ICNN. We predict the solution to eq. (12) with ϕ̂θ (α, β) param-
eterized by θ, resulting in a computationally efficient approximation to the optimum ϕ̂θ ≈ ϕ? .
Figure 3 instantiates a convolutional Meta ICNN model using a ResNet-18 [He et al., 2016] archi-
tecture for coupling image-based measures. We again emphasize that α, β used with the model here
are representations of measures, which in our cases are simply images.
Amortization loss. Applying objective-based amortization from eq. (14) to the W2GN loss in
eq. (12) completes our learning setup. Our model should best-optimize the expectation of the loss:
min E L(ϕ̂θ (α, β); α, β). (17)
θ (α,β)∼D

As in the discrete setting, it does not require ground-truth solutions ϕ? and we learn it with Adam.

4 Experiments
We demonstrate how Meta OT models improve the convergence of the state-of-the-art solvers in
settings where solving multiple OT problems naturally arises. We implemented our code in JAX
[Bradbury et al., 2018] as an extension to the the Optimal Transport Tools (OTT) package [Cuturi
et al., 2022]. All experiments take roughly ≈2 hours to run on our single Quadro GP100 GPU.
App. C covers further experimental and implementation details. The source code to reproduce all of
our experiments is available at http://github.com/facebookresearch/meta-ot.

6
MNIST Spherical
0.4 1.0

Error
Error
0.2 0.5

0.0 0.0
0 5 10 15 20 25 0 200 400 600 800 1000
Sinkhorn Iterations Sinkhorn Iterations
Sinkhorn Meta OT + Sinkhorn
Figure 4: Sinkhorn convergence on test instances. Meta OT successfully predicts warm-start initial-
izations that significantly improve the convergence of Sinkhorn iterations.

Sinkhorn (converged, ground-truth) Meta OT (initial prediction)

Figure 5: Test set coupling predictions of the spherical transport problem. Meta OT’s initial pre-
diction is ≈37500 times faster than solving Sinkhorn to optimality. Supply locations are shown as
black dots and the blue lines show the spherical transport maps T going to demand locations at the
end. The sphere is visualized with the Mercator projection.

4.1 Discrete OT between MNIST digits

Images provide a natural setting for Meta OT where the distribution over images provide the meta-
distribution D over OT problems. Given a pair of images α0 and α1 , each grayscale image is
cast as a discrete measure in 2-dimensional space where the intensities define the probabilities of
the atoms. The goal is to compute the optimal transport interpolation between the two measures
as in, e.g., Peyré et al. [2019, §7]. Formally, this means computing the optimal coupling P ? by
solving the entropic optimal transport problem between α0 and α1 and computing the interpolates
as αt = (t projy +(1 − t) projx )# P ? , for t ∈ [0, 1], where projx (x, y) := x and projy (x, y) = y.
We selected  = 10−2 as app. A shows that it gives interpolations that are not too blurry or sharp.
Our Meta OT model fˆθ (sect. 3.1) is an MLP that predicts the transport map between pairs of
MNIST digits. We train on every pair from the standard training dataset. Figure 2 shows that even
without fine-tuning, Meta OT’s predicted Wasserstein interpolations between the measures are close
to the ground-truth interpolations obtained from running the Sinkhorn algorithm to convergence.
We then fine-tune Meta OT’s prediction with Sinkhorn as in algorithm 4. Figure 4 shows that the
near-optimal predictions can be quickly refined in fewer iterations than running Sinkhorn with the
default initialization, and table 1 shows the runtime required to reach the default threshold.

7
α β T# α T#−1 β
W2GN (converged, ground-truth)

Meta OT (Initial prediction)

Figure 6: Color transfers with a Meta ICNN on test pairs of images. The objective is to optimally
transport the continuous RGB measure of the first image α to the second β, producing an invertible
transport map T . Meta OT’s prediction is ≈1000 times faster than training W2GN from scratch.
The image generating α is Market in Algiers by August Macke (1914) and β is Argenteuil, The
Seine by Claude Monet (1872), obtained from WikiArt.

4.2 Discrete OT for supply-demand transportation on spherical data

We next set up a synthetic transport problem between supply and demand locations where the supply
and demands may change locations or quantities frequently, creating another Meta OT setting to be
able to rapidly solve the new instances. We specifically consider measures living on the 2-sphere
defined by S2 := {x ∈ R3 : kxk = 1}, i.e. X = Y = S2 , with the transport cost given by the
spherical distance c(x, y) = arccos(hx, yi). We then randomly sample supply locations uniformly
from Earth’s landmass and demand locations from Earth’s population density to induce a class of
transport problems on the sphere obtained from the CC-licensed dataset from Doxsey-Whitfield et al.
[2015]. Figure 5 shows that the predicted transport maps on test instances are close to the optimal
maps obtained from Sinkhorn to convergence. Similar to the MNIST setting, fig. 4 and table 1 show
improved convergence and runtime.

4.3 Continuous Wasserstein-2 color transfer

The problem of color transfer between two im- 1.00


Dual Objective

ages consists in mapping the color palette of one 0.75


image into the other one. The images are re-
quired to have the same number of channels, for 0.50
example RGB images. The continuous formula- 0.25
tion that we use from Korotin et al. [2019], takes
i.e. X = Y = [0, 1]3 with c being the squared 0.00
0 500 1000 1500 2000
Euclidean distance. We collected ≈200 public
W2GN Iterations
domain images from WikiArt and trained a Meta
ICNN model from sect. 3.2 to predict the color W2GN Meta OT + W2GN
transfer maps between every pair of them. Fig- Figure 7: Convergence on color transfer test
ure 6 shows the predictions on test pairs and fig. 7 instances using W2GN. Meta ICNNs predicts
shows the convergence in comparison to the stan- warm-start initializations that significantly im-
dard W2GN learning. Table 2 reports runtimes prove the (normalized) dual objective values.
and app. D shows additional results.

8
5 Related work

Efficiently estimating OT maps. To compute OT maps with fixed cost between pairs of measures
efficiently, neural OT models [Korotin et al., 2019, Li et al., 2020, Korotin et al., 2021a, Mokrov
et al., 2021, Korotin et al., 2021b] leverage ICNNs to estimate maps between continuous high-
dimensional measures given samples from these, and Litvinenko et al. [2021], Scetbon et al. [2021a],
Forrow et al. [2019], Sommerfeld et al. [2019], Scetbon et al. [2021b], Muzellec and Cuturi [2019],
Bonet et al. [2021] leverage structural assumptions on coupling and cost matrices to reduce the
computational and memory complexity. In the meta-OT setting, we consider learning to rapidly
compute OT mappings between new pairs measures. All these works can hence potentially benefit
from an acceleration effect by leveraging amortization similarly.
Embedding measures where OT distances are discriminative. Effort has been invested in learn-
ing encodings/projections of measures through a nested optimization problem, which aims to find
discriminative embeddings of the measures to be compared [Genevay et al., 2018, Deshpande et al.,
2019, Nguyen and Ho, 2022]. While these works share an encoder and/or a projection across task
with the aim of leveraging more discriminative alignments (and hence an OT distance with a metric
different from the Euclidean metric), our work differs in the sense that we find good initializations
to solve the OT problem itself with fixed cost more efficiently across tasks.
Optimal transport and amortization. Few previous works in the OT literature leverage amorti-
zation. Courty et al. [2017a] learn a latent space in which the Wasserstein distance between the
measure’s embeddings is equivalent to the Euclidean distance. Concurrent work [Nguyen and Ho,
2022] amortizes the estimation of the optimal projection in the max-sliced objective, which differs
from our work where we instead amortize the estimation of the optimal coupling directly. Also,
Lacombe et al. [2021] learns to predict Wasserstein barycenters of pixel images by training a con-
volutional networks that, given images as input, outputs their barycenters. Our work is hence a
generalization of this pixel-based work to general measures – both discrete and continuous. A limi-
tation of Lacombe et al. [2021] is that it does not provide alignments, as the amortization networks
predicts the barycenter directly rather than individual couplings.

6 Conclusions, future directions, and limitations

We have presented foundations for modeling and learning to solve OT problems with Meta OT by
using amortized optimization to predict optimal transport plans. This works best in applications that
require solving multiple OT problems with shared structure. We instantiated it to speed up entropic
regularized optimal transport and unregularized optimal transport with squared cost by multiple
orders of magnitude. We envision extensions of the work in: 1) Meta OT models. While we mostly
consider models based on hypernetworks, other meta-learning paradigms can be connected in, and 2)
OT algorithms. While we instantiated models on top of log-Sinkhorn and W2GN, Meta OT could
be integrated with most other methods too. 3) OT applications that are computationally expensive
and repeatedly solved, e.g. in multi-marginal and barycentric settings, or for Gromov-Wasserstein
distances between metric-measure spaces.
Limitations. While we have illustrated successful applications of Meta OT, it is also important to
understand the limitations: 1) Meta OT does not make previously intractable problems tractable.
All of the baseline OT solvers we consider solve our problems within milliseconds or seconds. 2)
Out-of-distribution generalization. Meta OT may not generate good predictions on instances that
are not close to the training OT problems from the meta-distribution D over the measures and cost.
If the model makes a bad prediction, one fallback option is to re-solve the instance from scratch.

Acknowledgments

We would like to thank Mark Tygert, Maximilian Nickel, and Eugene Vinitsky for insightful com-
ments and discussions. The core set of tools in Python [Van Rossum and Drake Jr, 1995, Oliphant,
2007] enabled this work, including Hydra [Yadan, 2019], JAX [Bradbury et al., 2018], Matplotlib
[Hunter, 2007], numpy [Oliphant, 2006, Van Der Walt et al., 2011], Optimal Transport Tools [Cuturi
et al., 2022], and pandas [McKinney, 2012].

9
References
Mokhtar Z Alaya, Maxime Berar, Gilles Gasso, and Alain Rakotomamonjy. Screening sinkhorn algorithm for
regularized optimal transport. Advances in Neural Information Processing Systems, 32, 2019.
Jason Altschuler, Jonathan Niles-Weed, and Philippe Rigollet. Near-linear time approximation algorithms for
optimal transport via sinkhorn iteration. Advances in neural information processing systems, 30, 2017.
Luigi Ambrosio. Lecture notes on optimal transport problems. In Mathematical aspects of evolving interfaces,
pages 1–52. Springer, 2003.
Brandon Amos. Tutorial on amortized optimization for learning to optimize over continuous domains. arXiv
preprint arXiv:2202.00665, 2022.
Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In International Conference on
Machine Learning, pages 146–155. PMLR, 2017.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Inter-
national conference on machine learning, pages 214–223. PMLR, 2017.
Clément Bonet, Titouan Vayer, Nicolas Courty, François Septier, and Lucas Drumetz. Subspace detours meet
gromov–wasserstein. Algorithms, 14(12):366, 2021.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George
Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable trans-
formations of Python+NumPy programs. GitHub, 2018. URL http://github.com/google/jax.
Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications
on pure and applied mathematics, 44(4):375–417, 1991.
Rick Chartrand, Brendt Wohlberg, Kevin Vixie, and Erik Bollt. A gradient descent solution to the monge-
kantorovich problem. Applied Mathematical Sciences, 3(22):1071–1080, 2009.
Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin.
Learning to optimize: A primer and a benchmark. arXiv preprint arXiv:2103.12828, 2021.
Samuel Cohen, Brandon Amos, and Yaron Lipman. Riemannian convex potential maps. In International
Conference on Machine Learning, pages 2028–2038. PMLR, 2021.
Roberto Cominetti and J San Martín. Asymptotic analysis of the exponential penalty trajectory in linear pro-
gramming. Mathematical Programming, 67(1):169–187, 1994.
Nicolas Courty, Rémi Flamary, and Mélanie Ducoffe. Learning wasserstein embeddings. arXiv preprint
arXiv:1710.07457, 2017a.
Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal trans-
portation for domain adaptation. Advances in Neural Information Processing Systems, 30, 2017b.
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural informa-
tion processing systems, 26:2292–2300, 2013.
Marco Cuturi, Laetitia Meng-Papaxanthos, Yingtao Tian, Charlotte Bunne, Geoff Davis, and Olivier Teboul.
Optimal transport tools (ott): A jax toolbox for all things wasserstein. arXiv preprint arXiv:2201.12324,
2022.
Robert Dadashi, Léonard Hussenot, Matthieu Geist, and Olivier Pietquin. Primal wasserstein imitation learning.
arXiv preprint arXiv:2006.04678, 2020.
Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David
Forsyth, and Alexander G Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10648–10656, 2019.
Erin Doxsey-Whitfield, Kytt MacManus, Susana B Adamo, Linda Pistolesi, John Squires, Olena Borkovska,
and Sandra R Baptista. Taking advantage of the improved availability of census data: a first look at the
gridded population of the world, version 4. Papers in Applied Geography, 1(3):226–234, 2015.
Werner Fenchel. On conjugate convex functions. Canadian Journal of Mathematics, 1(1):73–77, 1949.
Arnaud Fickinger, Samuel Cohen, Stuart Russell, and Brandon Amos. Cross-domain imitation learning via
optimal transport. arXiv preprint arXiv:2110.03684, 2021.

10
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference
on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135. PMLR,
06–11 Aug 2017. URL https://proceedings.mlr.press/v70/finn17a.html.

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aurélie Boisbunon, Stanislas Chambon,
Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, et al. Pot: Python optimal transport.
Journal of Machine Learning Research, 22(78):1–8, 2021.

Aden Forrow, Jan-Christian Hütter, Mor Nitzan, Philippe Rigollet, Geoffrey Schiebinger, and Jonathan Weed.
Statistical optimal transport via factored couplings. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 2454–2465. PMLR, 2019.

Alfred Galichon. Optimal transport methods in economics. In Optimal Transport Methods in Economics.
Princeton University Press, 2016.

Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan,
Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Con-
ference on Machine Learning, pages 1704–1713. PMLR, 2018a.

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye
Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.

Aude Genevay, Gabriel Peyre, and Marco Cuturi. Learning generative models with sinkhorn divergences. In
Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference
on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages
1608–1617. PMLR, 09–11 Apr 2018. URL https://proceedings.mlr.press/v84/genevay18a.
html.

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In
European conference on computer vision, pages 630–645. Springer, 2016.

Chin-Wei Huang, Ricky TQ Chen, Christos Tsirigotis, and Aaron Courville. Convex potential flows: Universal
probability distributions with optimal transport and convex optimization. arXiv preprint arXiv:2012.05942,
2020.

John D Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9(3):90, 2007.

Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein fair classification.
In Uncertainty in Artificial Intelligence, pages 862–872. PMLR, 2020.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

Nicholas Kolkin, Jason Salavon, and Gregory Shakhnarovich. Style transfer by relaxed optimal transport and
self-similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 10051–10060, 2019.

Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced wasserstein kernels for probability distributions. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5258–5267, 2016.

Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport:
Signal processing and machine-learning applications. IEEE signal processing magazine, 34(4):43–59, 2017.

Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Rohde. Sliced wasserstein auto-encoders. In
International Conference on Learning Representations, 2018.

Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo K Rohde. Generalized sliced
wasserstein distances. arXiv preprint arXiv:1902.00434, 2019.

Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein-2
generative networks. arXiv preprint arXiv:1909.13082, 2019.

Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, Alexander Filippov, and Evgeny Bur-
naev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. arXiv preprint
arXiv:2106.01954, 2021a.

11
Alexander Korotin, Lingxiao Li, Justin Solomon, and Evgeny Burnaev. Continuous wasserstein-2 barycenter
estimation without minimax optimization. arXiv preprint arXiv:2102.01752, 2021b.

Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural optimal transport. arXiv preprint
arXiv:2201.12220, 2022.

Julien Lacombe, Julie Digne, Nicolas Courty, and Nicolas Bonneel. Learning to generate wasserstein barycen-
ters, 2021. URL https://openreview.net/forum?id=2ioNazs6lvw.

Lingxiao Li, Aude Genevay, Mikhail Yurochkin, and Justin Solomon. Continuous regularized wasserstein
barycenters. arXiv preprint arXiv:2008.12534, 2020.

Tianyi Lin, Nhat Ho, and Michael I Jordan. On the acceleration of the sinkhorn and greenkhorn algorithms for
optimal transport. arXiv preprint arXiv:1906.01437, 2019.

Alexander Litvinenko, Youssef Marzouk, Hermann G Matthies, Marco Scavino, and Alessio Spantini. Com-
puting f-divergences and distances of high-dimensional probability density functions–low-rank tensor ap-
proximations. arXiv preprint arXiv:2111.07164, 2021.

Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input
convex neural networks. In International Conference on Machine Learning, pages 6672–6681. PMLR, 2020.

Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O’Reilly
Media, Inc.", 2012.

Quentin Merigot and Boris Thibert. Optimal transport: discretization and algorithms. In Handbook of Numer-
ical Analysis, volume 22, pages 133–212. Elsevier, 2021.

Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, and Evgeny Burnaev. Large-
scale wasserstein gradient flows. arXiv preprint arXiv:2106.00736, 2021.

Boris Muzellec and Marco Cuturi. Subspace detours: Building transport plans that are optimal on subspace
projections. arXiv preprint arXiv:1905.10099, 2019.

Khai Nguyen and Nhat Ho. Amortized projection optimization for sliced wasserstein generative models. arXiv
preprint arXiv:2203.13417, 2022.

Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.

Travis E Oliphant. Python for scientific computing. Computing in Science & Engineering, 9(3):10–20, 2007.

Michaël Perrot, Nicolas Courty, Rémi Flamary, and Amaury Habrard. Mapping estimation for discrete optimal
transport. Advances in Neural Information Processing Systems, 29, 2016.

Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foun-
dations and Trends® in Machine Learning, 11(5-6):355–607, 2019.

Ievgen Redko, Nicolas Courty, Rémi Flamary, and Devis Tuia. Optimal transport for multi-source domain
adaptation under target shift. In The 22nd International Conference on Artificial Intelligence and Statistics,
pages 849–858. PMLR, 2019.

Ralph Tyrell Rockafellar. Convex analysis. In Convex analysis. Princeton university press, 2015.

Litu Rout, Alexander Korotin, and Evgeny Burnaev. Generative modeling with optimal transport maps. arXiv
preprint arXiv:2110.02999, 2021.

Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia
Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.

Meyer Scetbon, Marco Cuturi, and Gabriel Peyré. Low-rank sinkhorn factorization. arXiv preprint
arXiv:2103.04737, 2021a.

Meyer Scetbon, Gabriel Peyré, and Marco Cuturi. Linear-time gromov wasserstein distances using low rank
couplings and costs. arXiv preprint arXiv:2106.01128, 2021b.

12
Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua
Gould, Siyan Liu, Stacie Lin, Peter Berube, Lia Lee, Jenny Chen, Justin Brumbaugh, Philippe Rigol-
let, Konrad Hochedlinger, Rudolf Jaenisch, Aviv Regev, and Eric S. Lander. Optimal-transport analy-
sis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):
928–943.e22, 2019. ISSN 0092-8674. doi: https://doi.org/10.1016/j.cell.2019.01.006. URL https:
//www.sciencedirect.com/science/article/pii/S009286741930039X.

Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blon-
del. Large-scale optimal transport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.

Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du,
and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric
domains. ACM Transactions on Graphics (TOG), 34(4):1–11, 2015.

Max Sommerfeld, Jörn Schrieber, Yoav Zemel, and Axel Munk. Optimal transport: Fast probabilistic approxi-
mation with exact solvers. J. Mach. Learn. Res., 20:105–1, 2019.
Kenneth O Stanley, David B D’Ambrosio, and Jason Gauci. A hypercube-based encoding for evolving large-
scale neural networks. Artificial life, 15(2):185–212, 2009.

Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex potentials with
application to improved training for gans. arXiv preprint arXiv:1902.07197, 2019.

Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, Pratul P Srinivasan, Jonathan T Barron, and
Ren Ng. Learned initializations for optimizing coordinate-based neural representations. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2846–2855, 2021.
Alexis Thibault, Lenaic Chizat, Charles Dossal, and Nicolas Papadakis. Overrelaxed sinkhorn-knopp algorithm
for regularized optimal transport. arXiv preprint arXiv:1711.01851, 2017.

Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical
computation. Computing in Science & Engineering, 13(2):22, 2011.

Guido Van Rossum and Fred L Drake Jr. Python reference manual. Centrum voor Wiskunde en Informatica
Amsterdam, 1995.
Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.

Omry Yadan. Hydra - a framework for elegantly configuring complex applications. Github, 2019. URL
https://github.com/facebookresearch/hydra.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using
cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer
vision, pages 2223–2232, 2017.

Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation
via meta-learning. In International Conference on Machine Learning, pages 7693–7702. PMLR, 2019.

13
A Selecting  for MNIST

 = 0.1
 = 0.05
 = 0.001  = 0.005  = 0.01

Figure 8: We selected  = 10−2 for our MNIST coupling experiments as it results in transport maps
that are not too blurry or sharp.

B Other models for continuous OT


While developing the hyper-network or Meta ICNN in sect. 3.2 for predicting couplings between
continuous measures, we considered alternative modeling formulations briefly documented in this
section. We finalized only the hyper-network model because it is conceptually the most similar to
predicting the optimal dual variables in the continuous setting and results in rapid predictions.

B.1 Optimization-based meta-learning (MAML-inspired)

The model-agnostic meta-learning setup proposed in MAML [Finn et al., 2017] could also be ap-
plied in the Meta OT setting to learn an adaptable initial parameterization. In the continuous setting,
one initial version would take a parameterized dual potential model ψϕ (x) and seek to learn an ini-
tial parameterization ϕ0 so that optimizing a loss such as the W2GN loss L from eq. (12) results in
a minimal L(ϕK ) after adapting the model for K steps. Formally, this would optimize:
arg min L(ϕK ) where ϕt+1 = ϕt − ∇ϕ L(ϕt ) (18)
ϕ0

Tancik et al. [2021] explores similar learned initializations for coordinate-based neural implicit rep-
resentations for 2D images, CT scan reconstruction, and 3d shape and scene recovery from 2D
observations.
Challenges for Meta OT. The transport maps given by T = ∇ψ can significantly vary depending on
the input measures α, β. We found it difficult to learn an initialization that can be rapidly adapted,
and optimizing eq. (18) is more computationally expensive than eq. (17) as it requires unrolling
through many evaluations of the transport loss L. And, we found that only learning to predict
the optimal parameters with eq. (17), conditional on the input measures, and then fine-tuning with
W2GN to be stable.
Advantages for Meta OT. Exploring MAML-inspired methods could further incorporate the knowl-
edge that the model’s prediction is going to be fine-tuned into the learning process. One promising

14
direction we did not try could be to integrate some of the ideas from LEO [Rusu et al., 2018] and
CAVIA [Zintgraf et al., 2019], which propose to learn a latent space for the parameters where the
initialization is also conditional on the input.

B.2 Neural process

The (conditional) neural process models considered in Garnelo et al. [2018b,a] can also be adapted
for the Meta OT setting. In the continuous setting, this would result in a dual potential that is also
conditioned on a representation of the input measures, e.g. ψϕ (x; z) where z := fϕemb (α, β) is a
learned embedding of the input measures that is learned with the parameters of ψ. This could be
formulated as
arg min E L(ϕ, fϕemb (α, β)), (19)
ϕ (α,β)∼D

where L modifies the model used in the loss eq. (12) to also be conditioned on the context extracted
from the measures.
Challenges for Meta OT. This raises the issue on best-formulating the model to be conditional on
the context. One way could be to append z to the input point x in the domain, but if ψ is an input-
convex neural network, then the model would only need to be convex with respect to x and not z.
Advantages for Meta OT. A large advantage is that the representation z of the measures α, β would
be significantly lower-dimensional than the parameters ϕ that our Meta OT models are predicting.

C Additional experimental and implementation details


C.1 Hyper-parameters

We briefly summarize the hyper-parameters we used for training, which we did not extensively tune.
In the discrete setting, we use the same hyper-parameters for the MNIST and spherical settings.

Table 3: Discrete OT hyper-parameters.

Name Value
Batch size 128
Number of training iterations 50000
MLP Hidden Sizes [1024, 1024, 1024]
Adam learning rate 1e-3

Table 4: Continuous OT hyper-parameters.

Name Value
Meta batch size (for α, β) 8
Inner batch size (to estimate L) 1024
Cycle loss weight (γ) 3.
Adam learning rate 1e-3
`2 weight penalty 1e-6
Max grad norm (for clipping) 1.
Number of training iterations 200000
Meta ICNN Encoder ResNet18
Encoder output size (both measures) 256×2
Meta ICNN Decoder Hidden Sizes [512]

15
D Additional color transfer results
We next show additional color transfer results from the experiments in sect. 4.3 on the following
public domain images from WikiArt:

• Distant View of the Pyramids by Winston Churchill (1921)


• Charing Cross Bridge, Overcast Weather by Claude Monet (1900)
• Houses of Parliament by Claude Monet (1904)
• October Sundown, Newport by Childe Hassam (1901)
• Landscape with House at Ceret by Juan Gris (1913)
• Irises in Monet’s Garden by Claude Monet (1900)
• Crystal Gradation by Paul Klee (1921)
• Senecio by Paul Klee (1922)
• Váza s květinami by Josef Capek (1914)
• Sower with Setting Sun by Vincent van Gogh (1888)
• Three Trees in Grey Weather by Claude Monet (1891)
• Vase with Daisies and Anemones by Vincent van Gogh (1887)

16
α β T# α T#−1 β

Figure 9: Meta ICNN (initial prediction). The sources are given in the beginning of app. D.

17
α β T# α T#−1 β

Figure 10: Meta ICNN + W2GN fine-tuning. The sources are given in the beginning of app. D.

18
α β T# α T#−1 β

Figure 11: W2GN (final). The sources are given in the beginning of app. D.

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy