Dear Arthur,
Thank you for the detailed reading of the paper, and elaborating on connections we hadn’t made to the MMD literature. We discuss some of these connections in Appendix D, but we’ll make sure to highlight the most important ones in the main text. Thanks also for pointing out that one of our proofs was done by Szekely and Rizzo — we’ll certainly cite them instead. Can you give us the exact reference? It doesn’t seem to be in the 2004 paper you cite, or their “Hierarchical Clustering” paper.
Before I address the main points, one technical remark about integral probability metrics: they neither need to involve smooth functions (for example, the Kolmogorov distance corresponds to a class of indicator functions) nor even yield a witness function (e.g. if the function space is open). As a result I’d be wary of treating all IPMs as “distances we can optimize with GANs”.
It’s certainly very interesting to show that our results generalize to RKHSs and their corresponding MMDs, but it’s a bit orthogonal to the paper’s aim, which is to show that the Wasserstein metric is flawed in a way the Cramer distance is not. The theory is done for the 1D case, and so naturally we compare to the more intuitive Cramer, which is just the L2 distance between CDFs. The sum invariance and scale sensitivity (not invariance) properties are about comparing KL, Wasserstein, and Cramer rather than about training GANs.
Which brings us to the question of what’s in a name: we called our algorithm Cramer GAN so as to not confuse it with the energy-based GAN (https://arxiv.org/abs/1609.03126). It’s unfortunate that this loses a little bit of technical accuracy, but we believe as always that a memorable name is more useful to the community than an exact (but potentially confusing) one. Contrast with the common practice in some fields (sketching algorithms, I’m looking at you!) of using the authors’ initials: do we really want the BDDMLHM GAN?
Now to address the question, “is the critic incorrect?”. We use the approximate critic only when we don’t have access to two independent real samples — this point was unfortunately relegated to the appendix (C.3). There are many situations where we can draw one, but not two, samples from the reference P, i.e. we have a X but not a X’. In our case, we learn conditional models of faces, which means we only have one right half for each left half we’re conditioning on. But let me further add that this approximation is only for training the critic; the generator is still trained to minimize the energy distance of the transformed samples (L_g in Algorithm 1).
It seems plausible, but not obvious, that it could be possible to avoid the approximate critic and use something closer in spirit to the energy distance / MMD, even for conditional GANs. Open problem? :)