Improving Generative Adversarial Nets with the Wasserstein distance

In this blogpost, we explore how Wasserstein Generative Adversarial Nets (WGAN) improve upon the minimax game / objective of Generative Adversarial Nets (GAN) to stabilize training and make the value of the game correlate better with the performance of the generator. We first derive the divergence between the real data distribution and the generated one that GANs minimize. Then, we discuss how this divergence is sub-optimal for the optimization of neural networks and introduce the Wasserstein distance, proving that it has better properties w.r.t. neural net optimization. Thereafter, we prove that the Wasserstein distance, although intractable, can be approximated and indeed back-propagated to the generator.

We assume the reader is familiar with basic principles of machine learning, like neural networks and gradient descent, simple probabilistic concepts like, probability density functions, the basic formulation of GANs and Lipschitz functions.

GANs minimize the Jensen-Shannon divergence

Classical approaches to learning a probability distribution assume a parametric family of probability distributions \( \{p_{\theta}\}_{\theta \in \mathbb{R}^d} \) and learn the parameters by maximizing the likelihood of the data:

\[ \max_{\theta\in\mathbb{R}^d}\frac{1}{m}\sum_i \log{p_{\theta}(x^{(i)})} \]

which can easily be proven to be equal to minimizing the Kullback-Leibler (KL) divergence, \( D_{KL}(\mathbb{P}_r \| \mathbb{P}_{\theta}) = \mathop{\mathbb{E}}_{x \sim \mathbb{P}_r} [\log\frac{\mathbb{P}_r(x)}{\mathbb{P}_{\theta}(x)}] \):

\[ \begin{equation} \begin{split}\min_{\theta} \mathbb{D_{KL}}(\mathbb{P}_r \| \mathbb{P}_{\theta}) & = \min_{\theta} \mathop{\mathbb{E}}_{x\sim \mathbb{P}_r} \log{\mathbb{P}_r(x)} - \log{\mathbb{P}_{\theta}(x)} \\ & = \min_{\theta} \mathop{\mathbb{E}}_{x\sim \mathbb{P}_r} - \log{\mathbb{P}_{\theta}(x)} \quad \text{(independent of θ)} \\ &= \max_{\theta} \mathop{\mathbb{E}}_{x\sim \mathbb{P}_r} \log{\mathbb{P}_{\theta}(x)} \\ & \xrightarrow{m\rightarrow+\infty} \max_{\theta} \frac{1}{m} \sum_i \log{\mathbb{P}_{\theta}(x^{(i)})} \quad \text{(law of large numbers)}, \end{split}\end{equation} \]

which gives us the opportunity to discuss the maximizing of the likelihood in terms of distribution divergence. When learning using KL divergence, we have to keep four things in mind. First, as can be seen by its definition, KL divergence heavily punishes "mode dropping", i.e. the model assigning little to no probability to areas that the real distribution has "mass". As a matter of fact, it the model assigns zero mass where the real distribution is non-zero, then the divergence becomes infinite. On the other hand, assigning "mass" to fake samples is not heavily punished, as the KL divergence at such a region evaluates to 0. That is important in the light of the fact that KL divergence is non-negative (we prove that in [1]). Thirdly, notice that the KL divergence is not symmetric. Finally, the KL divergence has the nice property of having a single minimum where the distributions coincide.

So, how to GANs handle learning a probability distribution? The intuitive formulation, presented in the original paper of Goodfellow and al. (2014), is that we use a neural net that allows sampling from the distribution based on a helper random variable and train it in minimax game with another helper neural net, the discriminator. This can be formulated as:

\[ \begin{equation} \begin{split} G, D = \text{arg}\min_{G}\max_{D} \mathop{\mathbb{E}}_{x\sim\mathbb{P}_r} \log{D(x)} + \mathop{\mathbb{E}}_{x\sim\mathbb{P}_{\theta}} \log{(1 - D(x))} \end{split} \end{equation} \]

where G is the generator and D the discriminator. So, is that all, just alternating training steps between two neural nets? No, as Goodfellow et al. (2014) showed, this is, first, that the optimum discriminator, for a fixed discriminator, is

\[ D^*(x; G) = \frac{\mathbb{P}_r(x)}{\mathbb{P}_r(x) + \mathbb{P}_{\theta}(x)} \]

which can easily be proven using the objective of the minimax game:

\[ \begin{equation} \begin{split} V(G, D) & = \mathop{\mathbb{E}}_{x\sim\mathbb{P}_r} \log{D(x)} + \mathop{\mathbb{E}}_{x\sim\mathbb{P}_{\theta}} \log{(1 - D(x))} \\ & = \int_x \left[p_r(x) \log D(x) + p_{\theta}(x) \log{(1-D(x))}\right] dx \end{split} \end{equation} \]

and noting that \( v(x, y) = a \log{y} + b \log{(1-y)} \) is maximized w.r.t. \( y \) at \( \frac{a}{a+b} \), we get \( D^*(x; G) \). Plugging this into \(V\), we get:

\[ \begin{equation}\begin{split} V'(G) & = \max_D V(G, D) \\ & = \mathop{\mathbb{E}}_{x\sim \mathbb{P}_r} \log{D^*(x; G)} + \mathop{\mathbb{E}}_{x\sim \mathbb{P}_{\theta}}\log{(1 - D^*(x; G))}\\ & = \mathop{\mathbb{E}}_{x\sim \mathbb{P}_r} \log{\frac{\mathbb{P}_r(x)}{\mathbb{P}_r(x) + \mathbb{P}_{\theta}(x)}} + \mathop{\mathbb{E}}_{x\sim \mathbb{P}_{\theta}}\log{\frac{\mathbb{P}_{\theta}(x)}{\mathbb{P}_r(x) + \mathbb{P}_{\theta}(x)}} \\ & = D_{KL}(\mathbb{P}_r \| \mathbb{P}_r + \mathbb{P}_{\theta}) + D_{KL}(\mathbb{P}_{\theta} \| \mathbb{P}_r + \mathbb{P}_{\theta}) - \log{4} \\ & = - \log{4} + 2 \cdot JS(\mathbb{P}_r \| \mathbb{P}_{\theta}), \end{split}\end{equation} \]

where JS is the Jensen-Shannon divergence. Therefore, we can see that we learn the generator by minimizing the JS divergence between the real distribution and the generated one (approximately, depending on the optimality of the discriminator). In contrast to KL divergence, the JS divergence is symmetrical. Upon closer inspection, we see that the JS divergence serves as a middle ground between punishing fake samples and mode collapses because it combines two KL divergences with each of the two distributions serving as the first argument.

It is also important to note that in the inaugural GAN paper [2], Goodfellow et al. already propose updating the discriminator multiple times before each generator update, which means that the optimality assumption for the JS divergence is not far from the truth, especially as training proceeds.

Wasserstein distance

We saw that GAN training can be though of as minimizing the JS divergence between the real distribution and the distribution of the generator. Questions naturally arise: what other divergences can we minimize? What are the advantages and the disadvantages of using a particular divergence? To begin with, the most fundamental difference between divergences is their impact on the convergence of sequences of probability distributions (i.e. the parameters of the generator). A sequence of distributions converges iff there exists \( \mathbb{P}_{\infty} \) such that \( \rho (\mathbb{P}_t, \mathbb{P}_{\infty}) \xrightarrow{t\rightarrow\infty} 0 \), where \( \rho \) is the divergence of choice. \( \mathbb{P}_{\infty} \) depends on the choice of \( \rho \). We say that \( \rho \) induces a weaker topology when it is "easier" for sequences to converge.

Essentially, we would like to have a continuous mapping \( \theta \rightarrow \mathbb{P}_{\theta} \) and have \( \theta \rightarrow \rho( \mathbb{P}_{\theta}, \mathbb{P}_{\theta}) \) be continuous so as to have \( \mathbb{P}_{\theta} \) converge when \( \theta \) converges. This would allow training \( \theta \).

As it happens, JS divergence is not optimal for such a setting. It is not continuous when \( \theta \rightarrow \mathbb{P}_{\theta} \) and therefore not differentiable. In contrast, another divergence called Earth-Mover distance or Wasserstein has these properties. It is defined as:

\[ W(\mathbb{P}_r, \mathbb{P}_g) = \mathop{\text{inf}}_{\gamma \in \Pi (\mathbb{P}_r, \mathbb{P}_g)} \mathop{\mathbb{E}}_{(x,y)\sim\gamma}[\|x-y\|], \]

where \( \Pi (\mathbb{P}_r, \mathbb{P}_g) \) denotes the set of joint distributions of \( \mathbb{P}_r \) and \( \mathbb{P}_g \) whose marginals are \( \mathbb{P}_r \) and \( \mathbb{P}_g \). As a result, an intuitive explanation of the Wasserstein distance is the cost of optimal transport from \( x \) to \( y \) in order to transform one distribution to the other. Here, we prove the previous statement. Before doing that, we briefly formulate the following regularity assumption:

If the function \( g_{\theta}(z) \), parameterized by \( \theta \), is locally Lipschitz between finite dimensional vector spaces, we say that it satisfies this regularity assumption for a certain probability distribution \(p\) over the input random variable if there are local Lipschitz constants \( L(\theta, z) \) such that \( \mathbb{E}_{z\sim p} L(\theta, z) < \infty \).

It is relatively easy to show that a neural network and therefore a generative network satisfies the regularity assumption. However, because the formulas that are required become complicated (it basically suffices to show \( \mathbb{E}_z \| \nabla_{\theta, z}g_{\theta}(z) \| \le \infty \)), we skip this short proof. For those interested, it can be found in Appendix C in [3].

Now, we prove that that:

If the generator is continuous in \( \theta \), so is \( W(\mathbb{P}_r, \mathbb{P}_{\theta}) \). Moreover, if the generator is locally Lipschitz, \( W(\mathbb{P}_r, \mathbb{P}_{\theta}) \) is [continuous everywhere and] differentiable almost everywhere. In constrast, the two aforementioned statements are not true for KL-based divergences, such as the JS which GANs minimize.

Let \( \theta \) and \( \theta' \) be parameter vectors of the generator and \( g_{\theta} \) the generator function parameterized by parameters \( \theta \). If \( \gamma' \) is the joint probability \( (g_{\theta}, g_{\theta'}) \in \Pi(\mathbb{P}_{\theta}, \mathbb{P}_{\theta'}) \), then

\[ \begin{equation}\begin{split} W(\mathbb{P}_{\theta}, \mathbb{P}_{\theta'}) & \le \mathop{\mathbb{E}}_{(x, y)\sim \gamma'}[\| x - y \|] \\ & = \mathbb{E}_z [\| g_{\theta}(z) - g_{\theta'}(z) \|], \end{split}\end{equation} \]

and because the modeled variable's space is compact and has to be uniformly bounded by some constant along with the fact that \( g_{\theta} \) is continuous in \( \theta \), by the bounded convergence theorem:

\[ \begin{equation}\begin{split} 0 \le W(\mathbb{P}_{\theta}, \mathbb{P}_{\theta'}) \le \mathbb{E}_z [\| g_{\theta}(z) - g_{\theta'}(z) \|] \\ \Rightarrow W(\mathbb{P}_{\theta}, \mathbb{P}_{\theta'}) \xrightarrow{\theta \rightarrow \theta'} 0 \end{split}\end{equation} \]

In turn, because the Wasserstein distance is a distance:

\[ | W(\mathbb{P}_r, \mathbb{P}_{\theta}) - W(\mathbb{P}_r, \mathbb{P}_{\theta'}) | \le W(\mathbb{P}_{\theta}, \mathbb{P}_{\theta'}) \xrightarrow{\theta \rightarrow \theta'} 0 \]

which proves the continuity of \( W(\mathbb{P}_r, \mathbb{P}_{\theta}) \) in \( \theta \). Moreover, from the Lipschitz assumption, for a given pair of \( (\theta, z) \) there is a constant \( L(\theta, z) \) and an open set \( U \) such that \( (\theta, z) \in U \), such that for every \( (\theta', z') \in U \) we have

\[ \| g_{\theta}(z) - g_{\theta'}(z') \| \le L(\theta, z) (\| \theta - \theta' \| + \| z - z' \|) \]

By taking expectations and \( z' = z \)

\[ \mathbb{E}_z[\| g_{\theta}(z) - g_{\theta'}(z) \|] \le \mathbb{E}_z [L(\theta, z)] (\| \theta - \theta' \|). \]

If we define \( U_{\theta} = \{ \theta' | (\theta', z) \in U \} \), that remains an open set. Using the regularity assumption to define \( L(\theta) = \mathbb{E}_z L(\theta, z) \) along with the equation we arrived to for the continuity, we get

\[ | W(\mathbb{P}_r, \mathbb{P}_{\theta}) - W(\mathbb{P}_r, \mathbb{P}_{\theta'}) | \le W(\mathbb{P}_{\theta}, \mathbb{P}_{\theta'}) \le L(\theta) \| \theta - \theta' \| \]

for all \( \theta' \in U_{\theta} \). Since the generator satisfies the regularity assumption, this means that \( W(\mathbb{P}_r, \mathbb{P}_{\theta}) \) is locally Lipschitz. By Radamacher's theorem, it is also differentiable almost everywhere.

To show that these do not hold for the JS divergence, we do so with a counterexample. Let \( Z \sim U[0, 1] \), \( \mathbb{P}_0 \) be the distribution of \( (0, Z) \) and \( g_{\theta}(z) = (\theta, z) \). It is easy to see that \( \text{JS}(\mathbb{P}_0 \| \mathbb{P}_{\theta}) = \log2 \cdot \mathbb{1}\{\theta \ne 0\}\) while \( W(\mathbb{P}_0, \mathbb{P}_{\theta}) = |\theta| \). This concludes the proofs \( \blacksquare \).

This theorem basically shows us that the Wasserstein distance is a more sensible choice of divergence than the JS one for a generator. However, as one can guess, the infimum is highly intractable (see [4]). Thankfully, in the same manner that JS is approximately the objective when the discriminator is optimal for a given generator, we can arrive at similar conclusions about the Wasserstein distance.

Wasserstein GAN

In [4], we can find the description of the Kantorovich-Rubinstein formula, i.e.

\[ \mathop{\text{inf}}_{\gamma \in \Pi (\mathbb{P}_r, \mathbb{P}_g)} \mathop{\mathbb{E}}_{(x,y)\sim\gamma}[\|x-y\|] = \mathop{\text{sup}}_{\| f\|_L \le 1} [ \mathop{\mathbb{E}}_{x\sim \mathbb{P}_r} f(x) - \mathop{\mathbb{E}}_{x\sim \mathbb{P}_{\theta}} f(x) ], \]

where \( \| f \|_L \le K \) denotes all the K-Lipschitz functions. Therefore, we immediately have

\[ W(\mathbb{P}_r, \mathbb{P}_g) = \mathop{\text{sup}}_{\| f \|_L \le 1} [ \mathop{\mathbb{E}}_{x\sim \mathbb{P}_r} f(x) - \mathop{\mathbb{E}}_{x\sim \mathbb{P}_{\theta}} f(x) ]. \]

Notice that the above expression, in particular the right-hand side, when viewed as a function of function \( f \), is a traditional Machine Learning objective, i.e. maximizing a well-defined objective w.r.t. a function. Consequently, if we parameterize \( f \) as a neural network with parameters \( w \), all we need is a way to make \( f_w \) Lipschitz, even approximately, and getting a gradient w.r.t. the parameters of the generator from the above formulation of the Wasserstein distance. As a result, \( f \) now plays the role of the discriminator, or critic as the authors in [3] refer to it to differentiate it from the GAN discriminator. Naturally, this "GAN" framework is referred to as Wasserstein GAN. We can easily show, using Leibniz's integral rule, that

\[ \nabla_{\theta} W(\mathbb{P}_r, \mathbb{P_{\theta}}) = -\mathop{\mathbb{E}}_{z} \nabla_{\theta} \tilde{f}(g_{\theta}(z)), \]

where \( \tilde{f} \) is the optimal critic for the given generator. To impose the Lipschitz constraint, we can restrict parameters \( w \) to lie in a compact space \( \mathcal{W} \), for example \( \mathcal{W} = [-0.01, 0.01]^d \). (Note that this will lead to the approximation of the Wassestein distance up to a multiplying constant that depends on \( \mathcal{W} \). Nevertheless, that is in turn multiplied by the learning rate and as a result does not really matter for our purposes). The authors in [5] suggest substituting this constraint with a regularization term

\[ - \lambda \mathop{\mathbb{E}}_{\hat{x}\sim \mathbb{P}_{\hat{x}}}[(\| \nabla_{\hat{x}}f(\hat{x})\|_2 - 1)^2], \]

where \( \mathbb{P}_{\hat{x}} \) is the uniform distribution on the line between pairs of points sampled from the data distribution \( \mathbb{P}_r \) and the generator distribution \( \mathbb{P}_{\theta} \). This term is added to the Wasserstein distance [approximation].

Brief empirical comparison

GANs are notoriously difficult to train. Experiments utilizing the Wasserstein distance show improved stability in the training of GANs. Moreover, due to the properties of the Wasserstein distance compared to the JS divergence, the objective function of the minimax game that arises in the WGAN framework better correlates with the quality of the generated samples compared to the corresponding value of the mere GAN. Some examples, reproduced from [3], are:


Objective of the minimax game in a GAN (JS divergence).

Objective of the minimax game in a WGAN (Wasserstein distance).

It is evident that the JS divergence begins to increase as the generator becomes better, in contrast to the Wasserstein distance that either decreases as sample quality increases or remains constant if sample quality does not improve.

Even though WGANs have not gained much traction in image generation, they have become the norm in other settings, such as Zero-shot learning [6].

References

Chochlakis, G. (2020). The math behind Variational Autoencoders (VAEs), https://thinking-ai-aloud.blogspot.com/2020/10/variational-autoencoders-vae-in-depth.html
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial networks. arXiv preprint arXiv:1406.2661.
Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214-223). PMLR.
Villani, C. (2008). Optimal transport: old and new (Vol. 338). Springer Science & Business Media.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
Chochlakis, G., Georgiou, E., & Potamianos, A. (2021). End-to-end Generative Zero-shot Learning via Few-shot Learning. arXiv preprint arXiv:2102.04379.

Thinking Aloud

Search This Blog