The math behind Variational Autoencoders (VAEs)

The goal of this blogpost is to derive the mathematical formulation of the Variational Autoencoder (VAE) from simple principles and intuitively interpret it. We first describe what a VAE is, followed by how it differs from other neural networks of its class, then we derive its formulation as presented in Kingma et al. in detail and, lastly, intuitively explain how the final framework achieves what we describe.

We assume the reader is familiar with basic principles of machine learning, like neural networks and gradient descent, very basic graph jargon, like nodes and edges, simple probabilistic concepts like probability density functions, the expectation and the notion of i.i.d, simple calculus and linear algebra.

For hands-on experiments and code, see our respective blogpost.

What is a VAE?

A VAE is an autoencoder (AE). An AE is a neural network that is trained to copy its input to its output. Internally, it has a hidden layer whose output \(h\) is referred to as the code, used to represent the input. The whole network can be thought of as consisting of an encoder \(h = q(x)\) that encodes the input, and a decoder \(\hat{x} = p(h)\) that decodes the code.

A standard Autoencoder

How does a VAE differ from other AEs?

VAE results from trying to perform inference on a directed probabilistic model with continuous latent variables. Let's take a moment to briefly elaborate on these concepts and make the formulation and the conclusions clearer.

VAE's Directed Probabilistic Model.

A directed probabilistic model uses a directed graph to represent a process / joint distribution. The nodes of this graph represent random variables, and the edges causal probabilistic relationships. The graph demonstrates how the Probability Density Function (PDF) of the nodes can be decomposed into a product of factors each depending only on a subset of variables. We use the graph of the VAE as an example, demonstrated on the right, to get a better understanding. We focus on the solid lines for the time being. We have the random variables \(z\) and \(x\), depicted as circles. \(\theta\) denotes the deterministic parameters of the joint distribution, and is included to make this representation more expressive, depicted as a dot. The plate, aka the rounded rectangle with the symbol \(N\) at the bottom, is used to render the representation compact, denoting the existence of \(N\) nodes on which only one is shown. \(z\) is not the tail of any edge, which means it is independent of every random variable, whereas \(x\) is a direct successor of \(z\), denoting a causal relationship. Based on this model, the joint distribution as a whole can be expressed as \( p_{\theta}(x_1, \dots, x_N, z_1, \dots, z_N) =\ \)\( p_{\theta}(z_1) \cdots p_{\theta}(z_N) p_{\theta}(x_1 | z_1,\dots, z_N) \cdots p_{\theta}(x_n | z_1,\dots,z_N) \) (even though it is not immediately apparent in the representation, every latent variable affects every output one) where we include the subscript \(\theta\) to make the reliance on the parameters \(\theta\) explicit. There are many situations in which we want to sample from such a distribution. To do this, we can simply sample from a variable when all of its parents in the graph have been also determined (if none, then we simply sample from it marginal distribution), a process denoted as ancestral sampling. In our case, it suffices to sample each \(z_i\) from the "prior" distribution \(p_{\theta}(z_i)\). Then, given these values, we can sample the output variables \(x_i\) using \( p_{\theta}(x_i | z_1,\dots,z_N) \).

On the matter of inference, it usually refers to the (sometimes intractable) problem of computing the posterior distribution of the latent variables given the visible ones, e.g. \( p(z_1, \dots, z_n | x_1, \dots, x_n) \) using the notation of the VAE's directed probabilistic model. Variational inference, which is the technique we are concerned with, poses the inference problem as an optimization problem in which the quantity being optimized is a functional (a mapping that takes a function as input), like neural network techniques do.

So, returning to the subject at hand, VAEs consist of their encoder that results from (efficiently) solving the problem of inference, their decoder which coincides with the directed probabilistic graph and their code \(h\) which coincides with the "controllable" latent variables. We use controllable in the sense that we have a parameterized by \(\ theta \) prior distribution for the latent variables from which we can infinitely sample from. Therefore, using ancestral sampling, we can sample infinitely many output variables, which in the case of AEs mirror the input distribution. In this manner, we can use the VAE as a simple AE or retain its decoder to be used as a generator, unlike most other AEs. We comment after the derivation on why this is possible for VAEs.

Autoencoding Variational Bayes: The basic formulation

VAE's Directed Probabilistic Model.

In this section we present the formulation for any directed graphical probabilistic model, not just the VAE, and adjust it accordingly in the next section. We are given a dataset \(X = \{x_1, \dots, x_M\}\) of i.i.d. \(M\) samples. We assume that the data are generated by some random process, involving an unobserved continuous random variable \(z\), by first sampling \(z\) from \(p_{\theta^*}(z)\) then sampling \(x\) given \(z\) from \(p_{\theta^*}(x|z)\) (exactly as described in the ancestral sampling process of the VAE's directed probabilistic model). We repeat the figure above for convenience. We assume the prior \(p_{\theta^*}(z)\) and the likelihood \(p_{\theta^*}(x|z)\) come from a family of parametric distributions \(p_{\theta}(z)\) and \(p_{\theta}(x|z)\) and that their PDFs are differentiable almost everywhere w.r.t both \(\theta\) and \(z\). Unfortunately, the actual parameters \(\theta^*\) and the latent variables \(z\) are unknown to us.

So, we are tasked with finding suitable parameters \(\theta\) using \(X\). The only way we can achieve such a feat is by using the only available information and maximize the likelihood of \(X\) w.r.t parameters \(\theta\),

\[ p_{\theta}(X) = p_{\theta}(x_1, \dots, x_M) = \prod_{i=1}^Mp_{\theta}(x_i). \]

However,

\[ p_{\theta}(z|x) = p_{\theta}(x|z)p_{\theta}(z) / p_{\theta}(x) \]

is in general intractable, thus

\[ p_{\theta}(x) = \int p_{\theta}(z)p_{\theta}(x|z)dz \]

is intractable too (if it were tractable, that would render \(p_{\theta}(z|x)\) tractable, since the other two terms are easily computed by using the model described by \(\theta\), Lei et al. proves the intractability for deep generative networks). Consequently, we cannot evaluate or differentiate the marginal likelihood.

Fortunately, we can avoid this inconvenience by transforming the likelihood. Firstly, note that instead of the the likelihood itself, we optimize the log-likelihood \( \log p_{\theta}(X) = \sum_{i=1}^M{\log p_{\theta}(x_i)} \) to transform the product of factors into a sum (we can do that because the logarithmic function is an increasing function). We also introduce \( q_{\phi}(z|x) \): a parametric approximation to the intractable true posterior \( p_{\theta}(z|x) \), referred to as recognition model, which will become the encoder of the VAE. Note that the true posterior is first, under no obligation to be parametric [ citation needed :P ], second, it is always dependent upon the "choice" of \(\theta\) (hence the \( p_{\theta}(z|x) \)) and third, it is usually intractable. Therefore, new parameters are introduced, \( \phi \), to differentiate between the parameters of a parametric family of distributions and the approximation of the posterior. The influence of \(\phi\) in the figure is denoted by dashed lines. Thereafter, we transform each term in the log-likelihood as follows:

\[ \begin{equation*} \begin{split} \log{p_{\theta}(x)} & = \log{p_{\theta}(x)} \int q_{\phi}(z|x)dz \quad (\text{since } \int q_{\phi}(z|x)dz = 1) \\[4pt] & = \int q_{\phi}(z|x) \log{p_{\theta}(x)} dz \quad (\log{p_{\theta}(x)} \text{ constant w.r.t } z) \\[4pt] & = \int q_{\phi}(z|x) \log{\frac{p_{\theta}(x, z)}{p_{\theta}(z|x)}} dz \quad \left(p_{\theta}(z|x) = \frac{p_{\theta}(x, z)}{p_{\theta}(x)} \Leftrightarrow p_{\theta}(x) = \frac{p_{\theta}(x, z)}{p_{\theta}(z|x)}\right) \\[4pt] & = \int q_{\phi}(z|x) \log{\frac{p_{\theta}(x, z)}{q_{\phi}(z|x)}\frac{q_{\phi}(z|x)}{p_{\theta}(z|x)}} dz \\[4pt] & = \int q_{\phi}(z|x) \log{\frac{q_{\phi}(z|x)}{p_{\theta}(z|x)}} dz + \int q_{\phi}(z|x) \left[ \log{p_{\theta}(x, z)} - \log{q_{\phi}(z|x)} \right] dz \\[4pt] & = D_{KL}\left(q_{\phi}(z|x) \| p_{\theta}(z|x)\right) + \mathbb{E}_{q_{\phi}(z|x)}\left[ \log{p_{\theta}(x, z)} - \log{q_{\phi}(z|x)} \right] \quad (\text{by definitions}) \\[4pt] & = D_{KL}\left(q_{\phi}(z|x) \| p_{\theta}(z|x)\right) + \mathcal{L}(\theta, \phi; x), \end{split} \end{equation*} \]

where \(D_{KL}(p(x)\|q(x)) = \int p(x)\log{\frac{p(x)}{q(x)}}dx \) is the Kullback-Leibler divergence, \( \mathbb{E}_q \) is the expectation w.r.t the PDF \(q\) and \( \mathcal{L} \) is defined for convenience. Next, we use the property of the Kullback-Leibler divergence being non-negative:

\[ \begin{equation*} \begin{split} D_{KL}(p(x)\|q(x)) & = \int p(x)\log{\frac{p(x)}{q(x)}} = \int p(x) \left( - \log{\frac{q(x)}{p(x)}} \right) \\[4pt] & = \mathbb{E}_p\left[ -\log{\frac{q(x)}{p(x)}} \right] = \mathbb{E}_p\left[ f(q(x)/p(x)) \right] \\[4pt] & \ge f\left( \mathbb{E}_p[q(x)/p(x)] \right) \quad (\text{Jensen inequality}) \\[4pt] & = f\left(\int p(x) \frac{q(x)}{p(x)} dx\right) = f\left(\int q(x) dx\right) \\[4pt] & = f(1) = -\log{1} = 0 \\[4pt] \Rightarrow D_{KL}(p(x)\|q(x)) & \ge 0 \end{split} \end{equation*} \]

to get a lower bound of the log-likelihood: \( \log{p_{\theta}(x)} \ge \mathcal{L}(\theta, \phi; x) \). By transforming the lower bound further, we get:

\[ \begin{equation*} \begin{split} \log{p_{\theta}(x)} & \ge \mathcal{L}(\theta, \phi; x) = \int q_{\phi}(z|x)\log{\frac{p_{\theta}(x, z)}{q_{\phi}(z|x)}} dz \\[4pt] & = \int q_{\phi}(z|x) \log{\frac{p_{\theta}(x|z)p_{\theta}(z)}{q_{\phi}(z|x)}} dz \\[4pt] & = \int q_{\phi}(z|x) \log{\frac{p_{\theta}(z)}{q_{\phi}(z|x)}} dz + \int q_{\phi}(z|x) \log{p_{\theta}(x|z)} dz \\[4pt] & = -D_{KL}(q_{\phi}(z|x) \| p_{\theta}(z)) + \mathbb{E}_{q_{\phi}(z|x)}\left[ \log{p_{\theta}(x|z)} \right] \\[4pt] & = -D_{KL}(q_{\phi}(z|x) \| p_{\theta}(z)) - \mathcal{L}_{REC}(\theta, \phi; x). \end{split} \end{equation*} \]

In this manner, we can maximize a lower bound of the likelihood w.r.t \(\phi\) and \(\theta\), "maximizing" it indirectly. Notice that the introduction of \( q_{\phi} \) in the equation doesn't involve any assumptions about its nature or properties other than it being a function of \(z\). The specific choice of it being a PDF of \(z\) given the desirable output variable \(x\), \( q_{\phi}(z|x) \), is made because it makes intuitive sense when trying to optimize the lower bound.

Reformulation for Deep Learning

Even though we are not able to compute or differentiate the initial target, we have arrived at a lower bound, \( \mathcal{L} \), which we can maximize. Given the fact that Deep Learning formulations include the minimization of a cost function, we instead consider the remaining problem as the minimization of \( - \mathcal{L} \). We consider \( q_{\phi}\) to be the encoder and \( p_{\theta}\) as the decoder. Nevertheless, the job is still not done. These quantities are PDFs, not mere mappings/functions. Moreover, with traditional neural networks, we are unable to calculate the Kullback-Leibler divergence term, as we have defined it, since the network only outputs one value per input, whereas we require a whole distribution. To tackle these hindrances, we introduce the notion of Multilayer Perceptrons (MLP) as probabilistic encoders and decoders.

For the decoder, we discuss the Bernoulli MLP. In this case, we let \( p(x|z) \) be a multivariate Bernoulli (not to be confused with Multinoulli, where the probabilities would add up to 1) whose probabilities are computed given \( z \) with a MLP:

\[ \log{p(x|z)} = \sum_i x_i\log{\hat{x}_i} + (1 - x_i)\log{(1 - \hat{x}_i)}, \]

which is maximized (hence minimized in \( - \mathcal{L} \)) when \( \hat{x_i} = x_i \). This binary cross-entropy formulation necessitates normalizing the input to the \( [0, 1] \) range and using an output activation function that restricts the output to the same range, e.g. a Sigmoid.

For the encoder (and potentially for the decoder), we utilize the Gaussian MLP. In this case, the MLP produces the mean and the variance (usually the logarithm of the variance) of a Gaussian distribution and the posterior becomes:

\[ \log{p(z|x)} = \log{\mathcal{N}(z; m(x), diag(\sigma^2(x)))}, \]

where \(diag\) denotes the diagonal matrix whose diagonal elements are equal to the elements of its vector argument. We see that the MLP generates the mean and the diagonal elements of the (diagonal) covariance matrix.

Other than defining these MLPs, let's put into words how these solve our problems. First, the Bernoulli MLP, used as an decoder, directly solves the problem of the decoder being probabilistic and the loss, \( \mathcal{L}_{REC} \), is used as is. For the encoder, using the Gaussian MLP solves the problem of the encoder being probabilistic. Moreover, since the encoder produces a Gaussian distribution, the Kullback-Leibler divergence can be computed analytically. Naturally, the prior is selected to be Gaussian itself, usually a standard one. In fact, for two multivariate Gaussian distributions \( \mathcal{N}(\mu_0, \Sigma_0) \) and \( \mathcal{N}(\mu_1, \Sigma_1) \), the Kullback-Leibler divergence is equal to

\[ D_{KL}(\mathcal{N}_0 \| \mathcal{N}_1) = 0.5 \left\{ tr(\Sigma_1^{-1}\Sigma_0) + (\mu_1 - \mu_0)^T \Sigma_1^{-1}(\mu_1 - \mu_0) -k + \ln{\frac{|\Sigma_1|}{|\Sigma_0|}} \right\}. \]

where \( tr \) is the trace of a matrix. Another issue arises though. If we sample the latent code \( z \) from \( \mathcal{N}(z; m(x), diag(\sigma^2(x))) \), then we cannot backpropagate the gradients to the encoder. To get the actual latent variable / code, we use the reparameterization trick, according to which \( z = m(x) + \epsilon \odot \sigma(x) \), where \( \epsilon \sim \mathcal{N}(\epsilon; 0, I) \). This allows the gradients to be backpropagated through this stochastic process. Intuitively, this happens because sampling from the former distribution "stops" the forward propagation and the latent code is treated as an input variable, since we have no access to the values produced by the encoder. On the other hand, \( \epsilon \) can be treated as an arbitrary layer whose inputs are the mean and the variance and which we only use to backpropagate gradients through and does not need any updating.

Intuitive Explanation

Recall that the final objective to be minimized is:

\[-\mathcal{L}(\theta, \phi; x) = D_{KL}(q_{\phi}(z|x) \| p_{\theta}(z)) + \mathcal{L}_{REC}(\theta, \phi; x)\]

The \( \mathcal{L}_{REC} \) term is the reconstruction loss of the AE part of the VAE (hence the notation "Reconstruction Loss"), due to the fact that it is optimized when the output of the VAE is identical to the input, as noted before. The expectation is, as usual, substituted during the actual training by its Monte Carlo approximation , the mean, i.e.

\[ \mathcal{L}_{REC} = \mathbb{E}_{q_{\phi}(z|x)}\left[ \log{p_{\theta}(x|z)} \right] \simeq \frac{1}{B} \sum_{i=1}^B x \log{p(\mu(x) + \epsilon_i \sigma(x)) + (1 - x)\log{(1 - p(\mu(x) + \epsilon_i \sigma(x)))}}. \]

Kingma et al. propose that \(B\) can be even kept equal to 1 as long as many terms are used within each batch (recall that we maximize a sum of such terms but we only derived the formulation for a single one, the initial functional in its totality easily follows, and is approximated as usual with batches).

The Kullback-Leibler divergence term can be thought of as a regularization term that constricts the distribution of the encoder to be close to a prior distribution. It is this term that allows the usage of the decoder on its own as a generator. If the encoder's distribution is truly close to the prior, then we can indeed simply sample from the prior and generate samples, since there is no distribution shift between training and testing that would otherwise deter us from using it in such a manner.

References

Lei, Q., Jalal, A., Dhillon, I. S., & Dimakis, A. G. (2019). Inverting Deep Generative models, One layer at a time. In Advances in Neural Information Processing Systems (pp. 13910-13919).
Ian Goodfellow, Yoshua Bengio, & Aaron Courville (2016). Deep Learning. MIT Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Kingma, D. P., & Welling, M. (2013). Auto-encoding Variational Bayes. arXiv preprint arXiv:1312.6114.

Thinking Aloud

Search This Blog