Introduction to Diffusion Models for Machine Learning

This article was originally published on AssemblyAI and re-published to TOPBOTS with permission from the author.

Diffusion Models are generative models which have been gaining significant popularity in the past several years, and for good reason. A handful of seminal papers released in the 2020s alone have shown the world what Diffusion models are capable of, such as beating GANs^[6] on image synthesis. Most recently, practitioners will have seen Diffusion Models used in DALL-E 2, OpenAI’s image generation model released last month.

*Various images generated by DALL-E 2 (*source).

Given the recent wave of success by Diffusion Models, many Machine Learning practitioners are surely interested in their inner workings. In this article, we will examine the theoretical foundations for Diffusion Models, and then demonstrate how to generate images with a Diffusion Model in PyTorch. Let’s dive in!

If this in-depth educational content is useful for you, subscribe to our AI mailing list to be alerted when we release new material.

Diffusion Models – Introduction

Diffusion Models are generative models, meaning that they are used to generate data similar to the data on which they are trained. Fundamentally, Diffusion Models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, we can use the Diffusion Model to generate data by simply passing randomly sampled noise through the learned denoising process.

*Diffusion Models can be used to generate images from noise (adapted from* source)

More specifically, a Diffusion Model is a latent variable model which maps to the latent space using a fixed Markov chain. This chain gradually adds noise to the data in order to obtain the approximate posterior q(x_1:T|x₀), where x₁,…,x_T are the latent variables with the same dimensionality as x₀. In the figure below, we see such a Markov chain manifested for image data.

Ultimately, the image is asymptotically transformed to pure Gaussian noise. The goal of training a diffusion model is to learn the reverse process – i.e. training p_θ(x_t−1|x_t). By traversing backwards along this chain, we can generate new data.

Benefits of Diffusion Models

As mentioned above, research into Diffusion Models has exploded in recent years. Inspired by non-equilibrium thermodynamics^[1], Diffusion Models currently produce State-of-the-Art image quality, examples of which can be seen below:

Beyond cutting-edge image quality, Diffusion Models come with a host of other benefits, including not requiring adversarial training. The difficulties of adversarial training are well-documented; and, in cases where non-adversarial alternatives exist with comparable performance and training efficiency, it is usually best to utilize them. On the topic of training efficiency, Diffusion Models also have the added benefits of scalability and parallelizability.

While Diffusion Models almost seem to be producing results out of thin air, there are a lot of careful and interesting mathematical choices and details that provide the foundation for these results, and best practices are still evolving in the literature. Let’s take a look at the mathematical theory underpinning Diffusion Models in more detail now.

Diffusion Models – A Deep Dive

As mentioned above, a Diffusion Model consists of a forward process (or diffusion process), in which a datum (generally an image) is progressively noised, and a reverse process (or reverse diffusion process), in which noise is transformed back into a sample from the target distribution.

The sampling chain transitions in the forward process can be set to conditional Gaussians when the noise level is sufficiently low. Combining this fact with the Markov assumption leads to a simple parameterization of the forward process:

Mathematical Note

We have been talking about corrupting the data by adding Gaussian noise, but it may at first be unclear where we are performing this addition. Accoring to the above equation, at each step in the chain we are simply sampling from a Gaussian distribution whose mean is the previous value (i.e. image) in the chain.

These two statements are equivalent. That is

To understand why, we will utilize a slight abuse of notation by asserting

Where the final implication stems from the mathematical equivalence between a sum of random variables and the convolution of their distributions – see this Wikipedia page for more information.

In other words, we have show that asserting the distribution of a timestep conditioned on the previous one via the mean of a Gaussian distribution is equivalent to asserting that the distribution of a given timestep is that of the previous one with the addition of Gaussian noise. We omitted the scalars introduced by the variance schedule and showed this for one dimension for simplicity, but a similar proof holds for multivariate Gaussians.

Where β₁,…,β_T is a variance schedule (either learned or fixed) which, if well-behaved, ensures that x_T is nearly an isotropic Gaussian for sufficiently large T.

*Given the Markov assumption, the joint distribution of the latent variables is the product of the Gaussian conditional chain transitions (modified from* source).

As mentioned previously, the “magic” of diffusion models comes in the reverse process. During training, the model learns to reverse this diffusion process in order to generate new data. Starting with the pure Gaussian noise p(x_T):=N(x_T,0,I) the model learns the joint distribution p_θ(x_0:T) as

where the time-dependent parameters of the Gaussian transitions are learned. Note in particular that the Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous timestep (or following timestep, depending on how you look at it):

Training

A Diffusion Model is trained by finding the reverse Markov transitions that maximize the likelihood of the training data. In practice, training equivalently consists of minimizing the variational upper bound on the negative log likelihood.

Notation Detail

Note that L_vlb is technically an upper bound (the negative of the ELBO) which we are trying to minimize, but we refer to it as L_vlb for consistency with the literature.

We seek to rewrite the L_vlb in terms of Kullback-Leibler (KL) Divergences. The KL Divergence is an asymmetric statistical distance measure of how much one probability distribution P differs from a reference distribution Q. We are interested in formulating L_vlb in terms of KL divergences because the transition distributions in our Markov chain are Gaussians, and the KL divergence between Gaussians has a closed form.

What is the KL Divergence?

The mathematical form of the KL divergence for continuous distributions is

*The double bars indicate that the function is not symmetric with respect to its arguments.*

Below you can see the KL divergence of a varying distribution P (blue) from a reference distribution Q (red). The green curve indicates the function within the integral in the definition for the KL divergence above, and the total area under the curve represents the value of the KL divergence of P from Q at any given moment, a value which is also displayed numerically.

Casting $Lvlb$ in Terms of KL Divergences

As mentioned previously, it is possible [1] to rewrite L_vlb almost completely in terms of KL divergences:

where

Derivation Details

The variational bound is equal to

Replacing the distributions with their definitions given our Markov assumption, we get

We use log rules to transform the expression into a sum of logs, and then we pull out the first term

Using Bayes’ Theorem and our Markov assumption, this expression becomes

We then split up the middle term using log rules

Isolating the second term, we see

Plugging this back into our equation for L_vlb, we have

Using log rules, we rearrange

Next, we note the following equivalence for the KL divergence for any two distributions:

Finally, applying this equivalence to the previous expression, we arrive at

Conditioning the forward process posterior on x₀ in L_t−1 results in a tractable form that leads to all KL divergences being comparisons between Gaussians. This means that the divergences can be exactly calculated with closed-form expressions rather than with Monte Carlo estimates^[3].

Model Choices

With the mathematical foundation for our objective function established, we now need to make several choices regarding how our Diffusion Model will be implemented. For the forward process, the only choice required is defining the variance schedule, the values of which are generally increasing during the forward process.

For the reverse process, we much choose the Gaussian distribution parameterization/model architecture(s). Note the high degree of flexibility that Diffusion Models afford – the only requirement on our architecture is that its input and output have the same dimensionality.

We will explore the details of these choices in more detail below.

Forward Process and L_T

As noted above, regarding the forward process, we must define the variance schedule. In particular, we set them to be time-dependent constants, ignoring the fact that they can be learned. For example^[3], a linear schedule from β₁=10⁻⁴ to β_T=0.2 might be used, or perhaps a geometric series.

Regardless of the particular values chosen, the fact that the variance schedule is fixed results in L_T becoming a constant with respect to our set of learnable parameters, allowing us to ignore it as far as training is concerned.

Reverse Process and L_1:T−1

Now we discuss the choices required in defining the reverse process. Recall from above we defined the reverse Markov transitions as a Gaussian:

We must now define the functional forms of μ_θ or Σ_θ. While there are more complicated ways to parameterize Σ_θ^[5], we simply set

That is, we assume that the multivariate Gaussian is a product of independent Gaussians with identical variance, a variance value which can change with time. We set these variances to be equivalent to our forward process variance schedule.

Given this new formulation of Σ_θ, we have

which allows us to transform

where the first term in the difference is a linear combination of x_t and x₀ that depends on the variance schedule β_t. The exact form of this function is not relevant for our purposes, but it can be found in [3].

The significance of the above proportion is that the most straightforward parameterization of μ_θ simply predicts the diffusion posterior mean. Importantly, the authors of [3] actually found that training μ_θ to predict the noise component at any given timestep yields better results. In particular, let

where

This leads to the following alternative loss function, which the authors of [3] found to lead to more stable training and better results:

The authors of [3] also note connections of this formulation of Diffusion Models to score-matching generative models based on Langevin dynamics. Indeed, it appears that Diffusion Models and Score-Based models may be two sides of the same coin, akin to the independent and concurrent development of wave-based quantum mechanics and matrix-based quantum mechanics revealing two equivalent formulations of the same phenomena^[2].

Network Architecture

While our simplified loss function seeks to train a model ϵ_θ, we have still not yet defined the architecture of this model. Note that the only requirement for the model is that its input and output dimensionality are identical.

Given this restriction, it is perhaps unsurprising that image Diffusion Models are commonly implemented with U-Net-like architectures.

Reverse Process Decoder and L₀

The path along the reverse process consists of many transformations under continuous conditional Gaussian distributions. At the end of the reverse process, recall that we are trying to produce an image, which is composed of integer pixel values. Therefore, we must devise a way to obtain discrete (log) likelihoods for each possible pixel value across all pixels.

The way that this is done is by setting the last transition in the reverse diffusion chain to an independent discrete decoder. To determine the likelihood of a given image x₀ given x₁, we first impose independence between the data dimensions:

where D is the dimensionality of the data and the superscript i indicates the extraction of one coordinate. The goal now is to determine how likely each integer value is for a given pixel given the distribution across possible values for the corresponding pixel in the slightly noised image at time t=1:

where the pixel distributions for t=1 are derived from the below multivariate Gaussian whose diagonal covariance matrix allows us to split the distribution into a product of univariate Gaussians, one for each dimension of the data:

We assume that the images consist of integers in 0,1,…,255 (as standard RGB images do) which have been scaled linearly to [−1,1]. We then break down the real line into small “buckets”, where, for a given scaled pixel value x, the bucket for that range is [x−1/255, x+1/255]. The probability of a pixel value x, given the univariate Gaussian distribution of the corresponding pixel in x₁, is the area under that univariate Gaussian distribution within the bucket centered at x.

Below you can see the area for each of these buckets with their probabilities for a mean-0 Gaussian which, in this context, corresponds to a distribution with an average pixel value of 255/2 (half brightness). The red curve represents the distribution of a specific pixel in the t=1 image, and the areas give the probability of the corresponding pixel value in the t=0 image.

Technical Note

The first and final buckets extend out to -inf and +inf to preserve total probability.

Given a t=0 pixel value for each pixel, the value of p_θ(x₀|x₁) is simply their product. This process is succinctly encapsulated by the following equation:

where

and

Given this equation for p_θ(x₀|x₁), we can calculate the final term of L_vlb which is not formulated as a KL Divergence:

Final Objective

As mentioned in the last section, the authors of [3] found that predicting the noise component of an image at a given timestep produced the best results. Ultimately, they use the following objective:

The training and sampling algorithms for our Diffusion Model therefore can be succinctly captured in the below figure:

Diffusion Model Theory Summary

In this section we took a detailed dive into the theory of Diffusion Models. It can be easy to get caught up in mathematical details, so we note the most important points within this section below in order to keep ourselves oriented from a birds-eye perspective:

Our Diffusion Model is parameterized as a Markov chain, meaning that our latent variables x₁,…,x_T depend only on the previous (or following) timestep.
The transition distributions in the Markov chain are Gaussian, where the forward process requires a variance schedule, and the reverse process parameters are learned.
The diffusion process ensures that x_T is asymptotically distributed as an isotropic Gaussian for sufficiently large T.
In our case, the variance schedule was fixed, but it can be learned as well. For fixed schedules, following a geometric progression may afford better results than a linear progression. In either case, the variances are generally increasing with time in the series (i.e. β_i<β_j for i<j ).
Diffusion Models are highly flexible and allow for any architecture whose input and output dimensionality are the same to be used. Many implementations use U-Net-like architectures.
The training objective is tomaximize the likelihoodof the training data. This is manifested as tuning the model parameters to minimize the variational upper bound of the negative log likelihood of the data.
Almost all terms in the objective function can be cast as KL Divergences as a result of our Markov assumption. These values become tenable to calculate given that we are using Gaussians, therefore omitting the need to perform Monte Carlo approximation.
Ultimately, using a simplified training objective to train a function which predicts the noise component of a given latent variable yields the best and most stable results.
A discrete decoder is used to obtain log likelihoods across pixel values as the last step in the reverse diffusion process.

With this high-level overview of Diffusion Models in our minds, let’s move on to see how to use a Diffusion Models in PyTorch.

Diffusion Models in PyTorch

While Diffusion Models have not yet been democratized to the same degree as other older architectures/approaches in Machine Learning, there are still implementations available for use. The easiest way to use a Diffusion Model in PyTorch is to use the denoising-diffusion-pytorch package, which implements an image diffusion model like the one discussed in this article. To install the package, simply type the following command in the terminal:

pip install denoising_diffusion_pytorch

Minimal Example

To train a model and generate images, we first import the necessary packages:

import torch
from denoising_diffusion_pytorch import Unet, GaussianDiffusion

Next, we define our network architecture, in this case a U-Net. The dim parameter specifies the number of feature maps before the first down-sampling, and the dim_mults parameter provides multiplicands for this value and successive down-samplings:

model = Unet( dim = 64, dim_mults = (1, 2, 4, 8)
)

Now that our network architecture is defined, we need to define the Diffusion Model itself. We pass in the U-Net model that we just defined along with several parameters – the size of images to generate, the number of timesteps in the diffusion process, and a choice between the L1 and L2 norms.

diffusion = GaussianDiffusion( model, image_size = 128, timesteps = 1000, # number of steps loss_type = 'l1' # L1 or L2
)

Now that the Diffusion Model is defined, it’s time to train. We generate random data to train on, and then train the Diffusion Model in the usual fashion:

training_images = torch.randn(8, 3, 128, 128)
loss = diffusion(training_images)
loss.backward()

Once the model is trained, we can finally generate images by using the sample() method of the diffusion object. Here we generate 4 images, which are only noise given that our training data was random:

sampled_images = diffusion.sample(batch_size = 4)

Training on Custom Data

The denoising-diffusion-pytorch package also allow you to train a diffusion model on a specific dataset. Simply replace the 'path/to/your/images' string with the dataset directory path in the Trainer() object below, and change image_size to the appropriate value. After that, simply run the code to train the model, and then sample as before. Note that PyTorch must be compiled with CUDA enabled in order to use the Trainer class:

from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
model = Unet( dim = 64, dim_mults = (1, 2, 4, 8)
).cuda()
diffusion = GaussianDiffusion( model, image_size = 128, timesteps = 1000, # number of steps loss_type = 'l1' # L1 or L2
).cuda()
trainer = Trainer( diffusion, 'path/to/your/images', train_batch_size = 32, train_lr = 2e-5, train_num_steps = 700000, # total training steps gradient_accumulate_every = 2, # gradient accumulation steps ema_decay = 0.995, # exponential moving average decay amp = True # turn on mixed precision
)
trainer.train()

Below you can see progressive denoising from multivariate Gaussian noise to MNIST digits akin to reverse diffusion:

Final Words

Diffusion Models are a conceptually simple and elegant approach to the problem of generating data. Their State-of-the-Art results combined with non-adversarial training has propelled them to great heights, and further improvements can be expected in the coming years given their nascent status. In particular, Diffusion Models have been found to be essential to the performance of cutting-edge models like DALL-E 2.

References

[1] Deep Unsupervised Learning using Nonequilibrium Thermodynamics

[2] Generative Modeling by Estimating Gradients of the Data Distribution

[3] Denoising Diffusion Probabilistic Models

[4] Improved Techniques for Training Score-Based Generative Models

[5] Improved Denoising Diffusion Probabilistic Models

[6] Diffusion Models Beat GANs on Image Synthesis

[7] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

[8] Hierarchical Text-Conditional Image Generation with CLIP Latents

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

Source: https://www.topbots.com/introduction-to-diffusion-models-for-machine-learning/

Introduction to Diffusion Models for Machine Learning

Diffusion Models – Introduction

Benefits of Diffusion Models

Diffusion Models – A Deep Dive

Training

What is the KL Divergence?

Casting $Lvlb$ in Terms of KL Divergences

Model Choices

Forward Process and L_T

Reverse Process and L_1:T−1

Network Architecture

Reverse Process Decoder and L₀

Final Objective

Diffusion Model Theory Summary

Diffusion Models in PyTorch

Minimal Example

Training on Custom Data

Final Words

References

Enjoy this article? Sign up for more AI research updates.

Related

Welcome to

Accessibility Dashboard

Introduction to Diffusion Models for Machine Learning

Diffusion Models – Introduction

Benefits of Diffusion Models

Diffusion Models – A Deep Dive

Training

What is the KL Divergence?

Casting in Terms of KL Divergences

Model Choices

Forward Process and LT

Reverse Process and L1:T−1

Network Architecture

Reverse Process Decoder and L0

Final Objective

Diffusion Model Theory Summary

Diffusion Models in PyTorch

Minimal Example

Training on Custom Data

Final Words

References

Enjoy this article? Sign up for more AI research updates.

Related

Welcome to

Accessibility Dashboard

Casting $Lvlb$ in Terms of KL Divergences

Forward Process and L_T

Reverse Process and L_1:T−1

Reverse Process Decoder and L₀