Diffusion models
— GenAI Series Part 2
Introduction:
In the generative AI Series, we discussed the GANs and variational-autoencoders, and we noticed a couple of limitations with GANs and VAE for example:
1. Mode collapse: GANs are prone to mode collapse, where the generator learns to produce only a few samples that fool the discriminator, leading to a lack of diversity in generated samples.
2. Training instability: GAN training can be unstable and difficult to converge, requiring careful tuning of hyperparameters and training procedures to achieve good results.
3. Explicit representation of likelihood: VAEs assume a simple form of the posterior distribution and can struggle to model complex distributions in high-dimensional spaces.
4. Trade-off between reconstruction quality and sample diversity: VAEs typically struggle to generate high-quality samples while maintaining diversity in the generated samples.
To deal with the above limitations, we came up with the idea of Diffusion models, these are also known as autoregressive generative models, that have gained popularity in recent years for their ability to generate realistic samples and model complex high-dimensional data distributions. These models are based on the concept of iteratively diffusing noise to generate samples that resemble data from the target distribution.
In diffusion models, the process of generating samples involves transforming an initial noise distribution through a series of diffusion steps. At each step, the noise distribution is updated by adding noise and applying a series of invertible transformations. These transformations gradually push the noise distribution towards the target data distribution over multiple steps, resulting in the generation of high-quality samples.
Diffusion models are capable of capturing complex data distributions and generating high-fidelity samples across various domains, such as image generation, speech synthesis, and natural language processing. They have also demonstrated improved sample quality and stability compared to other generative models like GANs and VAEs.
One of the key advantages of diffusion models is their ability to model long-range dependencies and capture fine-grained details in the generated samples. They are also less prone to issues like mode collapse and training instability, making them an attractive choice for researchers and practitioners working on generative modeling tasks.
Detailed Analysis of Diffusion Models
Diffusion Training step:
1. Modeling the conditional distributions: The goal is to model the conditional distributions of the data, typically an image, at different levels of noise. The model learns these conditional distributions by estimating the parameters of a diffusion process.
2. Noise levels: The forward diffusion process in generative AI typically involves adding noise to the input data at different levels. The noise levels increase gradually as the process moves forward, leading to a clearer image generation.
3. Diffusion process: The forward diffusion process involves iteratively applying a series of transformations to the input data to generate samples. These transformations typically involve adding noise to the data and learning the parameters of the noise distributions.
4. Training the model: During training, the model learns to estimate the parameters of the noise distributions and generate samples by moving forward in the diffusion process. The model is optimized to minimize the gap between the generated samples and the true data distribution. The UNet architecture is used as the Diffusion Model.
5. Sampling: The model starts with a noisy input and applies the learned transformations in the forward diffusion process to generate new samples. The model generates samples that progressively resemble the true data distribution by iteratively applying these transformations.
6. Evaluation: The quality of the generated samples can be evaluated using metrics such as log-likelihood, Fréchet Inception Distance (FID), or visual inspection. These metrics help determine how closely the generated samples match the true data distribution.
Forward Diffusion Process
The forward diffusion process gradually adds Gaussian noise to the input image x₀ step by step, and there will be T steps in total. the process will produce a sequence of noisy image samples x₁, …, x_T.
When T → ∞, the final result will become a boisterous image as if it is sampled from an isotropic Gaussian distribution.
We use a closed-form formula to directly sample a noisy image at a specific time step t.
Using the closed-form formula, we can directly sample x at any time step , and this makes the forward process much faster.
Reverse Diffusion Process
After the forward diffusion process, the image has been fully ‘noised’, and then the role of the reverse diffusion process emerges. It helps us reconstruct the original image from its noisy state.
Leveraging what it learned during the noise-adding phase, the model predicts and subtracts the added noise. This denoising is iterative, with each step refining the model’s predictions and bringing the data closer to its original form.
The reverse process starts with the pure Gaussian noise. The model estimates the noise at each step and uses this estimation to recover the original image.
The model learns the joint distribution 𝑝𝜃(x0:𝑇) as
where the time-dependent parameters of the Gaussian transitions are learned, the model pθ(x₁|x) is trained to approximate q(x₁|x) (noise added during the forward diffusion process step).
The Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous timestep:
Model Architecture:
In the context of diffusion models used to generate images from noise, the U-Net architecture is employed as a deep learning framework for this task. U-Net is a convolutional neural network (CNN) architecture that is well-suited for image segmentation tasks but can also be adapted for image generation applications, such as denoising or image reconstruction.
The U-Net architecture typically consists of the following key components:
1. Contracting Path (Encoder): This part of the network consists of convolutional and pooling layers that progressively downsample the input data to capture its essential features.
2. Bottleneck: At the center of the U-Net architecture is a bottleneck layer that captures the most abstract features of the input data.
3. Expanding Path (Decoder): The expanding path consists of upsampling and convolutional layers that reconstruct the high-resolution segmentation map based on the learned features.
4. Skip Connections: One of the distinguishing features of the U-Net architecture is the inclusion of skip connections that connect the corresponding encoder and decoder layers. These connections help the network retain and utilize important spatial information during upsampling, allowing for better segmentation performance.
By employing the U-Net architecture in diffusion models for generating images from noise, researchers can effectively denoise and reconstruct images in various applications, such as medical imaging, microscopy, and remote sensing. The ability of U-Net to capture complex spatial features and details makes it a powerful tool for enhancing image quality and information recovery from noisy data.
Conclusion
We have presented high-quality image samples using diffusion models, and we have found connections among diffusion models and variational inference for training Markov chains, denoising score matching, autoregressive models, and progressive lossy compression. The diffusion models seem to have excellent inductive biases for image data.