Stable Diffusion Models

6 min readApr 28, 2024

— GenAI Series Part 3

Introduction

The stable diffusion model is a variant of diffusion models, which are a class of generative models used for image generation. It is also known as Latent Diffusion model. It introduces stability constraints to enhance the training and stability of the diffusion process. The stable diffusion model is designed to address some of the challenges faced by standard diffusion models, such as mode collapse and training instability.

Why do we need Stable Diffusion Models?

If you haven’t read about diffusion models, please visit here:

Latent Space Optimization: Unlike traditional diffusion models that operate directly on pixel space, stable diffusion models often work in a compressed feature space (latent space). This allows them to generate images more quickly and with fewer computational resources.

Guidance Techniques: These models use techniques like classifier-free guidance, where the model is adjusted to increase the likelihood of producing outputs that more closely align with the desired text descriptions.

Let's discuss the key components of stable diffusion models

Model Architecture

Stable diffusion models generally use a variant of the U-Net architecture for the core image generation process, combined with an autoencoder for encoding and decoding images:

Autoencoder: The model starts with a convolutional autoencoder that compresses an image into a lower-dimensional latent representation. This compression helps reduce the computational load and speeds up the generation process. The encoder converts images into a latent space, and the decoder reconstructs images from this latent representation.
U-Net: The U-Net architecture, originally designed for medical image segmentation, is adapted for the generative task. It features a symmetric structure with skip connections between downsampling (encoder) and upsampling (decoder) paths, which helps in better detail preservation during image generation.

Training Process

The training of stable diffusion models involves a denoising diffusion probabilistic model, which consists of two main phases: the forward diffusion (noising) process and the reverse diffusion (denoising) process.

Forward Diffusion Process: During this phase, the model gradually adds noise to the original images over a series of steps until the image is transformed into Gaussian noise. This is done by training the model to predict the noise that was added at each step, rather than directly predicting pixel values of the original image.

Reverse Diffusion Process: The reverse process is where the actual generation happens. The model is trained to predict the noise and then reverse it, effectively denoising the image step-by-step to reconstruct the clean image from noise. This is done by iteratively predicting and subtracting the added noise from the noisy images at each timestep, using the U-Net model.

Latent Space Optimization

Training in the latent space instead of the pixel space is one of the key optimizations in stable diffusion models:

Efficiency: By operating in a compressed latent space, the model requires less computational power and memory, and can generate images faster compared to traditional pixel-based diffusion models.
Quality: Latent space provides a more abstract representation of images, which helps in maintaining quality and consistency during generation.

Text Conditioning

To generate images from textual descriptions, stable diffusion models incorporate text conditioning.

Text Embedding:

Tokenization: The first step involves converting the input text into a sequence of tokens. These tokens are typically words or parts of words that have been pre-defined in a vocabulary.
Embedding: Each token is then transformed into a high-dimensional vector using an embedding layer. These embeddings capture semantic meanings and relationships between words, making them usable for the model.

Integration of Text into the Model:

Conditioning the Latent Space: The text embeddings are used to condition the latent representations in the model. This can be achieved by adding, concatenating, or modulating the latent space with the text embeddings.
Integration with U-Net: These text embeddings are integrated into the U-Net architecture, influencing the generation process to align the output with the text description. This integration is typically achieved by conditioning the model at various layers of the U-Net and modulating its parameters based on the text input.
Attention Mechanisms: Many stable diffusion models use attention mechanisms, particularly cross-attention, where the model pays more attention to certain parts of the text embedding depending on the context of the image being generated. This ensures that specific details mentioned in the text are considered more significantly during the image generation.

Classifier-Free Guidance

To enhance the fidelity and relevance of the generated images to the text prompts, stable diffusion models often use classifier-free guidance, where the model is fine-tuned to increase or decrease the influence of the text embeddings based on a guidance scale factor during generation.

How it works?

Training with Dropout: During training, the model randomly drops out the conditioning signal (e.g., text embeddings) at a certain rate. This means the model learns to generate data both with and without the conditioning text. The dropout acts like a switch that sometimes provides the model with the text information and sometimes forces it to generate outputs without any text guidance.
Inference with Scaling: At inference time, the embeddings from the text prompts are scaled up. By scaling the influence of the text embeddings, the model is guided more strongly by the textual description than it was during training. The key is that the model has learned to operate both with and without the text prompts, so increasing the strength of the text input during inference helps the model to produce outputs that are more tightly bound to the input text’s specifics.
Guidance Scale: The degree to which the text embeddings are scaled is determined by a guidance scale parameter. A higher guidance scale increases the influence of the text, leading to images that more closely match the description but may also result in less diversity. Conversely, a lower guidance scale results in more diverse outputs but with potentially less fidelity to the text prompt.

Conclusions

Stable diffusion models are highly versatile and suitable for applications ranging from digital art creation and photo enhancement to more commercial uses like advertising and content generation.

The key advantages of stable diffusion models:

Efficiency: They are more computationally efficient due to their operation in latent space rather than pixel space.

Quality and Stability: The models are optimized to produce high-quality and stable outputs, reducing artifacts and enhancing visual appeal.

Control and Creativity: Users can control the output through detailed text descriptions, allowing for a high degree of creative expression.