ControlNet — Stable Diffusion Model

Ankit kumar
4 min readApr 28, 2024

--

— GenAI Series Part 4

Introduction

It is a neural network architecture that adds spatial conditioning controls to large, pre-trained text-to-image diffusion models. It locks the production-ready large diffusion models and reuses their deep and robust encoding layers pre-trained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with “zero convolutions” (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. They have tested various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts.

Related Work

  1. Finetuning Neural Networks: One way to finetune a neural network is to directly continue training it with the additional training data. But this approach can lead to overfitting, mode collapse, and catastrophic forgetting.
  2. HyperNetwork is an approach that originated in the Natural Language Processing (NLP) domain, intending to train a small recurrent neural network to influence the weights of a larger one.
  3. Adapter methods are widely used in NLP for customizing a pre-trained transformer model to other tasks by embedding new module layers into it. In computer vision, adapters are used for incremental learning and domain adaptation. This technique is often used with CLIP for transferring pre-trained backbone models to different tasks.
  4. Additive Learning circumvents forgetting by freezing the original model weights and adding a small number of new parameters using learned weight masks, pruning, or hard attention.
  5. Low-Rank Adaptation (LoRA) prevents catastrophic forgetting by learning the offset of parameters with low-rank matrices, based on the observation that many over-parameterized models reside in a low intrinsic dimension subspace.
  6. Zero-Initialized Layers are used by ControlNet for connecting network blocks. These layers are initialized with weights and biases set to zero. These layers serve as a buffer, preserving the integrity of the original model’s training while allowing the new, trainable copy to adapt to additional conditions without introducing noise or disrupting the established neural processes.

ControlNet Architecture

ControlNet architecture

To add a ControlNet to such a pre-trained neural block, they lock (freeze) the parameters Θ of the original block and simultaneously clone the block to a trainable copy with parameters Θc.

The trainable copy takes an external conditioning vector c as input. When this structure is applied to large models like Stable Diffusion, the locked parameters preserve the production-ready model trained with billions of images, while the trainable copy reuses such large-scale pre-trained model to establish a deep, robust, and strong backbone for handling diverse input conditions.

The trainable copy is connected to the locked model with zero convolution layers.

y_c = F(x; Θ) + Z(F(x + Z(c; Θz1); Θc); Θz2)

1st training step of ControlNet

In this way, harmful noise cannot influence the hidden states of the neural network layers in the trainable copy when the training starts. Moreover, since Z(c; Θz1) = 0 and the trainable copy also receives the input image x, the trainable copy is fully functional and retains the capabilities of the large, pre-trained model allowing it to serve as a strong backbone for further learning. Zero convolutions protect this backbone by eliminating random noise as gradients in the initial training steps.

Using positional encoding, the CLIP text encoder and diffusion timesteps are encoded with a time encoder.

The ControlNet structure is applied to each encoder level of the U-net. In particular, It creates a trainable copy of the 12 encoding blocks and 1 middle block of Stable Diffusion. The 12 encoding blocks are in 4 resolutions (64 × 64, 32 × 32, 16 × 16, 8 × 8) with each one replicated 3 times. The outputs are added to the 12 skip connections and 1 middle block of the U-Net since Stable Diffusion is a typical U-Net structure.

Controlling Stable Diffusion with various conditions without prompts.

Conclusions

So, this novel approach allows the integration of additional image-based inputs (like edges or depth maps) to guide the image generation process, leading to more precise control over the resulting images. It employs zero-initialized convolution layers that grow progressively during training, ensuring the integrity of the pre-trained diffusion model’s parameters and enhancing the model’s ability to handle diverse conditions without introducing harmful noise. The method demonstrates robust training across various datasets and conditions, potentially broadening the applications of diffusion models in generating controlled and context-specific visual content.

--

--

Ankit kumar
Ankit kumar

No responses yet