Vision Transformer Part 3

4 min readMar 23, 2024

— Dive deep into the details of transformer architectures

ViT encoder architecture consists of a stem that converts the images into patches, a body based on the multilayer Transformer encoder, and a head that transforms the global representation into the output label.

How does the ViT Encoder work?

The ViT encoder module is a key component of the ViT architecture, responsible for processing input images by transforming them into a sequence of tokens and applying the transformer architecture to learn spatial relationships. Here are the details of how the ViT encoder module works:

1. Patch Embedding: The input image is divided into fixed-size patches, which are linearly projected into embeddings. Each patch represents a small spatial region of the image and is transformed into a lower-dimensional vector representation. These patch embeddings serve as the input tokens for the transformer encoder.

2. Positional Encoding: Just like in the original transformer architecture, positional information is added to the patch embeddings to capture spatial relationships between the patches. Positional encodings are added to the patch embeddings to provide information about the relative positions of the patches in the image.

3. Tokenization: The patch embeddings, along with the positional encodings, are flattened into a sequence of tokens. These tokens are then passed as input to the transformer encoder.

4. Transformer Encoder: The token sequence is processed through a stack of transformer encoder layers. Each encoder layer consists of multi-head self-attention mechanisms and feedforward neural networks. The self-attention mechanism allows the model to capture relationships between different patches in the image, while the feedforward networks apply non-linear transformations to learn complex patterns present in the image data.

5. Classification Head: The output token sequence from the transformer encoder is passed through a classification head, typically a fully connected layer followed by a softmax activation function. This final layer produces the predicted probabilities for different classes, enabling image classification tasks.

By transforming the input images into a sequence of tokens and utilizing the transformer architecture within the encoder module, the ViT model can efficiently capture spatial dependencies and patterns in image data. This approach has shown impressive performance in image classification tasks and has demonstrated the effectiveness of leveraging transformers for computer vision problems.

Overall, the transformer encoder module plays a crucial role in transforming input sequences into meaningful representations that can be used for downstream tasks such as machine translation, text generation, and sentiment analysis.

Multi-head attention module?

The ViT architecture has a stack of transformer encoder layers. Each encoder layer consists of multi-head self-attention mechanisms. The attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar Attention calculations are then combined to produce a final Attention score. This is called multi-head attention and it helps the ViT networks to capture spatial dependencies and patterns in image data.

How do we compute the attention scores in the attention module of the ViT?

1. Query, Key, Value Vectors: First, the input token embeddings (patch embeddings) are linearly projected into three separate vectors: query (Q), key (K), and value (V) vectors. This projection is learned during training and allows the model to represent the tokens in different latent spaces.

2. Matrix Multiplication: The query vector (Q) is multiplied with the transpose of the key vector (K) to generate a matrix of raw attention scores. This operation calculates the similarity between each pair of query and key vectors, indicating how much attention each token should pay to others.

3. Scale: To prevent the gradients from vanishing or exploding, the raw attention scores are scaled by the square root of the embedding dimension. This scaling helps stabilize the training process and achieve better performance.

4. Softmax Activation: The scaled attention scores are passed through a softmax activation function. This normalizes the scores across all tokens, turning them into weights that sum up to 1. These weights represent the importance or attention that the model should place on each token relative to others.

5. Weighted Summation: The softmax-processed attention scores are used to weight the corresponding value vectors (V) for each token. A weighted sum of the value vectors is computed based on the attention weights, resulting in a context vector representing the attended information for each token.

Conclusion

By computing attention scores using attention modules, the Vision Transformer model can effectively capture the relationships between different patches in the image and learn the dependencies between them, enabling it to generate meaningful representations for image processing and classification tasks.

Vision Transformer Part 3

How does the ViT Encoder work?

Multi-head attention module?

How do we compute the attention scores in the attention module of the ViT?

Conclusion

Written by Ankit kumar

No responses yet