CNN Architectures

13 min readJan 14, 2024

— Everything you need to know about important CNN architectures.

It all started with LeNet-5:

LeNet-5 architecture was introduced in 1998 by Yann LeCun et al. It was primarily designed for handwritten digit recognition tasks, specifically for the recognition of digits in the MNIST dataset. LeNet-5 played a significant role in advancing deep learning and convolutional neural networks.

The architecture of LeNet-5 consists of seven layers, including two convolutional layers, two subsampling (pooling) layers, and three fully connected layers. Let’s dive into the details of each layer:

1. Input layer: The first layer acts as the input layer and expects a grayscale image of size 32x32 pixels.

2. Convolutional Layer C1: This is the initial convolutional layer, which consists of six feature maps. Each feature map is produced by applying convolution between a learnable set of filters and the input image. The filters in this layer have a size of 5x5 pixels. The rectified linear unit (ReLU) activation function is applied to introduce non-linearity.

3. Subsampling Layer S2: This layer aims to reduce the spatial dimensions of the feature maps and provide local translation invariance. It performs downsampling by taking the maximum value within each 2x2 adjacent pixels in each feature map obtained from the previous layer.

4. Convolutional Layer C3: This layer consists of 16 feature maps, each resulting from convolving 5x5 filters with the input feature maps from the previous layer. Similar to C1, ReLU activation is applied.

5. Subsampling Layer S4: This layer, similar to S2, performs downsampling on the feature maps obtained from the C3 layer. It again uses max pooling with a 2x2 receptive field.

6. Fully Connected Layer F5: This layer acts as a traditional neural network layer with 120 neurons. Each neuron is connected to all the feature maps of S4. ReLU activation is applied again.

7. Fully Connected Layer F6: This layer has 84 neurons and is fully connected to F5. Once again, ReLU activation is applied.

8. Output Layer: The last layer is the output layer which consists of 10 neurons, corresponding to the 10 possible digit classes (0–9). It uses the softmax activation function to produce the probability distribution over the classes.

It combines convolutional layers for feature extraction with subsampling layers for spatial downsampling. The fully connected layers at the end of the network act as classifiers.

The architecture set the foundation for future advancements in deep learning and convolutional neural networks. While it was initially designed for handwritten digit recognition, its concepts and principles have been extended and applied to various computer vision tasks, such as image recognition, object detection, and image segmentation.

AlexNet (2012):

It was introduced by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in 2012. It marked a breakthrough in the field of deep learning by showing impressive performance on the ImageNet Large-Scale Visual Recognition Challenge. The architecture of AlexNet consists of eight layers, including five convolutional layers and three fully connected layers. Let’s delve into the details of each layer:

1. Input layer: The first layer acts as the input layer and expects RGB images of size 227x227 pixels.

2. Convolutional Layer C1: The initial layer applies convolution to the input image using 96 filters with a kernel size of 11x11 pixels. The stride is set to 4 pixels, which means the filters move by four pixels at a time. ReLU activation is applied to introduce non-linearity.

3. Max Pooling Layer S2: This layer follows C1 and performs max pooling with a size of 3x3 and a stride of 2 pixels. It reduces the spatial dimensions, enhancing the network’s ability to generalize.

4. Convolutional Layer C3: This layer employs 256 filters with a kernel size of 5x5 pixels. Similar to C1, ReLU activation is applied.

5. Max Pooling Layer S4: Similar to S2, this layer performs max pooling with a size of 3x3 and a stride of 2 pixels.

6. Convolutional Layer C5: This layer uses 384 filters with a kernel size of 3x3 pixels.

7. Convolutional Layer C6: C6 is a smaller convolutional layer with 384 filters and a kernel size of 3x3 pixels.

8. Convolutional Layer C7: C7 is another convolutional layer with 256 filters and a kernel size of 3x3 pixels.

9. Fully Connected Layer F8: This layer consists of 4096 neurons and is fully connected to the preceding layer. It employs ReLU activation.

10. Fully Connected Layer F9: F9 has 4096 neurons and is also fully connected. ReLU activation is again applied.

11. Output Layer: The last fully connected layer, F10, acts as the output layer and consists of 1000 neurons, each representing a class in the ImageNet dataset used during training. It uses softmax activation to produce the probabilities of different classes.

AlexNet utilizes techniques like local response normalization (LRN), dropout, and data augmentation to regularize the network and prevent overfitting. These techniques, combined with its deep architecture and large-scale training on GPUs, helped AlexNet achieve significant advancements in image classification accuracy.

The success of AlexNet paved the way for further research and advancements in deep learning, establishing convolutional neural networks as a powerful tool in various computer vision tasks.

VGGNet(2014)

It is a deep convolutional neural network (CNN) architecture developed by the Visual Geometry Group (VGG) at the University of Oxford. It was proposed by Karen Simonyan and Andrew Zisserman in 2014. VGGNet is known for its simplicity and uniformity in architecture. It achieved remarkable performance on the ImageNet dataset and helped advance the field of deep learning.

The architecture of VGGNet consists of several convolutional layers followed by a few fully connected layers. Let’s explore the key aspects of its architecture:

1. Input layer: The first layer acts as the input layer and expects RGB images of size 224x224 pixels.

2. Convolutional layers: VGGNet is characterized by deep convolutional layers, often with the same filter size of 3x3 pixels throughout the network. The depth of the network varies depending on the version of VGGNet.

3. Max Pooling layers: After every two or three convolutional layers, a max pooling layer with a filter size of 2x2 and a stride of 2 pixels is applied. The pooling operation reduces the spatial dimensions, increasing the receptive field and aiding in capturing more abstract features.

4. Fully Connected layers: VGGNet concludes with fully connected layers similar to traditional neural networks. These layers combine the features extracted from the convolutional layers and make predictions.

5. ReLU activation: Rectified Linear Unit (ReLU) activation is used after each convolutional and fully connected layer except for the output layer. ReLU introduces non-linearity, helping the network learn complex relationships in the data.

6. Softmax activation: The output layer of VGGNet employs softmax activation to produce a probability distribution over the classes for classification tasks.

VGGNet is known for its deep architecture, primarily consisting of 16 or 19 weight layers. The different versions of VGGNet, namely VGG16 and VGG19, differ in the number of convolutional and fully connected layers. VGG16 has 13 convolutional layers and 3 fully connected layers, while VGG19 has 16 convolutional layers and 3 fully connected layers.

The VGGNet architecture prioritizes depth over other design considerations and exhibits impressive performance due to its deep and uniform structure. However, its main drawback is its high computational cost and large number of parameters, which can make training and inference more time-consuming.

Despite its computational demands, VGGNet has significantly contributed to the development of deep learning. It has influenced subsequent architectures and provided a benchmark for evaluating the performance of new convolutional neural

GoogleNet (2014)

It was developed by researchers at Google. It was designed to be more computationally efficient than previous CNN architectures while still achieving high accuracy on image classification tasks. The architecture consists of several novel features:

1. Inception modules: Instead of using traditional filter sizes (e.g., 3x3 or 5x5) for convolutional layers, the Inception module employs multiple filter sizes simultaneously (1x1, 3x3, 5x5). This allows the network to capture features at different scales within the same layer and extract more meaningful representations.

2. Parallel convolutions: Inception modules are constructed with parallel convolutional layers of different sizes. This parallel structure ensures the network can effectively capture both fine-grained and global features from images.

3. 1x1 convolutions: GoogleNet utilizes 1x1 convolutions to reduce the dimensionality of feature maps before applying computationally expensive 3x3 or 5x5 convolutions. This helps reduce the number of parameters and lowers computational complexity, making the network more efficient.

4. Max pooling and average pooling: In addition to traditional max pooling, GoogleNet incorporates average pooling as a way to further downsample feature maps. Average pooling can preserve more spatial information, allowing the network to capture global context.

5. Auxiliary classifiers: To mitigate the vanishing gradient problem during training, GoogleNet includes auxiliary classifiers attached to intermediate layers. These classifiers provide additional supervision signals and gradients for earlier layers, helping the network to learn more discriminative features.

6. Fully connected layers: The network concludes with a global average pooling layer that averages the feature maps across spatial dimensions. This is followed by fully connected layers and a softmax layer for classifying the input image into different categories.

Overall, GoogleNet’s architecture offers a trade-off between computational efficiency and model accuracy. By employing multiple parallel convolutional layers, 1x1 convolutions, and auxiliary classifiers, it achieves state-of-the-art performance on various image classification benchmarks while keeping computational costs under control.

ResNet(2015)

ResNet, short for Residual Network, is a deep convolutional neural network (CNN) architecture developed by researchers at Microsoft. It is designed to address the degradation problem that occurs when deeper networks start to perform worse than shallower ones during training. The key innovation of ResNet is the introduction of residual connections, which allow the network to learn residual mappings instead of directly learning the desired underlying mapping.

1. Input layer: The network takes an input image of fixed size.

2. Convolutional and pooling layers: Initially, a convolutional layer with a small filter size is applied to the input image to extract low-level features. This is followed by max pooling to downsample the feature maps.

3. Residual blocks: The core building blocks of ResNet are the residual blocks. Each block consists of a stack of convolutional layers, typically with 3x3 filters, followed by a batch normalization layer and a Rectified Linear Unit (ReLU) activation function. The residual connection is achieved by summing the output of these convolutional layers with the original input to the block.

4. Identity mapping: The use of residual connections in ResNet allows for the learning of identity mappings. If the input and output of a residual block have the same dimensions, the connection simply copies the input and allows the network to learn the residual mapping. Otherwise, a 1x1 convolutional layer is used in the shortcut path to match the dimensions.

5. Shortcut connections: The skip or shortcut connection in a residual block enables the flow of gradients during backpropagation, ensuring that the gradients can travel through the network without vanishing. This helps in training very deep networks more effectively.

6. Bottleneck architecture: In deeper ResNet models (e.g., ResNet-50, ResNet-101), a bottleneck architecture is employed in the residual blocks. Here, the first 1x1 convolutional layer reduces the number of channels, followed by a 3x3 convolutional layer to capture features, and then another 1x1 convolutional layer to expand the number of channels again. This design reduces the computational cost while still capturing complex patterns.

7. Global average pooling: Instead of using fully connected layers at the end of the network, ResNet typically employs global average pooling. This operation averages the spatial dimensions of the feature maps, reducing them to a single vector. This reduces the number of parameters in the network and helps prevent overfitting.

8. Output layer: The global average pooled features are then fed into a fully connected layer, followed by a softmax activation function, which outputs the probabilities for different classes in a classification task.

By introducing residual connections, ResNet enables the training of much deeper networks, addressing the degradation problem faced by previous architectures. It allows for better optimization, improved gradient flow, and improved accuracy on various computer vision tasks.

MobileNetV3 (2019)

The MobileNetV3 architecture is a convolutional neural network (CNN) designed for efficient and lightweight computation on mobile and embedded devices. It aims to strike a balance between model size and accuracy. Here is an overview of the MobileNetV3 architecture:

1. Efficient design: MobileNetV3 is specifically optimized to be efficient on mobile devices, where computational resources are limited. It achieves this by using a combination of depthwise separable convolutions and inverted residual blocks.

2. Depthwise separable convolutions: MobileNetV3 extensively employs depthwise separable convolutions. This type of convolution splits the standard convolution into two separate operations: a depthwise convolution and a pointwise convolution. The depthwise convolution filters each input channel independently, while the pointwise convolution applies a 1x1 convolution to combine the filtered channels. This separation significantly reduces computational costs by reducing the number of parameters and operations.

3. Inverted residual blocks: The core building blocks of MobileNetV3 are inverted residual blocks. These blocks consist of several layers: a pointwise convolution, a non-linear activation function such as ReLU, a depthwise separable convolution, another pointwise convolution, and an element-wise addition between the output of the second pointwise convolution and the input to the block. The element-wise addition, called a skip or residual connection, helps the network to learn residual mappings similar to ResNet, improving gradient flow and information preservation.

4. MobileNetV3-Large and MobileNetV3-Small: MobileNetV3 comes in two main variations: MobileNetV3-Large and MobileNetV3-Small. MobileNetV3-Large is designed for high-accuracy tasks and has a slightly higher computational cost. MobileNetV3-Small, on the other hand, prioritizes speed and efficiency and is suitable for applications with strict latency constraints.

5. Efficient channel attention: MobileNetV3-Large introduces an efficient channel attention mechanism called squeeze-and-excitation (SE) blocks. SE blocks capture channel relationships to enhance the importance of informative channels. This mechanism helps improve feature representation and focus on relevant information.

6. Multiple input options: MobileNetV3 supports various input sizes, including 224x224 for standard high-accuracy tasks and smaller input sizes like 96x96 or 128x128 for low-latency applications. This flexibility allows MobileNetV3 to adapt to different scenarios and device constraints.

7. Final layers and classifier: Like other CNN architectures, MobileNetV3 typically ends with a global average pooling layer, which averages the spatial dimensions of feature maps into a single vector. This vector is then fed into a fully connected layer followed by a softmax activation function for classification tasks.

MobileNetV3 offers an efficient and lightweight architecture for running CNNs on mobile and embedded devices. By leveraging depthwise separable convolutions, inverted residual blocks, and efficient channel attention, it achieves an optimal trade-off between model size and accuracy, making it suitable for real-time and resource-constrained applications.

EfficientNet (2019)

The EfficientNet architecture is a family of convolutional neural network (CNN) models that are designed to achieve state-of-the-art performance with significantly fewer parameters and computations compared to other popular CNN architectures. The core idea behind EfficientNet is to optimize the network’s depth, width, and resolution simultaneously, using an approach called Neural Architecture Search (NAS).

Here’s a breakdown of the EfficientNet architecture:

1. Compound scaling: EfficientNet introduces a compound scaling method that uniformly scales the network’s depth, width, and resolution. This is achieved by using a set of scaling coefficients (phi) to determine the size of each dimension: depth multiplier (d), width multiplier (w), and resolution multiplier ®. By scaling these dimensions together, EfficientNet ensures a balanced network that achieves good performance.

2. Mobile inverted bottleneck convolution: EfficientNet utilizes a variant of the inverted bottleneck block, similar to MobileNetV2 and MobileNetV3. This block consists of a 1x1 pointwise convolution, a depthwise separable convolution, and another 1x1 pointwise convolution. It helps reduce the computational cost while capturing effective spatial and channel-wise representations.

3. Neural Architecture Search (NAS): EfficientNet leverages Neural Architecture Search (NAS) to automatically discover the optimal combination of scaling coefficients and architecture for each target resource constraint. NAS allows for automatic and efficient exploration of a large search space of possible architectures to find the best-performing one.

4. Depth, width, and resolution optimization: EfficientNet optimizes the network’s depth by adding more layers, and adjusting the depth multiplier based on the scaling coefficient. The width is optimized by adjusting the number of channels in each layer using the width multiplier. The resolution is optimized by varying the input image size based on the resolution multiplier.

5. EfficientNet architecture variants: EfficientNet provides multiple model variants, such as EfficientNet-B0, B1, B2, B3, B4, B5, B6, and B7. These variants differ in terms of their depth, width, and resolution based on the scaling coefficients. EfficientNet-B0 is the base model, while B7 represents the largest and most computationally expensive variant.

6. Performance and efficiency trade-off: EfficientNet achieves state-of-the-art performance on various computer vision benchmarks, such as image classification and object detection, while maintaining higher efficiency compared to other CNN architectures. The compound scaling technique ensures a well-balanced trade-off between model size, accuracy, and computational cost.

EfficientNet has gained significant attention and popularity due to its ability to achieve highly competitive performance while being computationally efficient. Its scalable architecture, achieved through compound scaling and Neural Architecture Search, offers a versatile solution for different resource constraints and real-world applications.