Object Detection

9 min readFeb 11, 2024

— different types of object detection methods

Histogram of Oriented Gradients (HOG)

It is a popular feature descriptor used for object detection and recognition tasks. It captures information about the local gradients in an image to characterize the shape and appearance of objects.

The HOG algorithm works as follows:

1. Preprocessing: The input image is first preprocessed by converting it to grayscale and applying local contrast normalization to improve the robustness against illumination variations.

2. Gradient Computation: The gradient magnitude and direction are computed at each pixel of the preprocessed image. This is typically done using techniques like the Sobel operator.

3. Cell Formation: The image is divided into small cells, typically with a square shape. The gradient magnitudes and directions within each cell are used to construct a histogram of gradient orientations.

4. Block Formation: Adjacent cells are grouped together to form larger blocks. Each block consists of a set of cells and can have overlaps with neighboring blocks.

5. Histograms Concatenation: The histograms computed within each cell are concatenated to form a feature vector. The length of the feature vector depends on the number of histogram bins and the number of cells.

6. Normalization: The computed feature vector is further normalized within each block to account for variations in contrast and lighting conditions. This normalization helps to make the descriptor more robust.

7. Feature Extraction: The normalized feature vectors from all blocks are concatenated to form the final feature representation for the input image.

8. Classification: The extracted HOG features can be used as input to a machine learning algorithm, such as a Support Vector Machine (SVM), to train a model for object detection or recognition. The SVM or a similar classifier can then be used to predict the presence and location of objects in new images based on their HOG features.

Pros:

1. Robust to variations in illumination.

2. Effective at capturing object shape and appearance.

3. Computationally efficient.

Cons:

Limited in capturing fine-grained details.
Sensitive to image resolution and orientation.

3. Limited context information.

4. Manual parameter tuning.

Faster R-CNN (Region-based Convolutional Neural Network)

It is a popular object detection framework that combines the advantages of both region proposal and deep learning-based object detection. It was introduced by Shaoqing Ren et al. in 2015.

The Faster R-CNN framework consists of two main components: a Region Proposal Network (RPN) and a Region-based CNN (RCNN) for object detection.

1. Region Proposal Network (RPN):
The RPN is a fully convolutional network that proposes potential object-bounding box regions in an image. It operates on the convolutional feature maps extracted from the input image. The RPN slides a small window (called an anchor) across the feature maps and predicts whether it contains an object or background. Simultaneously, it also predicts adjustments (offsets) to the anchor’s coordinates to more accurately match the object’s position. This generates a set of region proposals, ranked by their objectness scores and refined bounding box coordinates.

2. Region-based CNN (RCNN):
The RCNN takes the proposed regions from the RPN as input and performs object detection and classification. First, the proposed regions are warped to a fixed size, maintaining the aspect ratio, and fed into a pre-trained CNN, such as a VGG or ResNet network. These CNN features are used to classify the presence of objects and refine the bounding box coordinates for each proposed region. The classification is typically done using a softmax layer, and the bounding box refinement is performed using regression. Non-maximum suppression is applied to eliminate redundant or overlapping bounding boxes.

The key idea of Faster R-CNN is that both the RPN and RCNN share the convolutional features of the input image, allowing for end-to-end training. This enables efficient and accurate object detection without the need for handcrafted features.

The training process of Faster R-CNN involves jointly training the RPN and RCNN components. It begins with pre-training the CNN on a large dataset, followed by fine-tuning the RPN and RCNN on a region proposal dataset where bounding box annotations are available.

Faster R-CNN has shown significant improvements in accuracy and efficiency compared to earlier object detection methods. It has become one of the go-to frameworks for object detection and has paved the way for the development of many subsequent state-of-the-art object detection approaches.

The Single Shot MultiBox Detector (SSD)

It is a popular object detection algorithm introduced by Wei Liu et al. in 2016, focusing on achieving real-time object detection performance. SSD is designed to directly predict object bounding boxes and class labels from raw images without requiring a separate region proposal stage.

The main idea behind SSD is to use a set of default anchor boxes (also known as default bounding boxes) at different scales and aspect ratios to densely predict object bounding boxes and class probabilities. Here’s how SSD works:

1. Feature Extraction:
SSD starts with a base CNN network, such as VGG or ResNet, to extract feature maps from the input image. These feature maps are obtained at multiple spatial scales, capturing information at different levels of granularity.

2. Anchor Boxes:
For each spatial location on the feature maps, SSD associates a set of default anchor boxes of various scales and aspect ratios. These anchor boxes act as reference templates and are responsible for predicting object-bounding boxes.

3. Prediction Layers:
At each feature map level, SSD applies additional convolutional layers to generate two sets of predictions for each anchor box: class probabilities for different object categories and the offsets to adjust the anchor boxes’ positions and sizes to match the actual objects in the image.

4. Anchor Box Matching:
During training, the ground truth bounding boxes are matched with the anchor boxes based on the Jaccard overlap (Intersection over Union, IoU) to assign positive and negative labels to the anchor boxes. This determines which anchor boxes are responsible for predicting object classes and bounding box coordinates.

5. Loss Function:
SSD uses a combination of classification loss (such as softmax or focal loss) and regression loss (such as smooth L1 loss) to train the network. The classification loss measures the accuracy of object category predictions, while the regression loss measures the accuracy of predicted bounding box coordinates.

6. Inference and Post-processing:
During inference, SSD applies non-maximum suppression (NMS) to filter out redundant and overlapping bounding boxes, keeping only the most confident detections. The remaining bounding boxes with their corresponding class labels give the final detected objects.

The key advantage of SSD is its efficiency and real-time performance, as it performs object detection in a single pass of the network. It simplifies the object detection pipeline by eliminating the separate region proposal stage and achieves a good balance between speed and accuracy.

SSD has been widely adopted and extended, contributing to the development of subsequent object detection methods such as EfficientDet and YOLO (You Only Look Once).

You Only Look Once (YOLO)

It is an efficient and popular object detection algorithm introduced by Joseph Redmon et al. in 2016. Unlike other object detection methods, YOLO divides the input image into a grid and predicts object bounding boxes and class probabilities directly from the grid cells in a single pass of the network. YOLO focuses on achieving real-time object detection performance while maintaining good accuracy.

Here’s how YOLO works:

1. Grid Division:
The input image is divided into a grid of cells. Each grid cell is responsible for predicting bounding boxes and class probabilities for objects present within its boundaries.

2. Bounding Box Prediction:
For each grid cell, YOLO predicts a fixed number of bounding boxes with associated confidence scores and class probabilities. Each bounding box consists of coordinates (x, y, width, height) and confidence score representing the probability of containing an object and the accuracy of the predicted bounding box.

3. Class Probability Prediction:
Alongside the bounding box predictions, YOLO also predicts class probabilities for each bounding box, indicating the probability of the object belonging to different predefined object categories or classes.

4. Network Architecture:
YOLO employs a deep convolutional neural network (CNN) architecture, such as Darknet, to extract features from the input image. The network is trained end-to-end to optimize the joint loss of both bounding box predictions and class probabilities.

5. Non-Maximum Suppression (NMS):
During post-processing, YOLO applies non-maximum suppression (NMS) to reduce redundant and overlapping bounding boxes. NMS keeps the box with the highest confidence score for each object, discarding others that significantly overlap with it.

The advantages of YOLO include its simplicity, speed, and ability to detect objects in real time due to its single-pass design. It handles overlapping objects better than some other methods and produces fewer false positives. However, YOLO may struggle with detecting small objects due to its grid-based approach and can have lower localization accuracy compared to two-stage methods like Faster R-CNN.

Since its initial release, YOLO has undergone several improvements with subsequent versions like YOLOv2, YOLOv3, and YOLOv4, incorporating advancements like anchor boxes, feature pyramid networks, and other techniques to enhance detection accuracy and further optimize speed.

CenterNet

It is an advanced object detection algorithm that focuses on simultaneously predicting object centers and their associated bounding boxes in an image. It was introduced by Kaiwen Duan et al. in 2019. CenterNet offers a simple yet effective approach for accurate and efficient object detection.

Here’s how CenterNet works:

1. Center Heatmap Prediction:
CenterNet predicts a heatmap that highlights the most likely position of object centers in an image. Using a convolutional neural network (CNN), CenterNet outputs a heatmap with the same spatial dimensions as the input image. Each pixel in the heatmap represents the confidence score, indicating the likelihood of an object center being present at that location.

2. Size Regression:
In addition to the center heatmap, CenterNet performs regression to predict the size or extent of the object associated with each center. It regresses the box width and height using offsets from the center location.

3. Object Classification:
CenterNet predicts the class probabilities for each object by assigning a class label to the corresponding center point. It uses a separate heatmap for each object class, where each pixel represents the probability of the object belonging to that class.

4. Post-processing:
During inference, CenterNet performs post-processing steps to generate the final object detections. It first identifies the local maximum points in the center heatmap above a certain threshold. Each local maximum point represents a potential object center. The corresponding bounding box is generated by combining the predicted sizes and the offsets from the center.

CenterNet offers several advantages:

1. High Accuracy: CenterNet achieves high accuracy due to its explicit focus on detecting object centers, making it effective for objects of different shapes and sizes.

2. Simplicity: The approach of predicting object centers and regressing the bounding boxes is conceptually simple and reduces complexity compared to other object detection methods. This simplicity aids in faster training and inference.

3. Efficiency: CenterNet provides efficient object detection, as it avoids the need for anchor boxes or complex post-processing steps like non-maximum suppression (NMS). This results in faster inference times.

4. Versatility: CenterNet is capable of detecting objects in a wide range of scenarios, including objects of different scales, crowded scenes, and occluded objects.

CenterNet has inspired further research and variants such as CenterNet-Lite and CenterNet-Hourglass, which use lightweight architectures like MobileNet and Hourglass networks to achieve real-time performance. It has demonstrated impressive results on various benchmarks and remains an active area of research in the field of object detection.

DETR (DEtection TRansformer)

It is a transformer-based object detection method proposed by Nicolas Carion et al. in 2020. DETR revolutionizes traditional object detection approaches by eliminating the need for anchor-based methods and introducing a set-based formulation to solve the object detection problem.

The key idea behind DETR is to cast object detection as a direct set prediction problem. Instead of using a predefined number of anchor boxes, DETR employs a learnable set of object queries. These object queries are spatially distributed across the image and represent potential object locations.

The DETR architecture consists of two main components: the convolutional backbone and the transformer-based encoder-decoder.

The convolutional backbone network processes the input image and extracts a set of feature maps. These feature maps serve as the initial representation of the image and contain high-level features.

The transformer-based encoder-decoder network processes the feature maps and performs the object detection task. The encoder module takes the input feature maps and passes them through a stack of transformer encoder layers. These layers capture the spatial relationships between the object queries and the image features, enabling the model to reason about the presence and location of objects.

The decoder module, comprised of transformer decoder layers, takes the encoded representation and generates the final predictions. Each object query attends to the encoded representation, resulting in a set of predicted object bounding boxes and their corresponding class labels. By using transformer-based attention mechanisms, DETR can effectively capture global context and spatial dependencies, making it robust to occlusions and varying object scales.

During training, DETR utilizes a bipartite matching loss to assign the predicted bounding boxes to ground truth objects. The loss consists of three components: box regression loss, class prediction loss, and a Hungarian matching loss. The Hungarian matching loss ensures the optimal assignment of predicted and ground truth boxes, encouraging accurate predictions.

DETR has several advantages over traditional object detection methods. By removing the need for anchor boxes, it simplifies the detection pipeline and avoids the hyperparameter tuning associated with anchor-based methods. DETR also allows end-to-end training and has the potential for capturing long-range dependencies effectively.

DETR has achieved competitive results on standard object detection benchmarks while introducing a new direction in the field of computer vision. It has paved the way for further research and advancements in transformer-based techniques for object detection problems.