Original Source Here
EFPN: Extended Feature Pyramid Network for Small Object Detection
When small things make big problem
Object detection has been a breakthrough in computer vision applications since the first few days of intelligent machine systems. Despite being investigated for a long time, this topic seems to never get old and has become one of the must-known problems in video understanding and computer vision. I hope that you already got some background in object detection because I am going to ignore mentioning several fundamental things of object detection like what is it and how many types of object detectors, I believe that you can easily find tons of definitions and explanations on whatever resources regarding computer vision.
People love to make things more complicated for technology when it has just achieved something notable. In object detection, we will have further struggling data although many more innovative algorithms have been proposed. As modern image data has become much more struggling, for instance including a great number of small (or very small) objects, classical object detectors such as R-CNN, Single Shot Detector (SSD) and YOLO are getting obsolete because they can not handle small object detection issue. Therefore, modern models which are the successors of the mentioned methods (Cascade R-CNN, DSSD, YOLOv4, YOLOv4–5D) have been proposed to cope with the challenge. You can take a look at my post for YOLOv4–5D review and explanation here. In this post, I am going to review the Extended Feature Pyramid Network (EFPN) for Small Object Detection, you may want to read the full paper here.
The Problem of Vanilla FPN Detectors
As mentioned in the paper, the feature pyramid network (FPN) is the first concept of detecting objects at different scales of feature maps to enhance the performance of object detectors, especially in small object detection. FPNs have the same structure with encoder-decoder architectures where the features extracted from the input image are first encoded to a distilled feature map (bottleneck) and then this information is up-scaled again and is combined with corresponding feature levels to form the final feature maps which are used for prediction. By doing so, the location information in the lower layers can be combined with the rich semantic information in the deeper layers to improve the overall detection performance.
Although vanilla FPN object detectors can achieve favorable detection results at a certain scale, they still show a shortcoming. As shown in the figure below, the detectors do not utilize efficiently the feature maps to predict appropriate object sizes. Specifically, in the working mechanism of FPN, high-resolution feature map (Level 1 in the below figure) should be used for small object detection and low-resolution feature map (Level 4) should be responsible for predicting large objects. However, large object proposals are produced in almost all feature map levels, medium objects and small objects are proposed in the same feature level (Level 1), these statistics prove that the vanilla FPN detectors do not work in the way we literally expected.
Extended Feature Pyramid Network (EFPN)
To address the aforementioned problem, Deng et al. proposed Extended Feature Pyramid Network for Small Object Detection (full paper). The main contributions in the paper can be summarized as:
- An extended feature pyramid network (EFPN) to improve small object detection performance.
- A feature texture transfer (FTT) module to grant credible details for more accurate small object detection results.
- A foreground-background-balanced loss function to alleviate area imbalance between foreground and background.
Now, let’s walk through each item.
1. EFPN’s Architecture
As shown in the above figure and as compared to the vanilla FPN’s architecture, the main differences include: (1) EFPN is added one more detection layer P2′; (2) EFPN utilizes FTT module for transferring the features from P2 and P3 to P2′. Different from previous steps which only use the lower and adjacent feature map for upscaling, FTT module takes two feature maps P2 and P3 as input of its process and generates P3′ which is then used for a new detection layer P2′. EFPN is able to make predictions at 5 different scales.
2. FTT Module
In FFT module, a content extractor is first used to extract the semantic features from P3 (Main), after that, a sub-pixel convolution layer is applied to upscale the output of the content extractor. The latest information is then associated with the feature map P2 (Reference) to form the input of a texture extractor which is designed for selecting the credible textures of small objects. Finally, a residual connection is established for feature fusion and to yield the output feature map P3′. By using this manner, P3′ derives the selective features from the shallow feature map P2 and receives the semantics from the deeper layer P3.
3. Foreground-background-balanced Loss
A classical method to improve the performance of object detection is using high-resolution input. Inspired by this idea, the authors have proposed a new training mechanism called Cross Resolution Distillation.
To apply knowledge distillation, the student model needs to learn from the output of a teacher network. To cope with this, the authors use the model to test with 2x input then utilize the outputs of the top 4 layers as targets for training with knowledge distillation. For example, P5 of 2x input has double resolution compared to P5 of the original input (1x input) and has the same resolution with P4 of the original input (1x input), as illustrated in the above figure. Hence, P3 and P2 of the 2x input will be utilized as the targets for training P3′ and P2′ of the original input (1x input), respectively. The student EFPN is trained using the following loss function:
where L_fbb is the proposed foreground-background-balanced loss. This loss function is actually the L1 Loss, however, it is comprised of two parts: global reconstruction loss L_glob and positive patch loss L_pos. The proposed foreground-background-balanced loss can be described as:
where F denotes the computed feature map and Ft depicts the target feature map. λ is a weight balancing parameter. L_glob and L_pos are formulated as:
where P_pos demonstrates the patches of ground truth objects, (x, y) indicates the feature map coordinates and N depicts the number of positive pixels. You can find the original paper for more details.
In comparison with other modern methods, the authors have tested EFPN on the Tsinghua-Tencent 100K small traffic sign dataset and MS COCO small object dataset. The quantitative results are shown in the following tables. EFPN presents state-of-the-art detection results in various experimental settings when compared with other recent algorithms.
Visual detection results are shown in the below figure, each image pair shows a comparison between the results of the vanilla FPN (left) and EFPN (right) (red: false negative, blue: false positive, green: true positive). It is evident that EFPN outperforms FPN in terms of small object detection.
In this post, I have reviewed the Extended Feature Pyramid Network (EFPN), an improvement of the vanilla Feature Pyramid Network (FPN) for small object detection. EFPN has shown its effectiveness on both accuracy and computation aspects. EFPN has produced state-of-the-art results on the Tsinghua-Tencent 100K small traffic sign dataset and MS COCO small object dataset.
Readers are welcome to visit my Facebook fan page which is for sharing things regarding Machine Learning: Diving Into Machine Learning. Other notable posts from me regarding object detection can also be found at YOLOv4–5D review and Darkeras.
Thanks for reading!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot