Hamid Serry, WMG Graduate Trainee, University of Warwick
Two-stage detectors split the image detection process into two tasks: proposing areas of the image which are of significant interest using a Region Proposal Network (RPN); classification and applying bounding boxes using Region of Interest (RoI) pooling .
A well-renowned example of a two-stage algorithm is the Region-based Convolutional Neural Network (R-CNN), which employs a selective search algorithm to narrow down the region to just 2000 proposals, as opposed to several hundred thousand.
This then leads onto the first stage which is the extraction of Regions of Interest (ROI) within the input image, which are any areas that could potentially be defined and labeled in the next part, the second stage, the classification of these regions .
This would take an enormous amount of training time as each image or frame would require 2000 region proposals to be classified and labeled, on average taking around 84 hours to train the model using the PASCAL VOC 2007 dataset on an Nvidia K40 GPU . At this stage, the model was only capable of being run on still images.
Faster R-CNN was developed over a year later, which addressed a lot of these issues. It was part of a two-stage improvement, with Fast R-CNN having been released 3 months prior. It managed to maintain a mean Average Precision (mAP) of 73.2% using the PASCAL VOC 2007 dataset on a TITAN X GPU while dramatically reducing processing rates, resulting in an improvement to 7 fps , a step closer to real-time detection.
One (or single) stage detectors instead return the full model’s output with just one stage, using a CNN to both extract RoI and apply bounding boxes at the same time.
A popular One Stage Detector is YOLO (You 2 Only Look Once), as per the abbreviation, sweeps the image through the network in a single run. By cutting
down the bounding boxes around RoI to just 100 boxes, it manages to vastly improve processing time compared to R-CNN and Faster R-CNN; achieving a processing rate of 45 fps.
This however comes at a cost to its accuracy, with YOLOv1 which received a score of around 63.3 mAP with the same PASCAL VOC 2007 dataset on a TITAN X GPU . YOLOv2 achieved largely impressive results of 78.6% mAP and 40fps as compared to Faster R-CNN with a ResNet backbone of 76.4% mAP and 5fp .
Lastly, Transformer algorithms work slightly differently, using a ranking algorithm to apply a weighting value to each input, which allows them to prioritise how they handle the data, and therefore are able to determine their own order of processing the inputs .
They are commonly used within natural language applications, such as detecting positive and negative speech from comments, or voice-to-text translations, but has also been applied to object detection, with comparisons made to Faster R-CNN.
A comparison was made with the same training parameters using a Faster R-CNN with a ResNet-101 backbone against a Detector Transformer (DETR) also with a RestNet-101 backbone, where it achieved an output of 12 fps compared to Faster R-CNN’s 16 fps under the same conditions .
As it stands, Faster R-CNN offers a suitable compromise between a slower two-stage detector that has a larger accuracy and a one-stage detector that is much faster but at a price of the average mean precision. However, in the future, we might consider investigating more deep neural network architectures.
 L. Du, R. Zhang and X. Wang, “Overview of two-stage object detection algorithms”, Journal of Physics: Conference Series, vol. 1544, no. 1, p. 012033, 2020. Available: 10.1088/1742-6596/1544/1/012033
Read more >>>