Member-only story
Fast R-CNN
Understanding why it’s 213 Times Faster than R-CNN and More Accurate

In 2013, Ross Girshick et al. introduced R-CNN, an object detection model that combines convolutional layers with existing computer vision techniques, breaking previous records. It was a groundbreaking model at the time. In 2015, Ross Girshick developed Fast R-CNN, setting a new record. It was more accurate, and the inference speed became 213 times faster. Of course, we need to know what they were comparing. So, this article examines the results published in the paper to understand how Fast R-CNN became that fast.
If you are not familiar with R-CNN, please read the previous article first so that this article makes more sense.
R-CNN Slowness Reasons
In the original R-CNN paper, Ross Girshick explained that R-CNN is more accurate than OverFeat (Yann LeCun et al.) and then pointed out that R-CNN was nine times slower than OverFeat. So, he wanted to make R-CNN faster.
Speeding up R-CNN should be possible in a variety of ways and remains as future work.
Source: paper
However, the below figure from the paper shows that the pipeline is rather complex.
Ross Girshick raised three problems of R-CNN.
- Training is a multi-stage pipeline.
- Training is expensive in space and time.
- Object detection is slow.
Let’s examine each problem:
Training is a multi-stage pipeline
They needed to train a CNN, SVMs, and bounding-box regressors. First, they pre-trained their CNN on ImageNet (2012) classification tasks for 1000 classes. Then, they replaced the classification layer with an (N+1)-way classification layer (For Pascal VOC, N = 20 classes. For ImageNet detection task, N = 200. Plus one is for background) with randomly initialized weights and fine-tuned the model using only warped region proposals. They treated any region proposal with 0.5 or greater IoU overlap with a ground-truth…