Understanding the Logic and Tricky Part of NMS
Object detection models like YOLOv5 and SSD predict objects’ locations by generating bounding boxes (shown in blue rectangles below). However, object detection models produce more bounding boxes than the final output with different locations, sizes, and confidence levels. They do not just predict one bounding box per object.
It is where Non-Maximum Suppression (NMS) comes to play, keeping the most probable bounding boxes and eliminating other less-likely bounding boxes. This article explains how NMS works.
Overlapping Bounding Boxes
If we don’t do NMS, an object detection output may look like the one below, with many overlapping bounding boxes. There would be much more bounding boxes than the below image, but that would make the image too chaotic, so I’m showing only some overlapping bounding boxes to get my point across.
Some object detection models like YOLO (since version 2) use anchors to generate predictions. Anchors are a predefined set of boxes (width and height) located at every grid cell. For example, YOLOv2 predicts more than a thousand bounding boxes. In other words, anchor boxes provide reasonable priors for object shapes (width-height ratios) calculated from the training dataset. The model only needs to predict an offset and scale to each anchor box, which simplifies the network as we can make it fully convolutional.
In this article, we are not going into more detail about anchors. We only need to know that an object detection model generates many bounding boxes and we need to apply post-processing to eliminate redundant ones.
Each predicted bounding box has a confidence score which indicates how likely (the model believes) an object exists in a bounding box. For example, the model may output a bounding box for a dog with a confidence score of 75%. The confidence score tells us…