SHARE
CHAPTER 4
Author
Hamid Serry, WMG Graduate Trainee, University of Warwick
We hope you had a happy Easter! While there unfortunately no easter eggs found in our blog post last week, you will get to experience a different type of treat: Performance metrics!
These are used to compare the outcomes of different machine learning models and the progress of the mode training.
Base metrics
Most performance metrics are built up from a few simple definitions relating to how the model predictions compare with ground-truth data.
In the following, we will refer to the task of object detection in images; thus, the neural network (NN) will predict bounding boxes and the dataset will contain the ground truth in the form of labeled bounding boxes.
A ground-truth bounding box is a labeled section of an image containing a specific object which has been previously identified/classified.
The bounding boxes outputted by the NN can be categorised as follow:
- True Positive (TP): Ground-Truth bounding box correctly detected
- False Positive (FP): Detection of an object which does not exist
- False Negative (FN): Ground-truth bounding box undetected [1]
- True Negative (TN): Correctly not detecting an object which was not intended to be detected
i.e. no ground-truth bounding box.
True Negatives are often neglected [1]. The correctness of detection is related also to the Intersection over Union, further discussed in the following paragraph.
Intersection over Union (Eq.1) describes how a predicted bounding box compares to a ground-truth bounding box by comparing their areas, see Figure 1.


Figure 1 -Graphical representation of the Intersection over Union metrics: ground-truth boxes are depicted in green, while detection ones are in red. The area we are considering is highlighted in yellow: intersection on top and union on bottom
IoU is measured against a threshold value, α, of the proportion of the overlap between the two bounding boxes, on a scale of 0 to 1, 1 being a perfectly overlapping prediction.
Figure 1, for example, shows an IoU equal to around 0.4. If α = 0.5, the undiscovered object would be labeled as FN; if α = 0.3, it would instead be labeled TP since it has been successfully detected [2].
Precision and Recall
Precision is the ratio of the true predictions, TP, to all detections, Equation 2.

Recall illustrates how well the model picks up every ground truth and is the ratio of TP to all ground truths in the image, Equation 3.

A precise model would only detect TPs, making every detection at maximum confidence, but may miss some ground-truth objects. A high Recall model would detect all ground-truth objects in a scene, but may also collect some false positives.
Mean Average Precision
A precision-recall curve can be plotted while the NN is evaluating the dataset; Average Precision can then be calculated as the area under this curve for a certain IoU α threshold, Equation 4.

Across many different α values for IoU, the AP would vary; mean Average Precision (mAP) is the average across these values.
This metric is widely used to compare the performance of different models against each other, and cited in multiple research papers which make such comparisons [3].
F1 Score
Another useful metric in giving comparable information on models is the F1 Score. Also based on precision and recall, the score provides an overall performance in a single value, Equation 5.
A practical use of F1 is establishing a value for optimum confidence to produce a balanced output; often referred to as the Harmonic Mean of precision and recall [4].

Localisation Precision Recall
Rather than maintaining a constant threshold of IoU for every detection, Localisation Precision-Recall (LPR) attempts to solve over and under penalized localisations of bounding boxes.
A good way to imagine this is trying to detect a cat. A model may try to find a tail which results in poor localization of the bounding box; whereas the region of most interest for an autonomous car would be the cat body, to establish there is a cat in the path with no need for an exact tail position, Figure 2.
This applies a penalization for poorly localized detections and therefore provides another metric of particular interest for our use case.

Figure 2 – Localisation bounding boxes on a cat, with and without tail included
Conclusion
A number of suitable performance metrics have been identified, many of which build up to higher-level metrics with a single value output.
The next step are: Investigating the spatial recall index; implementing the above-described metrics in code and then they can be used to compare the selected model against the Anyverse and Kitti datasets.
References
Read more >>>