Synthetic data to develop a trustworthy autonomous driving system | Chapter 4



Hamid Serry, WMG Graduate Trainee, University of Warwick

We hope you had a happy Easter! While there unfortunately no easter eggs found in our blog post last week, you will get to experience a different type of treat: Performance metrics!

These are used to compare the outcomes of different machine learning models and the progress of the mode training.

Base metrics

Most performance metrics are built up from a few simple definitions relating to how the model predictions compare with ground-truth data.

In the following, we will refer to the task of object detection in images; thus, the neural network (NN) will predict bounding boxes and the dataset will contain the ground truth in the form of labeled bounding boxes.

A ground-truth bounding box is a labeled section of an image containing a specific object which has been previously identified/classified.

The bounding boxes outputted by the NN can be categorised as follow:

i.e. no ground-truth bounding box.

True Negatives are often neglected [1]. The correctness of detection is related also to the Intersection over Union, further discussed in the following paragraph.

Intersection over Union (Eq.1) describes how a predicted bounding box compares to a ground-truth bounding box by comparing their areas, see Figure 1.

Can you use synthetic data to develop a trustworthy autonomous driving system
Can you use synthetic data to develop a trustworthy autonomous driving system

Figure 1 -Graphical representation of the Intersection over Union metrics: ground-truth boxes are depicted in green, while detection ones are in red. The area we are considering is highlighted in yellow: intersection on top and union on bottom

IoU is measured against a threshold value, α, of the proportion of the overlap between the two bounding boxes, on a scale of 0 to 1, 1 being a perfectly overlapping prediction.

Figure 1, for example, shows an IoU equal to around 0.4. If α = 0.5, the undiscovered object would be labeled as FN; if α = 0.3, it would instead be labeled TP since it has been successfully detected [2].

Precision and Recall

Precision is the ratio of the true predictions, TP, to all detections, Equation 2.

Can you use synthetic data to develop a trustworthy autonomous driving system

Recall illustrates how well the model picks up every ground truth and is the ratio of TP to all ground truths in the image, Equation 3.

Can you use synthetic data to develop a trustworthy autonomous driving system

A precise model would only detect TPs, making every detection at maximum confidence, but may miss some ground-truth objects. A high Recall model would detect all ground-truth objects in a scene, but may also collect some false positives.

Mean Average Precision

A precision-recall curve can be plotted while the NN is evaluating the dataset; Average Precision can then be calculated as the area under this curve for a certain IoU α threshold, Equation 4.

Can you use synthetic data to develop a trustworthy autonomous driving system

Across many different α values for IoU, the AP would vary; mean Average Precision (mAP) is the average across these values.

This metric is widely used to compare the performance of different models against each other, and cited in multiple research papers which make such comparisons [3].

F1 Score

Another useful metric in giving comparable information on models is the F1 Score. Also based on precision and recall, the score provides an overall performance in a single value, Equation 5.

A practical use of F1 is establishing a value for optimum confidence to produce a balanced output; often referred to as the Harmonic Mean of precision and recall [4].

Can you use synthetic data to develop a trustworthy autonomous driving system

Localisation Precision Recall

Rather than maintaining a constant threshold of IoU for every detection, Localisation Precision-Recall (LPR) attempts to solve over and under penalized localisations of bounding boxes.

A good way to imagine this is trying to detect a cat. A model may try to find a tail which results in poor localization of the bounding box; whereas the region of most interest for an autonomous car would be the cat body, to establish there is a cat in the path with no need for an exact tail position, Figure 2.

This applies a penalization for poorly localized detections and therefore provides another metric of particular interest for our use case.

Can you use synthetic data to develop a trustworthy autonomous driving system

Figure 2 – Localisation bounding boxes on a cat, with and without tail included


A number of suitable performance metrics have been identified, many of which build up to higher-level metrics with a single value output.

The next step are: Investigating the spatial recall index; implementing the above-described metrics in code and then they can be used to compare the selected model against the Anyverse and Kitti datasets.


[1] L. Jiao et al., R. Padilla, S. L. Netto and E. A. B. da Silva, “A Survey on Performance Metrics for Object-Detection Algorithms”, 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 2020, pp. 237-242, doi: 10.1109/IWSSIP48289.2020.9145130.
[3] L. Jiao et al., “A Survey of Deep Learning-Based Object Detection”, IEEE Access, vol. 7, pp. 128837- 128868, 2019. Available: 10.1109/access.2019.2939201. [4] Z.C. Lipton, C. Elkan and B. Narayanaswamy, “Thresholding classifiers to maximize F1 score” 2014. arXiv preprint arXiv:1402.1892.
[4] K. Oksuz, B.C. Cam, E. Akbas and S. Kalkan. “Localization recall precision (LRP): A new performance metric for object detection,” 2018 Proceedings of the European Conference on Computer Vision (ECCV) (pp. 504-519).

Read more >>>

Scroll to Top

Let's talk about synthetic data!