Synthetic vs. Real-world data for traffic light classification
The Issue of Traffic Light Classification
Having a robust traffic light (TL) classifier is essential in any autonomous driving / ADAS system. Even though it may seem like a simple AI problem, the TL classifier requires very good performance. To obtain this it calls for a high-quality dataset both in size and composition. Training images should reflect the variability in shape, illumination and weather conditions of traffic lights on any road on the planet.
Most popular datasets for TL detection and classification use real-world images: Bosch, DTLD, LARA, and LISA. Although these widely differ in quantity and quality, all of them present some undesirable characteristics. Some examples include wrong labels, inaccurate bounding boxes, overrepresented annotations due to motionless scenes, and unbalanced classes, among others. It would be very expensive and time-consuming to overcome such a disadvantage by gathering more real-world data. Therefore, the synthetic solution seems like a more and more attractive approach.
ANYVERSE for TL
It has already been proven in another study that ANYVERSE’s physics-based render quality provides fidelity to the real world with respect to other synthetic data generation procedures. But how does ANYVERSE perform in comparison with real-world data? In this blogpost, we conduct a study using the four real-world datasets mentioned before together with a new synthetic one generated with ANYVERSE for TL classification.
The Training Subjects
Assuming we have the location of a traffic light, we train a classifier to infer its state: green, red, or yellow. As an input, we use 96×96 image crops. These contain traffic lights, only with three round bulbs and vertical housing, during the daytime. See below some samples from each dataset:
The next figure shows the size and class distribution of each dataset in terms of red, green and yellow lights. All real-world datasets, except for LISA, include 7K-15K annotated traffic lights, and the yellow class in all of them is unbalanced. LARA and LISA contain a large number of overrepresented traffic lights, i.e. very similar crops due to the lack of motion of the recording camera. In terms of variety of annotations, LISA is the most uneven dataset because the yellow lights are extremely underrepresented, while DTLD is much more uniform.
As for ANYVERSE, it has been generated independently from the real-world datasets. Since we can control all data generation parameters, we obtained a balanced dataset with a high variation in appearance.
In order to compare ANYVERSE with all four real-world datasets, we perform an all-vs-all experiment. We train five ResNet-8 classifiers, one per dataset, and then evaluate them on the testing sets of all datasets. When no training-testing division is provided, we use an 80:20 split. As a result, we get 25 experiments. The results from these in terms of average precision (F1 score yields similar results) are available in the following table. These values measure the capability of a dataset to perform well on other distributions and, thus, generalize to real unexpected situations.
As you can see, the numbers on the diagonal are all very close to 1.0. This is due to the fact that each dataset performs very well in its own testing set. However, the off-diagonal numbers are more diverse. Some of them, for example, when training with ANYVERSE and testing on LISA, are close to 98% accuracy. Yet other combinations show very poor results. This is the case, for instance, when training on LISA and testing on BOSCH. This is due to two facts:
- LISA is very unbalanced because it has few yellow lights for training (<100). The classifier is not able to correctly learn the difference between red an yellow from LISA. In fact, it thinks that 90% of the BOSCH red lights are yellow.
- BOSCH and LISA present different levels of global illumination (average channel values). On average, LISA crops are more than 20 units larger than those in Bosch. This indicates that the BOSCH ones were created under darker conditions:
In the next figure, we show a simplified version of the previous table. We plot the average precision of each dataset on the testing set of the remaining four sets. Error bars indicate the standard deviation. We notice that ANYVERSE and DTLD outperform the others. This is indeed very evident for the high-fidelity synthetic approach since it performs as well as the best real-world dataset and noticeably better than the rest.
In a nutshell, this experiment indicates that the performance obtained with high-fidelity synthetic data is comparable to the one obtained with real-world data when classifying traffic lights. In addition, the advantages of ANYVERSE for generating new data over capturing it in the real world would allow us to easily and quickly extend the ANYVERSE dataset to improve some of the results. This is a task that would be very expensive and time-consuming using real-world data only.
The capability of ANYVERSE to produce reliable synthetic data in a time efficient manner proves to be essential for accelerating the path of advanced perception models.