Performance of modern deep neural networks for object detection and semantic scene understanding relies on the training data used in the learning process. However, access to data with sufficient diversity and quality to produce reliable classifiers is not so straightforward. That is so because data collection and annotation of real-world media are both expensive, time-consuming and error-prone processes.
Therefore, it comes as no surprise that synthetic data is being widely used in the field of computer vision. This is mainly due to its versatility, configuration capabilities and the fact it is easier to get compared to its real-world counterpart. Nevertheless, not all synthetic data is the same. There are different data generation approaches, most of which have largely use game engines to render the final images. These lack fidelity to the real world, which is, in fact, crucial when dealing with real-life situations. ANYVERSE, on the other hand, applies a sophisticated rendering process and makes use of realistic physics-based models to mitigate this risk.
In this post, you can read about a project we did to show how ANYVERSE’s render quality and physical accuracy influence the training of an object detector in the context of Autonomous Driving/ADAS applications. In particular, we compare the performance obtained with SYNTHIA (an alternative synthetic dataset) and ANYVERSE separately. To do that we use real-world datasets for testing each model.
Firstly, images from SYNTHIA were generated as a random perturbation of a virtual world within the Unity framework. SYNTHIA uses a hand-modeled game world and rasterization-based image synthesis. For our project, we use a subset of 5K images corresponding to those scenes where the camera viewpoint is the same as the car viewpoint.
Secondly, we generate the ANYVERSE dataset taking into account some of SYNTHIA’s characteristics: class distribution, camera viewpoint, randomness of object location, and scene layout, for a fairer comparison. In the end, we generated a total of 1.6K images for the project.
Below we can see some sample images from both datasets. Differences in illumination quality and level of detail are visible even to an untrained eye:
For this project, only five classes are considered: bicycle, car, motorcycle, rider, and pedestrian. The distribution of these within each dataset can be observed in the graph below. Whereas the SYNTHIA dataset consists of 270K annotations, the ANYVERSE one has only 37K.
We train two Faster R-CNN models with a ResNet backbone initialized with COCO weights, considering the five classes above. We then perform the evaluation with two real datasets – Cityscapes and BDD100K, computing detection metrics used by COCO.
Cityscapes contains 5K real-world images recorded in the streets of 50 different European cities over several months (in spring, summer, fall), during the daytime, and with good/medium weather conditions. We run subsequently the evaluation on the validation set, which contains 500 images and 10K annotations. The following figure shows the precision-recall curves per category for a IoU threshold equal to 0.5.
For all classes, the model trained on ANYVERSE outperforms the one trained on SYNTHIA. Except for the dominant class (pedestrian), the ANYVERSE and SYNTHIA curves are far away from each other. In general, recall is very low for both models due to a large number of non-detected small objects, as observed in the next figure. This graph shows the average precision (AP) through different intersection over union (IoU) thresholds, and different bounding-box sizes (small, medium, and large) according to COCO metrics. For the most part, the AP obtained with ANYVERSE is twice better than the one obtained with SYNTHIA.
The images below show some qualitative results from Cityscapes evaluated with both models. It becomes evident that SYNTHIA fails at detecting the majority of cars, even when they are large. The same goes for some pedestrians, which is the most populated class. In comparison, the model trained with ANYVERSE, with fewer samples and annotations, is able to correctly locate and classify most of the objects present in the scene.
BDD100K is a large and diverse driving dataset captured by a real driving platform. It covers several locations in the USA, varying weather conditions, and different times of the day. This dataset provides 100K images with rich annotations. Evaluation is performed in the validation set, which contains 10K images with 67.6K annotations. As can be observed in the following figures, the results are very similar to the Cityscapes ones. There is a considerable gap between ANYVERSE and SYNTHIA curves for most of the classes. Moreover, ANYVERSE has an AP twice that obtained by SYNTHIA.
More interestingly, let’s take a look at some visual examples. The first row contains images taken during a rainy day, where the windscreen of the car has some raindrops on. Neither SYNTHIA nor ANYVERSE includes images with these characteristics in the datasets used. However, the model trained with ANYVERSE accurately detects all vehicles, whereas the SYNTHIA one is not able to generalize in this situation and detect the cars in the rain.
In summary, we have seen that the quality of a synthetic dataset influences the performance of a state of the art object detection system.
This is indeed very relevant for Autonomous vehicles/ADAS applications in current and future developments. There are still many tests and improvements to be made, but one cannot deny the promising results of this test. At ANYVERSE we do our best to boost the realism of our images and provide the right dataset needed for a specific problem.
As a bonus, the next video shows a validation test done over video footage in Paris. Pay special attention to the unsteadiness of the model trained with SYNTHIA (right) versus ANYVERSE (left).