Synthetic data to develop a trustworthy autonomous driving system | Chapter 10



Hamid Serry, WMG Graduate Trainee, University of Warwick

Last week we discussed the setup of the Anyverse Synthetic Data Platform, focusing on the sensor setup and the simulation environment. This week we will run through the steps taken to build a dataset similar to KITTI and how this process has been validated before generating a fully rendered dataset.

KITTI Dataset Composition

In Chapter 2 of the series, the class composition across the KITTI dataset was calculated, and appropriate sub-sets of the data were created, i.e. training, testing, and validation splits. The split was performed in order to have a class distribution in the split resembling the distribution in the full dataset.

Moreover, for each class, it is worth computing the probability that a given number of instances is present in one image. This computation allows to answer questions like “Given an image, what is the probability that 4 cars are present in the scene?”. Figure 1, for example, represents the probability that a given number of cars is present in a frame captured by the camera.

Figure 1 – Probability Distribution for Cars in the KITTI Dataset

Given the aforementioned probabilities for each class, we defined a function that generates a random number of cars, pedestrians, cyclists and vans to spawn in the scene.

We tested the generation procedure by drawing 5236 random samples (using our defined generator with the same number of samples as the KITTI training set) and by comparing, for each class, their distributions with respect to the original dataset. The results are shown in Figures 2 and 3.

Figure 2 – Comparison between Car class distribution in test set and KITTI

Figure 3 – Comparison between KITTI and generated class distributions

Dry Runs

Since the distributions had been verified to successfully emulate the KITTI class distributions, we could move on to implement this process to the generation scripts in Anyverse studio. Here, a new technique could be implemented to further validate the spawn rate of objects and ensure everything is running smoothly. A dry run is the process of running variations to environmental variables (such as time of day and weather), spawning objects and relocating the Ego vehicle, all without creating an actual rendered image or camera simulation.

This test ensures that the scene is set up correctly at every iteration, and that no errors are created from missing assets, incorrectly spawned items or from scripting issues. It also enables us to track the number of spawned classes per iteration and overall, allowing for further validation of the distribution and total number of vehicles/people in the scenes.


We have focused on the validation and verification side of the virtual dataset generation process this week, taking a look into the methods of matching the class distribution to that found in KITTI. We also went over the method of using the Dry Run feature before creating any sort of render, to ensure all the scenes could be set up correctly and the environment was ready for a batch rendered generation dataset to be created. The next step is to begin a sample dataset for the Highway scene and to begin analysing results.

Read other chapters >>>

About Anyverse™

Anyverse™ helps you continuously improve your deep learning perception models to reduce your system’s time to market applying new software 2.0 processes. Our synthetic data production platform allows us to provide high-fidelity accurate and balanced datasets. Along with a data-driven iterative process, we can help you reach the required model performance.

With Anyverse™, you can accurately simulate any camera sensor and help you decide which one will perform better with your perception system. No more complex and expensive experiments with real devices, thanks to our state-of-the-art photometric pipeline.

Need to know more?

Visit our website, anytime, or our Linkedin, Instagram, and Twitter profiles.

Looking for the right Synthetic Data to speed up your system? Please, enter the Anyverse now

Client Story

Would you like to know how Cron AI has improved LiDAR simulation accuracy with physically correct synthetic data?

Let's talk about synthetic data!

[contact-form-7 404 "Not Found"]