SHARE
Don't let poor data become your perception system's kryptonite
Poor data… The most dangerous villain advanced perception systems developers need to face and defeat if they want to develop an accurate deep learning model.
Little hint, synthetic data may help… a lot.
If your data is poor, your deep learning model is useless
Poor data quality is enemy number one to the use of deep learning for advanced perception development and many other use cases. But why? The answer is common sense, deep learning models demand high levels of quality and data accuracy from very early stages. First, the data used to train the predictive model, and then the new data employed by that model to make future decisions.
To properly train a DL model, data must comply with exceptionally accurate, wide, and high-quality standards.
The data must be right:
- Physically correct
- Properly labeled
- De-deduped
But it must be the right data for the specific model too:
- Unbiased data?
- Sensor-specific data?
- Enough scene variability?
Most data gatherers focus on one criterion or the other, but for computer perception deep learning models, you have to try to consider all of them simultaneously.
Why? Let’s take as an example the most commonly discussed case currently, self-driving cars. If your self-driving car kills someone on the road, whose fault, is it? You can blame the poor training data quality of your model, But, that won’t help, will it? We are talking about advanced perception applications where reliability is a must, and therefore, the data must be too.
The cost of poor data

As Rongala A. shares in his research study Cost of Bad Data for Organizations, attempts to quantify the financial impact of poor data have led to some pretty shocking numbers (among others): Gartner states that organizations lose $13.3 Million yearly average on poor data, Cio.com states that 80% of companies believe they lost revenue due to data challenges, CrowdFlower states data scientists spend 60% of their time cleaning and organizing data and Pragmaticworks states 20 to 30% of operating expenses are due to bad data.
In addition to these, you should consider other non-financial costs such as manual labeling or the creation of customized data, time-consuming processes, lacking scalability and adjustability which, ultimately, make you more inefficient and overload your team.
How to deal with poor data and data gaps in machine learning
Now that you know that poor data has the power to ruin your deep learning model, consume too much time from your team, and waste a huge part of your budget, what are you going to do?

- Model complexity
You can build a simple model with fewer parameters. This method will be less susceptible to over-fitting and it is often used to improve classification and prediction.
The problem with this method is that it can be easily called into question. It may work for simpler models and applications, but technologists seem to be working in the opposite direction.
- Transfer learning
Transfer Learning is applied for DL and neural networks, using a pre-built model, which is then adjusted on the small dataset that you have.
You can also reuse already trained neural networks that solve a similar problem to yours, usually leave the network architecture unchanged, and reuse some of the model weights.
This is useful when the new dataset is small and not sufficient to train the model from scratch. But as you can imagine, even if the model you have chosen has already been tested and trained, the new poor data you introduce to the model (and it uses to make future decisions) will affect its performance and results.
- Data augmentation
Data Augmentation can help you to make slight improvements to get new images. It takes the pre-existing samples and modifies them to create new and increase the number of training samples. Some data augmentation techniques are scaling, rotation or affine transforms.
These image processing options are often used as pre-processing techniques to make image classification models built using CNN more robust and try to minimize the effect of insufficient data. But still, its benefits to alleviate the problem with poor data are not clear.
- Synthetic data
Synthetic data empowers you to artificially generate samples that mimic real-world data and complement your datasets with data difficult (sometimes impossible) to get in the real world, adding custom scene variability or corner cases for example. But not only that, synthetic data (alone) is already being tested to train object detection algorithms with promising results as you can read in the paper RarePlanes: Synthetic Data Takes Flight.
After analyzing these 4 paths, synthetic data seems to be the best-positioned alternative to fight bad data, right?
Can synthetic data bridge the AI training data gap?
Advanced perception system’s deep learning models don’t just learn by themselves, at least not yet. They need to be trained with data with enough information to help them generalize when the system sees new data it hasn’t seen before. And “enough” is the keyword here… Normally real-world data is not enough, and even worse, it can be poorly labeled, making it inaccurate, or it can add bias to the system, causing the model not to perform as expected or leading it to erroneous results.

The need for synthetic data for perception systems deep learning models development is a fact and developers can’t afford to look the other way because their competitors have already jumped on the train and are ready to anticipate and overcome their data gaps.
About Anyverse™
With Anyverse™ you can accurately simulate any camera sensor and help you decide which one will perform better with your perception system. No more complex and expensive experiments with real devices, thanks to our state-of-the-art photometric pipeline.