Benefits of synthetic data for privacy preservation in computer vision

Benefits of synthetic data for privacy preservation in computer vision

Computer vision technology has long existed in the automotive world, but today it is increasingly important. Its applications are varied and useful, especially in driving safety and autonomous functions. In the next 10 years, it is estimated that over 20 percent of cars sold will have advanced autonomous driving functions, putting the privacy preservation of individual’s data in the spotlight.

As the prevalence of cars equipped with Advanced Driver Assistance Systems (ADAS), in-cabin monitoring systems, autonomous driving capabilities, and self-driving cars altogether increases on city streets, scholars and policymakers have raised concerns over the volume of data required for creating and validating the software underlying this technology which could lead to the over-surveillance of people’s movements and actions.

How is this sensitive data being collected? Are developers complying with the principle of privacy preservation?

Privacy legislation at a glance

What does the legislation say about privacy preservation? According to UNCTAD, 71% of the world’s nations already have instituted privacy legislation and 9% have drafted legislation.

In 2016, the General Data Protection Regulation in the European Union created a formal regulation on information privacy. The policy delineated the rights of individuals’ data, and set forth requirements for companies deemed as data controllers. Following the adoption of GDPR and similar policies worldwide, data privacy has become a priority for companies and best practices have trickled beyond the European Union.

Although the United States does not yet have an equivalent policy to that of the EU – which has a comprehensive data privacy law as we have seen before – there are several state laws that cover different aspects of data privacy, like health data, or data collected from children that have drawn comparisons to the EU legislation. An example is the California Consumer Privacy Act, which was instituted in 2018 to enhance privacy and consumer protection for the state’s residents.

But the reality is, since data collected by many companies is unregulated in most states, these companies can use, sell, or share people’s data without consent, therefore violating fundamental privacy preservation.

Data collection and privacy preservation

This year, Matthew Guariglia, a policy analyst focused on surveillance and policing, wrote an article for the Electronic Frontier Foundation detailing how the collection of video, images, audio, and location data for the design and usage of autonomous cars could easily be abused by companies, law enforcement agencies, and hackers.

Privacy preservation is now in the spotlight. The protection of personal data is crucial in giving individuals agency over how their information is used and shared. While data is often utilized by companies to tailor their products or ads to their users or improve technologies, as previously mentioned, the overcollection of data, especially personally identifiable information or sensitive data, can pose safety risks to individuals. There have been various hacks in recent years that have exposed individual data such as biographical, contact, financial information, and more. Even in instances where real data has been anonymized, people have found ways to trace this back to individuals, demonstrating the fragility of real data.

Autonomous systems are a technology that requires a vast amount of visual data involving people and could potentially have such risks.

Benefits of synthetic data for privacy preservation in computer vision Mountain accident

These systems need to understand what they see in the real world and react accordingly, and data is essential to teach these machines what the world around them is and how to interpret it. On the other hand, to implement the computer vision-based system’s “understanding” engineers use deep neural networks trained for different purposes, like object detection, object segmentation, or depth estimation.

No matter what the application is – driver monitoring, baby seat detection, etc – neural networks need data, lots and lots of data. This means gathering thousands of pictures of real people in many situations. Getting these images in the real world while complying with privacy legislation is far from easy.

Advantages of synthetic data for privacy preservation

Utilizing synthetic data in the design and training of computer vision-based systems drastically reduces some of these privacy concerns, allowing for the integrity of both people’s safety and privacy—commonly viewed as a tradeoff in technology.

Synthetic data allows data scientists and machine learning engineers to gather data to train their ML models without involving real people and kids, and without traveling thousands of miles, filming thousands of scenes, hiring a labeling company, burning their budgets, and realizing when they begin to train their deep learning models that have not obtained enough data, nor enough data variability for neural networks to learn.

In short, synthetic data provides a convenient solution for creating efficient and scalable datasets without including individuals’ data.

Synthetic data has a variety of use cases, but the following showcase instances where it provides particular advantages for the privacy preservation of people when developing computer vision-based systems that real data sets cannot replicate.

Privacy preservation advantages for in-cabin monitoring systems

Since 2023 the Euro NCAP instituted OEMs to conduct in-depth assessments of Driver State Monitoring Systems. In order to comply with the utmost safety standards, vehicles must have sensing, driver state, and vehicle response capabilities.

In-cabin monitoring systems can specifically evaluate various driver conditions such as distraction, fatigue, and unresponsiveness which help trigger safety mechanisms. Using synthetic data, instances such as people not wearing seatbelts, texting while driving, or being drowsy at the wheel can be recreated without real videos or actors.

Utilizing synthetic data can also help reduce some of the biases that exist in the real world and present privacy issues. Rather than needing to capture images of people of different races, genders, or physical conditions, these can all be recreated with synthetic data, covering all potential cases. These systems will be able to perform with high quality worldwide as they are tested with synthetic renderings of people of different ethnic groups and backgrounds.

Privacy preservation for in-cabin monitoring development using synthetic data

Privacy preservation advantages for self-driving system development

Synthetic data is also used to train and test autonomous car software. To ensure the safety of users in such high-stakes technology, the models must be trained using large datasets (in size and diversity) that take into account a variety of edge cases or adverse weather conditions that one could encounter on the road.

Synthetic data would reduce the gathering of any real data that contains the movements and images of real people. For instance, when developing an autonomous driving system, accuracy, and robust AI is mandatory. Safety is nonnegotiable, the system must be ready to recognize and act accordingly to any risky and vulnerable situation for people. This means detecting different types of people in rare conditions and being able to adequately handle accidents and emergencies.

Synthetic data allows for these situations to be replicated without the necessity to collect actual imagery that in some contexts could potentially lead to compromising people’s safety, and the surveillance of individuals.

Privacy preservation advantages for security and defense cases

These same principles apply to training perception systems in the field of security and defense. One important application of synthetic data for computer vision is helping detect people in cases of natural disasters and conflicts.

Autonomous systems must be a belt to operate under any circumstance, even in cases of car accidents, robberies, or potential terrorist attacks. Furthermore, it is important to ensure the technology still works under unsafe conditions, without putting people at risk. Thus, these edge cases can be tested while both eliminating any individual privacy concerns and keeping real people safe. Not only are these instances difficult and expensive to potentially replicate, but any real instances of them contain people in graphic or unsafe conditions.

By using synthetic datasets, the gathering of human data needed to train these edge cases can be largely eliminated. As people represented in synthetic images are not real and instead virtual renderings, systems can be trained to consider risky situations without violating the privacy of real people. Training such systems with real images would require the usage of people who may be injured or in incredibly vulnerable situations, who would not want their images used publicly.

Anyverse, preserving individual’s privacy with synthetic data

Anyverse generates synthetic data using a pure spectral path-tracing engine and physically-based sensor simulation allowing for the creation of accurate datasets with ground truth and 100% free of real-world events.

This means:

  • Being able to generate datasets that comply with any data protection regulations worldwide.
  • Reproducing any edge cases that are difficult to find in the real world to achieve the desired variability in the data to develop a robust system.
  • At the same time, gathering sensor-specific data to help reduce the domain gap in the dataset to achieve higher confidence levels and make sure the autonomous system generalizes the inputs from the real world.

Privacy preservation is more straightforward with synthetic data.

Find out more about Anyverse

Anyverse is a scalable synthetic data software platform to design, train, validate, or fine-tune advanced perception systems based on ML. It provides the power to generate all the data needed in a fraction of the time and cost compared with other real-world data workflows.

You might also like

Scroll to Top

Let's talk about synthetic data!