How Synthetic Data Can Disrupt Machine Learning at Scale

essidsolutions

Machine learning (ML) is a continuously evolving process that requires large, diverse and carefully labeled datasets for training ML algorithms. But, collecting and labeling datasets with millions of elements drawn from the real world is time-consuming and expensive. This has vaulted synthetic data into the spotlight as the favored tool for training purposes. Let’s see how synthetic data offers a reprieve from the data quality conundrum and reshape large-scale ML deployments.

Enterprises want to leverage artificial intelligence (AI) and machine learning (ML) more than ever. A 2021 surveyOpens a new window by Algorithmia found that 76% of organizations prioritize AI/ML over other IT initiatives, while 71% have increased their annual spending on AI/ML. Opens a new window

The only flip side to it is that many ML projects never come to fruition. A 2020 survey by International Data Corporation (IDC) found that 28% of the 2,000 IT decision-makers involved had witnessed their AI/ML initiatives fail. Opens a new window

In a similar vein, a 2021 report by Wakefield Research for Alation found around 87% of respondents identified data quality issues as the biggest hurdle to successfully implementing AI in their organizations. Bad data has already cost businesses trillions of dollars. IBM estimates bad data has cost the U.S. more than $3 trillion every year. Opens a new window

Generating good quality training data for machine learning can be tricky. Errors in measurement or poor understanding of the requirements of the machine learning model can erode data quality. There are various ways in which data scientists are trying to address this issue. One solution is to cleanse data before feeding them to the machine learning algorithm. However, this can take a lot of time.

On the other hand, synthetic data is a lot more cost-effective and less time-intensive. More importantly, it improves the data quality critical to the effectiveness of a machine learning model and the success of the project.

“Synthetic data is very effective at improving data quality in learning models, and we have experienced success using it. It can be challenging obtaining real production data, and when we do, it frequently needs close scrutiny to remove errors and ensure accuracy of labeling. Additionally, the data invariably requires significant redaction,” says Richard Whitehead, chief evangelist and CTO at Moogsoft, an AIOps company.

With more ML practitioners throwing their weight behind synthetic data, its implementation is expected to grow further. Gartner predicts, by 2024, the use of synthetic data and transfer learning will limit the volume of real-world data needed for machine learning by half.

IDC also observes that the use of synthetic data for machine learning has been steadily increasing, particularly in use cases that may not have precedence or sufficient past data (such as autonomous driving, computer vision, etc.). It is expected that the majority of the data used for AI/ ML projects would be synthetic in the next five years.

Learn more: Why Federated Learning Is Pivotal to Privacy-Preserving Machine Learning

What Is Synthetic Data?

Synthetic data refers to information generated through computer simulations instead of being collected or measured in the real world.

Though it’s artificial, it is supposed to reflect real-world data and have the same mathematical and statistical properties. Unlike real-world data, ML practitioners have complete sovereignty over the synthetic dataset. This allows them to control the degree of labeling, sampling size, and noise levels. This also helps address privacy and security concerns where using real-world data involves sensitive and personal information of users. This makes it a lot easier for ML practitioners to publish, share, and analyze synthetic datasets with a wider ML community without worrying about exposing personally identifiable information and facing the ire of data protection authorities. 

Synthetic data has seen a lot of traction in self-driving vehicles, robotics, healthcare, cybersecurity and fraud protection. Google and Uber have been leveraging them extensively to improve the autonomy levels in their self-driven cars. Likewise, Amazon reportedly uses synthetic data to train Alexa’s language tool.

Synthetic data can be classified into three categories: fully synthetic data, hybrid synthetic data and partially synthetic data.

When it comes to synthetic data generation, there are various techniques to build and perfect synthetic datasets in line with the complexity of the use case. Less complicated options include statistical modeling and Monte Carlo simulations, while the more complex options include engine-based simulation, agent-based synthetic data generation and generative adversarial networks (GAN).

Monte Carlo simulation is a mathematical technique that carries out risk analysis by building models of possible results by substituting a range of values. Agent-based synthetic data generation first creates a physical model of real-world data, then reproduces random data using the same model.

GAN pits two neural networks against each other. The first network, called a generator, is entrusted with creating synthetic data from random input. The second network, called the discriminator, tries to tell whether the input data is real or not. The discriminator is trained using both real and fake data.

“The core intuition is that synthetic data — by virtue of being artificially generated — allows us to introduce knowledge into AI models that we wouldn’t be able to incorporate via real data,” says Farhan Choudhary, Principal Analyst at Gartner.

Choudhary explains, “Many AI, machine learning and analytics projects suffer from delays caused by obtaining production data for development and testing. Synthetic data generation helps reduce the bias in datasets by representing data with appropriate balance, density, distribution and other parameters, ultimately solving the data quality problem in ML projects.”

According to Gartner, synthetic data can:

  • Introduce domain-specific knowledge in the training of AI models, thereby improving the quality of model predictions.
  • Complete datasets where data is scarce, unavailable, unbalanced and can also be used to generate unknown scenarios for better model training.
  • Help in testing complex AI models and increase their robustness.
  • Address combinatorial explosion, especially in portfolio optimization and initiative sequencing business problems, where real data is impossible to obtain.

Learn more: How To Drive Human-AI Collaboration in a Post-COVID-19 World

Complexities of Synthetic Data

Though adoption has been growing, generating and using synthetic data to train machine learning algorithms and finding people with the right skills to support it can be challenging.

Choudhary points out that the quality of the generated synthetic data depends on the model that generates the data; hence, not all approaches will yield high-quality results.

“Depending on the approach, synthetic data can still reveal sensitive information, can miss natural anomalies, or not even contribute any significant value over and above the already existing real world data, therefore understanding a wider variety of approaches is recommended,” he adds.

Sriram Subramanian, primary analyst for AI ML lifecycle software at IDC, believes synthetic data is not a panacea. It’s effectiveness depends on various factors such as the source of the data, accuracy of the generation algorithms and how close it is to real data.

“Some of the complexities include finding the right source, efforts and time involved, the possibility of bias introduction, and trustworthiness of the generated data,” Subramanian adds.

Learn more: Getting the Data and AI Implementation Right for Your Organization

Best Practices to Use Synthetic Data

To get the most out of synthetic data, ML practitioners should keep a few things in mind. First of all, they need to ensure the dataset adequately emulates their use case. Whitehead suggests they need to ensure that the dataset adequately represents the production environment in terms of both complexity and comprehensiveness of examples.

They also need to ensure the data is clean. “Since the data is synthesized, this is more to do with making sure there are no bugs in the generation. When possible, have subject matter experts validate the dataset,” he adds.

ML practitioners also need to accept that application of synthetic data may not work in all use cases. Choudhary cautions that the opportunity needs to be evaluated first to determine if synthetic data can potentially solve the problem.

He also weighs in favor of upskilling within the organization to utilize synthetic generation techniques and collaboration with researchers, academia, and startups working on the topic.

Subramanian feels that companies should have an enterprise-wide data intelligence strategy and invest in the right tools for DataOps and data labeling solutions. 

Conclusion  

Machine learning is highly sensitive to bad quality input data. By leveraging synthetic data, ML practitioners are beginning to address this gap. Unlike real-world data, synthetic data has been found to be more reliable and cost-effective to generate. Many experts feel it is even better than real-world data for training purposes.  

As the application of synthetic data has grown, so has awareness about it. More companies from different sectors including computing, healthcare, banking and automobile now want to tap it to improve their machine learning-based products and services. 

Do you think synthetic data generation is better than real-world data for training machine learning algorithms? Comment below or let us know on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!