You can see the faint stubble coming in on his upper lip, the wrinkles on his forehead, the blemishes on his skin. He isn’t a real person, but he’s meant to mimic one—as are the hundreds of thousands of others made by Datagen, a company that sells fake, simulated humans.
These humans are not gaming avatars or animated characters for movies. They are synthetic data designed to feed the growing appetite of deep-learning algorithms. Firms like Datagen offer a compelling alternative to the expensive and time-consuming process of gathering real-world data. They will make it for you: how you want it, when you want—and relatively cheaply.
To generate its synthetic humans, Datagen first scans actual humans. It partners with vendors who pay people to step inside giant full-body scanners that capture every detail from their irises to their skin texture to the curvature of their fingers. The startup then takes the raw data and pumps it through a series of algorithms, which develop 3D representations of a person’s body, face, eyes, and hands.
The company, which is based in Israel, says it’s already working with four major US tech giants, though it won’t disclose which ones on the record. Its closest competitor, Synthesis AI, also offers on-demand digital humans. Other companies generate data to be used in finance, insurance, and health care. There are about as many synthetic-data companies as there are types of data.
Once viewed as less desirable than real data, synthetic data is now seen by some as a panacea. Real data is messy and riddled with bias. New data privacy regulations make it hard to collect. By contrast, synthetic data is pristine and can be used to build more diverse data sets. You can produce perfectly labeled faces, say, of different ages, shapes, and ethnicities to build a face-detection system that works across populations.
But synthetic data has its limitations. If it fails to reflect reality, it could end up producing even worse AI than messy, biased real-world data—or it could simply inherit the same problems. “What I don’t want to do is give the thumbs up to this paradigm and say, ‘Oh, this will solve so many problems,’” says Cathy O’Neil, a data scientist and founder of the algorithmic auditing firm ORCAA. “Because it will also ignore a lot of things.”
Realistic, not real
Deep learning has always been about data. But in the last few years, the AI community has learned that good data is more important than big data. Even small amounts of the right, cleanly labeled data can do more to improve an AI system’s performance than 10 times the amount of uncurated data, or even a more advanced algorithm.
That changes the way companies should approach developing their AI models, says Datagen’s CEO and cofounder, Ofir Chakon. Today, they start by acquiring as much data as possible and then tweak and tune their algorithms for better performance. Instead, they should be doing the opposite: use the same algorithm while improving on the composition of their data.
But collecting real-world data to perform this kind of iterative experimentation is too costly and time intensive. This is where Datagen comes in. With a synthetic data generator, teams can create and test dozens of new data sets a day to identify which one maximizes a model’s performance.
To ensure the realism of its data, Datagen gives its vendors detailed instructions on how many individuals to scan in each age bracket, BMI range, and ethnicity, as well as a set list of actions for them to perform, like walking around a room or drinking a soda. The vendors send back both high-fidelity static images and motion-capture data of those actions. Datagen’s algorithms then expand this data into hundreds of thousands of combinations. The synthesized data is sometimes then checked again. Fake faces are plotted against real faces, for example, to see if they seem realistic.
Datagen is now generating facial expressions to monitor driver alertness in smart cars, body motions to track customers in cashier-free stores, and irises and hand motions to improve the eye- and hand-tracking capabilities of VR headsets. The company says its data has already been used to develop computer-vision systems serving tens of millions of users.
It’s not just synthetic humans that are being mass-manufactured. Click-Ins is a startup that uses synthetic AI to perform automated vehicle inspections. Using design software, it re-creates all car makes and models that its AI needs to recognize and then renders them with different colors, damages, and deformations under different lighting conditions, against different backgrounds. This lets the company update its AI when automakers put out new models, and helps it avoid data privacy violations in countries where license plates are considered private information and thus cannot be present in photos used to train AI.
Mostly.ai works with financial, telecommunications, and insurance companies to provide spreadsheets of fake client data that let companies share their customer database with outside vendors in a legally compliant way. Anonymization can reduce a data set’s richness yet still fail to adequately protect people’s privacy. But synthetic data can be used to generate detailed fake data sets that share the same statistical properties as a company’s real data. It can also be used to simulate data that the company doesn’t yet have, including a more diverse client population or scenarios like fraudulent activity.
Proponents of synthetic data say that it can help evaluate AI as well. In a recent paper published at an AI conference, Suchi Saria, an associate professor of machine learning and health care at Johns Hopkins University, and her coauthors demonstrated how data-generation techniques could be used to extrapolate different patient populations from a single set of data. This could be useful if, for example, a company only had data from New York City’s more youthful population but wanted to understand how its AI performs on an aging population with higher prevalence of diabetes. She’s now starting her own company, Bayesian Health, which will use this technique to help test medical AI systems.
The limits of faking it
But is synthetic data overhyped?
When it comes to privacy, “just because the data is ‘synthetic’ and does not directly correspond to real user data does not mean that it does not encode sensitive information about real people,” says Aaron Roth, a professor of computer and information science at the University of Pennsylvania. Some data generation techniques have been shown to closely reproduce images or text found in the training data, for example, while others are vulnerable to attacks that make them fully regurgitate that data.
This might be fine for a firm like Datagen, whose synthetic data isn’t meant to conceal the identity of the individuals who consented to be scanned. But it would be bad news for companies that offer their solution as a way to protect sensitive financial or patient information.
Research suggests that the combination of two synthetic-data techniques in particular—differential privacy and generative adversarial networks—can produce the strongest privacy protections, says Bernease Herman, a data scientist at the University of Washington eScience Institute. But skeptics worry that this nuance can be lost in the marketing lingo of synthetic-data vendors, which won’t always be forthcoming about what techniques they are using.
Meanwhile, little evidence suggests that synthetic data can effectively mitigate the bias of AI systems. For one thing, extrapolating new data from an existing data set that is skewed doesn’t necessarily produce data that’s more representative. Datagen’s raw data, for example, contains proportionally fewer ethnic minorities, which means it uses fewer real data points to generate fake humans from those groups. While the generation process isn’t entirely guesswork, those fake humans might still be more likely to diverge from reality. “If your darker-skin-tone faces aren’t particularly good approximations of faces, then you’re not actually solving the problem,” says O’Neil.
For another, perfectly balanced data sets don’t automatically translate into perfectly fair AI systems, says Christo Wilson, an associate professor of computer science at Northeastern University. If a credit card lender were trying to develop an AI algorithm for scoring potential borrowers, it would not eliminate all possible discrimination by simply representing white people as well as Black people in its data. Discrimination could still creep in through differences between white and Black applicants.
To complicate matters further, early research shows that in some cases, it may not even be possible to achieve both private and fair AI with synthetic data. In a recent paper published at an AI conference, researchers from the University of Toronto and the Vector Institute tried to do so with chest x-rays. They found they were unable to create an accurate medical AI system when they tried to make a diverse synthetic data set through the combination of differential privacy and generative adversarial networks.
None of this means that synthetic data shouldn’t be used. In fact, it may well become a necessity. As regulators confront the need to test AI systems for legal compliance, it could be the only approach that gives them the flexibility they need to generate on-demand, targeted testing data, O’Neil says. But that makes questions about its limitations even more important to study and answer now.
“Synthetic data is likely to get better over time,” she says, “but not by accident.”