Skip to main content

Software
Learn
Support
Partners
About Us

SAS Viya Platform
Overview
Capabilities
Why SAS Viya?
How to Buy
Managed Cloud Services

Solutions
Artificial Intelligence (AI)
Fraud
IoT
Marketing
Risk Management
All Products & Solutions

Industries
Banking
Public Sector
Insurance
Health Care
Life Sciences
Manufacturing
All Industries

Explore
Try/Buy
Contracting with SAS
Why SAS?
Customer Stories
Partners
Generative AI Solutions
Consulting

Software Success

Discover free resources and tailored guides to help you optimize your software experience.

Learn to use your software

Training
Overview
Train My Team
Course Catalog
Free Training
Books
My Training

Academics
Academic Programs
Free Academic Software
Educators
Students

Certification
Choose a Credential
Free Practice Exams
Exam Preparation
Discounts
My Certification

Explore
Communities
Events
Ask the Expert
All Webinars
Video Tutorials
YouTube Channel

Communities
SAS Viya
SAS Programming
Statistical Procedures
New SAS Users
Administrators
All Communities

Documentation
By Product
Installation & Configuration
SAS Viya Administration
SAS Viya Programming
System Requirements
All Documentation

Support & Services
Downloads
Knowledge Base
Starter Kit
Support by Product
Support Services
All Support & Services

Explore
Blogs
User Groups
Webinars
Video Tutorials
YouTube Channel
MySAS

Overview
Partner Program
Find a Partner
Sign Into PartnerNet

Why SAS?

Learn why SAS is the world's most trusted analytics platform, and why analysts, customers and industry experts love SAS.

Learn more about SAS

Company
Overview
Annual Report
Leadership
Vision & Mission
Office Locations

Careers
Overview
Culture
Internships
Search Jobs

News & Events
Newsroom
Newsletters
Blogs
Events

Explore
Brand
Social
Trust Center
Contact Us

sas.com
support.sas.com
documentation.sas.com
blogs.sas.com
communities.sas.com
developer.sas.com

Search

Select your region

Visit the Cary, NC, USA corporate headquarters site

Americas

Europe

Middle East & Africa

Asia Pacific

View our worldwide contacts list for help finding your region

Americas

Brasil
Canada (English)

Canada (Français)
Colombia

México
United States

Europe

Belgium
Česká Republika
Danmark
Deutschland
España
France

Iceland
Ireland
Italia
Nederland
Norge
Österreich

Polska
Portugal
România
Россия
Schweiz (Deutsch)
Suisse (Français)

Suomi
Sverige
Türkiye
Україна
United Kingdom

Middle East & Africa

Middle East

Saudi Arabia

South Africa

Asia Pacific

Australia
中国 (简体中文)
Hong Kong
India
日本

대한민국
Malaysia
New Zealand
Philippines
Singapore

台灣 (繁體中文)
Thailand (English)
ประเทศไทย (ภาษาไทย)

Contact Us

Hi !

Get access to My SAS, trials, communities and more.

Get access to My SAS, trials, communities and more.

SAS Sites

This page exists on your local site.

Go there now

Stay here

X

SAS Insights
Articles

Hands holding digital tablet in library

What is synthetic data?

And how can you use it to fuel AI breakthroughs?

What factors are driving a demand for synthetic data across industries? And what are the risks and benefits of using synthetic data for decision making? In this article, we'll discuss synthetic data’s vital place in our data-hungry AI initiatives, how businesses can use synthetic data to unlock growth and the ethical challenges yet to be solved.

It’s hard to believe, but the rise of artificial intelligence has, in some ways, created data scarcity. Not a shortage, per se. We have an astonishing amount of data that’s growing exponentially (estimates show that 120 zettabytes were created in 2023). And that number could more than double by 2027!

No, our current data problem is suitability, not quantity. Synthetic data – a product of generative AI – may be the answer for that.

Synthetic data becomes mainstream

Synthetic data can help organizations improve productivity and lower costs for AI development efforts. Watch this explainer video, where Brett Wujek – leading product strategy for next-generation AI technologies at SAS – details the current state of synthetic data and its promise for the future.

Video Player is loading.

Current Time 0:00

/

Duration 0:00

Loaded: 0%

0:00

Stream Type LIVE

Remaining Time -0:00

1x

2x
1.75x
1.5x
1.25x
1x, selected
0.75x
0.5x

Chapters

descriptions off, selected

subtitles settings, opens subtitles settings dialog
subtitles off, selected

This is a modal window.

Beginning of dialog window. Escape will cancel and close the window.

TextColorOpacity

Text BackgroundColorOpacity

Caption Area BackgroundColorOpacity

Font Size

Text Edge Style

Font Family

End of dialog window.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

Video describing synthetic data

What is synthetic data? And why do we need it?

Simply put, synthetic data is algorithmically generated data that mimics real-world data. It could be randomly generated – 100,000 birth dates. Easy.

Usually, though, synthetic data fills a gap in fit-for-purpose data: 100,000 birth dates of women who recently registered to vote. Tough.

Synthetic data’s real sweet spot, however, is found in the rare edge cases: a data set of male prostate cancer patients younger than 35 years old, or images of wear patterns in bronze piston rings, for example. See where this is going? That specificity – that rarity – makes the data harder to get and, in some cases, riskier to use.

Accenture's Chief Data Scientist Fernando Lucini explains in a podcast conversation with SAS strategic advisor Kimberly Nevala that synthetic data can also help with data privacy. Private personal information (PPI) is closely guarded in health care, the public sector and even retail. When we can’t risk exposing PPI, we need replacement data to analyze.

“We ask (AI to create …) data with the same patterns but none of the characteristics of the original data. In simple terms (synthetic data) is machine-generated data that is a facsimile – not a copy, but a facsimile – of the signals and patterns within the original data,” Lucini explains.

Key data equivalents:

1 yottabyte (YB) = 1,000 zettabytes

1 zettabyte (ZB) = 1,000 exabytes

1 exabyte (EB) = 1,000 petabytes

1 petabyte (PB) = 1,000 terabytes

1 terabyte (TB) = 1,000 gigabytes

1 gigabyte (GB) = 1,000 megabytes

1 megabyte (MB) = 1,000 kilobytes

1 kilobyte (KB) = 1,000 bytes

Benefits of synthetic data

Access to large, diverse and authentic data is crucial for training robust AI models. But getting that kind of real-world data can be tough given increasing privacy concerns, legal restrictions, and high data acquisition and annotation costs.

Synthetic data can be created with labels and annotations already baked in – saving time and resources – and without exposing sensitive information because the links to real individuals have been severed for built-in data privacy.

What about anonymized data, you ask? According to Edwin van Unen, SAS Principal Customer Advisor, anonymization isn’t the answer either. It is inadequate, laborious and inconsistent.

“Its poor quality makes it almost impossible to use for advanced analytics tasks such as AI or machine learning modeling and dashboarding,” explains van Unen.

Synthetic data changes the game here. It mirrors the original statistical properties and correlations. The data sets are highly useful for testing and training precise predictive models with no need to mask sensitive information. This “synthetic twin” approach helps counteract bias and achieves near-perfect anonymity.

Why Synthetic Data Is Essential for Your Organization's AI-Driven Future

Infographic

Why Synthetic Data Is Essential for Your Organization's AI-Driven Future

View infographic

A look at four basic types of synthetic data and how they’re often used

Synthetic structured data represents individuals, products and other entities and their activities or attributes – including customers and their purchasing habits, or patients and their symptoms, medications and diagnoses.
Synthetic images are crucial for training object detection, image classification and segmentation. These images are useful for early cancer detection, drug discovery and clinical trials, or teaching self-driving cars. Synthetic images can be used for rare edge cases where little data is available, like horizontal-oriented traffic signals.
Synthetic text can be tailored to enable robust, versatile natural language processing (NLP) models for translation, sentiment analysis and text generation for applications such as fraud detection and stress testing.
Synthetic time series data (including sensor data) can be used in radar systems, IoT sensor readings, and light detection and ranging. It can be valuable for predictive maintenance and autonomous vehicle systems, where more data can ensure safety and reliability.

SAS^® Data Maker – Now in Preview

Protect existing data, innovate faster and ensure scalable outcomes using a low-code/no-code interface to augment or generate data quickly. Unlock the potential of existing data with SAS Data Maker.

Screenshot of SAS Data Maker - Correlation with highlight

Creating synthetic data: When to use SMOTE vs. GAN

Generating data with business rules and business logic is not a new concept. AI adds a layer of accuracy to data generation by introducing algorithms that can use existing data to automatically model appropriate values and relationships.

Two popular AI techniques for generating synthetic data are:

Synthetic minority oversampling technique (SMOTE).
Generative adversarial network (GAN).

SMOTE is an intelligent interpolation technique. It works by using a sample of real data and generating data points between random points and their nearest neighbors. In this way, SMOTE allows you to focus on points of interest, such as underrepresented classes, and create similar points to balance the data set and improve overall accuracy in predictive models.

GAN, on the other hand, is a technique that generates data by training a sophisticated deep learning model to represent the original data. A GAN comprises two neural networks: a generator to create synthetic data and a discriminator that tries to detect it. This iterative adversarial relationship produces increasingly realistic synthetic data, as the discriminator ultimately cannot easily tell the difference between synthetic and real data. The training process can be time-consuming and often requires graphics processing units (GPUs), but it can capture highly nonlinear, complex relationships among variables and thus produce very accurate synthetic data. It can also generate data at or beyond the boundaries of the original data, potentially representing novel data that would otherwise be neglected.

A test: Synthetic data versus anonymized data

SAS and a partner tested synthetic data’s viability as an alternative to anonymized data using a real-world telecom customer’s churn data set (read the blog post, Using AI-generated synthetic data for easy and fast access to high-quality data). Van Unen explained that the team assessed the outcome on data quality, legal validity and usability.

What they learned:

Synthetic data retained the original statistical properties and business logic, including “deep hidden statistical patterns.” Comparatively, anonymization destroyed underlying correlations.
Synthetic data models predicted churn similarly to those trained on original data. Meanwhile, anonymized data models performed poorly.
Synthetic data can be used to train models and understand key data characteristics, protecting privacy by reducing and preventing access to original data.
Synthetic data generation processes are reproducible. Anonymization is variable, inconsistent and more manual.

“This case study reinforces the idea that AI-generated synthetic data provides fast, easy access to high-quality data for analytics and model development,” affirms van Unen. “Its privacy-by-design approach makes analysis, testing and development more agile."

We must approach synthetic data with great care to avoid unintended consequences. Natalya Spicer Synthetic Data Product Manager SAS

Ethical considerations of synthetic data

As synthetic data use becomes more widespread, synthetic data vaults will also become more prevalent. These shared repositories will foster collaboration, data democratization and cross-pollination of ideas. But they could inadvertently underwrite bias, hide data privacy infractions and perpetuate unfair data practices.

Contrary to popular belief, Lucini argues, synthetic data is neither automatically private nor privacy-preserving. If not implemented with the right controls and testing, synthetic data generation can still lead to privacy leaks.

"Generative models can be a ‘black box.’ To ensure responsible use, they require rigorous validation, which the industry has not yet fully developed. We have to approach synthetic data with great care to avoid unintended consequences," says Natalya Spicer, a Synthetic Data Product Manager at SAS.

The right to privacy is black and white – we can regulate it, put rules around it, and everyone can be bound by those rules. Fairness and bias are not as simple to regulate. If those subjective decisions are left to individuals, the consequences could have long-term consequences. So we need enterprise-level governance until there are more comprehensive government regulations.

“We built SAS^® Viya^® to serve as an enterprise platform for the compliant use of data and analytics, which is crucial with the acceleration of AI and synthetic data,” says Spicer. “SAS Viya has full traceability regarding how models are created, all the way back to raw data and the models used to analyze its accuracy.”

The future of synthetic data and AI

As artificial intelligence and data science advance, synthetic data will become increasingly important. The synergy between synthetic data and emerging techniques will enable the creation of even more sophisticated and realistic synthetic data sets, further pushing the boundaries of what is possible.

Governance will play an important role as the use of synthetic data evolves. Organizations must implement robust governance frameworks, data auditing practices, and clear communication around the limitations and appropriate use cases for synthetic data. Policies for labeling and identifying the use of synthetic data will also become crucial to avoid misuse and misunderstanding. By embracing the power of synthetic data, data scientists can unlock new frontiers of innovation, develop more robust and reliable AI models, and drive transformation that positively impacts our world.

Ready to subscribe to Insights now?

Recommended reading

Shut the front door on insurance application fraud!Fraudsters love the ease of plying their trade over digital channels. Smart insurance companies are using data from those channels (device fingerprint, IP address, geolocation, etc.) coupled with analytics and machine learning to detect insurance application fraud perpetrated by agents, customers and fraud rings.
Analytics: A must-have tool for leading the fight on prescription and illicit drug addictionStates and MFCUs now have the analytics tools they need to change the trajectory of the opioid crisis by analyzing data and predicting trouble spots – whether in patients, prescribers, distributors or manufacturers. The OIG Toolkit with free SAS® programming code makes that possible.
IFRS 9 and CECL: The challenges of loss accounting standardsThe loss accounting standards, CECL and IFRS 9, change how credit losses are recognized and reported by financial institutions. Although there are key differences in the standards for CECL (US) and IFRS 9 (international), both require a more forward-looking approach to credit loss estimation.

The power of SAS for is now in one easy-to-use data and AI platform. With the speed and convenience of being entirely cloud native.

Learn more about SAS Viya

Please leave this field blank

Thank you for subscribing to Insights!

Enter email address*

*

Subscribe to Insights newsletter

Home
SAS Insights
Articles
What is synthetic data? And how can you use it to fuel AI breakthroughs?

Home
SAS Insights
Articles
What is synthetic data? And how can you use it to fuel AI breakthroughs?

SAS data and AI solutions provide our global customers with knowledge they can trust in the moments that matter, inspiring bold new innovations across industries.

Follow Us

Facebook
Twitter
LinkedIn
YouTube
RSS

Explore

Accessibility
Careers
Certification
Communities
Company
Developers
Documentation

For Educators
Events
Industries
My SAS
Newsroom
Products
SAS Viya

Solutions
Students
Support & Services
Training
Try/Buy
Video Tutorials
Why SAS?

What is...

What is...

Analytics
Artificial Intelligence
Cloud Computing
Data Science
Digital Transformation
Internet of Things

Cookie Preferences
Privacy Statement
Terms of Use
Trust Center
©2025 SAS Institute Inc. All Rights Reserved.

Contact Us
Share
Subscribe

Share this
Share this page with friends or colleagues.

Back to Top

About cookies on this site

This site uses cookies and related technologies for site operation, analytics and third-party advertising purposes, as described in our SAS Privacy Statement. You may consent to our use of these technologies, reject non-essential technologies or further manage your preferences. To opt out of SAS making information relating to cookies and similar technologies available to third parties for advertising purposes, select "Required only." To exercise other rights you may have related to cookies, select "Manage cookies."

SAS Privacy Statement | Powered by:

| Truste