Purpose

The utility of synthetic data refers to how useful and effective it is in replacing/augmenting real data for purposes such as: research, analysis, modeling, or training ML algorithms. Evaluating the utility of synthetic data is important to guarantee that it serves as a reliable and viable substitute for real data.

Additionally, the synthetic data should maintain contextual relevance and applicability to the problem or use case at hand. It should accurately represent the domain knowledge and specific features of the real data, allowing for meaningful insights and informed decision-making.

Data utility is use case dependent

Evaluating the utility of synthetic data can be done by training ML models on synthetic data as well as real data (giving you two models: real and synthetic), and comparing their performance on real test data.

This approach allows us to see how practical it is to use synthetic data in various ML tasks.

When training models on synthetic data, performance metrics such as Accuracy, Precision, Recall, F1-Score, or Area Under the Curve (AUC), can be compared with models trained on real data. By evaluating these metrics, we can determine how well the models trained on synthetic data perform in relation to models trained on real data.

A high utility score indicates that ML models trained on synthetic data will behave similarly to ML models trained on real data.

Utility Score Guide

Utility Score	Use cases	Utility
U > 75	Models trained on synthetic data can be used in place of models trained on real data for ML analysis and Inference on new data.	High
50 < U < 75	Can still be used for ML analysis but with reduced confidence of getting similar results between real and synthetic data.	Medium
U < 50	Not suitable as a replacement for the real data.	Low

Utility Score Calculation

What is the % performance loss in synthetic data (L)?

If machine learning model performance on train-test data is 90% and performance on synthetic-test data is 82%, then % performance loss in synthetic data (L) is 10%

U = 100 - % performance loss in synthetic data (L)

Utility Score	% performance loss in synthetic data (L)	Utility
U = 100	L = 0	High
U = 80	L = 80	Medium
U = 0	L = 100	Low

If ML performance on synthetic data is close to the ML performance on real data, we will get a U score close to 100.

Disclaimer: Acceptable level of utility score depends on the use case. For high-stakes use cases in the healthcare industry, synthetic data with L = 0 might be only acceptable.

Utility Summary

Purpose#

Data utility is use case dependent#

Utility Score Guide#

Utility Score Calculation#

Purpose

Data utility is use case dependent

Utility Score Guide

Utility Score Calculation