Betterdata Docs
  1. Utility
Betterdata Docs
  • Getting Started
    • Introduction
    • Quickstart
  • Metrics Guide
    • Syntactical Accuracy
      • Syntactical Accuracy Metrics
    • Statistical Similarity
      • Statistical Similarity Summary
      • Statistical Similarity Metrics
    • Utility
      • Utility Summary
      • Utility Metrics
    • Privacy
      • Distance-based
      • Privacy Attacks
  1. Utility

Utility Summary

Purpose#

The utility of synthetic data refers to how useful and effective it is in replacing/augmenting real data for purposes such as: research, analysis, modeling, or training ML algorithms. Evaluating the utility of synthetic data is important to guarantee that it serves as a reliable and viable substitute for real data.
Additionally, the synthetic data should maintain contextual relevance and applicability to the problem or use case at hand. It should accurately represent the domain knowledge and specific features of the real data, allowing for meaningful insights and informed decision-making.

Data utility is use case dependent#

Evaluating the utility of synthetic data can be done by training ML models on synthetic data as well as real data (giving you two models: real and synthetic), and comparing their performance on real test data.
This approach allows us to see how practical it is to use synthetic data in various ML tasks.
When training models on synthetic data, performance metrics such as Accuracy, Precision, Recall, F1-Score, or Area Under the Curve (AUC), can be compared with models trained on real data. By evaluating these metrics, we can determine how well the models trained on synthetic data perform in relation to models trained on real data.
A high utility score indicates that ML models trained on synthetic data will behave similarly to ML models trained on real data.

Utility Score Guide#

Utility ScoreUse casesUtility
U > 75Models trained on synthetic data can be used in place of models trained on real data for ML analysis and Inference on new data.High
50 < U < 75Can still be used for ML analysis but with reduced confidence of getting similar results between real and synthetic data.Medium
U < 50Not suitable as a replacement for the real data.Low

Utility Score Calculation#

What is the % performance loss in synthetic data (L)?
If machine learning model performance on train-test data is 90% and performance on synthetic-test data is 82%, then % performance loss in synthetic data (L) is 10%
U = 100 - % performance loss in synthetic data (L)
Utility Score% performance loss in synthetic data (L)Utility
U = 100L = 0High
U = 80L = 80Medium
U = 0L = 100Low
If ML performance on synthetic data is close to the ML performance on real data, we will get a U score close to 100.
Disclaimer: Acceptable level of utility score depends on the use case. For high-stakes use cases in the healthcare industry, synthetic data with L = 0 might be only acceptable.
Modified at 2023-08-29 05:40:29
Previous
Statistical Similarity Metrics
Next
Utility Metrics
Built with