Betterdata Docs
  1. Statistical Similarity
Betterdata Docs
  • Getting Started
    • Introduction
    • Quickstart
  • Metrics Guide
    • Syntactical Accuracy
      • Syntactical Accuracy Metrics
    • Statistical Similarity
      • Statistical Similarity Summary
      • Statistical Similarity Metrics
    • Utility
      • Utility Summary
      • Utility Metrics
    • Privacy
      • Distance-based
      • Privacy Attacks
  1. Statistical Similarity

Statistical Similarity Summary

Purpose#

The statistical similarity between synthetic data and real data refers to the degree by which the distributional properties of synthetic data match those of real data. It involves comparing various statistical characteristics such as mean, variance, correlation, and higher-order moments between real and synthetic datasets.
The goal of achieving statistical similarity is to ensure that synthetic data captures the underlying patterns, trends, and relationships present in real data. If synthetic data closely resemble real data in terms of its statistical properties, it is more likely to be useful and applicable for analytical tasks, modeling purposes, and downstream use cases.
When evaluating the statistical similarity between synthetic and real data, several aspects need to be considered. These aspects provide insights into the different properties and characteristics of the datasets. Here are the key aspects that need to be checked:
1.
Univariate Level: These are your corner and edge pieces. They provide the basic structure of your dataset by analyzing one variable at a time. Each univariate metric within this level is like a different tool that helps you analyze each piece. For example, one tool (metric) might tell you the range of ages, another might tell you the most common gender in your dataset. When comparing real and synthetic data, these metrics help ensure that the basic structure of the synthetic dataset matches that of the real one. If the real data has most ages in the range of 0-25, the synthetic data should also reflect the same distribution.
2.
Bivariate Level: These are the pieces that start to fill in the picture. They analyze the relationship between two variables. Within this level, different bivariate metrics can help you understand different aspects of the relationships. For instance, one metric might show you how age and gender interact in your dataset, like if there's a majority gender within a certain age group. When comparing real and synthetic data, these metrics ensure that relationships between pairs of variables in the synthetic data match those in the real data. If in the real data, 80% of individuals under the age of 25 are female, the synthetic data should maintain this relationship.
3.
Multivariate Level: These are the intricate pieces that complete the puzzle. They take into account multiple variables at a time, offering a more complex, nuanced understanding of your data. Again, different multivariate metrics within this level can offer different perspectives on the interactions among the variables. For example, one could show how age, gender, and location interact. When comparing real and synthetic data, these metrics help ensure that the complex interactions among multiple variables in the synthetic data match those in the real data. If in the real data, most females under 25 from a certain location prefer a specific product, the synthetic data should capture this pattern.
By examining all these levels together, you can verify that your synthetic data is a faithful representation of your real data, much like how individual puzzle pieces come together to form a complete picture. The univariate level verifies the basic structure, the bivariate level checks the relationships between pairs of variables, and the multivariate level ensures that complex interactions among multiple variables are preserved. Together, they provide a full, detailed comparison between your real and synthetic datasets.
Fidelity Score (F)Wasserstein Distance (WD)Expected Fidelity
F > 95WD < 0.05Very High
80 < F < 950.05 < WD < 0.2High
65 < F < 800.2 < WD < 0.35Medium
F < 65WD > 0.35Low
Jensen-Shannon Distance (JSD) Banding
Fidelity Score (F)Jensen-Shannon Distance (JSD)Fidelity
F > 950Very High
80 < F < 950.05 < JSD < 0.2High
65 < F < 800.2 < JSD < 0.35Medium
F < 65JSD > 0.35Low
Chi-squared Distance Banding
Fidelity ScoreChi-squared DistanceFidelity
F > 95CSD > 0.95Very High
80 < F < 950.8 < CSD < 0.95High
65 < F < 800.65 < JSD < 0.8Medium
F < 65JSD < 0.65Low
All bivariate and multivariate are visual metrics. The following section explains how users can interpret visual metrics for understanding fidelity.
Modified at 2023-08-29 05:44:18
Previous
Syntactical Accuracy Metrics
Next
Statistical Similarity Metrics
Built with