Betterdata Docs
  1. Statistical Similarity
Betterdata Docs
  • Getting Started
    • Introduction
    • Quickstart
  • Metrics Guide
    • Syntactical Accuracy
      • Syntactical Accuracy Metrics
    • Statistical Similarity
      • Statistical Similarity Summary
      • Statistical Similarity Metrics
    • Utility
      • Utility Summary
      • Utility Metrics
    • Privacy
      • Distance-based
      • Privacy Attacks
  1. Statistical Similarity

Statistical Similarity Metrics

Within each level, various metrics evaluate specific characteristics of synthetic data, allowing us to gain a holistic understanding of its fidelity and quality.
This section explores these metrics, their role in evaluating synthetic data, and how they can be used to measure statistical similarity.

Univariate Metrics#

Univariate Distribution Plot#

Univariate Distribution plots provide visual representations of the distributions of individual variables in the synthetic data compared to the real data. By examining these plots, data analysts can assess how well the synthetic data replicates the distributional characteristics of the original data.

How to Interpret Distribution Plot#

Interpreting univariate distribution plots involves comparing the shape, center, and spread of the distributions. If the synthetic data closely resembles the real data, the plots will exhibit similar patterns, indicating a high level of statistical similarity.
On the other hand, discrepancies in the shape or location of the distributions may indicate limitations or biases in the synthetic data generation process. Deviations from the real data distribution may point to areas where the synthetic data fails to capture the true characteristics of the underlying phenomenon. Therefore, careful analysis of univariate distribution plots allows users to evaluate the extent to which synthetic data accurately reflects the distributional properties of the original data.
image.png

Univariate Statistical Distance#

Wasserstein Distance Banding#

Fidelity ScoreWassersteinExpected Fidelity
F > 95WD < 0.05Very High
80 < F < 950.05 < WD < 0.2High
65 < F < 800.2 < WD < 0.35Medium
F < 65WD > 0.35Low
The fidelity score can be calculated using the WD value using the formula: F = 100 * ( 1 - WD ) for WD ≤ 1
Please make sure Wasserstein's Distance for all important features is close to 0, if we are aiming for high fidelity.

Wasserstein Distance#

Wasserstein Distance measures the discrepancy between the empirical distributions of two datasets. It quantifies the amount of mass that needs to be transported to transform one distribution into the other, providing a measure of similarity in terms of shape and location. Wasserstein Distance is particularly useful for comparing continuous variables and capturing differences in their distributional characteristics. However, it may not be suitable for discrete or categorical variables due to its focus on continuous distributions.
How to interpret Wasserstein Distance
If the Wasserstein Distance between synthetic and real data is very small (e.g., 0.1), it suggests that the distributions are relatively similar or close in shape and position. This indicates a higher degree of similarity between the two datasets.
If the Wasserstein Distance is larger (e.g., 1.5), it suggests that the distributions are more dissimilar and exhibit notable differences. This indicates a lower degree of similarity between the synthetic and real data.
Learn the mathematics behind Wasserstein Distance:

Jensen-Shannon Distance (JSD) Banding#

Fidelity ScoreJensen-Shannon Distance (JSD)Expected Fidelity
F > 950Very High
80 < F < 950.05 < JSD < 0.2High
65 < F < 800.2 < JSD < 0.35Medium
F < 65JSD > 0.35Low
Fidelity score can be calculated using JSD with the following formula: F = 100 * ( 1 - JSD )
Please make sure Jensen-Shannon Distance for all important features is close to 0, if we are aiming for high Fidelity.

Jensen Shannon Distance (JSD)#

JSD quantifies the similarity between probability distributions. It takes into account both the shared information and the difference between the distributions. JSD is a versatile metric that can be applied to evaluate the similarity of probability distributions for both continuous and discrete variables. It offers a comprehensive measure of divergence and can capture differences in shape, spread, and density. However, JSD may be sensitive to the choice of binning or discretization when applied to continuous variables.
How to interpret Jensen Shannon Distance (JSD)
When the JSD between synthetic and real data is close to zero (e.g., 0.1), it indicates a high level of similarity between the distributions. This suggests that the synthetic data closely resembles the real data in terms of their distributional characteristics.
Conversely, when the JSD is larger (e.g., 1.0), it indicates a greater dissimilarity between the distributions. This suggests that the synthetic data deviates more from the real data, and they exhibit notable differences in their distributional properties.
Learn the mathematics behind Jensen Shannon Distance (JSD): Link

Chi-squared Distance Banding#

Fidelity ScoreChi-squared DistanceFidelity
F > 95CSD > 0.95Very High
80 < F < 950.8 < CSD < 0.95High
65 < F < 800.65 < JSD < 0.8Medium
F < 65JSD < 0.65Low
Fidelity score can be calculated using Chi-Sq using the following formula: F = 100 * CSD
Please make sure Chi-squared Distance for all important features is close to 1, if we are aiming for high Fidelity.

Chi-squared Distance#

The chi-squared test evaluates the difference between observed and expected frequencies in categorical variables. It focuses on comparing the distribution of categorical variables in the synthetic and real data. Chi-squared measures the discrepancy between the observed and expected counts and provides insights into the similarity or dissimilarity of the categorical distributions. Chi-squared is particularly useful for assessing the fidelity of categorical variables. However, it may not be suitable for continuous or ordinal variables.
How to interpret Chi-squared Distance (p-value)
On the other hand, when the Chi-squared Distance (p-value) is larger (e.g., 1.0), it suggests a high level of similarity between the distributions. This indicates that the synthetic data closely aligns with the distribution of the real data.
When the Chi-squared Distance (p-value) between synthetic and real data is close to zero (e.g., 0.1), it signifies a greater dissimilarity between the distributions. The synthetic data differs significantly from the real data in terms of their distributional characteristics.

Bivariate Metrics#

Bivariate Plots#

Bivariate Plots measure how similar the relationship between two variables is in real data compared to synthetic data.
How to interpret Bivariate Plots
Interpreting bivariate plots involves examining the overall shape, direction, and density of the data points. If the bivariate plot of the synthetic data closely resembles the plot of the real data, it indicates a good level of similarity in the relationship between the variables. On the other hand, if the patterns in the synthetic data plot differ significantly from the real data plot, it suggests that the synthetic data may not accurately capture the bivariate relationship.
Key aspects to consider when interpreting bivariate plots include the alignment of points along a trend or pattern, the distribution of points within different regions of the plot, and the presence of outliers or anomalous behavior. These characteristics help evaluate the statistical similarity of the synthetic data and determine if it adequately represents the bivariate relationships observed in the real data.
Two variables we would like to compare can be either categorical or numerical. So, bivariate plots can have either one of the three combinations: categorical-numerical, categorical-categorical, or numerical-numerical.

Categorical-Numerical Plots#

A box-plot is a visual tool that helps us understand the relationship between a categorical column (containing categories or groups on the x-axis) and a numerical column (containing values or measurements on the y-axis).
image.png
How to Interpret Categorical Numerical Plots
Median: Compare the position of the median for each category. Similar medians suggest the central tendency is well captured.
Box Size (IQR): Examine the size of the box. Similar box sizes indicate preserved variability.
Outliers: Check for outliers. Comparable presence and distribution suggest the representation of extreme values.
Distribution: Compare the spread and shape of distributions within each category.
Deviations: Note significant variations between synthetic and real data, indicating limitations or biases.
Visual Consistency: Assess overall similarity in box sizes, whisker lengths, and outlier presence.
Context: Consider domain-specific knowledge to interpret the findings appropriately.
Learn more about Box Plots here: Link

Categorical-Categorical Plots#

Heat-maps are used to visualize the relationship between two categorical columns.
image.png
Both the x and y axes represent the unique values of the categorical columns. To construct the heat maps, we calculate the count of each pair of unique values (one from each column) and use that to decide the hue of each cell in the graph.
In the context of comparing real and synthetic data, we create two separate heatmaps: one for the original (real) data and another for the synthetic data. These heatmaps are then plotted side by side.
How to Interpret Categorical-Categorical Plots
Color Intensity: Darker colors indicate higher frequencies, while lighter colors represent lower frequencies.
Compare Patterns: Analyze color patterns between synthetic and real data for differences or similarities.
High/Low Frequencies: Pay attention to cells with the highest or lowest frequencies for common or rare category combinations.
Distribution Differences: Look for variations in category distribution between synthetic and real data.
Patterns and Trends: Identify noticeable patterns or trends in the heatmap, indicating frequent category combinations.
Visual Consistency: Assess the overall visual similarity or dissimilarity of the heatmap.
Domain Knowledge: Consider domain-specific knowledge to accurately interpret the findings.

Numerical-Numerical Plots#

image.png
Joint plots are used to visualize the relationship between two numerical columns in real and synthetic data.
How to Interpret Numerical-Numerical Plots
Trend and Direction: Look for similarities or differences in the overall trend and direction of the lines between synthetic and real data.
Dispersion: Assess the spread or scattering of the lines to identify similarities or differences in variability between synthetic and real data.
Intersection and Overlapping: Note any consistent differences or similarities in intersection points or overlapping regions between synthetic and real data.
Deviations and Outliers: Identify any consistent differences or discrepancies in deviations or outliers from the overall trend between synthetic and real data.
Linearity: Evaluate if the lines show a clear linear pattern or exhibit deviations from linearity in synthetic and real data.
Visual Consistency: Consider the overall visual similarity or dissimilarity of the lines between synthetic and real data.
Domain Knowledge: Interpret the findings within the context of the domain and understanding of the variables being compared.

Correlation Metrics#

Correlation is a fundamental metric used to evaluate the similarity between synthetic and real data in bivariate analysis. It measures the strength and direction of the relationship between two variables. Assessing the correlation between variables helps us understand the consistency of their association in the synthetic data compared to the real data.
How to interpret Correlation Metrics
Interpreting the correlation matrix involves examining the correlation coefficients between pairs of variables. A high positive correlation (close to +1) indicates a strong positive relationship, while a high negative correlation (close to -1) indicates a strong negative relationship. A correlation close to 0 suggests no significant linear relationship.
When comparing the correlation matrix of synthetic and real data, we look for similarities in the pattern and magnitude of the correlation coefficients. If the correlation structure in the synthetic data closely resembles that of the real data, it indicates that the relationships between variables are accurately captured. Conversely, significant differences in the correlation matrix suggest that the synthetic data may not adequately represent the associations observed in the real data.
image.png
[ Graph coming soon: Correlation matrix - Difference between real and fake Correlation Matrix ]
Why is Spearman rank correlation better than Pearson correlation?
Spearman's rank correlation does not assume that the variables follow a specific distribution or that the relationship is linear. Instead, it assesses the monotonic relationship, which can be any increasing or decreasing pattern.
Learn more about Spearman Rank Correlation: Here.
Modified at 2023-08-29 05:44:39
Previous
Statistical Similarity Summary
Next
Utility Summary
Built with