Distance based Metrics#
Distance to Closest Record (DCR)#
DCR measures the degree of similarity or proximity between records in the real data and the synthetic data.DCR is calculated using the Euclidean distance between a data point d in synthetic data and the closest data point d in the real data point. Note that a synthetic data point with DCR = 0 leaks real information. DCR = 0 means synthetic data is the clone of real data.Distance to Closest Record (DCR) Test#
DCR of Test-Train: This measures the distance between each record in a test dataset and the closest record in the training dataset. The test dataset is a subset of the original data that was not used in the training process. A smaller DCR of Test-Train indicates that the test records are very similar to the training records.DCR of Synthetic-Train: This measures the distance between each record in the synthetic dataset and the closest record in the training dataset. A smaller DCR of Synthetic-Train indicates that the synthetic records are very similar to the training records.DCR difference between Test-Train & Synthetic-Train (Diff DCR): Difference between Distance to Closest Record (DCR) in Test-Train Data and Synthetic-Train DataDiff DCR in % (D): *Diff DCR / DCR ***Test-Train Data * 100Privacy score = 100 - (Diff DCR in %)Diff DCR in % (D) | Expected Privacy |
---|
D < 10 | High |
D > 10 | Medium |
D > 50 | Low |
To identify and remove compromised (low privacy) rows, use privacy attacks.How to interpret DCR#
If the DCR values for the synthetic data (Synthetic-Train) are consistently higher than the DCR values for the real data (Test-Train), it suggests that the synthetic data provides better privacy preservation. This is because higher DCR values indicate a greater distance or dissimilarity between the synthetic records and the real records. DCR = 0 means synthetic data is the clone of real data, so it would be a great idea to remove all synthetic data points with DCR value 0 before sharing private data.Implementation of DCR#
1.
Start with the synthetic dataset and the original dataset.
2.
For each synthetic data point in the synthetic dataset, calculate its distance to the closest record in the original dataset. This can be done using a distance metric such as Euclidean distance or Manhattan distance.
3.
Repeat this process for all synthetic data points, obtaining the distances to their respective closest records in the original dataset.
4.
Calculate the average or median distance across all synthetic data points. This represents the DCR value for the synthetic dataset.
Nearest Neighbor Distance Ratio (NNDR)#
NNDR is the ratio of the closest and second closest distances of synthetic data points measured against the real data points. It indicates the relative proximity of the nearest neighbor compared to the second-nearest neighborNearest Neighbor Distance Ratio (NNDR) Test#
NNDR of Test-Train: This measures the ratio of the distance from each record in a test dataset to the closest and second closest record in the training dataset. A lower NNDR of Test-Train indicates that there are records in the training dataset that are very similar to each other, which could potentially increase the privacy risk if the training data were to be released or compromised.NNDR of Synthetic-Train: This measures the ratio of the distance from each record in the synthetic dataset to the closest and second closest record in the training dataset. A lower NNDR of Synthetic-Train indicates that the synthetic records are very similar to each other and to the training records. This could be a sign of good utility (since the synthetic data closely resembles the original data), but if the NNDR is too low, it could also indicate a privacy risk (since the synthetic data might be too similar to the original data).NNDR difference between Test-Train & Synthetic-Train (Diff NNDR): Difference between Nearest Neighbour Distance Ratio in Test-Train Data and Synthetic-Train DataDiff NNDR in % (D)= *****Diff NNDR / NNDR ***Test-Train Data * 100Privacy score = 100 - (Diff NNDR in %)Diff DCR in % (D) | Expected Privacy |
---|
D < 10 | High |
D > 10 | Medium |
D > 50 | Low |
How to interpret NNDR#
If NNDR is close to 0, it means that the synthetic data point is significantly closer to a real data point compared to the second-nearest neighbor. This suggests that the synthetic data may reveal sensitive information about the real data.On the other hand, if NNDR is close to 1, it indicates that the distance between the nearest neighbour and the second-nearest neighbor is almost the same. In this case, it becomes more challenging to reveal the real data point, as the synthetic data points are more uniformly distributedImplementation of *NNDR#
1.
Start with the synthetic dataset (S) and the original dataset (O).
2.
For each synthetic data point in the synthetic dataset, find its nearest neighbour among the other synthetic data points. Measure the distance between the data point and its nearest neighbour within the synthetic dataset.
3.
For each synthetic data point, find its nearest neighbour in the original dataset. Measure the distance between the data point and its nearest neighbour in the original dataset.
4.
Calculate the Nearest Neighbour Distance Ratio (NNDR) for each synthetic data point by dividing the distance to the nearest neighbour in the original dataset by the distance to the nearest neighbour in the synthetic dataset.
5.
Compute the average or median NNDR value across all synthetic data points.
Modified at 2023-08-29 05:53:39