Singling Out Attack#
Legal Definition according to GDPR:#
Singling out, which corresponds to the possibility to isolate some or all records which identify an individual in the dataset.Singling Out Attack is a type of privacy attack that involves identifying a specific individual in a synthetic dataset by exploiting the unique attributes of that individual. These unique attributes could be anything that distinguishes the individual from others in the dataset, such as their age, gender, or occupation.The attacker's goal in a Singling Out Attack is to find a synthetic record that is similar to a specific individual in the original dataset, even if the synthetic record is not an exact match. Once the attacker has identified a synthetic record that is similar to the individual they are interested in, they can use other information available to them, such as publicly available information or data from other sources, to re-identify that individual.Singling Out Attack Banding#
Number of columns refers to the number of attributes used to identify unique records in a dataset.For example, in a dataset with attributes like age, gender, and city, an attacker could use one attribute (univariate attack) or multiple attributes (multivariate attack) to single out individuals.In a univariate attack, they might single out individuals based on city. In a multivariate attack, they could use age, gender, and city together to single out individuals. The more attributes used, the more specific the identification, but it also requires the attacker to have more detailed knowledge about the individuals.Privacy Risk of 0.117 means there is 11.7% chance that an attacker could correctly single out (identify) records using a combination of attributes of synthetic data.Privacy Score = 100 * ( 1 - Upper limit of the confidence interval )Privacy Score | Upper limit of the confidence interval |
---|
P = 100 | ci2 = 0 |
P = 90 | ci2 = 0.1 |
P = 0 | ci2 = 1 |
Privacy Score, ranging from 0 to 100, is a measure of the safety of synthetic data against singling out attacks. A score of 100 signifies no risk of any individual being singled out based on the synthetic data, while a score of 0 indicates an extremely high risk. The score is calculated using the upper limit of the confidence interval for singling out privacy risk, providing a conservative estimateDisclaimer: The acceptable level of privacy score in synthetic data depends on various factors, including regulatory requirements, industry standards, and the specific use case of the synthetic data.How to interpret Singling Out Attack#
The singling out attack aims to create predicates or guesses that can identify individual data records in a dataset, both in the original data and synthetic data. The attack exploits the fact that attributes or combinations of attributes that are rare or unique in the synthetic data are likely to be rare or unique in the original data as well.Imagine we have a dataset containing information about people, including their gender, age, and ZIP code. The singling out attack tries to find predicates that can single out specific individuals in the dataset.Example 1: Let's say in the synthetic data, there is only one person who is a male, 65 years old, and lives in ZIP code 30305. The singling out attack may generate a predicate like "There is just one person in the original dataset who is male, 65 years old, and lives at ZIP code 30305." This predicate is based on the observation that this combination of attributes is unique in the synthetic data, so it might also be unique in the original data.Example 2: Suppose in the synthetic data, there are several missing values for the age attribute. The singling out attack may generate a predicate like "There is a person in the original dataset whose age is missing." This predicate takes advantage of the fact that missing values are rare in the synthetic data and may help identify individuals in the original data.Example 3: In the synthetic data, there is only one person who has a heart attack. The singling out attack may generate a predicate like "There is just one person in the original dataset who had a heart attack." This predicate exploits the rarity of heart attacks in the synthetic data and assumes that it could also be a unique characteristic of an individual in the original data.These examples demonstrate how the singling out attack attempts to create predicates that leverage the uniqueness or rarity of attribute values in the synthetic data to identify individuals in the original data. By evaluating these predicates on the original dataset, the attack can assess the level of privacy risk associated with the synthetic dataImplementation of *Singling Out Attack#
1.
Identify target dataset: Select the dataset for which you want to perform the singling out attack. This dataset contains sensitive or personal information about individuals that you aim to identify or single out.
2.
Analyze attribute uniqueness: Analyze the uniqueness of attributes in the target dataset. Look for attributes that have a low cardinality or contain rare values. These attributes are more likely to be effective in singling out individuals.
3.
Create univariate singling out predicates: Generate univariate singling out predicates by examining each attribute in the target dataset. For categorical attributes or missing values, consider unique values or NaN (Not a Number) values. For numerical continuous attributes, create predicates based on being smaller than the minimum value or larger than the maximum value.
4.
4.Create multivariate singling out predicates: Create multivariate singling out predicates by combining multiple attributes from the target dataset. Select a random record from the dataset and consider a random set of attributes. Formulate the predicate as the logical combination (AND) of univariate predicates derived from the values of the selected record.
5.
Evaluate singling out predicates: Evaluate the singling out predicates on the target dataset. Check if the predicates successfully single out a unique individual in the dataset. If a predicate identifies a single record, it indicates a potential singling out of that individual.
6.
Compare with control dataset: To measure the true privacy risk, compare the results of the singling out attack against a control dataset. The control dataset should have a similar size to the target dataset. This comparison helps account for the difference in the number of predicates that single out individuals due to dataset size.
7.
Compare with control dataset: To measure the true privacy risk, compare the results of the singling out attack against a control dataset. The control dataset should have a similar size to the target dataset. This comparison helps account for the difference in the number of predicates that single out individuals due to dataset size.
8.
Risk quantification and privacy assessment: Based on the evaluation results, quantify the singling out risk and assess the privacy implications. Consider the number of predicates that successfully single out individuals and the sensitivity of the information disclosed. This assessment helps determine the level of privacy risk associated with the singling out attack.
Linkability attack#
Legal Definition according to GDPR:#
Linkability, which is the ability to link, at least, two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases). If an attacker can establish (e.g. by means of correlation analysis) that two records are assigned to a same group of individuals but cannot single out individuals in this group, the technique provides resistance against “singling out” but not against linkability.Linkability Attack is a type of privacy attack that involves linking two or more datasets together to identify individuals in the dataset. This can be done by linking a synthetic dataset with another dataset or by linking a synthetic dataset to external information such as social media profiles or public records.The attacker's goal in a Linkability Attack is to find a unique identifier or attribute in one dataset that can be used to link it to another dataset. For example, an attacker could use a person's zip code or age in a synthetic healthcare dataset to link it to public records and identify the person.Linkability attack Banding#
Number of neighbors (k) refers to the number of closest matches in the synthetic dataset that the attacker considers for each record in the original dataset. For example, if k is set to 3, the attacker would consider the three closest matches in the synthetic dataset for each record in the original dataset. This concept is used to account for the fact that synthetic data might not preserve exact one-to-one mappings of records.Similar to the singling out attack, the confidence interval in a linkability attack represents the range within which the true privacy risk lies with a certain level of confidence. For example, a 95% confidence interval means that if you were to repeat the linkability attack many times, 95% of the time the true privacy risk would fall within that interval. It provides an estimate of the reliability and precision of the privacy risk assessment.Privacy risk of 0.117 means there is an 11.7% chance that an attacker could correctly link a record in the synthetic dataset to a record in the original dataset using the considered attributes. This does not necessarily mean that 11.7% of the data is compromised, but rather that there is an 11.7% chance of a successful linkability attack based on the attribute values in the synthetic data. The higher the privacy risk, the greater the potential for privacy leakage from the synthetic data to the original data.PR = Privacy Risk + Upper limit of the confidence intervalPrivacy Score = 100 * ( 1 - Upper limit of the confidence interval )Privacy Score | Upper limit of the confidence interval |
---|
P = 100 | 0 |
P = 90 | 0.1 |
P = 0 | 1 |
Privacy Score, ranging from 0 to 100, is a measure of the safety of synthetic data against linkability attacks. A score of 100 signifies no risk of a record in the synthetic data being linked to a record in the original dataset, while a score of 0 indicates an extremely high risk. The score is calculated using the upper limit of the confidence interval for linkability privacy risk, providing a conservative estimate****Disclaimer: The acceptable level of privacy score in synthetic data depends on various factors, including regulatory requirements, industry standards, and the specific use case of the synthetic data.How to interpret Linkability attack#
The linkability attack focuses on establishing connections or links between datasets that share common attributes. By exploiting these shared attributes, an attacker can potentially link or associate records from different datasets, compromising individuals' privacy.Example 1: Alice is a privacy-conscious user who uses Tor to browse the web anonymously. However, an adversary can still link her different Tor sessions together by looking at the time and location of her connections. This is because Tor does not provide perfect anonymity, and the adversary can use statistical techniques to identify connections that are likely to belong to the same user.Example 2: Bob is a user who uses a credit card to make online purchases. An adversary could link his different credit card transactions together by looking at the merchant, the amount of the purchase, and the time of the purchase. This information could be used to track Bob's spending habits and identify his online shopping patterns.Example 3: Charlie is a user who uses a social media platform. An adversary could link Charlie's different social media accounts together by looking at the same usernames, the same friends, and the same interests. This information could be used to build a profile of Charlie and learn more about his personal life.Implementation of Likability attack#
1.
Identify shared attributes: Identify the attributes that are shared between the released dataset and the external dataset that you want to link. These shared attributes act as potential linking keys or identifiers that can establish a connection between the two datasets.
2.
Analyze attribute uniqueness: Assess the uniqueness of the shared attributes in both datasets. Look for attributes that have a low cardinality or contain rare values. Unique identifiers such as names, social security numbers, or email addresses are particularly valuable for linking.
3.
Combine datasets: Combine the released dataset and the external dataset based on the shared attributes. This can be done by merging the datasets using a common identifier or by performing a join operation on the shared attributes.
4.
Identify matching records: Identify the records in the combined dataset that have matching values on the shared attributes. These matching records indicate potential links between the released dataset and the external dataset.
5.
Evaluate linkability: Assess the level of linkability between the datasets by calculating the proportion of records that can be successfully linked. This can be done by comparing the number of matched records to the total number of records in the released dataset.
6.
Assess privacy risk: Evaluate the privacy risk associated with the linkability attack. Consider the sensitivity of the information in both datasets and the potential harm that could result from the linkage. Assess whether the linkability poses a significant privacy threat to individuals or organizations involved.
Inference attack#
Legal Definition according to GDPR:#
Inference, which is the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.Inference Attack is a type of privacy attack that involves using statistical patterns in a synthetic dataset to infer sensitive information about individuals in the dataset. This can be done by analyzing the statistical properties of the synthetic dataset and using machine learning or statistical models to make inferences about individuals in the dataset.The attacker's goal in an Inference Attack is to infer sensitive information about individuals in the synthetic dataset, such as their income level, race, or gender. This can be done by analyzing patterns in the data, such as the distribution of values for a particular attribute or the relationships between attributes. For example, an attacker might use the distribution of ages in a synthetic healthcare dataset to infer information about an individual's health condition.Inference Attack Banding#
Privacy risk for each column (or attribute) is like a game of guessing. It's the chance that an attacker could correctly guess information about a person in the original dataset based on the synthetic data. For example, if the privacy risk is 0.117, it means there's an 11.7% chance they could make a correct guess.Confidence interval is a way to show how sure we are about the privacy risk. If the confidence interval is 95%, it means that if we played the guessing game many times, we'd expect to be right about our privacy risk 95% of the time. A smaller confidence interval means we're more sure about our privacy risk estimate.ci1 is the lower bound of the confidence interval.ci2 is the upper bound of the confidence interval.Privacy Score = 100 * ( 1 - Upper limit of the confidence interval [ci2] )Privacy Score | Upper limit of the confidence interval |
---|
P = 100 | 0 |
P = 90 | 0.1 |
P = 0 | 1 |
Privacy Score, ranging from 0 to 100, is a measure of the privacy safety of each attribute (column) in synthetic data. A score of 100 signifies no privacy risk, while a score of 0 indicates extremely high risk. The score is calculated using the upper limit of the confidence interval for privacy risk, providing a conservative estimate.Disclaimer: The acceptable level of privacy score in synthetic data depends on various factors, including regulatory requirements, industry standards, and the specific use case of the synthetic data.How to interpret Inference attack#
Imagine we have a dataset containing information about individuals, including their age, education level, and income. The goal of the inference attack is to infer an individual's income by leveraging the relationships between age and education level.Example 1: Suppose we have the following dataset:Age | Education Level | Income |
---|
30 | High School | Low |
40 | College | Medium |
50 | Graduate | High |
35 | High School | Low |
45 | Graduate | High |
An attacker performing an inference attack may observe that individuals with a higher education level tend to have a higher income. Based on this observation, they can make an inference about the income of an individual without directly knowing it. For instance, if they encounter a person in the dataset who is 30 years old and has a high school education, the attacker might infer that their income is likely to be low based on the observed relationship between education level and income.Example 2: Let's consider another dataset:Age | Education Level | Income |
---|
25 | College | Low |
35 | Graduate | High |
40 | High School | Low |
45 | College | Medium |
30 | High School | Low |
In this scenario, the attacker notices that individuals with higher education levels tend to have higher incomes. By combining this observation with the age attribute, they might conclude that younger individuals with a graduate education level are likely to have high incomes. Therefore, if they come across a person in the dataset who is 25 years old and has a college education, the attacker could infer that their income is low based on the observed patterns.These examples illustrate how the inference attack exploits relationships between non-sensitive attributes to infer sensitive information, such as income. By analyzing patterns and correlations within the dataset, an attacker can make educated guesses or inferences about the sensitive attributes without directly having access to them.Implementation of Inference Attack#
1.
Identify the target attribute: Determine the sensitive attribute that you want to infer from the released data. This could be any information that is not directly disclosed but can be deduced from the available attributes.
2.
Analyze attribute relationships: Examine the relationships between the target attribute and the non-sensitive attributes in the dataset. Look for patterns, correlations, or dependencies that can help infer the target attribute. Statistical analysis techniques such as regression analysis, data mining, or machine learning algorithms can be used for this purpose.
3.
Select the attributes for inference: Identify the non-sensitive attributes that are most informative for inferring the target attribute. These attributes should have a strong relationship or correlation with the target attribute.
4.
Build an inference model: Develop a model or algorithm that uses the selected attributes to infer the target attribute. This model can be as simple as a set of rules or as complex as a machine learning model. The model should capture the patterns and relationships observed in the dataset.
5.
Test the inference model: Evaluate the accuracy and effectiveness of the inference model using a validation dataset or through cross-validation techniques. Assess how well the model predicts the target attribute based on the selected non-sensitive attributes.
6.
Apply the inference model: Once the model is validated, apply it to new or unseen data to infer the target attribute. By inputting values of the selected non-sensitive attributes into the model, the target attribute can be estimated or predicted.
7.
Assess the privacy risk: Evaluate the potential privacy risk associated with the inference attack. Consider the level of accuracy in inferring the target attribute and the potential harm or sensitivity of the inferred information.
Modified at 2023-08-29 05:57:50