Betterdata Docs
  1. Syntactical Accuracy
Betterdata Docs
  • Getting Started
    • Introduction
    • Quickstart
  • Metrics Guide
    • Syntactical Accuracy
      • Syntactical Accuracy Metrics
    • Statistical Similarity
      • Statistical Similarity Summary
      • Statistical Similarity Metrics
    • Utility
      • Utility Summary
      • Utility Metrics
    • Privacy
      • Distance-based
      • Privacy Attacks
  1. Syntactical Accuracy

Syntactical Accuracy Metrics

As explained in the previous section, syntactical accuracy checks ensure that the generated synthetic data adhere to the predefined rules and structures of the original data. Synthetic data, which maintains the same format, structure, and constraints as the original data, is suitable for use in various applications such as testing, training machine learning models, and conducting research.

Syntactical Accuracy Metrics#

Syntactical accuracy can be broken down into the following components. All features in synthetic data should pass syntactical accuracy checks.

Data Type Consistency Check:#

This refers to ensuring that the synthetic data maintains the same data types as the original data. For example, if a particular field in the original data is an integer, the corresponding field in the synthetic data should also be an integer. If let's say "Age" feature in the original data is an integer, the "Age" feature in the synthetic data should also be an integer. If you find floating-point numbers or strings in this field, it indicates a lack of data type consistency.

Format Consistency Check:#

This involves maintaining the same data formats as the original data. For example, if dates in the original data are in the format "YYYY-MM-DD", the synthetic data should also use this format. If you find dates in formats like "MM-DD-YYYY" or "DD-MM-YYYY", it indicates a lack of format consistency.

Constraint Adherence Check:#

This involves ensuring that the synthetic data adheres to the same constraints as the original data. For example, if a particular field in the original data can only take on values within a certain range, the synthetic data should also respect this constraint.
Suppose you have a dataset of employees in a company, and one of the fields is "Department", which can only take on values from a predefined list (e.g., "Sales", "Marketing", "HR", "Engineering"). If the synthetic data generation process is accurate, it should also generate values for the "Department" field from this same list. If you find values like "Finance" or "Legal" which are not in the original list, it indicates a lack of constraint adherence.

Relationship Consistency Check:#

This refers to maintaining the same relationships between different data fields as in the original data. For example, if a certain relationship exists between two fields in the original data (such as one field always being a certain value when another field is a certain value), this relationship should also exist in the synthetic data.
Let's say in your original dataset, there is a relationship between two fields: "Vehicle Type" and "Fuel Type". For instance, if the vehicle type is "Electric Car", the fuel type is always "Electricity". If the synthetic data generation process is accurate, it should also maintain this relationship. If you find an "Electric Car" with a fuel type of "Diesel" or "Petrol", it indicates a lack of relationship consistency.

Semantic Consistency Check:#

While not strictly a part of syntactical accuracy, semantic consistency is closely related and refers to ensuring that the synthetic data maintains the same meaning as the original data. This involves ensuring that the synthetic data makes sense in the context of the original data and the domain from which it comes.
Suppose you have a dataset of books in a library, and one of the fields is "Genre", which takes on values like "Fiction", "Non-fiction", "Sci-Fi", "Romance", etc. If the synthetic data generation process is accurate, it should generate values for the "Genre" field that make sense in the context of books. If you find values like "Fruit" or "Animal", which are semantically inconsistent with the concept of book genres, it indicates a lack of semantic consistency.
Modified at 2025-02-21 02:01:50
Previous
Quickstart
Next
Statistical Similarity Summary
Built with