Data Synthesis for Machine Learning

Note: fork of https://github.com/SAP/data-synthesis-for-machine-learning

Overview

The recent enforcement of data privacy protection regulations, such as GDPR, has made data sharing more difficult. This tool intends to facilitate data sharing from a customer by synthesizing a dataset based on the original dataset for later machine learning. There are two parts to this tool:

Data synthesizer
Synthesize a dataset based on the original dataset. It accepts CSV data as input, and output a synthesized dataset based on Differential Privacy. The algorithm in the data synthesizer reference to the paper - PrivBayes 2017.
Data utility evaluation
Evaluate the data utility for the synthesized dataset. The original dataset and the synthesized dataset as the input, one utility evaluation report will be generated with several indicators.

Positioning

Our project is independent of any DB and we are focus on the later machine learning. There is also one data anonymization feature in the HANA (not open sourced). Of a customer has HANA, please use this feature within HANA. If a customer does not have HANA and it's about a machine learning use case, please use our tool. Exception: If such a customer wants to try out the algorithm implemented in our tool to figure out if it provides better results than HANA, please use our tool.

Quickstart

There are one demo dataset (part of the public dataset from Adult), the synthesized dataset and the utility evaluation report in the folder example for your reference.

Prerequisites

Install python >= 3.6.0 from python.org
Install project: run pip install ds4ml;
Download demo data from example/adult.csv

Procedure

There are two parts in the project, data synthesizer and data utility evaluation.

Synthesizer:
```
data-synthesize <original-dataset> <-o> <synthesized-dataset-path>
```
Use adult.csv as original-dataset, the synthesizer dataset adult_a.csv is generated under current folder by default.
- To save the cost of (synthesized) data transfer, command data-pattern can help, which serialize anonymous patterns of a dataset. And then use data-synthesize to generate dataset from the pattern.
```
data-pattern <original-dataset> <-o> <anonymous-pattern-path>
data-synthesize <anonymous-pattern-path> <-o> <synthesized-dataset-path>
```
Evaluation:
```
data-evaluate <original-dataset> <synthesized-dataset> --class-label <attribute1,attribute...>
```
Use adult.csv as original-dataset, adult-a.csv as synthesized-dataset, sex,salary as attribute in "--class-label", one report.html is generated under current folder by default.

Download and Installation

Install Python Ensure your python >=3.6.0, you can download it from python.org
After clone the project, install it as: python setup.py install. The project provides three commands: data-synthesize, data-pattern and data-evaluate. Run them with option -h for details.

Help of data-synthesize Run data-synthesize -h.

usage: data-synthesize [-h] [--pseudonym LIST] [--delete LIST]
                     [--na-values LIST] [-o FILE] [--no-header] [-e FLOAT]
                     [--category LIST] [--retain LIST]
                     file

Synthesize one dataset by Differential Privacy

positional arguments:
  file                set path of a csv file to be synthesized or path of a 
                      pattern file to be generated

general arguments:
  -h, --help          show this help message and exit
  --pseudonym LIST    set candidate columns separated by a comma, which will
                      be replaced with a pseudonym. It only works on the
                      string column.
  --delete LIST       set columns separated by a comma, which will be deleted
                      when synthesis.
  --na-values LIST    set additional values to recognize as NA/NaN; (default
                      null values are from pandas.read_csv)
  -o, --output FILE   set the file name of output synthesized dataset
                      (default is input file name with suffix '-a.csv')
  --no-header         indicate there is no header in a CSV file, and will
                      take [#0, #1, #2, ...] as header. (default: the tool
                      will try to detect and take actions)
  --records INT       specify the records you want to generate; default is the
                      same records with the original dataset
  --sep STRING        specify the delimiter of the input file

advanced arguments:
  -e, --epsilon FLOAT  set epsilon for differential privacy (default 0.1)
  --category LIST      set categorical columns separated by a comma.
  --retain LIST        set columns to retain the values

Note: argument epsilon (ε) defines the privacy guarantee, which depends on the size and features of the dataset to be anonymized. The lower the value of epsilon, the more noise is applied, and the less the utility.

Help of data-pattern Run data-pattern -h.

usage: data-pattern [-h] [--pseudonym LIST] [--delete LIST] [--na-values LIST]
                  [-o FILE] [--no-header] [--sep STRING] [-e FLOAT]
                  [--category LIST] [--retain LIST]
                  file

Serialize patterns of a dataset anonymously

positional arguments:
  file                 set path of a csv file to be patterned anonymously

general arguments:
  -h, --help           show this help message and exit
  --pseudonym LIST     set candidate columns separated by a comma, which will
                       be replaced with a pseudonym. It only works on the
                       string column.
  --delete LIST        set columns separated by a comma, which will be deleted
                       when synthesis.
  --na-values LIST     set additional values to recognize as NA/NaN; (default
                       null values are from pandas.read_csv)
  -o, --output FILE    set the file name of anonymous patterns (default is
                       input file name with a suffix '-pattern.json')
  --no-header          indicate there is no header in a CSV file, and will
                       take [#0, #1, #2, ...] as header. (default: the tool
                       will try to detect and take actions)
  --sep STRING         specify the delimiter of the input file

advanced arguments:
  -e, --epsilon FLOAT  set epsilon for differential privacy (default 0.1)
  --category LIST      set categorical columns separated by a comma.

Note: after the pattern (.json) file is generated by data-pattern anonymously, then run command data-synthesize *-pattern.json to synthesize the data.

Help of data-evaluate Run data-evaluate -h.

usage: data-evaluate [-h] [--na-values LIST] [-o FILE] [-t TEST]
                    [--class-label LIST]
                    source target

Evaluate the utility of the synthesized dataset compared with the source dataset.

positional arguments:
  source              set file path of source (raw) dataset to be compared with
                      synthesized dataset, only support CSV files
  target              set file path of target (synthesized) dataset to evaluate

general arguments:
  -h, --help          show this help message and exit
  --na-values LIST    set additional values to recognize as NA/NaN; (default
                      null values are from pandas.read_csv)
  -o, --output FILE   set output path for evaluation report; (default is
                      "report.html" under current work directory)

advanced arguments:
  -t, --test TEST     set test dataset for classification or regression task;
                      (default take 20 percent from source dataset)
  --category LIST     set categorical columns separated by a comma.
  --class-label LIST  set column name as a class label for classification or
                      regression task; supports one or multiple columns
                      (separated by comma)

How to obtain support

If you encounter an issue, you can open an issue in GitHub.

Contribute

Please check our Contribution Guidelines.

Licensing

Copyright 2019-2021 SAP SE or an SAP affiliate company and data-synthesis-for-machine-learning contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.

Package Rankings

Top 14.0% on Pypi.org

Badges

Extracted from project README

Related Projects

business-document-processing

Python client library for the SAP AI Business Services: Document Classification and Document Info...

29 Jan 2020 20

ai-sdk-java

Integrate chat completion into your business applications with SAP Cloud SDK for AI. Leverage the...

16 Jul 2024 4

sanitizer-checker

A tool to evaluate the security of JavaScript sanitizer functions.

08 Jun 2022 3

data-attribute-recommendation-python-sdk

A client SDK for the Data Attribute Recommendation service on SAP Business Technology Platform (S...

13 May 2020 16

performance-regression-data-set

An open data set consisting of performance measurement results over time to allow research on per...

18 Nov 2021 6

mockup_loader

ABAP unit testing framework, prepare in Excel, reuse in abap code

03 Nov 2015 65

odfuzz

02 Aug 2019 18

ai-sdk-js

Integrate chat completion into your business applications with SAP Cloud SDK for AI. Leverage the...

20 Jun 2024 16

project-sailor

Easy access to APIs from SAP Digital Supply Chain for data scientists.

25 Feb 2021 24

sap-customer-data-cloud-accelerator

The SAP Customer Data Cloud accelerator is a local development environment for SAP CDC. Enables t...

13 Jul 2023 7

odata-vocabularies

SAP Vocabularies for semantic markup of structured data published via OData (www.odata.org) servi...

25 Nov 2019 164

abap-cheat-sheets

Explore ABAP syntax in a nutshell supported by executable demo examples.

29 Nov 2022 925

hana-my-thai-star-data-generator

A data generator that creates sample data for the SAP HANA use-cases of the devonfw reference app...

18 Nov 2018 7

software-documentation-data-set-for-machine-translation

A parallel evaluation data set of SAP software documentation with document structure annotation

17 May 2020 10

sqlalchemy-hana

SQLAlchemy Dialect for SAP HANA

13 Mar 2015 121