Synthesize a new dataset based on the original dataset for later machine learning to facilitate the data sharing from the customer.
APACHE-2.0 License
Note: fork of https://github.com/SAP/data-synthesis-for-machine-learning
The recent enforcement of data privacy protection regulations, such as GDPR, has made data sharing more difficult. This tool intends to facilitate data sharing from a customer by synthesizing a dataset based on the original dataset for later machine learning. There are two parts to this tool:
Our project is independent of any DB and we are focus on the later machine learning. There is also one data anonymization feature in the HANA (not open sourced). Of a customer has HANA, please use this feature within HANA. If a customer does not have HANA and it's about a machine learning use case, please use our tool. Exception: If such a customer wants to try out the algorithm implemented in our tool to figure out if it provides better results than HANA, please use our tool.
There are one demo dataset (part of the public dataset from Adult), the synthesized dataset and the utility evaluation report in the folder example for your reference.
pip install ds4ml
;There are two parts in the project, data synthesizer and data utility evaluation.
Synthesizer:
data-synthesize <original-dataset> <-o> <synthesized-dataset-path>
Use adult.csv as original-dataset, the synthesizer dataset adult_a.csv is generated under current folder by default.
data-pattern
can help, which serialize anonymous patterns of a dataset. And then use data-synthesize
to generate dataset from the pattern.data-pattern <original-dataset> <-o> <anonymous-pattern-path>
data-synthesize <anonymous-pattern-path> <-o> <synthesized-dataset-path>
Evaluation:
data-evaluate <original-dataset> <synthesized-dataset> --class-label <attribute1,attribute...>
Use adult.csv as original-dataset, adult-a.csv as synthesized-dataset, sex,salary as attribute in "--class-label", one report.html is generated under current folder by default.
Install Python Ensure your python >=3.6.0, you can download it from python.org
After clone the project, install it as: python setup.py install
.
The project provides three commands: data-synthesize
, data-pattern
and data-evaluate
. Run them with option -h
for details.
Help of data-synthesize
Run data-synthesize -h
.
usage: data-synthesize [-h] [--pseudonym LIST] [--delete LIST]
[--na-values LIST] [-o FILE] [--no-header] [-e FLOAT]
[--category LIST] [--retain LIST]
file
Synthesize one dataset by Differential Privacy
positional arguments:
file set path of a csv file to be synthesized or path of a
pattern file to be generated
general arguments:
-h, --help show this help message and exit
--pseudonym LIST set candidate columns separated by a comma, which will
be replaced with a pseudonym. It only works on the
string column.
--delete LIST set columns separated by a comma, which will be deleted
when synthesis.
--na-values LIST set additional values to recognize as NA/NaN; (default
null values are from pandas.read_csv)
-o, --output FILE set the file name of output synthesized dataset
(default is input file name with suffix '-a.csv')
--no-header indicate there is no header in a CSV file, and will
take [#0, #1, #2, ...] as header. (default: the tool
will try to detect and take actions)
--records INT specify the records you want to generate; default is the
same records with the original dataset
--sep STRING specify the delimiter of the input file
advanced arguments:
-e, --epsilon FLOAT set epsilon for differential privacy (default 0.1)
--category LIST set categorical columns separated by a comma.
--retain LIST set columns to retain the values
Help of data-pattern
Run data-pattern -h
.
usage: data-pattern [-h] [--pseudonym LIST] [--delete LIST] [--na-values LIST]
[-o FILE] [--no-header] [--sep STRING] [-e FLOAT]
[--category LIST] [--retain LIST]
file
Serialize patterns of a dataset anonymously
positional arguments:
file set path of a csv file to be patterned anonymously
general arguments:
-h, --help show this help message and exit
--pseudonym LIST set candidate columns separated by a comma, which will
be replaced with a pseudonym. It only works on the
string column.
--delete LIST set columns separated by a comma, which will be deleted
when synthesis.
--na-values LIST set additional values to recognize as NA/NaN; (default
null values are from pandas.read_csv)
-o, --output FILE set the file name of anonymous patterns (default is
input file name with a suffix '-pattern.json')
--no-header indicate there is no header in a CSV file, and will
take [#0, #1, #2, ...] as header. (default: the tool
will try to detect and take actions)
--sep STRING specify the delimiter of the input file
advanced arguments:
-e, --epsilon FLOAT set epsilon for differential privacy (default 0.1)
--category LIST set categorical columns separated by a comma.
data-pattern
anonymously, then run command data-synthesize *-pattern.json
to synthesize the data.Help of data-evaluate
Run data-evaluate -h
.
usage: data-evaluate [-h] [--na-values LIST] [-o FILE] [-t TEST]
[--class-label LIST]
source target
Evaluate the utility of the synthesized dataset compared with the source dataset.
positional arguments:
source set file path of source (raw) dataset to be compared with
synthesized dataset, only support CSV files
target set file path of target (synthesized) dataset to evaluate
general arguments:
-h, --help show this help message and exit
--na-values LIST set additional values to recognize as NA/NaN; (default
null values are from pandas.read_csv)
-o, --output FILE set output path for evaluation report; (default is
"report.html" under current work directory)
advanced arguments:
-t, --test TEST set test dataset for classification or regression task;
(default take 20 percent from source dataset)
--category LIST set categorical columns separated by a comma.
--class-label LIST set column name as a class label for classification or
regression task; supports one or multiple columns
(separated by comma)
If you encounter an issue, you can open an issue in GitHub.
Please check our Contribution Guidelines.
Copyright 2019-2021 SAP SE or an SAP affiliate company and data-synthesis-for-machine-learning contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.