This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models.
MIT License
This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models. In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.
Note: Presidio evaluator requires Python>=3.9
conda create --name presidio python=3.9
conda activate presidio
pip install presidio-evaluator
# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg
To install the package:
# Install package+dependencies
pip install poetry
poetry install --with=dev
# To install with all additional NER dependencies (e.g. Flair, Stanza, CRF), run:
# poetry install --with='ner,dev'
# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg
# Verify installation
pytest
Note that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.
See Data Generator README for more details.
The data generation process receives a file with templates, e.g. My name is {{name}}
.
Then, it creates new synthetic sentences by sampling templates and PII values.
Furthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.
Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See this notebook for more details.
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see data_objects.py.
The standardized structure, List[InputSample]
could be translated into different formats:
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
conll = InputSample.create_conll_dataset(dataset)
conll.to_csv("dataset.csv", sep="\t")
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
flair = InputSample.create_flair_dataset(dataset)
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.to_json(dataset, output_file="dataset_json")
The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, or a specific PII recognizer for precision and recall and error-analysis.
To train a vanilla CRF on a new dataset, see this notebook. To evaluate, see this notebook.
To train a new spaCy model, first save the dataset in a spaCy format:
# dataset is a List[InputSample]
InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")
To evaluate, see this notebook
from presidio_evaluator.models import FlairTrainer
train_samples = "data/generated_train.json"
test_samples = "data/generated_test.json"
val_samples = "data/generated_validation.json"
trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
corpus = trainer.read_corpus("")
trainer.train(corpus)
Note that the three json files are created using
InputSample.to_json
.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Copyright notice:
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.