patent-parsing-tools

USPTO patents dataset generator

MIT License

Stars
5
Committers
3

patent-parsing-tools

USPTO patents dataset generator.

Documentation

Read the docs

System requirements

sudo yum install python-devel libxslt-devel libxml2-devel

Installation:

pip install patent-parsing-tools

Examples:

Downloading dataset:

python -m patent_parsing_tools.downloader \
  --directory dataset \
  --year-from 2010 \
  --year-to 2010

Collecting and serializing data:

python -m patent_parsing_tools.supervisor \
  --working-directory patents/working_directory \
  --train-destination patents/train_destination \
  --test-destination patents/test_destination \
  --year-from 2014 \
  --year-to 2015

Generating dictionary with train set:

python -m patent_parsing_tools.bow.dictionary_maker \
  --train-directory patents/train_destination \
  --max-patents 1000000000 \
  --dictionary dictionary.txt \
  --dict-max-size 4096

Generate bag of words with train set and test set:

python -m patent_parsing_tools.bow.bag_of_words \
  --serialized-patents patents/train_destination \
  --destination-directory patents/final_dataset_train \
  --dictionary dictionary.txt \
  --batch-size 1048576
python -m patent_parsing_tools.bow.bag_of_words \
  --serialized-patents patents/test_destination \
  --destination-directory patents/final_dataset_test \
  --dictionary dictionary.txt \
  --batch-size 1048576

Testing

pytest

Contributing and develpment

pip install -r requirements.txt

License

The MIT License (MIT). Copyright (c) 2014 Micha Dul, Piotr Przetacznik, Krzysztof Strojny. Check LICENSE files for more information.

Badges
Extracted from project README
Build Status Documentation Status patent-parsing-tools CI