USPTO patents dataset generator.
sudo yum install python-devel libxslt-devel libxml2-devel
pip install patent-parsing-tools
Downloading dataset:
python -m patent_parsing_tools.downloader \
--directory dataset \
--year-from 2010 \
--year-to 2010
Collecting and serializing data:
python -m patent_parsing_tools.supervisor \
--working-directory patents/working_directory \
--train-destination patents/train_destination \
--test-destination patents/test_destination \
--year-from 2014 \
--year-to 2015
Generating dictionary with train set:
python -m patent_parsing_tools.bow.dictionary_maker \
--train-directory patents/train_destination \
--max-patents 1000000000 \
--dictionary dictionary.txt \
--dict-max-size 4096
Generate bag of words with train set and test set:
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/train_destination \
--destination-directory patents/final_dataset_train \
--dictionary dictionary.txt \
--batch-size 1048576
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/test_destination \
--destination-directory patents/final_dataset_test \
--dictionary dictionary.txt \
--batch-size 1048576
pytest
pip install -r requirements.txt
The MIT License (MIT). Copyright (c) 2014 Micha Dul, Piotr Przetacznik, Krzysztof Strojny. Check LICENSE files for more information.