delta and deletion file generator for JSON and CSV dumps (often called feeds)
MIT License
python programs that extract changes and deletions from raw CSV or JSON data dumps.
You typically need this to more efficiently realize a mirror database or system integration, but the master / source system is not able to provide clean delta / diff exports (or not with deletion information).
In addition to the generic CSV and JSON commands the repository contains an example script that is parsing a custom nested JSON.
TODO: migrate TODOs from code to issues and realize.
brew install python
brew install yajl
pip install ijson cffi mmh3
'newfile' is the path to the new full dump feed and 'lastfingerprintsfile' is the path to the last fingerprints file generated by this program.
If the last fingerprints file is given, the program generates:
.changes.csv
or ´changes.json´ respectively to the newfile
name.deltacsv newfile idColumn [lastfingerprintsfile]
'idColumn' is the name or 0-index of the CSV column that contains the reliable ID of the lines. CSVs need a header row.
deltajson newfile entriesProperty idJsonPath [lastfingerprintsfile]
'entriesProperty' is the name of the property in the JSON that contains the array of entry objects to be processed and analyzed for changes. It's typically in the root object, but does not need to (if nested, there should be no other ones with that name).
'idJsonPath' is a JSONPath to the value inside the entry objects that represents the ID
csvformat
command.On a 2012 Macbook Pro:
A 460 MB CSV with 1.2 Million CSV lines is processed in under a minute. RAM usage currently peaking in the 300 MBs. TODO RAM needs to be debugged as it should be much less since the CSV is supposed to be streamed.
A ca 100MB JSON file with 50,000 entries is processed in ca. 60 seconds. RAM peak 30MB.