es-dedupe

Tool for removing duplicate documents from Elasticsearch

APACHE-2.0 License

Stars
54

Bot releases are hidden (Show)

es-dedupe - v2.0.0 Latest Release

Published by deric about 3 years ago

Completely rewritten implementation.

  • Support removing duplicates by multiple fields
  • Using sliding window for timestamp ordered documents
es-dedupe - v1.0.2

Published by deric over 3 years ago

  • Fixed Docker image
es-dedupe - v1.0.1

Published by deric over 4 years ago

  • Reorganized ENV variables
es-dedupe - v1.0.0

Published by deric over 6 years ago

  • refactored code, should work with ES 5.x
  • fixed StringIO import for python2 and python3 (imports)
  • added better logging (logme)
  • added more log messages
  • revamped verbose/debug output (debug=more output)
  • made sure the indexname is always called the same all over the script (idxname)
  • removed date-counting completely (inc_day, msg_using)
  • implemented getting the full indexlist instead, including excludes via regexp (idxlist_uri, fetch_indexlist, -I)
  • implemented function calls failing (rc = -1) instead of interpreting rc = 0 as an error (which it is not when an index-query returns 0 entries on a readable ES index)
  • in case an index has blocks->write=true set, we temporarily set it to false, do our dupe deleting and set it back to what it was before our changing it (settings_uri, allsettings_uri, fetch_allsettings, set_index_writable)
Badges
Extracted from project README
Related Projects