My implementation of "perceptual hash" for images: find duplicate images by contents, not bytes
With a Little Help From ...
- JetBrains : the Acme of .NET tool suites!
- Deleaker : the best tool for finding memory, GDI and other leaks!
My implementation of "perceptual hash" (phash) for images.
To find duplicate / similar images is a two-phase process.
Phase 1: Calculate the phash value for all images in a folder and sub-folders. The image paths and phashes are stored in a file.
Phase 2: Load a file from phase 1 into a viewer. It compares all image phash values and shows a list of image pairs, ordered by phash simularity. Rows in the list are selected to view the two images side-by-side.
Phase 2a: More than one phash file from phase 1 may be loaded into the viewer. An example use case is to compare a separate set of recently downloaded images against an existing set, to find out whether the new images already exist or might be better than the existing. By using the "Filter same phash" menu, you can focus on matches between the sets, rather than matches within a set.
Two PHash files have been loaded.
Not shown: "Filter same phash" option has been used to filter out duplicates within each Phash file. Not shown: double-clicking on either of the two images in the area marked '2' will invoke the viewing window to show the actual images, allowing you to examine them "as is" at the same size.
20200322: Performance improvements:
20200201: Updated the repository with the latest changes.
20160411: Provide some accumulated changes for the viewer:
20160410: Added a CRC calculation (based on the image pixels). Allows the viewer program to indicate that a file is an actual duplicate.
20151201: Upgraded the Phase 1 code/project to use OpenMP for parallelism. Each file will be processed in its own thread.
Timed on my physical machine, processing a directory tree containing 940 images at 196M in size. Non-OpenMP: 105.66 seconds OpenMP: 28.36 seconds Your mileage may vary, depending on the number of processors you have ...
20151129: Uploaded the code for the viewer. This version is WinForms/C#. The code has a bit of historic cruft to be removed...
20151128: Uploaded the code for phase 1. The initial check-in uses CImg to load files; due to various issues it is limited to JPG files only.
Today's update is to replace using CImg to load the files with GDI+. As a result, GIF, PNG, TIFF and BMP files are now supported. Preliminary testing suggests GDI+ is about 25% faster than CImg/libjpeg.