similarity-search

Implementation and survey of similarity search methods that rely on dimensionality reduction (e.g. LSH), D-dimensional vector clustering

Stars
4

Similarity Search

This project contains implementations for the following dimensionality reduction methods and their application to the domains of similarity search (i.e. approximate k-Nearest Neighbors) and clustering:

Clustering through reverse assignment using range queries
---------------------------------------------------------

1. Index all data points of interest in either data structure (LSH hash tables or a single Hypercube hash table)
2. Initialize all cluster centroids
3. At each iteration, for each centroid execute a ranged query around it and assign all neighboring points to it  
4. The initial radii can be computed as min(dist between centers)/2 and after each iteration they are doubled
5. Repeat the above steps until the algorithm converges to a fix point, or certain criteria are met
6. If there are any unassigned points, we can assign them using the classic K-Means algorithm
7. Output computed clusters

For clustering, we used the initialization scheme of K-Means++ and we implemented the classic Lloyd's algorithm, comparing it to the above methods based on the well-known Silhouette metric. See the project documentation, benchmark results and the related bibliography.

Contributors

George Sittas (Jo) Dimitra Kousta (Demesta)