gpu powered brute force knn ground truth dataset generator
NeighborhoodWatch (nw
) is a GPU powered brute force knn ground truth dataset generator
At high level, in order to run this program, the following prerqusites need to be satsified:
p3.8xlarge
instance type)An example of setting up a bare-metal environment on an AWS p3.8xlarge
instance with Ubuntun 22.04
OS is provided in the following script:
For convenience purposes, a Dockerfile is also provided which allows you to build a docker image allows you to run the nw
program within a docker container with all the required driver and library dependencies. For more detailed information, please refer to the nw_docker document
First check and install Python dependencies by running the following commands in the home directory of this program:
poetry lock && poetry install
Then run the program with poetry run nw <input parameter list>
command. The available input parameter list is as below:
$ poetry run nw -h
usage: nw [-h] [-m MODEL_NAME] [-rd REDUCED_DIMENSION_SIZE] [-k K] [--data_dir DATA_DIR] [--use-dataset-api | --no-use-dataset-api] [--gen-hdf5 | --no-gen-hdf5]
[--post-validation | --no-post-validation] [--enable-memory-tuning] [--disable-memory-tuning]
query_count base_count
nw (neighborhood watch) uses GPU acceleration to generate ground truth KNN datasets
positional arguments:
query_count number of query vectors to generate
base_count number of base vectors to generate
options:
-h, --help show this help message and exit
-m MODEL_NAME, --model_name MODEL_NAME
model name to use for generating embeddings, i.e. text-embedding-ada-002, textembedding-gecko, or intfloat/e5-large-v2
-rd REDUCED_DIMENSION_SIZE, --reduced_dimension_size REDUCED_DIMENSION_SIZE
Reduced (output) dimension size. Only supported in models (e.g. OpenAI text-embedding-3-xxx) that have this feature. Ignored otherwise!
-k K, --k K number of neighbors to compute per query vector
--data_dir DATA_DIR Directory to store the generated data (default: knn_dataset)
--use-dataset-api, --no-use-dataset-api
Use 'pyarrow.dataset' API to read the dataset (default: True). Recommended for large datasets. (default: False)
--gen-hdf5, --no-gen-hdf5
Generate hdf5 files (default: True) (default: True)
--post-validation, --no-post-validation
Validate the generated files (default: False) (default: False)
--enable-memory-tuning
Enable memory tuning
--disable-memory-tuning
Disable memory tuning (useful for very small datasets)
Some example commands:
nw 1000 10000 -k 100 -m 'textembedding-gecko' --disable-memory-tuning
nw 1000 10000 -k 100 -m 'intfloat/e5-large-v2' --disable-memory-tuning
nw 1000 10000 -k 100 -m 'intfloat/e5-small-v2' --disable-memory-tuning
nw 1000 10000 -k 100 -m 'intfloat/e5-base-v2' --disable-memory-tuning
After the program is successfully run, it will generate a set of data sets under a spcified folder which is default to knn_dataset
subfolder.
You can override the output directory using the --data_dir <dir_name>
option.
In particular, the following datasets include the KNN ground truth results:
file format | dataset name | dataset file |
---|---|---|
fvec |
train dataset (base) |
<model_name>_<base_count>_base_vectors |
fvec |
test dataset (query) |
<model_name>_<base_count>_query_vectors_<query_count> |
fvec |
distances dataset (distances) |
<model_name>_<base_count>_distances_<query_count> |
ivec |
neighors dataset (indices) |
<model_name>_<base_count>_indices_query_<query_count> |
hdf5 |
consolidated hdf5 dataset of the above 4 datasets | <model_name>_base_<base_count>_query_<query_count> |
poetry run pytest