Async bulk data ingestion and querying in various document, graph and vector databases via their Python clients
MIT License
Bot releases are hidden (Show)
Full Changelog: https://github.com/prrao87/db-hub-fastapi/compare/0.10.0...0.10.1
Published by prrao87 about 1 year ago
Full Changelog: https://github.com/prrao87/db-hub-fastapi/compare/0.9.2...0.10.0
Published by prrao87 about 1 year ago
Full Changelog: https://github.com/prrao87/db-hub-fastapi/compare/0.9.1...0.9.2
Published by prrao87 about 1 year ago
Improved bulk indexer for Meilisearch and compared performance with the sync version of the Python client. Clearly, async performs better, the larger the data.
Full Changelog: https://github.com/prrao87/db-hub-fastapi/compare/0.9.0...0.9.1
Published by prrao87 about 1 year ago
Update to Pydantic v2 finished.
Full Changelog: https://github.com/prrao87/db-hub-fastapi/compare/0.8.3...0.9.0
Published by prrao87 about 1 year ago
Full Changelog: https://github.com/prrao87/async-db-fastapi/compare/0.8.2...0.8.3
Published by prrao87 over 1 year ago
Full Changelog: https://github.com/prrao87/async-db-fastapi/compare/0.8.1...0.8.2
Published by prrao87 over 1 year ago
Full Changelog: https://github.com/prrao87/async-db-fastapi/compare/0.8.0...0.8.1
Published by prrao87 over 1 year ago
Full Changelog: https://github.com/prrao87/async-db-fastapi/compare/0.7.0...0.8.0
Published by prrao87 over 1 year ago
Published by prrao87 over 1 year ago
Published by prrao87 over 1 year ago
Added code for Qdrant, a vector database built in Rust
Includes:
Bulk index both the data and associated vectors (sentence embeddings) using sentence-transformers
into Qdrant so that we can perform similarity search on phrases.
sentence-transformers
multi-qa-distilbert-cos-v1
is the model used: As per the docs, "This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs."It looks like ONNX does utilize all available CPU cores when processing the text and generating the embeddings (the image below was generated from an AWS EC2 T2 ubuntu instance with a single 4-core CPU).
On average, the entire wine reviews dataset of 129,971 reviews is vectorized and ingested into Qdrant in 34 minutes via the quantized ONNX model, as opposed to more than 1 hour for the regular sbert
model downloaded from the sentence-transformers
repo. The quantized ONNX model is also ~33% smaller in size from the original model.
sbert
model: Processes roughly 51 items/seconnxruntime
model: Processes roughly 92 items/secThis amounts to a roughly 1.8x reduction in indexing time, with a ~26% smaller (quantized) model that loads and processes results faster. To verify that the embeddings from the quantized models are of similar quality, some example cosine similarities are shown below.
The following results are for the sentence-transformers/multi-qa-MiniLM-L6-cos-v1
model that was built for semantic similarity tasks.
---
Loading vanilla sentence transformer model
---
Similarity between 'I'm very happy' and 'I am so glad': [0.74601071]
Similarity between 'I'm very happy' and 'I'm so sad': [0.6456476]
Similarity between 'I'm very happy' and 'My dog is missing': [0.09541589]
Similarity between 'I'm very happy' and 'The universe is so vast!': [0.27607652]
---
Loading quantized ONNX model
---
The ONNX file model_optimized_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
Similarity between 'I'm very happy' and 'I am so glad': [0.74153285]
Similarity between 'I'm very happy' and 'I'm so sad': [0.65299551]
Similarity between 'I'm very happy' and 'My dog is missing': [0.09312761]
Similarity between 'I'm very happy' and 'The universe is so vast!': [0.26112114]
As can be seen, the similarity scores are very close to the vanilla model, but the model is ~26% smaller and we are able to process the sentences much faster on the same CPU.
Published by prrao87 over 1 year ago
srsly
is a fast and lightweight JSON serialization library from Explosion.
pip install
time, and reduces the number of lines of code quite significantlysrsly
to read gzipped JSONLFor Meilisearch, the settings specification is moved over to a settings.json
to keep things clean and easy to find all in one place
Published by prrao87 over 1 year ago
This release contains updates and enhancements from #15 and #16.
#15 results in a ~4x reduction in indexing time for Meilisearch. The key changes are as follows:
concurrent.futures
), avoiding sequential executionaiofiles
was also tried to process files in async fashion, but the bottleneck seems to be with the validation in pydantic, not with file I/O.
Published by prrao87 over 1 year ago
Published by prrao87 over 1 year ago
#8 adds Meilisearch, a fast and responsive search engine database written in Rust. Like the other databases in this repo, the async Python client is used to bulk-index the dataset into the db and async queries are used in FastAPI. The following tasks are implemented:
.env.example
Published by prrao87 over 1 year ago
Includes updates from #5 and #6.
This release introduces Elasticsearch indexing and API code to the repo.
wines
alias and its associated index in ElasticPublished by prrao87 over 1 year ago
This release is for https://github.com/prrao87/async-db-fastapi/pull/4.
Published by prrao87 over 1 year ago
uvloop
to speed up async event loop (The AsyncGraphDatabase
driver already uses thisPublished by prrao87 over 1 year ago
This version adds support for ingesting the wine reviews dataset into Neo4j, a graph database, in an async fashion. In addition, it also provides a query API written in FastAPI that allows the user to send queries via available endpoints. As usual in FastAPI, the API is documented via OpenAPI specs.