Represent, send, store and search multimodal data
APACHE-2.0 License
Bot releases are hidden (Show)
Published by JoanFM over 1 year ago
Warning
This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.
If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.
DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.
This gives the following advantages:
You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:
For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.
Document
Document
has been renamed to BaseDoc
.BaseDoc
cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.BaseDoc
allows for a flexible schema compared to the Document
class in v1 which only allowed for a fixed schema, with one of tensor
, text
and blob
, and additional chunks
and matches
..load_uri_to_image_tensor()
) are not supported in v2. Instead, we provide some of those methods on the typing-level.LegacyDocument
class, which extends BaseDoc
while following the same schema as v1's Document
. The LegacyDocument
can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 Document
. Indeed, none of the methods associated with Document
are present. Only the schema of the data is similar.DocumentArray
DocumentArray
class from v1 has been renamed to DocList
, to be more descriptive of its actual functionality, since it is a list of BaseDoc
s.DocVec
, which is a column-based representation of BaseDoc
s. Both DocVec
and DocList
extend AnyDocArray
.DocVec
is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).DocVec
has a similar interface as DocList
but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the DocVec
(the .doc_type
which is a BaseDoc
) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single doc_vec
(Torch/TensorFlow/NumPy) tensor. If the tensor field is AnyTensor
or a Union of tensor types, the .tensor_type
will be used to determine the type of the doc_vec
column.DocList
it does not necessarily have to be homogenous.DocList
you can parameterize it at initialization time:from docarray import DocList
from docarray.documents import ImageDoc
docs = DocList[ImageDoc]()
.from_csv()
or .pull()
only work with parameterized DocList
s.AnyDocArray
will expose the same attributes as the BaseDoc
s it contains. This will return a list of type(attribute)
. However, this only works if (and only if) all the BaseDoc
s in the AnyDocArray
have the same schema. Therefore only this works:from docarray import BaseDoc, DocList
class Book(BaseDoc):
title: str
author: str = None
docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title # returns a list[str]
# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title
In v2 the Document Store
has been renamed to DocIndex
and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex
supports:
Instead of creating a DocumentArray
instance and setting the storage
parameter to a vector database of your choice, in v2 you can initialize a DocIndex
object of your choice, such as:
db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')
In contrast, DocStore
in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.
Thank you to all of the contributors to this release: