Represent, send, store and search multimodal data
APACHE-2.0 License
Bot releases are visible (Hide)
0.40.0
)Release time: 2023-12-22 12:12:15
๐ We'd like to thank all contributors for this new release! In particular,
954, Joan Fontanals, Tony Yang, Naymul Islam, Ben Shaver, Jina Dev Bot, ๐
ff00b604
] - index: add epsilla connector (#1835) (Tony Yang)522811f4
] - use literal in type hints (#1827) (Ben Shaver)1f86e263
] - error type hints in Python3.12 (#1147) (#1840) (954)21e107bd
] - fix issue serializing deserializing complex schemas (#1836) (Joan Fontanals)3cfa0b8f
] - fix storage issue in torchtensor class (#1833) (Naymul Islam)a2421a6a
] - epsilla: add epsilla integration guide (#1838) (Tony Yang)82918fe7
] - fix sign commit commad in docs (#1834) (Naymul Islam)Published by github-actions[bot] 12 months ago
0.39.1
)Release time: 2023-10-23 08:56:38
This release contains 2 bug fixes.
A recent update to numpy has changed some of the versioning semantics, breaking DocArray's from_dataframe()
method in some cases where the dataframe contains a numpy array. This has now been now fixed.
class MyDoc(BaseDoc):
embedding: NdArray
text: str
da = DocVec[MyDoc](
[
MyDoc(
embedding=[1, 2, 3, 4],
text='hello',
),
MyDoc(
embedding=[5, 6, 7, 8],
text='world',
),
],
tensor_type=NdArray,
)
df_da = da.to_dataframe()
# This broke before and is now fixed
da2 = DocVec[MyDoc].from_dataframe(df_da, tensor_type=NdArray)
Starting with Python 3.9, Optional.__args__
is not always available, leading to some compatibility problems. This has been fixed by using the typing.get_args
helper.
We would like to thank all contributors to this release:
Published by github-actions[bot] about 1 year ago
0.39.0
)Release time: 2023-10-02 13:06:02
This release contains 4 new features, 8 bug fixes, and 7 documentation improvements.
The biggest feature of this release is full support for Pydantic v2! We are continuing to support Pydantic v1 at the same time.
If you use Pydantic v2, you will need to adapt your DocArray code to the new Pydantic API. Check out their migration guide here.
Pydantic v2 has its core written in Rust and provides significant performance improvements to DocArray: JSON serialization is 240% faster and validation of BaseDoc and DocList with non-native types like TorchTensor
is 20% faster.
A BaseDoc
by default includes an id
field. This can be problematic if you want to build an API that requires a model without this ID field. Therefore, we now provide a BaseDocWithoutId
which is, as its name suggests, is BaseDoc without the ID field.
Please use this Document with caution, BaseDoc is still the base class to use unless you specifically need to remove the ID.
โ ๏ธ BaseDocWithoutId
is not compatible with DocIndex
or any feature requiring a vector database. This is because DocIndex needs the id field to store and retrieve documents.
Jina AI Cloud is being discontinued. Therefore, we are removing the push/pull
feature related to Jina AI cloud.
DocList
can be typed from BaseDoc using the following syntax DocList[MyDoc]()
.
In this release, we have fixed a bug that allowed users to specify the type of a DocList
multiple times
Doing DocList[MyDoc1][MyDoc2]
won't work anymore (#1800)
We also fixed a bug that caused a silent failure when users passed DocList
the wrong type, for example DocList[doc()]
. (#1794)
We fixed a small bug that incorrectly set the port of the Milvus client.
We would like to thank all contributors to this release:
Published by github-actions[bot] about 1 year ago
0.38.0
)Release time: 2023-09-07 13:40:16
This release contains 3 bug fixes and 4 documentation improvements, including 1 breaking change.
DocList.to_json()
and DocVec.to_json()
In order to make the to_json
method consistent across different classes, we changed its return type in DocList
and DocVec
to str
.
This means that, if you use this method in your application, make sure to update your codebase to expect str
instead of bytes
.
This release changes the return type of the methods DocList.to_json()
and DocVec.to_json()
in order to be consistent with BaseDoc .to_json()
and other pydantic models. After this release, these methods will return str
type data instead of bytes
.
๐ฅ Since the return type is changed, this is considered a breaking change.
This release introduces type casting internally in the reduce
helper function, casting its inputs before appending them to the final result. This will make it possible to reduce documents whose schemas are compatible but not exactly the same.
__annotations__
but not in __fields__
(#1777)This release fixes an issue in the create_pure_python_type_model helper function. Starting with this release, only attributes in the class __fields__
will be considered during type creation.
The previous behavior broke applications when users introduced a ClassVar in an input class:
class MyDoc(BaseDoc):
endpoint: ClassVar[str] = "my_endpoint"
input_test: str = ""
field_info = model.__fields__[field_name].field_info
KeyError: 'endpoint'
Kudos to @NarekA for raising the issue and contributing a fix in the Jina project, which was ported in DocArray.
filter_docs
(#1762)We would like to thank all contributors to this release:
d5cb02fb
] - version: the next version will be 0.37.2 (Jina Dev Bot)Published by github-actions[bot] about 1 year ago
This release contains 4 bug fixes and 1 Documentation improvement.
The previous schema check in the UpdateMixin
was strict and does not allow updating in cases the schema of both documents are similar but do not have the same reference.
For instance, if the schemas are dynamically generated but have the same fields and field types, the check will still evaluate to False
and it would not be possible to update the documents.
This release relaxes the check and allows checking whether the fields of the schemas are similar instead.
We fixed an issue where non-class type fields used in schemas with QdrantDocumentIndex
result in a TypeError
.
The issue has been resolved by replacing the usage of issubclass
with safe_issubclass
in the QdrantDocumentIndex
implementation.
The following case used to result in a KeyError
:
from docarray import BaseDoc
from docarray.utils.create_dynamic_doc_class import create_base_doc_from_schema
class Nested2(BaseDoc):
value: str
class Nested1(BaseDoc):
nested: Nested2
class RootDoc(BaseDoc):
nested: Nested1
new_my_doc_cls = create_base_doc_from_schema(RootDoc.schema(), 'RootDoc')
We fixed this issue by changigng create_base_doc_from_schema
such that global definitions of nested schemas are propagated during recursive calls.
We would like to thank all contributors to this release:
Published by github-actions[bot] about 1 year ago
0.37.0
)Release time: 2023-08-03 03:11:16
This release contains 6 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.
Leverage the power of Milvus in your DocArray project with this latest integration. Here's a simple usage example:
import numpy as np
from docarray import BaseDoc
from docarray.index import MilvusDocumentIndex
from docarray.typing import NdArray
from pydantic import Field
class MyDoc(BaseDoc):
text: str
embedding: NdArray[10] = Field(is_embedding=True)
docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = MilvusDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)
In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Milvus-backed document index and use it to index our documents. Finally, we perform a search query.
Supported Functionalities
HnswDocumentIndex
(#1718)With our latest update, you can easily utilize filtering in HnswDocumentIndex
either as an independent function or in conjunction with the query builder to combine it with vector search.
The code below shows how the new feature works:
import numpy as np
from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
class SimpleSchema(BaseDoc):
year: int
price: int
embedding: NdArray[128]
# Create dummy documents.
docs = DocList[SimpleSchema](
SimpleSchema(year=2000 - i, price=i, embedding=np.random.rand(128))
for i in range(10)
)
doc_index = HnswDocumentIndex[SimpleSchema](work_dir="./tmp_5")
doc_index.index(docs)
# Independent filtering operation (year == 1995)
filter_query = {"year": {"$eq": 1995}}
results = doc_index.filter(filter_query)
# Filtering combined with vector search
hybrid_query = (
doc_index.build_query() # get empty query object
.filter(filter_query={"year": {"$gt": 1994}}) # pre-filtering (year > 1994)
.find(
query=np.random.rand(128), search_field="embedding"
) # add vector similarity search
.filter(filter_query={"price": {"$lte": 3}}) # post-filtering (price <= 3)
.build()
)
results = doc_index.execute_query(hybrid_query)
First, we create and index some dummy documents. Then, we use the filter function in two ways. One is by itself to find documents from a specific year. The other is mixed with a vector search, where we first filter by year, perform a vector search, and then filter by price.
InMemoryExactNNIndex
(#1713)You can now add a pre-filter to your queries in InMemoryExactNNIndex
. This lets you create flexible queries where you can set up as many pre- and post-filters as you want. Here's a simple example:
query = (
doc_index.build_query()
.filter(filter_query={'price': {'$lte': 3}}) # Pre-filter: price <= 3
.find(query=np.ones(10), search_field='tensor') # Vector search
.filter(filter_query={'text': {'$eq': 'hello 1'}}) # Post-filter: text == 'hello 1'
.build()
)
In this example, we first set a pre-filter to only include items priced 3 or less. We then do a vector search. Lastly, we add a post-filter to find items with the text 'hello 1'. This way, you can easily filter before and after your search!
InMemoryExactNNIndex
(#1724)You can now easily update your documents in InMemoryExactNNIndex
. Previously, when you tried to update the same set of documents, it would just add duplicate copies instead of making changes to the existing ones. But not anymore! From now on, If you want to update documents you just have to re-index them.
DocVec
deserialization (#1679)Now you can specify the format of your tensor during DocVec
deserialization. You can do this with any method you're using to convert data - like protobuf
, json
, pandas
, bytes
, binary
, or base64
. This means you'll always get your tensors in the format you want, whether it's a Torch tensor, TensorFlow tensor, NDarray, and so on.
id
field of BaseDoc
(#1737)We added a description and example to the id
field of BaseDoc, so that you get a richer OpenAPI specification when building FastAPI based applications with it.
HnswDocumentIndex
performance (#1727, #1729)We've implemented two key optimizations to enhance the performance of HnswDocumentIndex
. Firstly, we've avoided serialization of embeddings to SQLite, which is a costly operation and unnecessary as the embeddings can be reconstructed from hnswlib
index itself. Additionally, we've minimized the frequency of computing num_docs()
, which previously involved time-consuming full table scan to determine the number of documents in SQLite. As a result, we've seen an approximate speed increase of 10%, enhancing both the indexing and searching processes.
TorchTensor
type comparison (#1739)We have addressed an exception raised when trying to compare TorchTensor
with the type
keyword in the docarray.typing
module. Previously, this would lead to a TypeError
, but the error has now been resolved, ensuring proper type comparison.
When using the method create_base_doc_from_schema
to dynamically create a BaseDoc class, some information was lost, so we made sure that the new class keeps FieldInfo information from the original class such as description
and examples
.
issubclass
(#1731)We fixed a bug calling issubclass
by changing the call for a safer implementation against some types.
QdrantDocumentIndex
(#1723)We've corrected an issue where the collection name was not being updated to match a newly-initialized subindex name in QdrantDocumentIndex
. This ensures consistent naming between collections and their respective subindexes.
We fixed a bug that will allow deepcopying documents with TorchTensors.
We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
0.36.0
)Release time: 2023-07-18 14:43:28
This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.
You can now use JAX with Docarray. We have introduced JaxArray as a new type option for your documents. JaxArray ensures that JAX can now natively process any array-like data in your DocArray documents. Here's how you use of it:
from docarray import BaseDoc
from docarray.typing import JaxArray
import jax.numpy as jnp
class MyDoc(BaseDoc):
arr: JaxArray
image_arr: JaxArray[3, 224, 224] # For images of shape (3, 224, 224)
square_crop: JaxArray[3, 'x', 'x'] # For any square image, regardless of dimensions
random_image: JaxArray[3, ...] # For any image with 3 color channels, and arbitrary other dimensions
As you can see, the JaxArray typing is extremely flexible and can support a wide range of tensor shapes.
Creating a document with tensors is straightforward. Here is an example:
doc = MyDoc(
arr=jnp.zeros((128,)),
image_arr=jnp.zeros((3, 224, 224)),
square_crop=jnp.zeros((3, 64, 64)),
random_image=jnp.zeros((3, 128, 256)),
)
Leverage the power of Redis in your Docarray project with this latest integration. Here's a simple usage example:
import numpy as np
from docarray import BaseDoc
from docarray.index import RedisDocumentIndex
from docarray.typing import NdArray
class MyDoc(BaseDoc):
text: str
embedding: NdArray[10]
docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = RedisDocumentIndex[MyDoc](host='localhost')
db.index(docs)
results = db.find(query, search_field='embedding', limit=10)
In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Redis-backed document index and use it to index our documents. Finally, we perform a search query.
Find: Vector search for efficient retrieval of similar documents.
Filter: Use Redis syntax to filter based on textual and numeric data.
Text Search: Leverage text search methods, such as BM25, to find relevant documents.
Get/Del: Fetch or delete specific documents from the index.
Hybrid Search: Combine find and filter functionalities for more refined search. Currently, only these two can be combined.
Subindex: Search through nested data.
HnswDocumentIndex
by caching num docs (#1706)We've optimized the num_docs() operation by caching the document count, addressing previous slowdowns during searches. This change results in a minor increase in indexing time, but significantly accelerates search times.
from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir='tst', index_name='index')
index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start
query = docs[0]
find_start = time.time()
matches, _ = index.find(query, search_field='embedding', limit=10)
find_time = time.time() - find_start
In the above experiment, we observed a 13x improvement in the speed of the search function, reducing its execution time from 0.0238 to 0.0018 seconds.
We've moved the contains method into the base class. With this refactoring, the responsibility for checking if a document exists is now delegated to individual backend implementations using the new _doc_exists method.
We have implemented a more robust method of detecting existing indices for WeaviateDocumentIndex
WeaviateDocumentIndex
handles lowercase index names (#1711)We've addressed an issue in the WeaviateDocumentIndex
where passing a lowercase index name led to mismatches and subsequent errors. This was due to the system automatically capitalizing the index name when creating an index.
QdrantDocumentIndex
unable to see index_name
(#1705)We've resolved an issue where the QdrantDocumentIndex
was not properly recognizing the index_name
parameter. Previously, the specified index_name
was ignored and the system defaulted to the schema name.
InMemoryExactNNIndex
with AnyEmbedding
(#1696)From now on, you can perform search operations in InMemoryExactNNIndex
using AnyEmbedding
safe_issubclass
everywhere (#1691)We now use safe_issubclass instead of issubclass because it supports non-class inputs, helping us to avoid unexpected errors
DocLists
in the base index (#1685)We added an additional check to avoid passing DocLists to a function that converts a list of dictionaries to a DocList.
We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
0.35.0
)This release contains 3 new features, 2 bug fixes and 1 documentation improvement.
DocVec
(#1562)DocVec
now has the same serialization interface as DocList
. This means that that following methods are available for it:
to_protobuf()
/from_protobuf()
to_base64()
/from_base64()
save_binary()
/load_binary()
to_bytes()
/from_bytes()
to_dataframe()
/from_dataframe()
For example, you can now perform Base64 (de)serialization like this:
from docarray import BaseDoc, DocVec
class SimpleDoc(BaseDoc):
text: str
dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])
base64_repr_dv = dv.to_base64(compress=None, protocol='pickle')
dl_from_base64 = DocVec[SimpleDoc].from_base64(
base64_repr_dv, compress=None, protocol='pickle'
)
For further guidance, check out the documentation section on serialization.
Validate the file formats given in URL types such as AudioURL, TextURL, ImageURL
to check they correspond to the expected mime type.
BaseDoc
from schema (#1667)Sometimes it can be useful to dynamically create a BaseDoc
from a given schema of an original BaseDoc
. Using the methods create_pure_python_type_model
and create_base_doc_from_schema
you can make sure to reconstruct the BaseDoc
.
from docarray.utils.create_dynamic_doc_class import (
create_base_doc_from_schema,
create_pure_python_type_model,
)
from typing import Optional
from docarray import BaseDoc, DocList
from docarray.typing import AnyTensor
from docarray.documents import TextDoc
class MyDoc(BaseDoc):
tensor: Optional[AnyTensor]
texts: DocList[TextDoc]
MyDocPurePython = create_pure_python_type_model(MyDoc) # Due to limitation of DocList as Pydantic List, we need to have the MyDoc `DocList` converted to `List`.
NewMyDoc = create_base_doc_from_schema(
MyDocPurePython.schema(), 'MyDoc', {}
)
new_doc = NewMyDoc(tensor=None, texts=[TextDoc(text='text')])
Due to the breaking change in Pydantic v2
, we have capped the version to avoid problems when installing docarray.
After calling doc_list = doc_vec.to_doc_list()
, doc_vec
ends up in an unusable state since its data has been transferred to doc_list
. This fix gives users a more informative error message when they try to interact with doc_vec
after it has been made unusable.
We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
0.21.1
)Release time: 2023-06-21 08:15:43
This release contains 1 bug fix.
This extra headers allow to pass authentication keys to connect to a secured Weaviate instance
WeaviateDocumentArray supports
We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
0.34.0
)Release time: 2023-06-21 08:15:43
This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.
โ ๏ธ โ ๏ธ DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.
We decided to drop it for two reasons:
DocVec
Protobuf definition (#1639)In order to fix a bug in the DocVec
protobuf serialization described in #1561,
we have changed the DocVec
.proto definition.
This means that DocVec
objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray
v.0.34.0 or later, and vice versa.
โ ๏ธ โ ๏ธ We strongly recommend that everyone using Protobuf with DocVec
upgrade to DocArray v0.34.0 or
later.
You can now check if a Document has already been indexed by using the in
keyword:
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(text="Example text", embedding=np.random.rand(128))
for _ in range(2000)])
index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index
InMemoryExactNNIndex
(#1617)You can now use the find_subindex
method with the ExactNNSearch DocIndex.
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, VideoUrl, AnyTensor
class ImageDoc(BaseDoc):
url: ImageUrl
tensor_image: AnyTensor = Field(space='cosine', dim=64)
class VideoDoc(BaseDoc):
url: VideoUrl
images: DocList[ImageDoc]
tensor_video: AnyTensor = Field(space='cosine', dim=128)
class MyDoc(BaseDoc):
docs: DocList[VideoDoc]
tensor: AnyTensor = Field(space='cosine', dim=256)
doc_index = InMemoryExactNNIndex[MyDoc]()
...
# find by the `ImageDoc` tensor when index is populated
root_docs, sub_docs, scores = doc_index.find_subindex(
np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)
You can deserialize any DocVec
protobuf message to any tensor type,
by passing the tensor_type
parameter to from_protobuf
.
This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.
class MyDoc(BaseDoc):
tensor: TensorFlowTensor
da = DocVec[MyDoc](...) # doesn't matter what tensor_type is here
proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)
assert isinstance(da_after.tensor, TensorFlowTensor)
DBConfig
to InMemoryExactNNSearch
InMemoryExactNNsearch
used to get a single parameter index_file_path
as a constructor parameter, unlike the rest of
the Indexers who accepted their own DBConfig
. Now index_file_path
is part of the DBConfig
which allows to
initialize from it.
This will allow us to extend this config if more parameters are needed.
The parameters of DBConfig
can be passed at construction time as **kwargs
making this change compatible with old
usage.
These two initializations are equivalent.
from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')
index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')
BaseDoc
with Union
type (#1655)Serialization of BaseDoc
types who have Union
types parameter of Python native types is supported.
from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
union_field: Union[int, str]
docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2
When these Union
types involve other BaseDoc
types, an exception is thrown.
class CustomDoc(BaseDoc):
ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')
docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])
# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())
HNSWDocumentIndex
(#1657, #1656)If you call find
or find_batched
on an HNSWDocumentIndex
, the limit
parameter will automatically be cast to
integer
.
default_column_config
from RuntimeConfig
to DBconfig
(#1648)default_column_config
contains specific configuration information about the columns and tables inside the backend's
database. This was previously put inside RuntimeConfig
which caused an error because this information is required at
initialization time. This information has been moved inside DBConfig
so you can edit it there.
from docarray.index import HNSWDocumentIndex
import numpy as np
db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)
This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the
data essentially inaccessible. This has now been fixed, and DocVec
objects are identical before and after (de)serialization.
find
and filter
combination used in InMemoryExactNNIndex
(#1642)Hybrid search (find+filter) for InMemoryExactNNIndex
was prioritizing low similarities (lower scores) for returned
matches. Fixed by adding an option to sort matches in a reverse order based on their scores.
# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')
query = (
db.build_query()
.find(query=q_doc, search_field='embedding')
.filter(filter_query={'text': {'$exists': True}})
.build()
)
results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first
When using QdrandDocumentIndex
to connect to a Qdrant DB initialized outside of docarray
raised a KeyError
.
This has been fixed, and now you can use QdrantDocumentIndex
to connect to externally initialized collections.
DocVec
equality (#1641, #1663)summary()
called for LegacyDocument
. (#1637)DocList
and DocVec
coersion. (#1568)update()
on BaseDoc
with tensors fields (#1628)We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
0.33.0
)Release time: 2023-06-06 14:05:56
This release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements.
Allow coercing to a TorchTensor
from an NdArray
or TensorFlowTensor
and the other way around.
from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np
class MyTensorsDoc(BaseDoc):
tensor: TorchTensor
doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
๐ MyTensorsDoc : 0a10f88 ...
โญโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Attribute โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ tensor: TorchTensor โ TorchTensor of shape (512,), dtype: torch.float64 โ
โฐโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
We have made a performance improvement for the find
interface for InMemoryExactNNIndex
that gives a ~2x speedup.
The script used to measure this is as follows:
from torch import rand
from time import perf_counter
โ
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor
โ
โ
class MyDocument(BaseDoc):
embedding: TorchTensor
embedding2: TorchTensor
embedding3: TorchTensor
โ
def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
return DocList[MyDocument](
[
MyDocument(
embedding=rand(dims),
embedding2=rand(dims),
embedding3=rand(dims),
)
for _ in range(num_docs)
]
)
โ
num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)
โ
index = InMemoryExactNNIndex[MyDocument](data_list)
โ
start = perf_counter()
for _ in range(5):
matches, scores = index.find_batched(queries, search_field='embedding')
โ
print(f"Number of queries: {num_queries} \n"
f"Number of indexed documents: {num_docs} \n"
f"Total time: {(perf_counter() - start)/5} seconds")
limit
parameter in filter
for index backends (#1618)InMemoryExactNNIndex
and HnswDocumentIndex
now respect the limit
parameter in the filter
API.
HnswDocumentIndex
can search with limit
greater than number of documents (#1611)HnswDocumentIndex
now allows to call find
with a limit
parameter larger than the number of indexed documents.
HnswDocumentIndex
(#1604)HnswDocumentIndex
now allows reindexing documents with the same id
, updating the original documents.
HnswDocumentIndex
now allows indexing more than max_elements
, dynamically adapting the index as it grows.
HnswDocumentIndex
(#1596)from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')
Previously, this basic usage threw an exception:
TypeError: ModelMetaclass object argument after must be a mapping, not MyDoc
Now, it works as expected.
InMemoryExactNNIndex
index initialization with nested DocList
(#1582)Instantiating an InMemoryExactNNIndex
with a Document
schema that had a nested DocList
previously threw this error:
from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
text: str,
d_list: DocList[TextDoc]
index = HnswDocumentIndex[MyDoc]()
TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'
Now it can be successfully instantiated.
Calling summary
on a document with a List
attribute previously showed the wrong type:
from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
str_list: List[str]
dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()
Previous output:
โญโโโโโโโ DocList Summary โโโโโโโโฎ
โ โ
โ Type DocList[TestDoc] โ
โ Length 2 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโ Document Schema โโโโฎ
โ โ
โ TestDoc โ
โ โโโ str_list: str โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโฏ
New output:
โญโโโโโโโ DocList Summary โโโโโโโโฎ
โ โ
โ Type DocList[TestDoc] โ
โ Length 2 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโ Document Schema โโโโโโโฎ
โ โ
โ TestDoc โ
โ โโโ str_list: List[str] โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
issubclass
(#1594)DocArray
relies heavily on calling Python's issubclass
method which caused multiple issues. We now use a safe version that counts for edge cases and types.
The example
payload of a given document schema with Tensor
attribute was previously of bytes
type. This has now been changed to str
.
from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')
n_dim
to dim
(#1610)We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
0.32.1
)Release time: 2023-05-26 14:50:34
This release contains 4 bug fixes, 1 refactoring and 2 documentation improvements.
ElasticDocIndex
logging (#1551)More debugging logs have been added inside ElasticDocIndex
.
InMemoryExactNNIndex
with Optional
embedding tensors (#1575)You can now index Documents where the tensor search_field
is Optional
. The index will not consider these None
embeddings when running a search.
import torch
from typing import Optional
from docarray import BaseDoc, DocList
from docarray.typing import TorchTensor
from docarray.index import InMemoryExactNNIndex
class EmbeddingDoc(BaseDoc):
embedding: Optional[TorchTensor[768]]
index = InMemoryExactNNIndex[TestDoc](DocList[TestDoc]([TestDoc(embedding=(torch.rand(768,) if i % 2 else None)) for i in range(5)]))
index.find(torch.rand((768,)), search_field="embedding", limit=3)
is_subclass
check (#1569)In DocArray, especially when dealing with indexers, field types are checked that lead to calls to Python's is_subclass
method.
This call fails under some circumstances, for instance when checked for a List
or Tuple
. Starting with this release, we use a safe version that does not fail for these cases.
This enables the following usage, which would otherwise fail:
from docarray import BaseDoc
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
test: List[str]
index = HnswDocumentIndex[MyDoc]()
AnyDoc
deserialization (#1571)AnyDoc
is a schema-less special Document that adapts to the schema of the data it tries to load. However, in cases where the data contained Dictionaries or Lists, deserialization failed. This is now fixed and you can have this behavior:
from docarray.base_doc import AnyDoc, BaseDoc
from typing import Dict
class ConcreteDoc(BaseDoc):
text: str
tags: Dict[str, int]
doc = ConcreteDoc(text='text', tags={'type': 1})
any_doc = AnyDoc.from_protobuf(doc.to_protobuf())
assert any_doc.text == 'text'
assert any_doc.tags == {'type': 1}
dict
method for Document view (#1559)Prior to this fix, doc.dict()
would return an empty Dictionary if doc.is_view() == True
:
class MyDoc(BaseDoc):
foo: int
vec = DocVec[MyDoc]([MyDoc(foo=3)])
# before
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {}
# after
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {'id': 'f285db406a949a7e7ab084032800f7d8', 'foo': 3}
DocList
in FastAPI (#1546)We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
v0.32.0
)This release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.
The subindex feature allows you to index documents that contain another DocList
by automatically creating a separate collection/index for each such DocList
:
# create nested document schema
class SimpleDoc(BaseDoc):
tensor: NdArray[10]
text: str
class MyDoc(BaseDoc):
docs: DocList[SimpleDoc]
# create some docs
my_docs = [
MyDoc(
docs=DocList[SimpleDoc](
[
SimpleDoc(
tensor=np.ones(10) * (j + 1),
text=f"hello {j}",
)
for j in range(10)
]
),
)
]
# index them into Elasticsearch
index = ElasticDocIndex[MyDoc](index_name="idx")
index.index(my_docs) # index with name 'idx' and 'idx__docs' will be generated
# search on the nested level (subindex)
query = np.random.rand(10)
matches_root, matches_nested, scores = index.find_subindex(
query, search_field="docs__tensor", limit=5
)
We have enabled shaped tensors to be properly represented in OpenAPI/SwaggerUI, both in examples and the schema.
This means that you can now built web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:
class Doc(BaseDoc):
embedding_torch: TorchTensor[3, 4]
app = FastAPI()
@app.post("/foo", response_model=Doc, response_class=DocArrayResponse)
async def foo(doc: Doc) -> Doc:
return Doc(embedding=doc.embedding_np)
Generated Swagger UI:
We added a persist
method to the InMemoryExactNNIndex
class to save the index to disk.
# Save your existing index as a binary file
doc_index.persist('docs.bin')
# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')
search_field
should be optional in hybrid text search (#1516)We have added a sane default to text_search()
for the search_field
argument that is now Optional.
We have added an internal check to see if index_file_path
exists when passed to InMemoryExactNNIndex
.
We have ensured that empty indices do not fail when find
is called.
Serializing tensors with gradients no longer fails.
Docvec
display (#1522)Docvec
display issues have been resolved.
We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
0.31.1
)This patch release fixes a small bug that was introduced in the latest minor release (0.31.0
).
json
or dict
on a Optional nested DocList does not throw an error anymore if the value is set to None
(#1512)We would like to thank all contributors to this release:
Published by github-actions[bot] over 1 year ago
v0.31.0
)This release contains 4 new features, 11 bug fixes, and several documentation improvements.
DocVec
Optional Tensor (#1472)Optional tensor fields in a DocVec
will return None
instead of a list of Nan
if the column does not hold any tensor.
This code snippet shows the breaking change:
from typing import Optional
from docarray import BaseDoc, DocVec
from docarray.typing import NdArray
class MyDoc(BaseDoc):
tensor: Optional[NdArray[10]]
docs = DocVec[MyDoc]([MyDoc() for j in range(2)])
print(docs.tensor)
Version | Return type |
---|---|
0.30.0 | [nan nan] |
0.31.0 | None |
Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.
In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name
or collection_name
.
Starting with DocArray v0.30.0, the default index_name
/collection_name
will be derived from the document schema name:
from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc
class MyDoc(BaseDoc):
pass
# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()
If you create an persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.
You can fix this by manually specifying the index name to match the old default:
# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')
The below table summarizes the change for all DB backends:
DBConfig argument | Default in v0.30.0 | Default in v0.31.0 | |
---|---|---|---|
WeaviateDocumentIndex | index_name |
'Document' | Schema class name |
QdrantDocumentIndex | collection_name |
'documents' | Schema class name |
ElasticDocIndex | index_name |
'index__' + a random id | Schema class name |
ElasticV7DocIndex | index_name |
'index__' + a random id | Schema class name |
HnswDocumentIndex | n/a | n/a | n/a |
InMemoryExactNNIndex
(#1441)In this version we have introduced the InMemoryExactNNIndex
Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).
The InMemoryExactNNIndex
can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
tensor: NdArray[512]
docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))
doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)
print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))
FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))
DocList
inherits from Python list
(#1457)DocList
is now a subclass of Python's list
. This means that you can now use all the methods that are available to Python lists on DocList
objects. For example, you can now use len
on DocList
objects and tools like Pydantic or FastAPI will be able to work with it more easily.
len
to DocIndex
(#1454)You can now perform len(vector_index)
which is equivalent to vector_index.num_docs()
.
to_json
alias to BaseDoc
(#1494)Document
or Documentarray
(#1422)Trying to load Document
or DocumentArray
from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.
AnyDoc.from_protobuf
(#1437)AnyDoc
can now read any BaseDoc
protobuf file. The same applies to DocList
.
extend
to DocList
(#1493)dict()
on BaseDoc
(#1481)json()
on BaseDoc
(#1481)pd.concat()
instead of df.append()
in to_dataframe()
to avoid warning (#1478)ndarray
(#1429)hnswlib
(#1424)Docindex
URLs (#1433)hnswlib
and elastic
document indexes (#1431)We would like to thank all contributors to this release:
Published by JoanFM over 1 year ago
Warning
This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.
If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.
DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.
This gives the following advantages:
You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:
For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.
Document
Document
has been renamed to BaseDoc
.BaseDoc
cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.BaseDoc
allows for a flexible schema compared to the Document
class in v1 which only allowed for a fixed schema, with one of tensor
, text
and blob
, and additional chunks
and matches
..load_uri_to_image_tensor()
) are not supported in v2. Instead, we provide some of those methods on the typing-level.LegacyDocument
class, which extends BaseDoc
while following the same schema as v1's Document
. The LegacyDocument
can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 Document
. Indeed, none of the methods associated with Document
are present. Only the schema of the data is similar.DocumentArray
DocumentArray
class from v1 has been renamed to DocList
, to be more descriptive of its actual functionality, since it is a list of BaseDoc
s.DocVec
, which is a column-based representation of BaseDoc
s. Both DocVec
and DocList
extend AnyDocArray
.DocVec
is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).DocVec
has a similar interface as DocList
but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the DocVec
(the .doc_type
which is a BaseDoc
) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single doc_vec
(Torch/TensorFlow/NumPy) tensor. If the tensor field is AnyTensor
or a Union of tensor types, the .tensor_type
will be used to determine the type of the doc_vec
column.DocList
it does not necessarily have to be homogenous.DocList
you can parameterize it at initialization time:from docarray import DocList
from docarray.documents import ImageDoc
docs = DocList[ImageDoc]()
.from_csv()
or .pull()
only work with parameterized DocList
s.AnyDocArray
will expose the same attributes as the BaseDoc
s it contains. This will return a list of type(attribute)
. However, this only works if (and only if) all the BaseDoc
s in the AnyDocArray
have the same schema. Therefore only this works:from docarray import BaseDoc, DocList
class Book(BaseDoc):
title: str
author: str = None
docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title # returns a list[str]
# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title
In v2 the Document Store
has been renamed to DocIndex
and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex
supports:
Instead of creating a DocumentArray
instance and setting the storage
parameter to a vector database of your choice, in v2 you can initialize a DocIndex
object of your choice, such as:
db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')
In contrast, DocStore
in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.
Thank you to all of the contributors to this release:
Published by github-actions[bot] almost 2 years ago
0.21.0
)Release time: 2023-01-17 09:10:50
This release contains 3 new features, 7 bug fixes and 5 documentation improvements.
This version of DocArray adds a new Document Store: OpenSearch!
You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:
from docarray import Document, DocumentArray
import numpy as np
# Connect to OpenSearch instance
n_dim = 3
da = DocumentArray(
storage='opensearch',
config={'n_dim': n_dim},
)
# Index Documents
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim))
for i in range(10)
]
)
# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)
Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.
Learn more about its usage in the official documentation.
You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor()
:
coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')
doc = Document(
tensor=coords,
chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()
The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language
parameter in the Redis configuration:
da = DocumentArray(
storage='redis',
config={
'n_dim': 128,
'index_text': True,
'language': 'chinese',
},
)
Whenever the string "\n"
was contained in any Document field, doc.plot()
would result in a rendering error. This fixes those errors be rendering "\n"
as whitespace.
to_pydantic_model
(#949)This bug caused all strings of the form 'Infinity'
to be coerced to the string 'inf'
when calling to_pydantic_model()
or to_dict()
. This is fixed now, leaving such strings unchanged.
In the embed_and_evaluate()
method, the number of relevant Documents per label used to be calculated based on the Document in self
. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.
When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.
Loading audio files from a remote URL would cause FileNotFoundError
, which is now fixed.
$exists
does not work correctly with tags (#911) (#923)Before this fix, $exists
would treat false-y values such as 0
or []
as non existent. This is now fixed.
When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]
. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.
We would like to thank all contributors to this release:
Published by github-actions[bot] almost 2 years ago
0.20.1
)Release time: 2022-12-12 09:32:37
This bug was causing connectivity issues when using multiple DocumentArrays in different threads to connect to the same Milvus instance, e.g. in pytest.
This would produce an error like the following:
E1207 14:59:51.357528591 2279 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.367985469 2279 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.457061884 3934 ev_epoll1_linux.cc:824] assertion failed: gpr_atm_no_barrier_load(&g_active_poller) != (gpr_atm)worker
Fatal Python error: Aborted
This fix creates a separate gRPC connection for each MilvusDocumentArray instance, circumventing the issue.
DocArray v0.20.0 broke (de)serialization backwards compatibility with earlier versions of the library, making it impossible to load DocumentArrays from v0.19.1 or earlier from disk:
# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.0
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
AttributeError: 'DocumentArrayInMemory' object has no attribute '_is_subindex'
This fix restores backwards compatibility by not relying on newly introduced private attributes:
# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.1
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
<DocumentArray (length=11) at 140683902276416>
Process finished with exit code 0
We would like to thank all contributors to this release:
Published by github-actions[bot] almost 2 years ago
0.20.0
)Release time: 2022-12-07 12:15:30
This release contains 8 new features, 3 bug fixes and 7 documentation improvements.
This release supports the Milvus vector database as a document store.
da = DocumentArray(storage='milvus', config={'n_dim': 3))
When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).
top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)
To allow this we now store the root_id
in the chunks' tags. You can enable this by passing root_id=True
in your document store configuration.
You can now filter based on text keywords for the Qdrant document store.
filter = {
'must': [
{"key": "info", "match": {"text": "shoes"}}
]
}
results = da.find(np.random.rand(n_dim), filter=filter)
DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.
doc.load_uris_to_rgbd_tensor()
Multi page tiff
images can now be loaded with load_uri_to_image_tensor()
.
d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)
<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
โโ chunks
โโ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
โโ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
โโ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>
key_frame_indices
are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.
d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])
[0, 25, 196, ...]
You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding()
method. This makes it easier to plot embeddings for complex and nested data.
docs.plot_embeddings(exclude_fields_metas=['chunks'])
This release adds a max_rel_per_label
parameter to better support metric calculations that require the number of relevant Documents.
metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})
DocArray 0.19 added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.
In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.
The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage
is called.
We would like to thank all contributors to this release:
Published by github-actions[bot] almost 2 years ago
This release contains 1 hot fix.
This release introduces namespaces when pushing/pulling DocumentArrays to/from Jina AI Cloud.
from docarray import DocumentArray
DocumentArray.pull('<username>/<da-name>')
DocumentArray.push('<username>/<da-name>')
You should now use a namespace when accessing an artifact. This release fixes a bug related to this namespace in DocArray.