docarray

Represent, send, store and search multimodal data

APACHE-2.0 License

Downloads
246.6K
Stars
3K
Committers
76

Bot releases are visible (Hide)

docarray - ๐Ÿ’ซ Patch v0.40.0 Latest Release

Published by github-actions[bot] 10 months ago

Release Note (0.40.0)

Release time: 2023-12-22 12:12:15

๐Ÿ™‡ We'd like to thank all contributors for this new release! In particular,
954, Joan Fontanals, Tony Yang, Naymul Islam, Ben Shaver, Jina Dev Bot, ๐Ÿ™‡

๐Ÿ†• New Features

  • [ff00b604] - index: add epsilla connector (#1835) (Tony Yang)
  • [522811f4] - use literal in type hints (#1827) (Ben Shaver)

๐Ÿž Bug fixes

  • [1f86e263] - error type hints in Python3.12 (#1147) (#1840) (954)
  • [21e107bd] - fix issue serializing deserializing complex schemas (#1836) (Joan Fontanals)
  • [3cfa0b8f] - fix storage issue in torchtensor class (#1833) (Naymul Islam)

๐Ÿ“— Documentation

  • [a2421a6a] - epsilla: add epsilla integration guide (#1838) (Tony Yang)
  • [82918fe7] - fix sign commit commad in docs (#1834) (Naymul Islam)

๐Ÿน Other Improvements

  • [0e183ff0] - upgrade version (#1841) (Joan Fontanals)
  • [8de3e175] - refactor test of the torchtensor (#1837) (Naymul Islam)
  • [d5d928b8] - version: the next version will be 0.39.2 (Jina Dev Bot)
docarray - ๐Ÿ’ซ Patch v0.39.1

Published by github-actions[bot] 12 months ago

Release Note (0.39.1)

Release time: 2023-10-23 08:56:38

This release contains 2 bug fixes.

๐Ÿž Bug Fixes

From_dataframe with numpy==1.26.1 (#1823)

A recent update to numpy has changed some of the versioning semantics, breaking DocArray's from_dataframe() method in some cases where the dataframe contains a numpy array. This has now been now fixed.

class MyDoc(BaseDoc):
    embedding: NdArray
    text: str

da = DocVec[MyDoc](
    [
        MyDoc(
            embedding=[1, 2, 3, 4],
            text='hello',
        ),
        MyDoc(
            embedding=[5, 6, 7, 8],
            text='world',
        ),
    ],
    tensor_type=NdArray,
)
df_da = da.to_dataframe()
# This broke before and is now fixed
da2 = DocVec[MyDoc].from_dataframe(df_da, tensor_type=NdArray)

Type handling in python 3.9 (#1823)

Starting with Python 3.9, Optional.__args__ is not always available, leading to some compatibility problems. This has been fixed by using the typing.get_args helper.

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Johannes Messner (@JohannesMessner )
docarray - ๐Ÿ’ซ Release v0.39.0

Published by github-actions[bot] about 1 year ago

Release Note (0.39.0)

Release time: 2023-10-02 13:06:02

This release contains 4 new features, 8 bug fixes, and 7 documentation improvements.

๐Ÿ†• Features

Support for Pydantic v2 ๐Ÿš€ (#1652)

The biggest feature of this release is full support for Pydantic v2! We are continuing to support Pydantic v1 at the same time.

If you use Pydantic v2, you will need to adapt your DocArray code to the new Pydantic API. Check out their migration guide here.

Pydantic v2 has its core written in Rust and provides significant performance improvements to DocArray: JSON serialization is 240% faster and validation of BaseDoc and DocList with non-native types like TorchTensor is 20% faster.

Add BaseDocWithoutId (#1803)

A BaseDoc by default includes an id field. This can be problematic if you want to build an API that requires a model without this ID field. Therefore, we now provide a BaseDocWithoutId which is, as its name suggests, is BaseDoc without the ID field.

Please use this Document with caution, BaseDoc is still the base class to use unless you specifically need to remove the ID.

โš ๏ธ BaseDocWithoutId is not compatible with DocIndex or any feature requiring a vector database. This is because DocIndex needs the id field to store and retrieve documents.

๐Ÿ’ฃ Breaking change

Remove Jina AI cloud push/pull (#1791)

Jina AI Cloud is being discontinued. Therefore, we are removing the push/pull feature related to Jina AI cloud.

๐Ÿž Bug Fixes

Fix DocList subscription error

DocList can be typed from BaseDoc using the following syntax DocList[MyDoc]().

In this release, we have fixed a bug that allowed users to specify the type of a DocList multiple times

Doing DocList[MyDoc1][MyDoc2] won't work anymore (#1800)

We also fixed a bug that caused a silent failure when users passed DocList the wrong type, for example DocList[doc()]. (#1794)

Milvus connection parameter missing (#1802)

We fixed a small bug that incorrectly set the port of the Milvus client.

๐Ÿ“— Documentation Improvements

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • lvzi (@lvzii )
  • Puneeth K (@punndcoder28 )
  • Joan Fontanals (@JoanFM )
  • samsja (@samsja )
docarray - ๐Ÿ’ซ Release v0.38.0

Published by github-actions[bot] about 1 year ago

Release Note (0.38.0)

Release time: 2023-09-07 13:40:16

This release contains 3 bug fixes and 4 documentation improvements, including 1 breaking change.

๐Ÿ’ฅ Breaking Changes

Changes to the return type of DocList.to_json() and DocVec.to_json()

In order to make the to_json method consistent across different classes, we changed its return type in DocList and DocVec to str.
This means that, if you use this method in your application, make sure to update your codebase to expect str instead of bytes.

๐Ÿž Bug Fixes

Make DocList.to_json() and DocVec.to_json() return str instead of bytes (#1769)

This release changes the return type of the methods DocList.to_json() and DocVec.to_json() in order to be consistent with BaseDoc .to_json() and other pydantic models. After this release, these methods will return str type data instead of bytes.
๐Ÿ’ฅ Since the return type is changed, this is considered a breaking change.

Casting in reduce before appending (#1758)

This release introduces type casting internally in the reduce helper function, casting its inputs before appending them to the final result. This will make it possible to reduce documents whose schemas are compatible but not exactly the same.

Skip doc attributes in __annotations__ but not in __fields__ (#1777)

This release fixes an issue in the create_pure_python_type_model helper function. Starting with this release, only attributes in the class __fields__ will be considered during type creation.
The previous behavior broke applications when users introduced a ClassVar in an input class:

class MyDoc(BaseDoc):
    endpoint: ClassVar[str] = "my_endpoint"
    input_test: str = ""
    field_info = model.__fields__[field_name].field_info
KeyError: 'endpoint'

Kudos to @NarekA for raising the issue and contributing a fix in the Jina project, which was ported in DocArray.

๐Ÿ“— Documentation Improvements

  • Explain how to set Document config (#1773)
  • Add workaround for torch compile (#1754)
  • Add note about pickling dynamically created Doc class (#1763)
  • Improve the docstring of filter_docs (#1762)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Sami Jaghouar (@samsja )
  • Johannes Messner (@JohannesMessner )
  • AlaeddineAbdessalem (@alaeddine-13 )
  • Joan Fontanals (@JoanFM ))
  • [d5cb02fb] - version: the next version will be 0.37.2 (Jina Dev Bot)
docarray - ๐Ÿ’ซ Patch v0.37.1

Published by github-actions[bot] about 1 year ago

Release Note v0.37.1

This release contains 4 bug fixes and 1 Documentation improvement.

๐Ÿž Bug Fixes

Relax the schema check in update mixin (#1755)

The previous schema check in the UpdateMixin was strict and does not allow updating in cases the schema of both documents are similar but do not have the same reference.
For instance, if the schemas are dynamically generated but have the same fields and field types, the check will still evaluate to False and it would not be possible to update the documents.
This release relaxes the check and allows checking whether the fields of the schemas are similar instead.

Fix non-class type fields (#1752)

We fixed an issue where non-class type fields used in schemas with QdrantDocumentIndex result in a TypeError.
The issue has been resolved by replacing the usage of issubclass with safe_issubclass in the QdrantDocumentIndex implementation.

Fix dynamic class creation with doubly nested schemas (#1747)

The following case used to result in a KeyError:

from docarray import BaseDoc
from docarray.utils.create_dynamic_doc_class import create_base_doc_from_schema

class Nested2(BaseDoc):
    value: str

class Nested1(BaseDoc):
    nested: Nested2

class RootDoc(BaseDoc):
    nested: Nested1

new_my_doc_cls = create_base_doc_from_schema(RootDoc.schema(), 'RootDoc')

We fixed this issue by changigng create_base_doc_from_schema such that global definitions of nested schemas are propagated during recursive calls.

Fix readme test (#1746)

๐Ÿ“— Documentation Improvements

  • Update readme (#1744)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • AlaeddineAbdessalem (@alaeddine-13)
  • Joan Fontanals (@JoanFM)
  • TERBOUCHE Hacene (@TerboucheHacene)
  • samsja (@samsja)
docarray - ๐Ÿ’ซ Release v0.37.0

Published by github-actions[bot] about 1 year ago

Release Note (0.37.0)

Release time: 2023-08-03 03:11:16

This release contains 6 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.

๐Ÿ†• Features

Milvus Integration (#1681)

Leverage the power of Milvus in your DocArray project with this latest integration. Here's a simple usage example:

import numpy as np
from docarray import BaseDoc
from docarray.index import MilvusDocumentIndex
from docarray.typing import NdArray
from pydantic import Field


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10] = Field(is_embedding=True)

docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = MilvusDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)

In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Milvus-backed document index and use it to index our documents. Finally, we perform a search query.

Supported Functionalities

  • Find: Vector search for efficient retrieval of similar documents.
  • Filter: Use Redis syntax to filter based on textual and numeric data.
  • Get/Del: Fetch or delete specific documents from the index.
  • Hybrid Search: Combine find and filter functionalities for more refined search.
  • Subindex: Search through nested data.

Support filtering in HnswDocumentIndex (#1718)

With our latest update, you can easily utilize filtering in HnswDocumentIndex either as an independent function or in conjunction with the query builder to combine it with vector search.

The code below shows how the new feature works:

import numpy as np

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray


class SimpleSchema(BaseDoc):
    year: int
    price: int
    embedding: NdArray[128]


# Create dummy documents.
docs = DocList[SimpleSchema](
    SimpleSchema(year=2000 - i, price=i, embedding=np.random.rand(128))
    for i in range(10)
)

doc_index = HnswDocumentIndex[SimpleSchema](work_dir="./tmp_5")
doc_index.index(docs)

# Independent filtering operation (year == 1995)
filter_query = {"year": {"$eq": 1995}}
results = doc_index.filter(filter_query)

# Filtering combined with vector search
hybrid_query = (
    doc_index.build_query()  # get empty query object
    .filter(filter_query={"year": {"$gt": 1994}})  # pre-filtering (year > 1994)
    .find(
        query=np.random.rand(128), search_field="embedding"
    )  # add vector similarity search
    .filter(filter_query={"price": {"$lte": 3}})  # post-filtering (price <= 3)
    .build()
)
results = doc_index.execute_query(hybrid_query)

First, we create and index some dummy documents. Then, we use the filter function in two ways. One is by itself to find documents from a specific year. The other is mixed with a vector search, where we first filter by year, perform a vector search, and then filter by price.

Pre-filtering in InMemoryExactNNIndex (#1713)

You can now add a pre-filter to your queries in InMemoryExactNNIndex. This lets you create flexible queries where you can set up as many pre- and post-filters as you want. Here's a simple example:

query = (
   doc_index.build_query()
   .filter(filter_query={'price': {'$lte': 3}})  # Pre-filter: price <= 3
   .find(query=np.ones(10), search_field='tensor')  # Vector search
   .filter(filter_query={'text': {'$eq': 'hello 1'}})  # Post-filter: text == 'hello 1'
   .build()
)

In this example, we first set a pre-filter to only include items priced 3 or less. We then do a vector search. Lastly, we add a post-filter to find items with the text 'hello 1'. This way, you can easily filter before and after your search!

Support document updates in InMemoryExactNNIndex (#1724)

You can now easily update your documents in InMemoryExactNNIndex. Previously, when you tried to update the same set of documents, it would just add duplicate copies instead of making changes to the existing ones. But not anymore! From now on, If you want to update documents you just have to re-index them.

Choose tensor format with DocVec deserialization (#1679)

Now you can specify the format of your tensor during DocVec deserialization. You can do this with any method you're using to convert data - like protobuf, json, pandas, bytes, binary, or base64. This means you'll always get your tensors in the format you want, whether it's a Torch tensor, TensorFlow tensor, NDarray, and so on.

Add description and example to id field of BaseDoc (#1737)

We added a description and example to the id field of BaseDoc, so that you get a richer OpenAPI specification when building FastAPI based applications with it.

๐Ÿš€ Performance

Improve HnswDocumentIndex performance (#1727, #1729)

We've implemented two key optimizations to enhance the performance of HnswDocumentIndex. Firstly, we've avoided serialization of embeddings to SQLite, which is a costly operation and unnecessary as the embeddings can be reconstructed from hnswlib index itself. Additionally, we've minimized the frequency of computing num_docs(), which previously involved time-consuming full table scan to determine the number of documents in SQLite. As a result, we've seen an approximate speed increase of 10%, enhancing both the indexing and searching processes.

๐Ÿž Bug Fixes

Fix TorchTensor type comparison (#1739)

We have addressed an exception raised when trying to compare TorchTensor with the type keyword in the docarray.typing module. Previously, this would lead to a TypeError, but the error has now been resolved, ensuring proper type comparison.

Add more info from dynamic class (#1733)

When using the method create_base_doc_from_schema to dynamically create a BaseDoc class, some information was lost, so we made sure that the new class keeps FieldInfo information from the original class such as description and examples.

Fix call to unsafe issubclass (#1731)

We fixed a bug calling issubclass by changing the call for a safer implementation against some types.

Align collection and index name in QdrantDocumentIndex (#1723)

We've corrected an issue where the collection name was not being updated to match a newly-initialized subindex name in QdrantDocumentIndex. This ensures consistent naming between collections and their respective subindexes.

Fix deepcopy TorchTensor (#1720)

We fixed a bug that will allow deepcopying documents with TorchTensors.

๐Ÿ“— Documentation Improvements

  • Make Document Indices self-contained (#1678)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Joan Fontanals (@JoanFM )
  • Johannes Messner (@JohannesMessner )
  • Saba Sturua (@jupyterjazz )
docarray - ๐Ÿ’ซ Release v0.36.0

Published by github-actions[bot] over 1 year ago

Release Note (0.36.0)

Release time: 2023-07-18 14:43:28

This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.

๐Ÿ†• Features

JAX Integration (#1646)

You can now use JAX with Docarray. We have introduced JaxArray as a new type option for your documents. JaxArray ensures that JAX can now natively process any array-like data in your DocArray documents. Here's how you use of it:

from docarray import BaseDoc
from docarray.typing import JaxArray
import jax.numpy as jnp


class MyDoc(BaseDoc):
    arr: JaxArray
    image_arr: JaxArray[3, 224, 224] # For images of shape (3, 224, 224)
    square_crop: JaxArray[3, 'x', 'x'] # For any square image, regardless of dimensions
    random_image: JaxArray[3, ...]  # For any image with 3 color channels, and arbitrary other dimensions

As you can see, the JaxArray typing is extremely flexible and can support a wide range of tensor shapes.

Creating a Document with Tensors

Creating a document with tensors is straightforward. Here is an example:

doc = MyDoc(
    arr=jnp.zeros((128,)),
    image_arr=jnp.zeros((3, 224, 224)),
    square_crop=jnp.zeros((3, 64, 64)),
    random_image=jnp.zeros((3, 128, 256)),
)

Redis Integration (#1550)

Leverage the power of Redis in your Docarray project with this latest integration. Here's a simple usage example:

import numpy as np
from docarray import BaseDoc
from docarray.index import RedisDocumentIndex
from docarray.typing import NdArray


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10]

docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = RedisDocumentIndex[MyDoc](host='localhost')
db.index(docs)
results = db.find(query, search_field='embedding', limit=10)

In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Redis-backed document index and use it to index our documents. Finally, we perform a search query.

Supported Functionalities

Find: Vector search for efficient retrieval of similar documents.
Filter: Use Redis syntax to filter based on textual and numeric data.
Text Search: Leverage text search methods, such as BM25, to find relevant documents.
Get/Del: Fetch or delete specific documents from the index.
Hybrid Search: Combine find and filter functionalities for more refined search. Currently, only these two can be combined.
Subindex: Search through nested data.

๐Ÿš€ Performance

Speedup HnswDocumentIndex by caching num docs (#1706)

We've optimized the num_docs() operation by caching the document count, addressing previous slowdowns during searches. This change results in a minor increase in indexing time, but significantly accelerates search times.

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]


docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir='tst', index_name='index')

index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start

query = docs[0]

find_start = time.time()
matches, _ = index.find(query, search_field='embedding', limit=10)
find_time = time.time() - find_start

In the above experiment, we observed a 13x improvement in the speed of the search function, reducing its execution time from 0.0238 to 0.0018 seconds.

โš™ Refactoring

Put Contains method in the base class (#1701)

We've moved the contains method into the base class. With this refactoring, the responsibility for checking if a document exists is now delegated to individual backend implementations using the new _doc_exists method.

More robust method to detect duplicate index (#1651)

We have implemented a more robust method of detecting existing indices for WeaviateDocumentIndex

๐Ÿž Bug Fixes

WeaviateDocumentIndex handles lowercase index names (#1711)

We've addressed an issue in the WeaviateDocumentIndex where passing a lowercase index name led to mismatches and subsequent errors. This was due to the system automatically capitalizing the index name when creating an index.

QdrantDocumentIndex unable to see index_name (#1705)

We've resolved an issue where the QdrantDocumentIndex was not properly recognizing the index_name parameter. Previously, the specified index_name was ignored and the system defaulted to the schema name.

Fix search in InMemoryExactNNIndex with AnyEmbedding (#1696)

From now on, you can perform search operations in InMemoryExactNNIndex using AnyEmbedding

Use safe_issubclass everywhere (#1691)

We now use safe_issubclass instead of issubclass because it supports non-class inputs, helping us to avoid unexpected errors

Avoid converting DocLists in the base index (#1685)

We added an additional check to avoid passing DocLists to a function that converts a list of dictionaries to a DocList.

๐Ÿ“— Documentation Improvements

  • Add docs for dict() method (#1643)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Puneeth K (@punndcoder28)
  • Joan Fontanals (@JoanFM)
  • Saba Sturua (@jupyterjazz)
  • Aman Agarwal (@agaraman0)
  • samsja (@samsja)
  • Shukri (@hsm207)
docarray - ๐Ÿ’ซ Release v0.35.0

Published by github-actions[bot] over 1 year ago

Release Note (0.35.0)

This release contains 3 new features, 2 bug fixes and 1 documentation improvement.

๐Ÿ†• Features

More serialization options for DocVec (#1562)

DocVec now has the same serialization interface as DocList. This means that that following methods are available for it:

  • to_protobuf()/from_protobuf()
  • to_base64()/from_base64()
  • save_binary()/load_binary()
  • to_bytes()/from_bytes()
  • to_dataframe()/from_dataframe()

For example, you can now perform Base64 (de)serialization like this:

from docarray import BaseDoc, DocVec

class SimpleDoc(BaseDoc):
    text: str

dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])
base64_repr_dv = dv.to_base64(compress=None, protocol='pickle')

dl_from_base64 = DocVec[SimpleDoc].from_base64(
    base64_repr_dv, compress=None, protocol='pickle'
)

For further guidance, check out the documentation section on serialization.

Validate file formats in URL (#1606) (#1669)

Validate the file formats given in URL types such as AudioURL, TextURL, ImageURL to check they correspond to the expected mime type.

Add methods to create BaseDoc from schema (#1667)

Sometimes it can be useful to dynamically create a BaseDoc from a given schema of an original BaseDoc. Using the methods create_pure_python_type_model and create_base_doc_from_schema you can make sure to reconstruct the BaseDoc.

from docarray.utils.create_dynamic_doc_class import (
    create_base_doc_from_schema,
    create_pure_python_type_model,
)

from typing import Optional
from docarray import BaseDoc, DocList
from docarray.typing import AnyTensor
from docarray.documents import TextDoc

class MyDoc(BaseDoc):
    tensor: Optional[AnyTensor]
    texts: DocList[TextDoc]

MyDocPurePython = create_pure_python_type_model(MyDoc) # Due to limitation of DocList as Pydantic List, we need to have the MyDoc `DocList` converted to `List`.
NewMyDoc = create_base_doc_from_schema(
    MyDocPurePython.schema(), 'MyDoc', {}
)

new_doc = NewMyDoc(tensor=None, texts=[TextDoc(text='text')])

๐Ÿž Bug Fixes

Cap Pydantic version (#1682)

Due to the breaking change in Pydantic v2, we have capped the version to avoid problems when installing docarray.

Better error message when DocVec is unusable (#1675)

After calling doc_list = doc_vec.to_doc_list(), doc_vec ends up in an unusable state since its data has been transferred to doc_list. This fix gives users a more informative error message when they try to interact with doc_vec after it has been made unusable.

๐Ÿ“— Documentation Improvements

  • Fix a reference in README (#1674)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Saba Sturua (@jupyterjazz )
  • Joan Fontanals (@JoanFM )
  • Han Xiao (@hanxiao )
  • Johannes Messner (@JohannesMessner )
docarray - ๐Ÿ’ซ Patch v0.21.1

Published by github-actions[bot] over 1 year ago

Release Note (0.21.1)

Release time: 2023-06-21 08:15:43

This release contains 1 bug fix.

๐Ÿž Bug Fixes

Allow passing extra headers to WeaviateDocumentArray (#1673)

This extra headers allow to pass authentication keys to connect to a secured Weaviate instance

WeaviateDocumentArray supports

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Girish Chandrashekar (@girishc13)
docarray - ๐Ÿ’ซ Release v0.34.0

Published by github-actions[bot] over 1 year ago

Release Note (0.34.0)

Release time: 2023-06-21 08:15:43

This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.

๐Ÿ’ฃ Breaking Changes

Terminate Python 3.7 support

โš ๏ธ โš ๏ธ DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.

We decided to drop it for two reasons:

  • Several dependencies of DocArray require Python 3.8.
  • Python long-term support for 3.7 is ending this week. This means there will no longer
    be security updates for Python 3.7, making this a good time for us to change our requirements.

Changes to DocVec Protobuf definition (#1639)

In order to fix a bug in the DocVec protobuf serialization described in #1561,
we have changed the DocVec .proto definition.

This means that DocVec objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray
v.0.34.0 or later, and vice versa
.

โš ๏ธ โš ๏ธ We strongly recommend that everyone using Protobuf with DocVec upgrade to DocArray v0.34.0 or
later.

๐Ÿ†• Features

Allow users to check if a Document is already indexed in a DocIndex (#1633)

You can now check if a Document has already been indexed by using the in keyword:

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = DocList[MyDoc](
        [MyDoc(text="Example text", embedding=np.random.rand(128))
         for _ in range(2000)])

index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index

Support subindexes in InMemoryExactNNIndex (#1617)

You can now use the find_subindex
method with the ExactNNSearch DocIndex.

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, VideoUrl, AnyTensor

class ImageDoc(BaseDoc):
    url: ImageUrl
    tensor_image: AnyTensor = Field(space='cosine', dim=64)


class VideoDoc(BaseDoc):
    url: VideoUrl
    images: DocList[ImageDoc]
    tensor_video: AnyTensor = Field(space='cosine', dim=128)


class MyDoc(BaseDoc):
    docs: DocList[VideoDoc]
    tensor: AnyTensor = Field(space='cosine', dim=256)

doc_index = InMemoryExactNNIndex[MyDoc]()
...

# find by the `ImageDoc` tensor when index is populated
root_docs, sub_docs, scores = doc_index.find_subindex(
    np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)

Flexible tensor types for protobuf deserialization (#1645)

You can deserialize any DocVec protobuf message to any tensor type,
by passing the tensor_type parameter to from_protobuf.

This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.

class MyDoc(BaseDoc):
    tensor: TensorFlowTensor

da = DocVec[MyDoc](...)  # doesn't matter what tensor_type is here

proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)

assert isinstance(da_after.tensor, TensorFlowTensor)

โš™ Refactoring

Add DBConfig to InMemoryExactNNSearch

InMemoryExactNNsearch used to get a single parameter index_file_path as a constructor parameter, unlike the rest of
the Indexers who accepted their own DBConfig. Now index_file_path is part of the DBConfig which allows to
initialize from it.
This will allow us to extend this config if more parameters are needed.

The parameters of DBConfig can be passed at construction time as **kwargs making this change compatible with old
usage.

These two initializations are equivalent.

from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')

index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')

๐Ÿž Bug Fixes

Allow protobuf deserialization of BaseDoc with Union type (#1655)

Serialization of BaseDoc types who have Union types parameter of Python native types is supported.

from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
    union_field: Union[int, str]

docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2

When these Union types involve other BaseDoc types, an exception is thrown.

class CustomDoc(BaseDoc):
    ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')

docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])

# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())

Cast limit to integer when passed to HNSWDocumentIndex (#1657, #1656)

If you call find or find_batched on an HNSWDocumentIndex, the limit parameter will automatically be cast to
integer.

Moved default_column_config from RuntimeConfig to DBconfig (#1648)

default_column_config contains specific configuration information about the columns and tables inside the backend's
database. This was previously put inside RuntimeConfig which caused an error because this information is required at
initialization time. This information has been moved inside DBConfig so you can edit it there.

from docarray.index import HNSWDocumentIndex
import numpy as np

db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)

Fix issue with Protobuf (de)serialization for DocVec (#1639)

This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the
data essentially inaccessible. This has now been fixed, and DocVec objects are identical before and after (de)serialization.

Fix order of returned matches when find and filter combination used in InMemoryExactNNIndex (#1642)

Hybrid search (find+filter) for InMemoryExactNNIndex was prioritizing low similarities (lower scores) for returned
matches. Fixed by adding an option to sort matches in a reverse order based on their scores.

# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')

query = (
    db.build_query()
    .find(query=q_doc, search_field='embedding')
    .filter(filter_query={'text': {'$exists': True}})
    .build()
)

results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first

Working with external Qdrant collections (#1632)

When using QdrandDocumentIndex to connect to a Qdrant DB initialized outside of docarray raised a KeyError.
This has been fixed, and now you can use QdrantDocumentIndex to connect to externally initialized collections.

Other bug fixes

  • Update text search to match Weaviate client's new sig (#1654)
  • Fix DocVec equality (#1641, #1663)
  • Fix exception when summary() called for LegacyDocument. (#1637)
  • Fix DocList and DocVec coersion. (#1568)
  • Fix update() on BaseDoc with tensors fields (#1628)

๐Ÿ“— Documentation Improvements

  • Enhance DocVec section (#1658)
  • Qdrant in memory usage (#1634)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Johannes Messner (@JohannesMessner)
  • Nikolas Pitsillos (@npitsillos)
  • Shukri (@hsm207)
  • Kacper ลukawski (@kacperlukawski)
  • Aman Agarwal (@agaraman0)
  • maxwelljin (@maxwelljin)
  • samsja (@samsja)
  • Saba Sturua (@jupyterjazz)
  • Joan Fontanals (@JoanFM)
docarray - ๐Ÿ’ซ Release v0.33.0

Published by github-actions[bot] over 1 year ago

Release Note (0.33.0)

Release time: 2023-06-06 14:05:56

This release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements.

๐Ÿ†• Features

Allow coercion between different Tensor types (#1552) (#1588)

Allow coercing to a TorchTensor from an NdArray or TensorFlowTensor and the other way around.

from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np


class MyTensorsDoc(BaseDoc):
    tensor: TorchTensor


doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
๐Ÿ“„ MyTensorsDoc : 0a10f88 ...
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Attribute           โ”‚ Value                                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ tensor: TorchTensor โ”‚ TorchTensor of shape (512,), dtype: torch.float64      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿš€ Performance

Avoid stack embedding for every search (#1586)

We have made a performance improvement for the find interface for InMemoryExactNNIndex that gives a ~2x speedup.

The script used to measure this is as follows:

from torch import rand
from time import perf_counter
โ€‹
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor
โ€‹
โ€‹
class MyDocument(BaseDoc):
    embedding: TorchTensor
    embedding2: TorchTensor
    embedding3: TorchTensor
โ€‹
def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
    return DocList[MyDocument](
        [
            MyDocument(
                embedding=rand(dims),
                embedding2=rand(dims),
                embedding3=rand(dims),
            )
            for _ in range(num_docs)
        ]
    )
โ€‹
num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)
โ€‹
index = InMemoryExactNNIndex[MyDocument](data_list)
โ€‹
start = perf_counter()
for _ in range(5):
    matches, scores =  index.find_batched(queries, search_field='embedding')
โ€‹
print(f"Number of queries: {num_queries} \n"
      f"Number of indexed documents: {num_docs} \n"
      f"Total time: {(perf_counter() - start)/5} seconds")

๐Ÿž Bug Fixes

Respect limit parameter in filter for index backends (#1618)

InMemoryExactNNIndex and HnswDocumentIndex now respect the limit parameter in the filter API.

HnswDocumentIndex can search with limit greater than number of documents (#1611)

HnswDocumentIndex now allows to call find with a limit parameter larger than the number of indexed documents.

Allow updating HnswDocumentIndex (#1604)

HnswDocumentIndex now allows reindexing documents with the same id, updating the original documents.

Dynamically resize internal index to adapt to increasing number of documents (#1602)

HnswDocumentIndex now allows indexing more than max_elements, dynamically adapting the index as it grows.

Fix simple usage of HnswDocumentIndex (#1596)

from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')

Previously, this basic usage threw an exception:

TypeError: ModelMetaclass object argument after  must be a mapping, not MyDoc

Now, it works as expected.

Fix InMemoryExactNNIndex index initialization with nested DocList (#1582)

Instantiating an InMemoryExactNNIndex with a Document schema that had a nested DocList previously threw this error:

from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex

class MyDoc(BaseDoc):
    text: str,
    d_list: DocList[TextDoc]

index = HnswDocumentIndex[MyDoc]()
TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'

Now it can be successfully instantiated.

Fix summary of document with list (#1595)

Calling summary on a document with a List attribute previously showed the wrong type:

from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
    str_list: List[str]

dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()

Previous output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€ DocList Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                               โ”‚
โ”‚   Type     DocList[TestDoc]   โ”‚
โ”‚   Length   2                  โ”‚
โ”‚                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€ Document Schema โ”€โ”€โ”€โ•ฎ
โ”‚                       โ”‚
โ”‚   TestDoc             โ”‚
โ”‚   โ””โ”€โ”€ str_list: str   โ”‚
โ”‚                       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

New output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€ DocList Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                               โ”‚
โ”‚   Type     DocList[TestDoc]   โ”‚
โ”‚   Length   2                  โ”‚
โ”‚                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€ Document Schema โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                             โ”‚
โ”‚   TestDoc                   โ”‚
โ”‚   โ””โ”€โ”€ str_list: List[str]   โ”‚
โ”‚                             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Solve issues caused by issubclass (#1594)

DocArray relies heavily on calling Python's issubclass method which caused multiple issues. We now use a safe version that counts for edge cases and types.

Make example payload a string rather than bytes (#1587)

The example payload of a given document schema with Tensor attribute was previously of bytes type. This has now been changed to str.

from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')

๐Ÿ“— Documentation Improvements

  • Add forward declaration steps to example to avoid pickling error (#1615)
  • Fix n_dim to dim (#1610)
  • Add "in memory" to documentation as list of supported vector indexes (#1607)
  • Add a tensor section (#1576)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Mohammad Kalim Akram (@makram93)
  • samsja (@samsja)
  • Saba Sturua (@jupyterjazz)
  • Joan Fontanals (@JoanFM)
  • maxwelljin (@maxwelljin)
docarray - ๐Ÿ’ซ Patch v0.32.1

Published by github-actions[bot] over 1 year ago

Release Note (0.32.1)

Release time: 2023-05-26 14:50:34

This release contains 4 bug fixes, 1 refactoring and 2 documentation improvements.

โš™ Refactoring

Improve ElasticDocIndex logging (#1551)

More debugging logs have been added inside ElasticDocIndex.

๐Ÿž Bug Fixes

Allow InMemoryExactNNIndex with Optional embedding tensors (#1575)

You can now index Documents where the tensor search_field is Optional. The index will not consider these None embeddings when running a search.

import torch
from typing import Optional

from docarray import BaseDoc, DocList
from docarray.typing import TorchTensor
from docarray.index import InMemoryExactNNIndex


class EmbeddingDoc(BaseDoc):
    embedding: Optional[TorchTensor[768]]

index = InMemoryExactNNIndex[TestDoc](DocList[TestDoc]([TestDoc(embedding=(torch.rand(768,) if i % 2 else None)) for i in range(5)]))
index.find(torch.rand((768,)), search_field="embedding", limit=3)

Safe is_subclass check (#1569)

In DocArray, especially when dealing with indexers, field types are checked that lead to calls to Python's is_subclass method.
This call fails under some circumstances, for instance when checked for a List or Tuple. Starting with this release, we use a safe version that does not fail for these cases.

This enables the following usage, which would otherwise fail:

from docarray import BaseDoc
from docarray.index import HnswDocumentIndex

class MyDoc(BaseDoc):
    test: List[str]

index = HnswDocumentIndex[MyDoc]()

Fix AnyDoc deserialization (#1571)

AnyDoc is a schema-less special Document that adapts to the schema of the data it tries to load. However, in cases where the data contained Dictionaries or Lists, deserialization failed. This is now fixed and you can have this behavior:

from docarray.base_doc import AnyDoc, BaseDoc
from typing import Dict

class ConcreteDoc(BaseDoc):
    text: str
    tags: Dict[str, int]

doc = ConcreteDoc(text='text', tags={'type': 1})

any_doc = AnyDoc.from_protobuf(doc.to_protobuf())
assert any_doc.text == 'text'
assert any_doc.tags == {'type': 1}

dict method for Document view (#1559)

Prior to this fix, doc.dict() would return an empty Dictionary if doc.is_view() == True:

class MyDoc(BaseDoc):
    foo: int

vec = DocVec[MyDoc]([MyDoc(foo=3)])
# before
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {}

# after
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {'id': 'f285db406a949a7e7ab084032800f7d8', 'foo': 3}

๐Ÿ“— Documentation Improvements

  • Update doc building guide (#1566)
  • Explain the state of DocList in FastAPI (#1546)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • aman-exp-infy (@agaraman0)
  • Johannes Messner (@JohannesMessner)
  • Joan Fontanals (@JoanFM)
  • Saba Sturua (@jupyterjazz)
  • Ge Jin (@maxwelljin)
docarray - ๐Ÿ’ซ Release v0.32.0

Published by github-actions[bot] over 1 year ago

Release Note (v0.32.0)

This release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.

๐Ÿ†• Features

Subindex for document index (#1428)

The subindex feature allows you to index documents that contain another DocList by automatically creating a separate collection/index for each such DocList:

# create nested document schema
class SimpleDoc(BaseDoc):
    tensor: NdArray[10]
    text: str


class MyDoc(BaseDoc):
    docs: DocList[SimpleDoc]


# create some docs
my_docs = [
    MyDoc(
        docs=DocList[SimpleDoc](
            [
                SimpleDoc(
                    tensor=np.ones(10) * (j + 1),
                    text=f"hello {j}",
                )
                for j in range(10)
            ]
        ),
    )
]

# index them into Elasticsearch
index = ElasticDocIndex[MyDoc](index_name="idx")
index.index(my_docs)  # index with name 'idx' and 'idx__docs' will be generated

# search on the nested level (subindex)
query = np.random.rand(10)
matches_root, matches_nested, scores = index.find_subindex(
    query, search_field="docs__tensor", limit=5
)

Openapi and FastAPI tensor shapes (#1510)

We have enabled shaped tensors to be properly represented in OpenAPI/SwaggerUI, both in examples and the schema.

This means that you can now built web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:

class Doc(BaseDoc):
    embedding_torch: TorchTensor[3, 4]


app = FastAPI()


@app.post("/foo", response_model=Doc, response_class=DocArrayResponse)
async def foo(doc: Doc) -> Doc:
    return Doc(embedding=doc.embedding_np)

Generated Swagger UI:

image
image

Save and load inmemory index (#1534)

We added a persist method to the InMemoryExactNNIndex class to save the index to disk.

# Save your existing index as a binary file
doc_index.persist('docs.bin')
# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')

๐Ÿž Bug Fixes

search_field should be optional in hybrid text search (#1516)

We have added a sane default to text_search() for the search_field argument that is now Optional.

Check if file path exists for in-memory index (#1537)

We have added an internal check to see if index_file_path exists when passed to InMemoryExactNNIndex.

Add empty judgement to index search (#1533)

We have ensured that empty indices do not fail when find is called.

Detach torch tensors (#1526)

Serializing tensors with gradients no longer fails.

Docvec display (#1522)

Docvec display issues have been resolved.

๐Ÿ“— Documentation Improvements

  • Remove erroneous info (#1531)
  • Fix link to documentation in readme (#1525)
  • Flatten structure (#1520)
  • Fix links (#1518)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Mohammad Kalim Akram (@makram93)
  • Johannes Messner (@JohannesMessner)
  • Anne Yang (@AnneYang720)
  • Zhaofeng Miao (@mapleeit)
  • Joan Fontanals (@JoanFM)
  • Kacper ลukawski (@kacperlukawski)
  • IyadhKhalfallah (@IyadhKhalfallah)
  • Saba Sturua (@jupyterjazz)
docarray - ๐Ÿ’ซ Patch v0.31.1

Published by github-actions[bot] over 1 year ago

Release Note (0.31.1)

This patch release fixes a small bug that was introduced in the latest minor release (0.31.0).

๐Ÿž Bug Fixes

  • Calling json or dict on a Optional nested DocList does not throw an error anymore if the value is set to None (#1512)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • samsja (@samsja)
docarray - ๐Ÿ’ซ Release v0.31.0

Published by github-actions[bot] over 1 year ago

Release Note (v0.31.0)

This release contains 4 new features, 11 bug fixes, and several documentation improvements.

๐Ÿ’ฅ Breaking changes

Return type of DocVec Optional Tensor (#1472)

Optional tensor fields in a DocVec will return None instead of a list of Nan if the column does not hold any tensor.

This code snippet shows the breaking change:

from typing import Optional

from docarray import BaseDoc, DocVec
from docarray.typing import NdArray

class MyDoc(BaseDoc):
    tensor: Optional[NdArray[10]]

docs = DocVec[MyDoc]([MyDoc() for j in range(2)])

print(docs.tensor)
Version Return type
0.30.0 [nan nan]
0.31.0 None

Default index collection names

Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.

In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name or collection_name.

Starting with DocArray v0.30.0, the default index_name/collection_name will be derived from the document schema name:

from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc

class MyDoc(BaseDoc):
    pass

# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()

If you create an persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.

You can fix this by manually specifying the index name to match the old default:

# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')

The below table summarizes the change for all DB backends:

DBConfig argument Default in v0.30.0 Default in v0.31.0
WeaviateDocumentIndex index_name 'Document' Schema class name
QdrantDocumentIndex collection_name 'documents' Schema class name
ElasticDocIndex index_name 'index__' + a random id Schema class name
ElasticV7DocIndex index_name 'index__' + a random id Schema class name
HnswDocumentIndex n/a n/a n/a

๐Ÿ†• Features

Add InMemoryExactNNIndex (#1441)

In this version we have introduced the InMemoryExactNNIndex Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).

The InMemoryExactNNIndex can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray

import numpy as np

class MyDoc(BaseDoc):
    tensor: NdArray[512]

docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))

doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)

print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))
FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

DocList inherits from Python list (#1457)

DocList is now a subclass of Python's list. This means that you can now use all the methods that are available to Python lists on DocList objects. For example, you can now use len on DocList objects and tools like Pydantic or FastAPI will be able to work with it more easily.

Add len to DocIndex (#1454)

You can now perform len(vector_index) which is equivalent to vector_index.num_docs().

Other minor features

  • Add a to_json alias to BaseDoc (#1494)

๐Ÿž Bug Fixes

Point to older versions when importing Document or Documentarray (#1422)

Trying to load Document or DocumentArray from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.

Fix AnyDoc.from_protobuf (#1437)

AnyDoc can now read any BaseDoc protobuf file. The same applies to DocList.

Other bug fixes

  • Fix extend to DocList (#1493)
  • Fix bug when calling dict() on BaseDoc (#1481)
  • Fix bug when calling json() on BaseDoc (#1481)
  • Support Pandas 2.0 by using pd.concat() instead of df.append() in to_dataframe() to avoid warning (#1478)
  • Add logs to Elasticsearch index (#1427)
  • Fix a bug in Document Index where Torch tensors that required grad were not able to be converted to ndarray (#1429)
  • Fix a bug with HNSW (#1426)
  • Hubble Binary format version bump (#1414)
  • Save index during creation for hnswlib (#1424)

๐Ÿ“— Documentation Improvements

  • Fix FastAPI docs (#1453)
  • Index predefined Documents (#1434)
  • Clean up data types section (#1412)
  • Remove duplicate API reference section (#1408)
  • Docindex URLs (#1433)
  • Fix Install commands hint (#1421)
  • Add Google Analytics (#1432)
  • Add install instructions for hnswlib and elastic document indexes (#1431)
  • Various fixes (#1436, #1417, #1423, #1418, #1411, #1419)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Alex Cureton-Griffiths (@alexcg1)
  • samsja (@samsja)
  • Johannes Messner (@JohannesMessner)
  • Anne Yang (@AnneYang720)
  • Scott Martens (@scott-martens)
  • ใ‚ซใƒฌใƒณ (@RStar2022)
  • Aman Agarwal (@agaraman0)
  • Yanlong Wang (@nomagick)
  • Charlotte Gerhaher (@anna-charlotte)
docarray - ๐Ÿ’ซ Release v0.30.0

Published by JoanFM over 1 year ago

๐Ÿ’ซ Release v0.30.0 (a.k.a DocArray v2)

Warning
This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.

Changelog

If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.

DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.

This gives the following advantages:

  • Flexibility: No need to conform to a fixed set of fields -- your data defines the schema.
  • Multimodality: Easily store multiple modalities and multiple embeddings in the same Document.
  • Language agnostic: At their core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.

You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:

  • Hybrid search: You can now combine vector search with text search, and even filter by arbitrary fields.
  • Production-ready: The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.
  • Increased flexibility: We strive to support any configuration or setting that you could perform through the DB's first-party client.

For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.

Changes to Document

  • Document has been renamed to BaseDoc.
  • BaseDoc cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.
  • Following from the previous point, extending BaseDoc allows for a flexible schema compared to the Document class in v1 which only allowed for a fixed schema, with one of tensor, text and blob, and additional chunks and matches.
  • Due to the added flexibility, one can not know what fields your document class will provide. Therefore, various methods from v1 (such as .load_uri_to_image_tensor()) are not supported in v2. Instead, we provide some of those methods on the typing-level.
  • In v2 we have the LegacyDocument class, which extends BaseDoc while following the same schema as v1's Document. The LegacyDocument can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 Document. Indeed, none of the methods associated with Document are present. Only the schema of the data is similar.

Changes to DocumentArray

DocList

  • The DocumentArray class from v1 has been renamed to DocList, to be more descriptive of its actual functionality, since it is a list of BaseDocs.

DocVec

  • Additionally, we introduced the class DocVec, which is a column-based representation of BaseDocs. Both DocVec and DocList extend AnyDocArray.
  • DocVec is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).
  • A DocVec has a similar interface as DocList but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the DocVec (the .doc_type which is a BaseDoc) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single doc_vec (Torch/TensorFlow/NumPy) tensor. If the tensor field is AnyTensor or a Union of tensor types, the .tensor_type will be used to determine the type of the doc_vec column.

Parameterized DocList

  • With the added flexibility of your document schema, and therefore endless options to design your document schema, when initializing a DocList it does not necessarily have to be homogenous.
  • If you want a homogenous DocList you can parameterize it at initialization time:
from docarray import DocList
from docarray.documents import ImageDoc

docs = DocList[ImageDoc]()
  • Methods like .from_csv() or .pull() only work with parameterized DocLists.

Access attributes of your DocumentArray

  • In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance.
  • In v2 you don't have to use the plural, but instead just use the document's attribute name, since AnyDocArray will expose the same attributes as the BaseDocs it contains. This will return a list of type(attribute). However, this only works if (and only if) all the BaseDocs in the AnyDocArray have the same schema. Therefore only this works:
from docarray import BaseDoc, DocList


class Book(BaseDoc):
    title: str
    author: str = None


docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title  # returns a list[str]

# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title

Changes to Document Store

In v2 the Document Store has been renamed to DocIndex and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex supports:

Instead of creating a DocumentArray instance and setting the storage parameter to a vector database of your choice, in v2 you can initialize a DocIndex object of your choice, such as:

db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')

In contrast, DocStore in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.

Thank you to all of the contributors to this release:

  • @samsja
  • @JohannesMessner
  • @anna-charlotte
  • @AnneYang720
  • @hsm207
  • @kacperlukawski
  • @JoanFM
  • @alexcg1
  • @Jackmin801
  • @nan-wang
  • @jupyterjazz
  • @azayz
  • @agaraman0
  • @hrik2001
  • @srini047
docarray - ๐Ÿ’ซ Release v0.21.0

Published by github-actions[bot] almost 2 years ago

Release Note (0.21.0)

Release time: 2023-01-17 09:10:50

This release contains 3 new features, 7 bug fixes and 5 documentation improvements.

๐Ÿ†• Features

OpenSearch Document Store (#853)

This version of DocArray adds a new Document Store: OpenSearch!

You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:

from docarray import Document, DocumentArray
import numpy as np

# Connect to OpenSearch instance
n_dim = 3

da = DocumentArray(
    storage='opensearch',
    config={'n_dim': n_dim},
)

# Index Documents
with da:
    da.extend(
        [
            Document(id=f'r{i}', embedding=i * np.ones(n_dim))
            for i in range(10)
        ]
    )

# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)

Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.

Learn more about its usage in the official documentation.

Add color to point cloud display (#961)

You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor():

coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')

doc = Document(
    tensor=coords,
    chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()

image

Add language attribute to Redis Document Store (#953)

The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language parameter in the Redis configuration:

da = DocumentArray(
    storage='redis',
    config={
        'n_dim': 128,
        'index_text': True,
        'language': 'chinese',
    },
)

๐Ÿž Bug Fixes

Replace newline with whitespace to fix display in plot embeddings (#963)

Whenever the string "\n" was contained in any Document field, doc.plot() would result in a rendering error. This fixes those errors be rendering "\n" as whitespace.

Fix unwanted coercion in to_pydantic_model (#949)

This bug caused all strings of the form 'Infinity' to be coerced to the string 'inf' when calling to_pydantic_model() or to_dict(). This is fixed now, leaving such strings unchanged.

Calculate relevant docs on index instead of queries (#950)

In the embed_and_evaluate() method, the number of relevant Documents per label used to be calculated based on the Document in self. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.

Remove offset index create on list like false (#936)

When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.

Add support for remote audio files (#933)

Loading audio files from a remote URL would cause FileNotFoundError, which is now fixed.

Query operator $exists does not work correctly with tags (#911) (#923)

Before this fix, $exists would treat false-y values such as 0 or [] as non existent. This is now fixed.

Document from dataclass with singleton list (#1018)

When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.

๐Ÿ“— Documentation Improvements

  • Link to Discord (#1010)
  • Have less versions to avoid deployment timeout (#977)
  • Fix data management section not appearing in Documentation (#967)
  • Link to OpenSearch docs in sidebar (#960)
  • Multimodal to datatypes (#934)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Jay Bhambhani (@jay-bhambhani)
  • Alvin Prayuda (@alphinside)
  • Johannes Messner (@JohannesMessner)
  • samsja (@samsja)
  • Marco Luca Sbodio (@marcosbodio)
  • Anne Yang (@AnneYang720)
  • Michael Gรผnther (@guenthermi)
  • AlaeddineAbdessalem (@alaeddine-13)
  • Han Xiao (@hanxiao)
  • Alex Cureton-Griffiths (@alexcg1)
  • Charlotte Gerhaher (@anna-charlotte)
docarray - ๐Ÿ’ซ Patch v0.20.1

Published by github-actions[bot] almost 2 years ago

Release Note (0.20.1)

Release time: 2022-12-12 09:32:37

๐Ÿž Bug Fixes

Make Milvus DocumentArray thread safe and suitable for pytest (#904)

This bug was causing connectivity issues when using multiple DocumentArrays in different threads to connect to the same Milvus instance, e.g. in pytest.

This would produce an error like the following:

E1207 14:59:51.357528591    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.367985469    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.457061884    3934 ev_epoll1_linux.cc:824]     assertion failed: gpr_atm_no_barrier_load(&g_active_poller) != (gpr_atm)worker
Fatal Python error: Aborted

This fix creates a separate gRPC connection for each MilvusDocumentArray instance, circumventing the issue.

Restore backwards compatibility for (de)serialization (#903)

DocArray v0.20.0 broke (de)serialization backwards compatibility with earlier versions of the library, making it impossible to load DocumentArrays from v0.19.1 or earlier from disk:

# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.0
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
AttributeError: 'DocumentArrayInMemory' object has no attribute '_is_subindex'

This fix restores backwards compatibility by not relying on newly introduced private attributes:

# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.1
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
<DocumentArray (length=11) at 140683902276416>

Process finished with exit code 0

๐Ÿ“— Documentation Improvements

  • Polish docs throughout (#895)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Anne Yang (@AnneYang720)
  • Nan Wang (@nan-wang)
  • anna-charlotte (@anna-charlotte)
  • Alex Cureton-Griffiths (@alexcg1)
docarray - ๐Ÿ’ซ Release v0.20.0

Published by github-actions[bot] almost 2 years ago

Release Note (0.20.0)

Release time: 2022-12-07 12:15:30

This release contains 8 new features, 3 bug fixes and 7 documentation improvements.

๐Ÿ†• Features

Milvus document store (#587)

This release supports the Milvus vector database as a document store.

da = DocumentArray(storage='milvus', config={'n_dim': 3))

Root_id for document stores (#808)

When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).

top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)

To allow this we now store the root_id in the chunks' tags. You can enable this by passing root_id=True in your document store configuration.

Filtering based on text keywords for Qdrant (#849)

You can now filter based on text keywords for the Qdrant document store.

filter = {
    'must': [
        {"key": "info", "match": {"text": "shoes"}}
    ]
}

results = da.find(np.random.rand(n_dim), filter=filter)

RGB-D representation of 3D meshes (#753)

DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.

doc.load_uris_to_rgbd_tensor()

Load multi page tiff files into chunks (#845)

Multi page tiff images can now be loaded with load_uri_to_image_tensor().

d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)
<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
  โ””โ”€ chunks
     โ”œโ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
     โ”œโ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
     โ””โ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>

Store key frame indices when loading video tensor from uri (#880)

key_frame_indices are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.

d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])
[0, 25, 196, ...]

Better plotting of embeddings for nested and complex data (#891)

You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding() method. This makes it easier to plot embeddings for complex and nested data.

docs.plot_embeddings(exclude_fields_metas=['chunks'])

Better support for information retrieval evaluation (#826)

This release adds a max_rel_per_label parameter to better support metric calculations that require the number of relevant Documents.

metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})

๐Ÿž Bug Fixes

Support length calculation independently from list-like behavior (#840)

DocArray 0.19 added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.

Remove cosine similarity field with false assignment (#835)

In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.

Rebuild index after clearing storage (#837)

The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage is called.

๐Ÿ“— Documentation Improvements

  • Correct Document description (#842)
  • Minor correction in Document description (#834)
  • Add username to DocArray pull (#847)
  • Fix broken docs (#805)
  • Fix data management section (#801)
  • Change logic order according to blog (#797)
  • Move cloud support to integrations (#798)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Delgermurun (@delgermurun)
  • Anne Yang (@AnneYang720)
  • anna-charlotte (@anna-charlotte)
  • Johannes Messner (@JohannesMessner)
  • Alex Cureton-Griffiths (@alexcg1)
  • AlaeddineAbdessalem (@alaeddine-13)
  • dong xiang (@dongxiang123)
  • coolmian (@coolmian)
  • Joan Fontanals (@JoanFM)
  • Nan Wang (@nan-wang)
  • samsja (@samsja)
  • Michael Gรผnther (@guenthermi)
docarray - ๐Ÿ’ซ Patch v0.19.1

Published by github-actions[bot] almost 2 years ago

Release note 0.19.1

This release contains 1 hot fix.

๐Ÿž Hot Fix

Support for new Jina AI Cloud namespace format.

This release introduces namespaces when pushing/pulling DocumentArrays to/from Jina AI Cloud.

from docarray import DocumentArray

DocumentArray.pull('<username>/<da-name>')
DocumentArray.push('<username>/<da-name>')

You should now use a namespace when accessing an artifact. This release fixes a bug related to this namespace in DocArray.

๐ŸคŸ Contributors

  • samsja (@samsja)
  • delgermurun (@delgermurun)
Package Rankings
Top 1.58% on Pypi.org
Top 5.69% on Proxy.golang.org
Related Projects