Bot releases are hidden (Show)

docarray - 💫 Release v0.30.0

Published by JoanFM over 1 year ago

💫 Release v0.30.0 (a.k.a DocArray v2)

Warning
This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.

Changelog

If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.

DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.

This gives the following advantages:

Flexibility: No need to conform to a fixed set of fields -- your data defines the schema.
Multimodality: Easily store multiple modalities and multiple embeddings in the same Document.
Language agnostic: At their core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.

You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:

Hybrid search: You can now combine vector search with text search, and even filter by arbitrary fields.
Production-ready: The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.
Increased flexibility: We strive to support any configuration or setting that you could perform through the DB's first-party client.

For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.

Changes to `Document`

Document has been renamed to BaseDoc.
BaseDoc cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.
Following from the previous point, extending BaseDoc allows for a flexible schema compared to the Document class in v1 which only allowed for a fixed schema, with one of tensor, text and blob, and additional chunks and matches.
Due to the added flexibility, one can not know what fields your document class will provide. Therefore, various methods from v1 (such as .load_uri_to_image_tensor()) are not supported in v2. Instead, we provide some of those methods on the typing-level.
In v2 we have the LegacyDocument class, which extends BaseDoc while following the same schema as v1's Document. The LegacyDocument can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 Document. Indeed, none of the methods associated with Document are present. Only the schema of the data is similar.

Changes to `DocumentArray`

DocList

The DocumentArray class from v1 has been renamed to DocList, to be more descriptive of its actual functionality, since it is a list of BaseDocs.

DocVec

Additionally, we introduced the class DocVec, which is a column-based representation of BaseDocs. Both DocVec and DocList extend AnyDocArray.
DocVec is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).
A DocVec has a similar interface as DocList but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the DocVec (the .doc_type which is a BaseDoc) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single doc_vec (Torch/TensorFlow/NumPy) tensor. If the tensor field is AnyTensor or a Union of tensor types, the .tensor_type will be used to determine the type of the doc_vec column.

Parameterized DocList

With the added flexibility of your document schema, and therefore endless options to design your document schema, when initializing a DocList it does not necessarily have to be homogenous.
If you want a homogenous DocList you can parameterize it at initialization time:

from docarray import DocList
from docarray.documents import ImageDoc

docs = DocList[ImageDoc]()

Methods like .from_csv() or .pull() only work with parameterized DocLists.

Access attributes of your DocumentArray

In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance.
In v2 you don't have to use the plural, but instead just use the document's attribute name, since AnyDocArray will expose the same attributes as the BaseDocs it contains. This will return a list of type(attribute). However, this only works if (and only if) all the BaseDocs in the AnyDocArray have the same schema. Therefore only this works:

from docarray import BaseDoc, DocList


class Book(BaseDoc):
    title: str
    author: str = None


docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title  # returns a list[str]

# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title

Changes to Document Store

In v2 the Document Store has been renamed to DocIndex and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex supports:

Instead of creating a DocumentArray instance and setting the storage parameter to a vector database of your choice, in v2 you can initialize a DocIndex object of your choice, such as:

db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')

In contrast, DocStore in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.

Thank you to all of the contributors to this release:

@samsja
@JohannesMessner
@anna-charlotte
@AnneYang720
@hsm207
@kacperlukawski
@JoanFM
@alexcg1
@Jackmin801
@nan-wang
@jupyterjazz
@azayz
@agaraman0
@hrik2001
@srini047

Package Rankings

Top 1.58% on Pypi.org

Top 5.69% on Proxy.golang.org

Related Projects

gen-ai-meetup-to-2023

18 Jul 2023 4

elastic_search_guide

These are my notes on Elasticsearch and search/retrieval methods (including data structures) in g...

17 Jul 2024 0

app-search-nlp-insurance

05 Dec 2022 0

flexsearch

Next-Generation full text search library for Browser and Node.js

25 Feb 2018 11,784

great-big-example-application

A full-stack example app built with JHipster, Spring Boot, Kotlin, Angular 4, ngrx, and Webpack

23 Oct 2016 927

elasticsearch-labs

Notebooks & Example Apps for Search & AI Applications with Elasticsearch

14 Jun 2023 622

elastic

Deprecated: Use the official Elasticsearch client for Go at https://github.com/elastic/go-elastic...

06 Dec 2012 7,399

laravel-elastic-vision

Elasticsearch driver for Laravel Scout.

22 Jan 2023 17

crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of d...

10 Apr 2013 3,988

elasticsearch-specification

Elasticsearch full specification

23 Feb 2016 102

advanced-java

😮 Core Interview Questions & Answers For Experienced Java(Backend) Developers | 互联网 Java 工程师进阶知识完...

06 Oct 2018 74,113

blog-langchain-elasticsearch

Code examples accompanying blog "Privacy-first AI search using LangChain and Elasticsearch"

17 May 2023 30

db-tutorial

📚 db-tutorial 是一个数据库教程。

08 Aug 2017 4,195

dataux

Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore

27 Dec 2014 318

app-search-javascript

Elastic App Search Official JavaScript Client

09 Aug 2019 66

docarray