diff --git a/README.md b/README.md index f75e08559c0..17cdd535d42 100644 --- a/README.md +++ b/README.md @@ -393,7 +393,6 @@ query = dl[0] results, scores = index.find(query, limit=10, search_field='embedding') ``` - --- ## Learn DocArray diff --git a/docs/API_reference/doc_index/backends/milvus.md b/docs/API_reference/doc_index/backends/milvus.md new file mode 100644 index 00000000000..38514163cac --- /dev/null +++ b/docs/API_reference/doc_index/backends/milvus.md @@ -0,0 +1,3 @@ +# MilvusDocumentIndex + +::: docarray.index.backends.milvus.MilvusDocumentIndex \ No newline at end of file diff --git a/docs/API_reference/doc_index/backends/redis.md b/docs/API_reference/doc_index/backends/redis.md new file mode 100644 index 00000000000..f9622b23d55 --- /dev/null +++ b/docs/API_reference/doc_index/backends/redis.md @@ -0,0 +1,3 @@ +# RedisDocumentIndex + +::: docarray.index.backends.redis.RedisDocumentIndex \ No newline at end of file diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index af9488a11e3..dee3653e0f9 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -1,6 +1,6 @@ # Introduction -A Document Index lets you store your Documents and search through them using vector similarity. +A Document Index lets you store your documents and search through them using vector similarity. This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to some query that you provide. @@ -37,716 +37,90 @@ Currently, DocArray supports the following vector databases: - [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) - [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) +- [Redis](https://redis.com/) | [Docs](index_redis.md) +- [Milvus](https://milvus.io/) | [Docs](index_milvus.md) - [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) +- InMemoryExactNNIndex | [Docs](index_in_memory.md) -For this user guide you will use the [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] -because it doesn't require you to launch a database server. Instead, it will store your data locally. -!!! note "Using a different vector database" - You can easily use Weaviate, Qdrant, or Elasticsearch instead -- they share the same API! - To do so, check their respective documentation sections. +## Basic usage -!!! note "Hnswlib-specific settings" - The following sections explain the general concept of Document Index by using - [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] as an example. - For HNSWLib-specific settings, check out the [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] documentation - [here](index_hnswlib.md). +Let's learn the basic capabilities of Document Index with [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. +This doesn't require a database server - rather, it saves your data locally. -## Create a Document Index -!!! note - To use [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex], you need to install extra dependencies with the following command: - - ```console - pip install "docarray[hnswlib]" - ``` +!!! note "Using a different vector database" + You can easily use Weaviate, Qdrant, Redis, Milvus or Elasticsearch instead -- their APIs are largely identical! + To do so, check their respective documentation sections. -To create a Document Index, you first need a document that defines the schema of your index: +!!! note "InMemoryExactNNIndex in more detail" + The following section only covers the basics of InMemoryExactNNIndex. + For a deeper understanding, please look into its [documentation](index_in_memory.md). +### Define document schema and create data +The following code snippet defines a document schema using the `BaseDoc` class. Each document consists of a title (a string), +a price (an integer), and an embedding (a 128-dimensional array). It also creates a list of ten documents with dummy titles, +prices ranging from 0 to 9, and randomly generated embeddings. ```python -from docarray import BaseDoc -from docarray.index import HnswDocumentIndex +from docarray import BaseDoc, DocList +from docarray.index import InMemoryExactNNIndex from docarray.typing import NdArray - +import numpy as np class MyDoc(BaseDoc): + title: str + price: int embedding: NdArray[128] - text: str - - -db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db') -``` - -### Schema definition - -In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`. -The Document Index then _creates a column for each field in `MyDoc`_. - -The column types in the backend database are determined by the type hints of the document's fields. -Optionally, you can [customize the database types for every field](#customize-configurations). - -Most vector databases need to know the dimensionality of the vectors that will be stored. -Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that -the database will store vectors with 128 dimensions. - -!!! note "PyTorch and TensorFlow support" - Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that - for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! - - -### Using a predefined Document as schema - -DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. -If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: -Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. - -The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` -field. But this is crucial information for any vector database to work properly! - -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: - -=== "Using type hint" - ```python - from docarray.documents import TextDoc - from docarray.typing import NdArray - from docarray.index import HnswDocumentIndex - - - class MyDoc(TextDoc): - embedding: NdArray[128] - - - db = HnswDocumentIndex[MyDoc](work_dir='test_db') - ``` - -=== "Using Field()" - ```python - from docarray.documents import TextDoc - from docarray.typing import AnyTensor - from docarray.index import HnswDocumentIndex - from pydantic import Field - - - class MyDoc(TextDoc): - embedding: AnyTensor = Field(dim=128) - - - db = HnswDocumentIndex[MyDoc](work_dir='test_db3') - ``` -Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or your custom Document type. - -The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: - -```python -from docarray import DocList - -# data of type TextDoc -data = DocList[TextDoc]( - [ - TextDoc(text='hello world', embedding=np.random.rand(128)), - TextDoc(text='hello world', embedding=np.random.rand(128)), - TextDoc(text='hello world', embedding=np.random.rand(128)), - ] -) - -# you can index this into Document Index of type MyDoc -db.index(data) -``` - - -**Database location:** - -For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you -usually specify a `host` and a `port` instead. - -In addition to a host and a port, most backends can also take an `index_name`, `table_name`, `collection_name` or similar. -This specifies the name of the index/table/collection that will be created in the database. -You don't have to specify this though: By default, this name will be taken from the name of the Document type that you use as schema. -For example, for `WeaviateDocumentIndex[MyDoc](...)` the data will be stored in a Weaviate Class of name `MyDoc`. - -In any case, if the location does not yet contain any data, we start from a blank slate. -If the location already contains data from a previous session, it will be accessible through the Document Index. - -## Index data - -Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: - -```python -import numpy as np -from docarray import DocList - -# create some random data docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] + MyDoc(title=f"title #{i}", price=i, embedding=np.random.rand(128)) + for i in range(10) ) - -# index the data -db.index(docs) ``` -That call to [index()][docarray.index.backends.hnswlib.HnswDocumentIndex.index] stores all Documents in `docs` into the Document Index, -ready to be retrieved in the next step. - -As you can see, `DocList[MyDoc]` and `HnswDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. -This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store -need to have compatible schemas. - -!!! question "When are two schemas compatible?" - The schemas of your Document Index and data need to be compatible with each other. - - Let's say A is the schema of your Document Index and B is the schema of your data. - There are a few rules that determine if schema A is compatible with schema B. - If _any_ of the following are true, then A and B are compatible: - - - A and B are the same class - - A and B have the same field names and field types - - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A - - In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. - -## Vector similarity search - -Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. - -By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find -similar Documents in the Document Index: - -=== "Search by Document" - - ```python - # create a query Document - query = MyDoc(embedding=np.random.rand(128), text='query') - - # find similar Documents - matches, scores = db.find(query, search_field='embedding', limit=5) - - print(f'{matches=}') - print(f'{matches.text=}') - print(f'{scores=}') - ``` - -=== "Search by raw vector" - - ```python - # create a query vector - query = np.random.rand(128) - - # find similar Documents - matches, scores = db.find(query, search_field='embedding', limit=5) - - print(f'{matches=}') - print(f'{matches.text=}') - print(f'{scores=}') - ``` - -To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the -basis of comparison between your query and the documents in the Document Index. - -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. -In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose -which one to use for the search. - -The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest -matching documents and their associated similarity scores. - -When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. - -How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). - -### Batched search - -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. - -=== "Search by Documents" - - ```python - # create some query Documents - queries = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) - ) - - # find similar Documents - matches, scores = db.find_batched(queries, search_field='embedding', limit=5) - - print(f'{matches=}') - print(f'{matches[0].text=}') - print(f'{scores=}') - ``` - -=== "Search by raw vectors" - - ```python - # create some query vectors - query = np.random.rand(3, 128) - - # find similar Documents - matches, scores = db.find_batched(query, search_field='embedding', limit=5) - - print(f'{matches=}') - print(f'{matches[0].text=}') - print(f'{scores=}') - ``` - -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing -a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. - -## Filter search and text search - -In addition to vector similarity search, the Document Index interface offers methods for text search and filtered search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search] and [filter()][docarray.index.abstract.BaseDocIndex.filter], -as well as their batched versions [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched] and [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched]. [filter_subindex()][docarray.index.abstract.BaseDocIndex.filter_subindex] is for filter on subindex level. - -!!! note - The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter - or text search. - - To see how to perform filter or text search, you can check out other backends that offer support. - -## Hybrid search through the query builder - -Document Index supports atomic operations for vector similarity search, text search and filter search. - -To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: - +### Initialize the Document Index and add data +Here we initialize an `InMemoryExactNNIndex` instance with the document schema we defined previously, and add the created documents to this index. ```python -# prepare a query -q_doc = MyDoc(embedding=np.random.rand(128), text='query') - -query = ( - db.build_query() # get empty query object - .find(query=q_doc, search_field='embedding') # add vector similarity search - .filter(filter_query={'text': {'$exists': True}}) # add filter search - .build() # build the query -) - -# execute the combined query and return the results -results = db.execute_query(query) -print(f'{results=}') +doc_index = InMemoryExactNNIndex[MyDoc]() +doc_index.index(docs) ``` -In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search -to obtain a combined set of results. - -The kinds of atomic queries that can be combined in this way depends on the backend. -Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -To see what backend can do what, check out the [specific docs](#document-index). - -## Access documents by `id` - -To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. - -You can also access data by the `id` that was assigned to each document: - +### Perform a vector similarity search +Now, let's perform a similarity search on the document embeddings. +As a result, we'll retrieve the ten most similar documents and their corresponding similarity scores. ```python -# prepare some data -data = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) -) - -# remember the Document ids and index the data -ids = data.id -db.index(data) - -# access the Documents by id -doc = db[ids[0]] # get by single id -docs = db[ids] # get by list of ids +query = np.ones(128) +retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) ``` -## Delete Documents - -In the same way you can access Documents by id, you can also delete them: - +### Filter documents +In this snippet, we filter the indexed documents based on their price field, specifically retrieving documents with a price less than 5: ```python -# prepare some data -data = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) -) - -# remember the Document ids and index the data -ids = data.id -db.index(data) - -# access the Documents by id -del db[ids[0]] # del by single id -del db[ids[1:]] # del by list of ids +query = {'price': {'$lt': 5}} +filtered_docs = doc_index.filter(query, limit=10) ``` -## Customize configurations - -DocArray's philosophy is that each Document Index should "just work", meaning that it comes with a sane set of defaults -that get you most of the way there. - -However, there are different configurations that you may want to tweak, including: - -- The [ANN](https://ignite.apache.org/docs/latest/machine-learning/binary-classification/ann) algorithm used, for example [HNSW](https://www.pinecone.io/learn/hnsw/) or [ScaNN](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) -- Hyperparameters of the ANN algorithm, such as `ef_construction` for HNSW -- The distance metric to use, such as cosine or L2 distance -- The data type of each column in the database -- And many more... - -The specific configurations that you can tweak depend on the backend, but the interface to do so is universal. - -Document Indexes differentiate between three different kind of configurations: - -### Database configurations - -_Database configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column), -and that you _don't_ dynamically change at runtime. - -This commonly includes: - -- host and port -- index or collection name -- authentication settings -- ... - -For every backend, you can get a full list of configurations and their defaults: - +### Combine different search methods +The final snippet combines the vector similarity search and filtering operations into a single query. +We first perform a similarity search on the document embeddings and then apply a filter to return only those documents with a price greater than or equal to 2: ```python -from docarray.index import HnswDocumentIndex - - -db_config = HnswDocumentIndex.DBConfig() -print(db_config) - -# > HnswDocumentIndex.DBConfig(work_dir='.') -``` - -As you can see, `HnswDocumentIndex.DBConfig` is a dataclass that contains only one possible configuration, `work_dir`, -that defaults to `.`. - -You can customize every field in this configuration: - -=== "Pass individual settings" - - ```python - db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db') - - custom_db_config = db._db_config - print(custom_db_config) - - # > HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') - ``` - -=== "Pass entire configuration" - - ```python - custom_db_config = HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') - - db = HnswDocumentIndex[MyDoc](custom_db_config) - - print(db._db_config) - - # > HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') - ``` - -### Runtime configurations - -_Runtime configurations_ are configurations that relate to the way how an `instance` operates with respect to a specific -database. - - -This commonly includes: -- default batch size for batching operations -- default consistency level for various database operations -- ... - -For every backend, you can get the full list of configurations and their defaults: - -```python -from docarray.index import ElasticDocIndex - - -runtime_config = ElasticDocIndex.RuntimeConfig() -print(runtime_config) - -# > ElasticDocIndex.RuntimeConfig(chunk_size=500) -``` - -As you can see, `HnswDocumentIndex.RuntimeConfig` is a dataclass that contains only one configuration: -`default_column_config`, which is a mapping from Python types to database column configurations. - -You can customize every field in this configuration using the [configure()][docarray.index.abstract.BaseDocIndex.configure] method: - -=== "Pass individual settings" - - ```python - db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db') - - db.configure( - default_column_config={ - np.ndarray: { - 'dim': -1, - 'index': True, - 'space': 'ip', - 'max_elements': 2048, - 'ef_construction': 100, - 'ef': 15, - 'M': 8, - 'allow_replace_deleted': True, - 'num_threads': 5, - }, - None: {}, - } - ) - - custom_runtime_config = db._runtime_config - print(custom_runtime_config) - - # > HnswDocumentIndex.RuntimeConfig(default_column_config={: {'dim': -1, 'index': True, 'space': 'ip', 'max_elements': 2048, 'ef_construction': 100, 'ef': 15, 'M': 8, 'allow_replace_deleted': True, 'num_threads': 5}, None: {}}) - ``` - -=== "Pass entire configuration" - - ```python - custom_runtime_config = HnswDocumentIndex.RuntimeConfig( - default_column_config={ - np.ndarray: { - 'dim': -1, - 'index': True, - 'space': 'ip', - 'max_elements': 2048, - 'ef_construction': 100, - 'ef': 15, - 'M': 8, - 'allow_replace_deleted': True, - 'num_threads': 5, - }, - None: {}, - } - ) - - db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db') - - db.configure(custom_runtime_config) - - print(db._runtime_config) - - # > HHnswDocumentIndex.RuntimeConfig(default_column_config={: {'dim': -1, 'index': True, 'space': 'ip', 'max_elements': 2048, 'ef_construction': 100, 'ef': 15, 'M': 8, 'allow_replace_deleted': True, 'num_threads': 5}, None: {}}) - ``` - -After this change, the new setting will be applied to _every_ column that corresponds to a `np.ndarray` type. - -### Column configurations - -For many vector databases, individual columns can have different configurations. - -This commonly includes: -- the data type of the column, e.g. `vector` vs `varchar` -- the dimensionality of the vector (if it is a vector column) -- whether an index should be built for a specific column - -The available configurations vary from backend to backend, but in any case you can pass them -directly in the schema of your Document Index, using the `Field()` syntax: - -```python -from pydantic import Field - - -class Schema(BaseDoc): - tens: NdArray[100] = Field(max_elements=12, space='cosine') - tens_two: NdArray[10] = Field(M=4, space='ip') - - -db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db') -``` - -The `HnswDocumentIndex` above contains two columns which are configured differently: -- `tens` has a dimensionality of `100`, can take up to `12` elements, and uses the `cosine` similarity space -- `tens_two` has a dimensionality of `10`, and uses the `ip` similarity space, and an `M` hyperparameter of 4 - -All configurations that are not explicitly set will be taken from the `default_column_config` of the `DBConfig`. -You can modify these defaults in the following way: - -```python -import numpy as np -from pydantic import Field - -from docarray import BaseDoc -from docarray.index import HnswDocumentIndex -from docarray.typing import NdArray - - -class Schema(BaseDoc): - tens: NdArray[100] = Field(max_elements=12, space='cosine') - tens_two: NdArray[10] = Field(M=4, space='ip') - - -# create a DBConfig for your Document Index -conf = HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') -# update the default max_elements for np.ndarray columns -conf.default_column_config.get(np.ndarray).update(max_elements=2048) -# create Document Index -# tens has a max_elements of 12, specified in the schema -# tens_two has a max_elements of 2048, specified by the default in the DBConfig -db = HnswDocumentIndex[Schema](conf) -``` - - -For an explanation of the configurations that are tweaked in this example, see the `HnswDocumentIndex` [documentation](index_hnswlib.md). - -## Nested data - -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. - -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='/tmp2') - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), +query = ( + doc_index.build_query() # get empty query object + .find(query=np.ones(128), search_field='embedding') # add vector similarity search + .filter(filter_query={'price': {'$gte': 2}}) # add filter search + .build() # build the query ) - -# find by the `youtubevideo` tensor; root level -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) - -# find by the `thumbnail` tensor; nested level -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) - -# find by the `video` tensor; neseted level -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) +retrieved_docs, scores = doc_index.execute_query(query) ``` -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). +## Learn more +The code snippets above just scratch the surface of what a Document Index can do. +To learn more and get the most out of `DocArray`, take a look at the detailed guides for the vector database backends you're interested in: -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). - -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine', dim=128) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = HnswDocumentIndex[MyDoc](work_dir='/tmp3') - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) -``` - -!!! note "Subindex not supported with InMemoryExactNNIndex" - Currently, subindex feature is not available for InMemoryExactNNIndex +- [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) +- [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) +- [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) +- [Redis](https://redis.com/) | [Docs](index_redis.md) +- [Milvus](https://milvus.io/) | [Docs](index_milvus.md) +- [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) +- InMemoryExactNNIndex | [Docs](index_in_memory.md) diff --git a/docs/user_guide/storing/first_step.md b/docs/user_guide/storing/first_step.md index e8f7ab80315..836f12646d1 100644 --- a/docs/user_guide/storing/first_step.md +++ b/docs/user_guide/storing/first_step.md @@ -25,7 +25,7 @@ This section covers the following three topics: ## Document Index -A Document Index lets you store your Documents and search through them using vector similarity. +A Document Index lets you store your documents and search through them using vector similarity. This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to a query that you provide. @@ -41,5 +41,7 @@ use a vector search library locally (HNSWLib, Exact NN search): - [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) - [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) +- [Redis](https://redis.com/) | [Docs](index_redis.md) +- [Milvus](https://milvus.io/) | [Docs](index_milvus.md) - [Hnswlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) - InMemoryExactNNSearch | [Docs](index_in_memory.md) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index b8251e9c88f..062a95c976d 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -33,7 +33,39 @@ DocArray comes with two Document Indexes for [Elasticsearch](https://www.elastic The following example is based on [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex], but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex]. -# Start Elasticsearch + +## Basic usage +This snippet demonstrates the basic usage of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] to index these documents, +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. + +```python +from docarray import BaseDoc, DocList +from docarray.index import ElasticDocIndex # or ElasticV7DocIndex +from docarray.typing import NdArray +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new ElasticDocIndex instance and add the documents to the index. +doc_index = ElasticDocIndex[MyDoc](index_name='my_index') +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) +``` + + + +## Initialize + You can use docker-compose to create a local Elasticsearch service with the following `docker-compose.yml`. @@ -62,7 +94,7 @@ Run the following command in the folder of the above `docker-compose.yml` to sta docker-compose up ``` -## Construct +### Schema definition To construct an index, you first need to define a schema in the form of a `Document`. @@ -94,263 +126,207 @@ class SimpleDoc(BaseDoc): doc_index = ElasticDocIndex[SimpleDoc](hosts='http://localhost:9200') ``` -## Index documents +### Using a predefined document as schema -Use `.index()` to add documents into the index. -The`.num_docs()` method returns the total number of documents in the index. +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. -```python -index_docs = [SimpleDoc(tensor=np.ones(128)) for _ in range(64)] +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! -doc_index.index(index_docs) +You can work around this problem by subclassing the predefined document and adding the dimensionality information: -print(f'number of docs in the index: {doc_index.num_docs()}') -``` +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import ElasticDocIndex -## Access documents -To access the `Doc`, you need to specify the `id`. You can also pass a list of `id` to access multiple documents. + class MyDoc(TextDoc): + embedding: NdArray[128] -```python -# access a single Doc -doc_index[index_docs[16].id] -# access multiple Docs -doc_index[index_docs[16].id, index_docs[17].id] -``` - -### Persistence + db = ElasticDocIndex[MyDoc](index_name='test_db') + ``` -You can hook into a database index that was persisted during a previous session. -To do so, you need to specify `index_name` and the `hosts`: +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import ElasticDocIndex + from pydantic import Field -```python -doc_index = ElasticDocIndex[SimpleDoc]( - hosts='http://localhost:9200', index_name='previously_stored' -) -doc_index.index(index_docs) -doc_index2 = ElasticDocIndex[SimpleDoc]( - hosts='http://localhost:9200', index_name='previously_stored' -) + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) -print(f'number of docs in the persisted index: {doc_index2.num_docs()}') -``` + db = ElasticDocIndex[MyDoc](index_name='test_db3') + ``` -## Delete documents +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. -To delete the documents, use the built-in function `del` with the `id` of the Documents that you want to delete. -You can also pass a list of `id`s to delete multiple documents. +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: ```python -# delete a single Doc -del doc_index[index_docs[16].id] +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) -# delete multiple Docs -del doc_index[index_docs[17].id, index_docs[18].id] +# you can index this into Document Index of type MyDoc +db.index(data) ``` -## Find nearest neighbors -The `.find()` method is used to find the nearest neighbors of a vector. +## Index -You need to specify the `search_field` that is used when performing the vector search. -This is the field that serves as the basis of comparison between your query and indexed Documents. - -You can use the `limit` argument to configure how many documents to return. - -!!! note - [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex] uses Elasticsearch v7.10.1, which does not support approximate nearest neighbour algorithms such as HNSW. - This can lead to poor performance when the search involves many vectors. - [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] does not have this limitation. +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method. +The `.num_docs()` method returns the total number of documents in the index. ```python -query = SimpleDoc(tensor=np.ones(128)) - -docs, scores = doc_index.find(query, limit=5, search_field='tensor') -``` +from docarray import DocList -## Nested data +# create some random data +docs = DocList[SimpleDoc]([SimpleDoc(tensor=np.ones(128)) for _ in range(64)]) -When using the index you can define multiple fields, including nesting documents inside another document. +doc_index.index(docs) -Consider the following example: +print(f'number of docs in the index: {doc_index.num_docs()}') +``` -- You have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. -- `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. +As you can see, `DocList[SimpleDoc]` and `ElasticDocIndex[SimpleDoc]` both have `SimpleDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(similarity='cosine', dims=64) + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(similarity='cosine', dims=128) +## Vector search -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(similarity='cosine', dims=256) +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. +You can use the [`find()`][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +to find similar documents within the Document Index: -doc_index = ElasticDocIndex[YouTubeVideoDoc]() -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] -doc_index.index(index_docs) -``` +You can use the `limit` argument to configure how many documents to return. -**You can perform search on any nesting level** by using the dunder operator to specify the field defined in the nested data. +!!! note + [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex] uses Elasticsearch v7.10.1, which does not support approximate nearest neighbour algorithms such as HNSW. + This can lead to poor performance when the search involves many vectors. + [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] does not have this limitation. -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or the `tensor` field of the `thumbnail` and `video` field: +=== "Search by Document" -```python -# example of find nested and flat index -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) + ```python + # create a query document + query = SimpleDoc(tensor=np.ones(128)) -# find by the youtubevideo tensor -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) + # find similar documents + matches, scores = doc_index.find(query, search_field='tensor', limit=5) -# find by the thumbnail tensor -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` -# find by the video tensor -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) -``` +=== "Search by raw vector" -To delete a nested data, you need to specify the `id`. + ```python + # create a query vector + query = np.random.rand(128) -!!! note - You can only delete `Doc` at the top level. Deletion of `Doc`s on lower levels is not yet supported. + # find similar documents + matches, scores = doc_index.find(query, search_field='tensor', limit=5) -```python -# example of delete nested and flat index -del doc_index[index_docs[3].id, index_docs[4].id] -``` + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` -### Nested data with subindex +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. -In the following example you can see a complex schema that contains nested Documents with subindex. +In this example you only have one field (`tensor`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(dims=64) +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(dims=128) +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(dims=256) +### Batched search +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. -# create a Document Index -doc_index = ElasticDocIndex[MyDoc](index_name='subindex') +=== "Search by Documents" -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), + ```python + # create some query Documents + queries = DocList[SimpleDoc]( + SimpleDoc(tensor=np.random.rand(128)) for i in range(3) ) - for i in range(10) -] -# index the Documents -doc_index.index(index_docs) - -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) + # find similar documents + matches, scores = doc_index.find_batched(queries, search_field='tensor', limit=5) -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) # return both root and subindex docs - -# filter on subindex level -query = {'match': {'url': 'http://example.ai/images/0-0-0'}} -docs = doc_index.filter_subindex(query, subindex='docs__images') -``` + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` -## Other Elasticsearch queries +=== "Search by raw vectors" -Besides vector search, you can also perform other queries supported by Elasticsearch, such as text search, and various filters. + ```python + # create some query vectors + query = np.random.rand(3, 128) -### Text search + # find similar documents + matches, scores = doc_index.find_batched(query, search_field='tensor', limit=5) -As in "pure" Elasticsearch, you can use text search directly on the field of type `str`: + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` -```python -class NewsDoc(BaseDoc): - text: str +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. -doc_index = ElasticDocIndex[NewsDoc]() -index_docs = [ - NewsDoc(id='0', text='this is a news for sport'), - NewsDoc(id='1', text='this is a news for finance'), - NewsDoc(id='2', text='this is another news for sport'), -] -doc_index.index(index_docs) -query = 'finance' -# search with text -docs, scores = doc_index.text_search(query, search_field='text') -``` +## Filter -### Query Filter +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow [Elastic's query language](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html). The `filter()` method accepts queries that follow the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) and consists of leaf and compound clauses. Using this, you can perform [keyword filters](#keyword-filter), [geolocation filters](#geolocation-filter) and [range filters](#range-filter). -#### Keyword filter +### Keyword filter To filter documents in your index by keyword, you can use `Field(col_type='keyword')` to enable keyword search for given fields: @@ -373,7 +349,7 @@ query_filter = {'terms': {'category': ['sport']}} docs = doc_index.filter(query_filter) ``` -#### Geolocation filter +### Geolocation filter To filter documents in your index by geolocation, you can use `Field(col_type='geo_point')` on a given field: @@ -408,7 +384,7 @@ query = { docs = doc_index.filter(query) ``` -#### Range filter +### Range filter You can have [range field types](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/range.html) in your document schema and set `Field(col_type='integer_range')`(or also `date_range`, etc.) to filter documents based on the range of the field. @@ -444,11 +420,40 @@ query = { docs = doc_index.filter(query) ``` -### Hybrid serach and query builder -To combine any of the "atomic" search approaches above, you can use the `QueryBuilder` to build your own hybrid query. +## Text search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +As in "pure" Elasticsearch, you can use text search directly on the field of type `str`: + +```python +class NewsDoc(BaseDoc): + text: str + + +doc_index = ElasticDocIndex[NewsDoc]() +index_docs = [ + NewsDoc(id='0', text='this is a news for sport'), + NewsDoc(id='1', text='this is a news for finance'), + NewsDoc(id='2', text='this is another news for sport'), +] +doc_index.index(index_docs) +query = 'finance' + +# search with text +docs, scores = doc_index.text_search(query, search_field='text') +``` + + +## Hybrid search + +Document Index supports atomic operations for vector similarity search, text search and filter search. -For this the `find()`, `filter()` and `text_search()` methods and their combination are supported. +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: For example, you can build a hybrid serach query that performs range filtering, vector search and text search: @@ -478,7 +483,34 @@ docs, _ = doc_index.execute_query(q) You can also manually build a valid ES query and directly pass it to the `execute_query()` method. -## Configuration options + +## Access documents + +To access a document, you need to specify its `id`. You can also pass a list of `id`s to access multiple documents. + +```python +# access a single Doc +doc_index[index_docs[1].id] + +# access multiple Docs +doc_index[index_docs[2].id, index_docs[3].id] +``` + +## Delete documents + +To delete documents, use the built-in function `del` with the `id` of the documents that you want to delete. +You can also pass a list of `id`s to delete multiple documents. + +```python +# delete a single Doc +del doc_index[index_docs[1].id] + +# delete multiple Docs +del doc_index[index_docs[2].id, index_docs[3].id] +``` + + +## Configuration ### DBConfig @@ -487,8 +519,8 @@ The following configs can be set in `DBConfig`: | Name | Description | Default | |-------------------|----------------------------------------------------------------------------------------------------------------------------------------|-------------------------| | `hosts` | Hostname of the Elasticsearch server | `http://localhost:9200` | -| `es_config` | Other ES [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/python-api/8.6/config.html) in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | None | -| `index_name` | Elasticsearch index name, the name of Elasticsearch index object | None. Data will be stored in an index named after the Document type used as schema. | +| `es_config` | Other ES [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/python-api/8.6/config.html) in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | `None` | +| `index_name` | Elasticsearch index name, the name of Elasticsearch index object | `None`. Data will be stored in an index named after the Document type used as schema. | | `index_settings` | Other [index settings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/index-modules.html#index-modules-settings) in a Dict for creating the index | dict | | `index_mappings` | Other [index mappings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/mapping.html) in a Dict for creating the index | dict | | `default_column_config` | The default configurations for every column type. | dict | @@ -503,7 +535,7 @@ class SimpleDoc(BaseDoc): tensor: NdArray[128] = Field(similarity='l2_norm', m=32, num_candidates=5000) -doc_index = ElasticDocIndex[SimpleDoc]() +doc_index = ElasticDocIndex[SimpleDoc](index_name='my_index_1') ``` ### RuntimeConfig @@ -511,9 +543,37 @@ doc_index = ElasticDocIndex[SimpleDoc]() The `RuntimeConfig` dataclass of `ElasticDocIndex` consists of `chunk_size`. You can change `chunk_size` for batch operations: ```python -doc_index = ElasticDocIndex[SimpleDoc]() +doc_index = ElasticDocIndex[SimpleDoc](index_name='my_index_2') doc_index.configure(ElasticDocIndex.RuntimeConfig(chunk_size=1000)) ``` You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. See [here](docindex.md#configuration-options#customize-configurations) for more information. + + +### Persistence + +You can hook into a database index that was persisted during a previous session by +specifying the `index_name` and `hosts`: + +```python +doc_index = ElasticDocIndex[MyDoc]( + hosts='http://localhost:9200', index_name='previously_stored' +) +doc_index.index(index_docs) + +doc_index2 = ElasticDocIndex[MyDoc]( + hosts='http://localhost:9200', index_name='previously_stored' +) + +print(f'number of docs in the persisted index: {doc_index2.num_docs()}') +``` + + +## Nested data and subindex search + +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. + +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index d19973389aa..e662cc220ae 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -7,6 +7,7 @@ pip install "docarray[hnswlib]" ``` + [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html). @@ -19,11 +20,427 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s - [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] - [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] - [ElasticDocumentIndex][docarray.index.backends.elastic.ElasticDocIndex] + - [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] + - [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] + +## Basic usage +This snippet demonstrates the basic usage of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. + +```python +from docarray import BaseDoc, DocList +from docarray.index import HnswDocumentIndex +from docarray.typing import NdArray +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new HnswDocumentIndex instance and add the documents to the index. +doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp_0') +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) +``` + +## Initialize + +To create a Document Index, you first need a document class that defines the schema of your index: + +```python +from docarray import BaseDoc +from docarray.index import HnswDocumentIndex +from docarray.typing import NdArray + + +class MyDoc(BaseDoc): + embedding: NdArray[128] + text: str + + +db = HnswDocumentIndex[MyDoc](work_dir='./tmp_1') +``` + +### Schema definition + +In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + + +### Using a predefined document as schema + +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import HnswDocumentIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + db = HnswDocumentIndex[MyDoc](work_dir='./tmp_2') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import HnswDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + db = HnswDocumentIndex[MyDoc](work_dir='./tmp_3') + ``` + +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. + +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) + +# you can index this into Document Index of type MyDoc +db.index(data) +``` + + +## Index + +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] +) + +# index the data +db.index(docs) +``` + +That call to [`index()`][docarray.index.backends.hnswlib.HnswDocumentIndex.index] stores all Documents in `docs` in the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `HnswDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector search + +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. + +You can use the [`find()`][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +to find similar documents within the Document Index: + +=== "Search by Document" + + ```python + # create a query document + query = MyDoc(embedding=np.random.rand(128), text='query') + + # find similar documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched search + +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by Documents" + + ```python + # create some query Documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar documents + matches, scores = db.find_batched(queries, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar documents + matches, scores = db.find_batched(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + +## Filter + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the query language of DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + title: str + price: int + + +books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) +book_index = HnswDocumentIndex[Book](work_dir='./tmp_4') + +# filter for books that are cheaper than 29 dollars +query = {'price': {'$lt': 29}} +cheap_books = book_index.filter(query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + + + +## Text search + +!!! note + The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not support text search. -## Basic Usage + To see how to perform text search, you can check out other backends that offer support. -To see how to create a [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] instance, add Documents, -perform search, etc. see the [general user guide](./docindex.md). +In addition to vector similarity search, the Document Index interface offers methods for text search: +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. + + +## Hybrid search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# Define the document schema. +class SimpleSchema(BaseDoc): + year: int + price: int + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(year=2000-i, price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = HnswDocumentIndex[SimpleSchema](work_dir='./tmp_5') +doc_index.index(docs) + +query = ( + doc_index.build_query() # get empty query object + .filter(filter_query={'year': {'$gt': 1994}}) # pre-filtering + .find(query=np.random.rand(128), search_field='embedding') # add vector similarity search + .filter(filter_query={'price': {'$lte': 3}}) # post-filtering + .build() +) +# execute the combined query and return the results +results = doc_index.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. + + +## Access documents + +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +doc = db[ids[0]] # get by single id +docs = db[ids] # get by list of ids +``` + + +## Delete documents + +In the same way you can access Documents by `id`, you can also delete them: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +del db[ids[0]] # del by single id +del db[ids[1:]] # del by list of ids +``` + +## Update documents +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import HnswDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = HnswDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of these documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` ## Configuration @@ -49,7 +466,7 @@ class MyDoc(BaseDoc): text: str -db = HnswDocumentIndex[MyDoc](work_dir='./path/to/db') +db = HnswDocumentIndex[MyDoc](work_dir='./tmp_6') ``` To load existing data, you can specify a directory that stores data from a previous session. @@ -70,7 +487,7 @@ import numpy as np db = HnswDocumentIndex[MyDoc]( - work_dir='/tmp/my_db', + work_dir='./tmp_7', default_column_config={ np.ndarray: { 'dim': -1, @@ -104,7 +521,7 @@ For more information on these settings, see [below](#field-wise-configurations). Fields that are not vector fields (e.g. of type `str` or `int` etc.) do not offer any configuration, as they are simply stored as-is in a SQLite database. -### Field-wise configurations +### Field-wise configuration There are various setting that you can tweak for every vector field that you index into Hnswlib. @@ -119,7 +536,7 @@ class Schema(BaseDoc): tens_two: NdArray[10] = Field(M=4, space='ip') -db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db') +db = HnswDocumentIndex[Schema](work_dir='./tmp_8') ``` In the example above you can see how to configure two different vector fields, with two different sets of settings. @@ -142,125 +559,26 @@ In this way, you can pass [all options that Hnswlib supports](https://github.com You can find more details on the parameters [here](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md). -## Nested Index - -When using the index, you can define multiple fields and their nested structure. In the following example, you have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) - - -doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2') -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] -doc_index.index(index_docs) -``` - -You can use the `search_field` to specify which field to use when performing the vector search. You can use the dunder operator to specify the field defined in the nested data. In the following code, you can perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the `thumbnail` and `video` field: - -```python -# example of find nested and flat index -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) -# find by the youtubevideo tensor -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) -# find by the thumbnail tensor -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) -# find by the video tensor -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) -``` - -To delete nested data, you need to specify the `id`. - -!!! note - You can only delete `Doc` at the top level. Deletion of the `Doc` on lower levels is not yet supported. - -```python -# example of deleting nested and flat index -del doc_index[index_docs[6].id] -``` - -Check [here](../docindex#nested-data-with-subindex) for nested data with subindex. - -### Update elements -In order to update a Document inside the index, you only need to reindex it with the updated attributes. -First lets create a schema for our Index -```python -import numpy as np -from docarray import BaseDoc, DocList -from docarray.typing import NdArray -from docarray.index import HnswDocumentIndex -class MyDoc(BaseDoc): - text: str - embedding: NdArray[128] -``` -Now we can instantiate our Index and index some data. +### Database location -```python -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] -) -index = HnswDocumentIndex[MyDoc]() -index.index(docs) -assert index.num_docs() == 100 -``` +For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you +usually specify a `host` and a `port` instead. -Now we can find relevant documents +In addition to a host and a port, most backends can also take an `index_name`, `table_name`, `collection_name` or similar. +This specifies the name of the index/table/collection that will be created in the database. +You don't have to specify this though: By default, this name will be taken from the name of the Document type that you use as schema. +For example, for `WeaviateDocumentIndex[MyDoc](...)` the data will be stored in a Weaviate Class of name `MyDoc`. -```python -res = index.find(query=docs[0], search_field='tens', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the first version' in doc.text -``` +In any case, if the location does not yet contain any data, we start from a blank slate. +If the location already contains data from a previous session, it will be accessible through the Document Index. -and update all of the text of this documents and reindex them -```python -for i, doc in enumerate(docs): - doc.text = f'I am the second version of Document {i}' -index.index(docs) -assert index.num_docs() == 100 -``` +## Nested data and subindex search -When we retrieve them again we can see that their text attribute has been updated accordingly +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. -```python -res = index.find(query=docs[0], search_field='tens', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the second version' in doc.text -``` +Go to the [Nested Data](nested_data.md) section to learn more. diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 77a911b4599..9b275b67063 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -1,178 +1,279 @@ # In-Memory Document Index -[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all Documents in DocLists in memory. +[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all documents in memory using DocLists. It is a great starting point for small datasets, where you may not want to launch a database server. -For vector search and filtering the InMemoryExactNNIndex utilizes DocArray's [`find()`][docarray.utils.find.find] and -[`filter_docs()`][docarray.utils.filter.filter_docs] functions. +For vector search and filtering [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] +utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][docarray.utils.filter.filter_docs] functions. -## Basic usage +!!! note "Production readiness" + [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] is a great starting point + for small- to medium-sized datasets, but it is not battle tested in production. If scalability, uptime, etc. are + important to you, we recommend you eventually transition to one of our database-backed Document Index implementations: + + - [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] + - [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] + - [ElasticDocumentIndex][docarray.index.backends.elastic.ElasticDocIndex] + - [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] + - [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] -To see how to create a [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] instance, add Documents, -perform search, etc. see the [general user guide](./docindex.md). -You can initialize the index as follows: + +## Basic usage +This snippet demonstrates the basic usage of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] to index these documents, +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList -from docarray.index.backends.in_memory import InMemoryExactNNIndex +from docarray.index import InMemoryExactNNIndex from docarray.typing import NdArray +import numpy as np - +# Define the document schema. class MyDoc(BaseDoc): - tensor: NdArray = None - + title: str + embedding: NdArray[128] -docs = DocList[MyDoc](MyDoc() for _ in range(10)) +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) +# Initialize a new InMemoryExactNNIndex instance and add the documents to the index. doc_index = InMemoryExactNNIndex[MyDoc]() doc_index.index(docs) -# or in one step, create with inserted docs. -doc_index = InMemoryExactNNIndex[MyDoc](docs) +# Perform a vector search. +query = np.ones(128) +retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) ``` -Alternatively, you can pass an `index_file_path` argument to make sure that the index can be restored if persisted from that specific file. +## Initialize + +To create a Document Index, you first need a document class that defines the schema of your index: + ```python -# Save your existing index as a binary file -docs = DocList[MyDoc](MyDoc() for _ in range(10)) +from docarray import BaseDoc +from docarray.index import InMemoryExactNNIndex +from docarray.typing import NdArray -doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') -doc_index.index(docs) -# or in one step: -doc_index.persist() +class MyDoc(BaseDoc): + embedding: NdArray[128] + text: str -# Initialize a new document index using the saved binary file -new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') + +db = InMemoryExactNNIndex[MyDoc]() ``` -## Configuration +### Schema definition -This section lays out the configurations and options that are specific to [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. +In this code snippet, `InMemoryExactNNIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. -The `DBConfig` of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] contains two entries: -`index_file_path` and `default_column_mapping`, the default mapping from Python types to column configurations. +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). -You can see in the [section below](#field-wise-configurations) how to override configurations for specific fields. -If you want to set configurations globally, i.e. for all vector fields in your Documents, you can do that using `DBConfig` or passing it at `__init__`:: +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + + +### Using a predefined document as schema + +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import InMemoryExactNNIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + db = InMemoryExactNNIndex[MyDoc]() + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import InMemoryExactNNIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + db = InMemoryExactNNIndex[MyDoc]() + ``` + +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. + +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: ```python -from collections import defaultdict -from docarray.typing import AbstractTensor -new_doc_index = InMemoryExactNNIndex[MyDoc]( - default_column_config=defaultdict( - dict, - { - AbstractTensor: {'space': 'cosine_sim'}, - }, - ) +from docarray import DocList + +# data of type TextDoc +data = DocList[MyDoc]( + [ + MyDoc(text='hello world', embedding=np.random.rand(128)), + MyDoc(text='hello world', embedding=np.random.rand(128)), + MyDoc(text='hello world', embedding=np.random.rand(128)), + ] ) + +# you can index this into Document Index of type MyDoc +db.index(data) ``` -This will set the default configuration for all vector fields to the one specified in the example above. +## Index -For more information on these settings, see [below](#field-wise-configurations). +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: -Fields that are not vector fields (e.g. of type `str` or `int` etc.) do not offer any configuration. +```python +import numpy as np +from docarray import DocList +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] +) -### Field-wise configurations +# index the data +db.index(docs) +``` -For a vector field you can adjust the `space` parameter. It can be one of: +That call to [`index()`][docarray.index.backends.in_memory.InMemoryExactNNIndex.index] stores all Documents in `docs` in the Document Index, +ready to be retrieved in the next step. -- `'cosine_sim'` (default) -- `'euclidean_dist'` -- `'sqeuclidean_dist'` +As you can see, `DocList[MyDoc]` and `InMemoryExactNNIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. -You pass it using the `field: Type = Field(...)` syntax: +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: -```python -from docarray import BaseDoc -from pydantic import Field + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. -class Schema(BaseDoc): - tensor_1: NdArray[100] = Field(space='euclidean_dist') - tensor_2: NdArray[100] = Field(space='sqeuclidean_dist') -``` -In the example above you can see how to configure two different vector fields, with two different sets of settings. +## Vector search -## Nested index +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -When using the index, you can define multiple fields and their nested structure. In the following example, you have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. +You can use the [`find()`][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +to find similar documents within the Document Index: -```python -import numpy as np -from docarray import BaseDoc -from docarray.index.backends.in_memory import InMemoryExactNNIndex -from docarray.typing import ImageUrl, VideoUrl, AnyTensor -from pydantic import Field +=== "Search by Document" + + ```python + # create a query document + query = MyDoc(embedding=np.random.rand(128), text='query') -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine_sim') + # find similar documents + matches, scores = db.find(query, search_field='embedding', limit=5) + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine_sim') +=== "Search by raw vector" + ```python + # create a query vector + query = np.random.rand(128) -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine_sim') - - -doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]() -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), + # find similar documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched search + +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) ) - for i in range(8) -] -doc_index.index(index_docs) -``` -## Search Documents + # find similar documents + matches, scores = db.find_batched(queries, search_field='embedding', limit=5) -To search Documents, the `InMemoryExactNNIndex` uses DocArray's [`find`][docarray.utils.find.find] function. + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` -You can use the `search_field` to specify which field to use when performing the vector search. -You can use the dunder operator to specify the field defined in nested data. -In the following code, you can perform vector search on the `tensor` field of the `YouTubeVideoDoc` -or the `tensor` field of the `thumbnail` and `video` fields: +=== "Search by raw vectors" -```python -# find by the youtubevideo tensor -query = parse_obj_as(NdArray, np.ones(256)) -docs, scores = doc_index.find(query, search_field='tensor', limit=3) + ```python + # create some query vectors + query = np.random.rand(3, 128) -# find by the thumbnail tensor -query = parse_obj_as(NdArray, np.ones(64)) -docs, scores = doc_index.find(query, search_field='thumbnail__tensor', limit=3) + # find similar documents + matches, scores = db.find_batched(query, search_field='embedding', limit=5) -# find by the video tensor -query = parse_obj_as(NdArray, np.ones(128)) -docs, scores = doc_index.find(query, search_field='video__tensor', limit=3) -``` + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. -## Filter Documents + +## Filter To filter Documents, the `InMemoryExactNNIndex` uses DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. -The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. +The query should follow the query language of DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. In the following example let's filter for all the books that are cheaper than 29 dollars: @@ -197,41 +298,218 @@ for doc in cheap_books: doc.summary() ``` -
- Output - ```text - 📄 Book : 1f7da15 ... - ╭──────────────────────┬───────────────╮ - │ Attribute │ Value │ - ├──────────────────────┼───────────────┤ - │ title: str │ title 0 │ - │ price: int │ 0 │ - ╰──────────────────────┴───────────────╯ - 📄 Book : 63fd13a ... - ╭──────────────────────┬───────────────╮ - │ Attribute │ Value │ - ├──────────────────────┼───────────────┤ - │ title: str │ title 1 │ - │ price: int │ 10 │ - ╰──────────────────────┴───────────────╯ - 📄 Book : 49b21de ... - ╭──────────────────────┬───────────────╮ - │ Attribute │ Value │ - ├──────────────────────┼───────────────┤ - │ title: str │ title 2 │ - │ price: int │ 20 │ - ╰──────────────────────┴───────────────╯ - ``` -
+## Text search -## Delete Documents +!!! note + The [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] implementation does not support text search. -To delete nested data, you need to specify the `id`. + To see how to perform text search, you can check out other backends that offer support. -!!! note - You can only delete Documents at the top level. Deletion of Documents on lower levels is not yet supported. +In addition to vector similarity search, the Document Index interface offers methods for text search: +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. + + + +## Hybrid search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# Define the document schema. +class SimpleSchema(BaseDoc): + year: int + price: int + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(year=2000-i, price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = InMemoryExactNNIndex[SimpleSchema](docs) + +query = ( + doc_index.build_query() # get empty query object + .filter(filter_query={'year': {'$gt': 1994}}) # pre-filtering + .find(query=np.random.rand(128), search_field='embedding') # add vector similarity search + .filter(filter_query={'price': {'$lte': 3}}) # post-filtering + .build() +) +# execute the combined query and return the results +results = doc_index.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. + + +## Access documents + +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +doc = db[ids[0]] # get by single id +docs = db[ids] # get by list of ids +``` + + +## Delete documents + +In the same way you can access Documents by `id`, you can also delete them: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +del db[ids[0]] # del by single id +del db[ids[1:]] # del by list of ids +``` + +## Update documents +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import InMemoryExactNNIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = InMemoryExactNNIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of these documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + + +## Configuration + +This section lays out the configurations and options that are specific to [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. + +The `DBConfig` of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] contains two entries: +`index_file_path` and `default_column_mapping`, the default mapping from Python types to column configurations. + +You can see in the [section below](#field-wise-configurations) how to override configurations for specific fields. +If you want to set configurations globally, i.e. for all vector fields in your Documents, you can do that using `DBConfig` or passing it at `__init__`:: ```python -# example of deleting nested and flat index -del doc_index[index_docs[6].id] +from collections import defaultdict +from docarray.typing.tensor.abstract_tensor import AbstractTensor +new_doc_index = InMemoryExactNNIndex[MyDoc]( + default_column_config=defaultdict( + dict, + { + AbstractTensor: {'space': 'cosine_sim'}, + }, + ) +) ``` + +This will set the default configuration for all vector fields to the one specified in the example above. + +For more information on these settings, see [below](#field-wise-configurations). + +Fields that are not vector fields (e.g. of type `str` or `int` etc.) do not offer any configuration. + + +### Field-wise configuration + +For a vector field you can adjust the `space` parameter. It can be one of: + +- `'cosine_sim'` (default) +- `'euclidean_dist'` +- `'sqeuclidean_dist'` + +You pass it using the `field: Type = Field(...)` syntax: + +```python +from docarray import BaseDoc +from pydantic import Field + + +class Schema(BaseDoc): + tensor_1: NdArray[100] = Field(space='euclidean_dist') + tensor_2: NdArray[100] = Field(space='sqeuclidean_dist') +``` + +In the example above you can see how to configure two different vector fields, with two different sets of settings. + + +### Persist and Load +You can pass an `index_file_path` argument to make sure that the index can be restored if persisted from that specific file. +```python +doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') +doc_index.index(docs) + +doc_index.persist() + +# Initialize a new document index using the saved binary file +new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') +``` + + +## Nested data and subindex search + +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. + +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md new file mode 100644 index 00000000000..4cf9c91c7d5 --- /dev/null +++ b/docs/user_guide/storing/index_milvus.md @@ -0,0 +1,436 @@ +# Milvus Document Index + +!!! note "Install dependencies" + To use [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex], you need to install extra dependencies with the following command: + + ```console + pip install "docarray[milvus]" + ``` + +This is the user guide for the [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex], +focusing on special features and configurations of Milvus. + + +## Basic usage +This snippet demonstrates the basic usage of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. + +!!! note "Single Search Field Requirement" + In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. + This is due to Milvus' configuration, which permits a single vector for each data object. + +```python +from docarray import BaseDoc, DocList +from docarray.index import MilvusDocumentIndex +from docarray.typing import NdArray +from pydantic import Field +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] = Field(is_embedding=True) + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new MilvusDocumentIndex instance and add the documents to the index. +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_1') +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, limit=10) +``` + + +## Initialize + +First of all, you need to install and run Milvus. Download `docker-compose.yml` with the following command: + +```shell +wget https://github.com/milvus-io/milvus/releases/download/v2.2.11/milvus-standalone-docker-compose.yml -O docker-compose.yml +``` + +And start Milvus by running: +```shell +sudo docker-compose up -d +``` + +Learn more on [Milvus documentation](https://milvus.io/docs/install_standalone-docker.md). + +Next, you can create a [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] instance using: + +```python +from docarray import BaseDoc +from docarray.index import MilvusDocumentIndex +from docarray.typing import NdArray +from pydantic import Field + + +class MyDoc(BaseDoc): + embedding: NdArray[128] = Field(is_embedding=True) + text: str + + +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_2') +``` + +### Schema definition +In this code snippet, `MilvusDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + +### Using a predefined document as schema + +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import MilvusDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: NdArray[128] = Field(is_embedding=True) + + + doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_3') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import MilvusDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128, is_embedding=True) + + + doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_4') + ``` + + +## Index + +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] = Field(is_embedding=True) + +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_5') + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), title=f'text {i}') for i in range(100)] +) + +# index the data +doc_index.index(docs) +``` + +That call to [`index()`][docarray.index.backends.milvus.MilvusDocumentIndex.index] stores all Documents in `docs` in the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `MilvusDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector search + +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. + +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: + +=== "Search by Document" + + ```python + # create a query document + query = MyDoc(embedding=np.random.rand(128), title='query') + + # find similar documents + matches, scores = doc_index.find(query, limit=5) + + print(f'{matches=}') + print(f'{matches.title=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar documents + matches, scores = doc_index.find(query, limit=5) + + print(f'{matches=}') + print(f'{matches.title=}') + print(f'{scores=}') + ``` + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched search + +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar documents + matches, scores = doc_index.find_batched(query, limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + +## Filter + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of the Milvus](https://milvus.io/docs/boolean.md). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + price: int + embedding: NdArray[10] = Field(is_embedding=True) + + +books = DocList[Book]([Book(price=i * 10, embedding=np.random.rand(10)) for i in range(10)]) +book_index = MilvusDocumentIndex[Book](index_name='tmp_index_6') +book_index.index(books) + +# filter for books that are cheaper than 29 dollars +query = 'price < 29' +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + +## Text search + +!!! note + The [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] implementation does not support text search. + + To see how to perform text search, you can check out other backends that offer support. + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. + + + +## Hybrid search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# Define the document schema. +class SimpleSchema(BaseDoc): + price: int + embedding: NdArray[128] = Field(is_embedding=True) + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = MilvusDocumentIndex[SimpleSchema](index_name='tmp_index_7') +doc_index.index(docs) + +query = ( + doc_index.build_query() # get empty query object + .find(query=np.random.rand(128)) # add vector similarity search + .filter(filter_query='price < 3') # add filter search + .build() +) +# execute the combined query and return the results +results = doc_index.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. + + +## Access documents + +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[SimpleSchema]( + SimpleSchema(embedding=np.random.rand(128), price=i) for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +doc_index.index(data) + +# access the Documents by id +doc = doc_index[ids[0]] # get by single id +docs = doc_index[ids] # get by list of ids +``` + + +## Delete documents + +In the same way you can access Documents by `id`, you can also delete them: + +```python +# prepare some data +data = DocList[SimpleSchema]( + SimpleSchema(embedding=np.random.rand(128), price=i) for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +doc_index.index(data) + +# access the Documents by id +del doc_index[ids[0]] # del by single id +del doc_index[ids[1:]] # del by list of ids +``` + + +## Configuration + +This section lays out the configurations and options that are specific to [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]. + +### DBConfig + +The following configs can be set in `DBConfig`: + +| Name | Description | Default | +|-------------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------| +| `host` | The host address for the Milvus server. | `localhost` | +| `port` | The port number for the Milvus server | 19530 | +| `index_name` | The name of the index in the Milvus database | `None`. Data will be stored in an index named after the Document type used as schema. | +| `user` | The username for the Milvus server | `None` | +| `password` | The password for the Milvus server | `None` | +| `token` | Token for secure connection | '' | +| `collection_description` | Description of the collection in the database | '' | +| `default_column_config` | The default configurations for every column type. | dict | + +You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. + + +### Field-wise configuration + + +`default_column_config` is the default configurations for every column type. Since there are many column types in Milvus, you can also consider changing the column config when defining the schema. + +```python +class SimpleDoc(BaseDoc): + tensor: NdArray[128] = Field(is_embedding=True, index_type='IVF_FLAT', metric_type='L2') + + +doc_index = MilvusDocumentIndex[SimpleDoc](index_name='tmp_index_10') +``` + + +### RuntimeConfig + +The `RuntimeConfig` dataclass of `MilvusDocumentIndex` consists of `batch_size` index/get/del operations. +You can change `batch_size` in the following way: + +```python +doc_index = MilvusDocumentIndex[SimpleDoc]() +doc_index.configure(MilvusDocumentIndex.RuntimeConfig(batch_size=128)) +``` + +You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. + + +## Nested data and subindex search + +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. + +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index 266b5695d1e..71770e45982 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -10,138 +10,499 @@ The following is a starter script for using the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], based on the [Qdrant](https://qdrant.tech/) vector search engine. -For general usage of a Document Index, see the [general user guide](./docindex.md#document-index). -!!! tip "See all configuration options" - To see all configuration options for the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], - you can do the following: +## Basic usage +This snippet demonstrates the basic usage of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. - ```python - from docarray.index import QdrantDocumentIndex +```python +from docarray import BaseDoc, DocList +from docarray.index import QdrantDocumentIndex +from docarray.typing import NdArray +import numpy as np - # the following can be passed to the __init__() method - db_config = QdrantDocumentIndex.DBConfig() - print(db_config) # shows default values +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] - # the following can be passed to the configure() method - runtime_config = QdrantDocumentIndex.RuntimeConfig() - print(runtime_config) # shows default values - ``` - - Note that the collection_name from the DBConfig is an Optional[str] with None as default value. This is because - the QdrantDocumentIndex will take the name the Document type that you use as schema. For example, for QdrantDocumentIndex[MyDoc](...) - the data will be stored in a collection name MyDoc if no specific collection_name is passed in the DBConfig. +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) -```python -import numpy as np +# Initialize a new QdrantDocumentIndex instance and add the documents to the index. +doc_index = QdrantDocumentIndex[MyDoc](host='localhost') +doc_index.index(docs) -from typing import Optional +# Perform a vector search. +query = np.ones(128) +retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) +``` -from docarray import BaseDoc -from docarray.index import QdrantDocumentIndex -from docarray.typing import NdArray +## Initialize -from qdrant_client.http import models +You can initialize [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] in three different ways: -class MyDocument(BaseDoc): - title: str - title_embedding: NdArray[786] - image_path: Optional[str] - image_embedding: NdArray[512] +**Connecting to a local Qdrant instance running as a Docker container** +You can use docker-compose to create a local Qdrant service with the following `docker-compose.yml`. -# Creating an in-memory Qdrant document index -qdrant_config = QdrantDocumentIndex.DBConfig(location=":memory:") -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) +```yaml +version: '3.8' + +services: + qdrant: + image: qdrant/qdrant:v1.1.2 + ports: + - "6333:6333" + - "6334:6334" + ulimits: # Only required for tests, as there are a lot of collections created + nofile: + soft: 65535 + hard: 65535 +``` + +Run the following command in the folder of the above `docker-compose.yml` to start the service: + +```bash +docker-compose up +``` + +Next, you can create a [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] instance using: + +```python +qdrant_config = QdrantDocumentIndex.DBConfig('localhost') +doc_index = QdrantDocumentIndex[MyDoc](qdrant_config) + +# or just +doc_index = QdrantDocumentIndex[MyDoc](host='localhost') +``` -# Connecting to a local Qdrant instance running as a Docker container -qdrant_config = QdrantDocumentIndex.DBConfig("http://localhost:6333") -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) -# Connecting to Qdrant Cloud service +**Creating an in-memory Qdrant document index** +```python +qdrant_config = QdrantDocumentIndex.DBConfig(location=":memory:") +doc_index = QdrantDocumentIndex[MyDoc](qdrant_config) +``` + +**Connecting to Qdrant Cloud service** +```python qdrant_config = QdrantDocumentIndex.DBConfig( "https://YOUR-CLUSTER-URL.aws.cloud.qdrant.io", api_key="", ) -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) +doc_index = QdrantDocumentIndex[MyDoc](qdrant_config) +``` + +### Schema definition +In this code snippet, `QdrantDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + +### Using a predefined document as schema + +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import QdrantDocumentIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + doc_index = QdrantDocumentIndex[MyDoc](host='localhost') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import QdrantDocumentIndex + from pydantic import Field -# Indexing the documents -doc_index.index( + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + doc_index = QdrantDocumentIndex[MyDoc](host='localhost') + ``` + +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. + +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( [ - MyDocument( - title=f"My document {i}", - title_embedding=np.random.random(786), - image_path=None, - image_embedding=np.random.random(512), - ) - for i in range(100) + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), ] ) -# Performing a vector search only -results = doc_index.find( - query=np.random.random(512), - search_field="image_embedding", - limit=3, -) +# you can index this into Document Index of type MyDoc +doc_index.index(data) +``` -# Connecting to a local Qdrant instance with Scalar Quantization enabled, -# and using non-default collection name to store the datapoints -qdrant_config = QdrantDocumentIndex.DBConfig( - "http://localhost:6333", - collection_name="another_collection", - quantization_config=models.ScalarQuantization( - scalar=models.ScalarQuantizationConfig( - type=models.ScalarType.INT8, - quantile=0.99, - always_ram=True, - ), - ), -) -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) -# Indexing the documents -doc_index.index( - [ - MyDocument( - title=f"My document {i}", - title_embedding=np.random.random(786), - image_path=None, - image_embedding=np.random.random(512), - ) - for i in range(100) - ] +## Index + +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] ) -# Text lookup, without vector search. Using the Qdrant filtering mechanisms: -# https://qdrant.tech/documentation/filtering/ -results = doc_index.filter( - filter_query=models.Filter( +# index the data +doc_index.index(docs) +``` + +That call to `index()` stores all documents in `docs` in the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `QdrantDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector search + +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. + +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: + +=== "Search by Document" + + ```python + # create a query document + query = MyDoc(embedding=np.random.rand(128), text='query') + + # find similar documents + matches, scores = doc_index.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar documents + matches, scores = doc_index.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched search + +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar documents + matches, scores = doc_index.find_batched(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + +## Filter + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of Qdrant](https://qdrant.tech/documentation/concepts/filtering/). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList +from qdrant_client.http import models as rest + + +class Book(BaseDoc): + title: str + price: int + + +books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) +book_index = QdrantDocumentIndex[Book]() +book_index.index(books) + +# filter for books that are cheaper than 29 dollars +query = rest.Filter( + must=[rest.FieldCondition(key='price', range=rest.Range(lt=29))] + ) +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + +## Text search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +You can use text search directly on the field of type `str`: + +```python +class NewsDoc(BaseDoc): + text: str + + +doc_index = QdrantDocumentIndex[NewsDoc](host='localhost') +index_docs = [ + NewsDoc(id='0', text='this is a news for sport'), + NewsDoc(id='1', text='this is a news for finance'), + NewsDoc(id='2', text='this is another news for sport'), +] +doc_index.index(index_docs) +query = 'finance' + +# search with text +docs, scores = doc_index.text_search(query, search_field='text') +``` + + +## Hybrid search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: + +For example, you can build a hybrid serach query that performs range filtering, vector search and text search: + +```python +class SimpleDoc(BaseDoc): + tens: NdArray[10] + num: int + text: str + + +doc_index = QdrantDocumentIndex[SimpleDoc](host='localhost') +index_docs = [ + SimpleDoc(id=f'{i}', tens=np.ones(10) * i, num=int(i / 2), text=f'Lorem ipsum {int(i/2)}') + for i in range(10) +] +doc_index.index(index_docs) + +find_query = np.ones(10) +text_search_query = 'ipsum 1' +filter_query = rest.Filter( must=[ - models.FieldCondition( - key="title", - match=models.MatchText(text="document 2"), - ), - ], - ), -) + rest.FieldCondition( + key='num', + range=rest.Range( + gte=1, + lt=5, + ), + ) + ] + ) -# Vector search with additional filtering. Qdrant has the additional filters -# incorporated directly into the vector search phase, without a need to perform -# pre or post-filtering. query = ( - index.build_query() - .find(np.random.random(512), search_field="image_embedding") - .filter(filter_query=models.Filter( - must=[ - models.FieldCondition( - key="title", - match=models.MatchText(text="document 2"), - ), - ], - )) - .build(limit=5) + doc_index.build_query() + .find(find_query, search_field='tens') + .text_search(text_search_query, search_field='text') + .filter(filter_query) + .build(limit=5) ) -results = index.execute_query(query) + +docs = doc_index.execute_query(query) +``` + + +## Access documents + +To access a document, you need to specify its `id`. You can also pass a list of `id`s to access multiple documents. + +```python +# access a single Doc +doc_index[index_docs[16].id] + +# access multiple Docs +doc_index[index_docs[16].id, index_docs[17].id] +``` + +## Delete documents + +To delete documents, use the built-in function `del` with the `id` of the documents that you want to delete. +You can also pass a list of `id`s to delete multiple documents. + +```python +# delete a single Doc +del doc_index[index_docs[16].id] + +# delete multiple Docs +del doc_index[index_docs[17].id, index_docs[18].id] +``` + +## Update documents +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import QdrantDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = QdrantDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 ``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of these documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + + +## Configuration + +!!! tip "See all configuration options" To see all configuration options for the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], you can do the following: + +```python +from docarray.index import QdrantDocumentIndex + +# the following can be passed to the __init__() method +db_config = QdrantDocumentIndex.DBConfig() +print(db_config) # shows default values + +# the following can be passed to the configure() method +runtime_config = QdrantDocumentIndex.RuntimeConfig() +print(runtime_config) # shows default values +``` + +Note that the collection_name from the DBConfig is an Optional[str] with `None` as default value. This is because +the QdrantDocumentIndex will take the name the Document type that you use as schema. For example, for QdrantDocumentIndex[MyDoc](...) +the data will be stored in a collection name MyDoc if no specific collection_name is passed in the DBConfig. + + +## Nested data and subindex search + +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. + +Go to the [Nested Data](nested_data.md) section to learn more. diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md new file mode 100644 index 00000000000..4e6522d1195 --- /dev/null +++ b/docs/user_guide/storing/index_redis.md @@ -0,0 +1,507 @@ +# Redis Document Index + +!!! note "Install dependencies" + To use [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex], you need to install extra dependencies with the following command: + + ```console + pip install "docarray[redis]" + ``` + +This is the user guide for the [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex], +focusing on special features and configurations of Redis. + + +## Basic usage +This snippet demonstrates the basic usage of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. + +```python +from docarray import BaseDoc, DocList +from docarray.index import RedisDocumentIndex +from docarray.typing import NdArray +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new RedisDocumentIndex instance and add the documents to the index. +doc_index = RedisDocumentIndex[MyDoc](host='localhost') +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) +``` + + +## Initialize + +Before initializing [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex], +make sure that you have a Redis service that you can connect to. + +You can create a local Redis service with the following command: + +```shell +docker run --name redis-stack-server -p 6379:6379 -d redis/redis-stack-server:7.2.0-RC2 +``` +Next, you can create [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]: +```python +from docarray import BaseDoc +from docarray.index import RedisDocumentIndex +from docarray.typing import NdArray + + +class MyDoc(BaseDoc): + embedding: NdArray[128] + text: str + + +doc_index = RedisDocumentIndex[MyDoc](host='localhost') +``` + + +### Schema definition +In this code snippet, `RedisDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + + +### Using a predefined document as schema + +DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import RedisDocumentIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + doc_index = RedisDocumentIndex[MyDoc]() + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import RedisDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + doc_index = RedisDocumentIndex[MyDoc]() + ``` + +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. + +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) + +# you can index this into Document Index of type MyDoc +doc_index.index(data) +``` + +## Index + +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] +) + +# index the data +doc_index.index(docs) +``` + +That call to [`index()`][docarray.index.backends.redis.RedisDocumentIndex.index] stores all Documents in `docs` in the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `RedisDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector search + +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. + +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: + +=== "Search by Document" + + ```python + # create a query document + query = MyDoc(embedding=np.random.rand(128), text='query') + + # find similar documents + matches, scores = doc_index.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar documents + matches, scores = doc_index.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched search + +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar documents + matches, scores = doc_index.find_batched(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + +## Filter + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of the Redis](https://redis.io/docs/interact/search-and-query/query/). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + title: str + price: int + + +books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) +book_index = RedisDocumentIndex[Book]() +book_index.index(books) + +# filter for books that are cheaper than 29 dollars +query = '@price:[-inf 29]' +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + +## Text search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +You can use text search directly on the field of type `str`: + +```python +class NewsDoc(BaseDoc): + text: str + + +doc_index = RedisDocumentIndex[NewsDoc]() +index_docs = [ + NewsDoc(id='0', text='this is a news for sport'), + NewsDoc(id='1', text='this is a news for finance'), + NewsDoc(id='2', text='this is another news for sport'), +] +doc_index.index(index_docs) +query = 'finance' + +# search with text +docs, scores = doc_index.text_search(query, search_field='text') +``` + +## Hybrid search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# Define the document schema. +class SimpleSchema(BaseDoc): + price: int + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = RedisDocumentIndex[SimpleSchema](host='localhost') +doc_index.index(docs) + +query = ( + doc_index.build_query() # get empty query object + .find(query=np.random.rand(128), search_field='embedding') # add vector similarity search + .filter(filter_query='@price:[-inf 3]') # add filter search + .build() +) +# execute the combined query and return the results +results = doc_index.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. + + +## Access documents + +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +doc = db[ids[0]] # get by single id +docs = db[ids] # get by list of ids +``` + + +## Delete documents + +In the same way you can access Documents by `id`, you can also delete them: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +del db[ids[0]] # del by single id +del db[ids[1:]] # del by list of ids +``` + +## Update documents +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import RedisDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = RedisDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of these documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + + +## Configuration + +This section lays out the configurations and options that are specific to [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]. + +### DBConfig + +The following configs can be set in `DBConfig`: + +| Name | Description | Default | +|-------------------------|----------------------------------------------------|-------------------------------------------------------------------------------------| +| `host` | The host address for the Redis server. | `localhost` | +| `port` | The port number for the Redis server | 6379 | +| `index_name` | The name of the index in the Redis database | `None`. Data will be stored in an index named after the Document type used as schema. | +| `username` | The username for the Redis server | `None` | +| `password` | The password for the Redis server | `None` | +| `text_scorer` | The method for [scoring text](https://redis.io/docs/interact/search-and-query/advanced-concepts/scoring/) during text search | `BM25` | +| `default_column_config` | The default configurations for every column type. | dict | + +You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. + + +### Field-wise configuration + + +`default_column_config` is the default configurations for every column type. Since there are many column types in Redis, you can also consider changing the column config when defining the schema. + +```python +class SimpleDoc(BaseDoc): + tensor: NdArray[128] = Field(algorithm='FLAT', distance='COSINE') + + +doc_index = RedisDocumentIndex[SimpleDoc]() +``` + + +### RuntimeConfig + +The `RuntimeConfig` dataclass of `RedisDocumentIndex` consists of `batch_size` index/get/del operations. +You can change `batch_size` in the following way: + +```python +doc_index = RedisDocumentIndex[SimpleDoc]() +doc_index.configure(RedisDocumentIndex.RuntimeConfig(batch_size=128)) +``` + +You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. + + + +## Nested data and subindex search + +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. + +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index e4663d53d15..029c86de377 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -1,17 +1,3 @@ ---- -jupyter: - jupytext: - text_representation: - extension: .md - format_name: markdown - format_version: '1.3' - jupytext_version: 1.14.5 - kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - # Weaviate Document Index !!! note "Install dependencies" @@ -24,15 +10,50 @@ jupyter: This is the user guide for the [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], focusing on special features and configurations of Weaviate. -For general usage of a Document Index, see the [general user guide](./docindex.md). -# 1. Start Weaviate service +## Basic usage +This snippet demonstrates the basic usage of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. + +!!! note "Single Search Field Requirement" + In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. + This is due to Weaviate's configuration, which permits a single vector for each data object. + +```python +from docarray import BaseDoc, DocList +from docarray.index import WeaviateDocumentIndex +from docarray.typing import NdArray +from pydantic import Field +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] = Field(is_embedding=True) + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new WeaviateDocumentIndex instance and add the documents to the index. +doc_index = WeaviateDocumentIndex[MyDoc]() +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, limit=10) +``` + + +## Initialize + + +### Start Weaviate service To use [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], DocArray needs to hook into a running Weaviate service. There are multiple ways to start a Weaviate instance, depending on your use case. - -## 1.1. Options - Overview +**Options - Overview** | Instance type | General use case | Configurability | Notes | | ----- | ----- | ----- | ----- | @@ -41,15 +62,15 @@ There are multiple ways to start a Weaviate instance, depending on your use case | **Docker-Compose** | Development | Yes | **Recommended for development + customizability** | | **Kubernetes** | Production | Yes | | -## 1.2. Instantiation instructions +### Instantiation instructions -### 1.2.1. WCS (managed instance) +**WCS (managed instance)** Go to the [WCS console](https://console.weaviate.cloud) and create an instance using the visual interface, following [this guide](https://weaviate.io/developers/wcs/guides/create-instance). Weaviate instances on WCS come pre-configured, so no further configuration is required. -### 1.2.2. Docker-Compose (self-managed) +**Docker-Compose (self-managed)** Get a configuration file (`docker-compose.yaml`). You can build it using [this interface](https://weaviate.io/developers/weaviate/installation/docker-compose), or download it directly with: @@ -58,14 +79,12 @@ curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/ ``` Where `v` is the actual version, such as `v1.18.3`. - ```bash curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?modules=standalone&runtime=docker-compose&weaviate_version=v1.18.3" ``` - -#### 1.2.2.1 Start up Weaviate with Docker-Compose +**Start up Weaviate with Docker-Compose** Then you can start up Weaviate by running from a shell: @@ -73,7 +92,7 @@ Then you can start up Weaviate by running from a shell: docker-compose up -d ``` -#### 1.2.2.2 Shut down Weaviate +**Shut down Weaviate** Then you can shut down Weaviate by running from a shell: @@ -81,19 +100,17 @@ Then you can shut down Weaviate by running from a shell: docker-compose down ``` -#### Notes +**Notes** Unless data persistence or backups are set up, shutting down the Docker instance will remove all its data. See documentation on [Persistent volume](https://weaviate.io/developers/weaviate/installation/docker-compose#persistent-volume) and [Backups](https://weaviate.io/developers/weaviate/configuration/backups) to prevent this if persistence is desired. - ```bash docker-compose up -d ``` - -### 1.2.3. Embedded Weaviate (from the application) +**Embedded Weaviate (from the application)** With Embedded Weaviate, Weaviate database server can be launched from the client, using: @@ -103,7 +120,7 @@ from docarray.index.backends.weaviate import EmbeddedOptions embedded_options = EmbeddedOptions() ``` -## 1.3. Authentication +### Authentication Weaviate offers [multiple authentication options](https://weaviate.io/developers/weaviate/configuration/authentication), as well as [authorization options](https://weaviate.io/developers/weaviate/configuration/authorization). @@ -116,9 +133,8 @@ With DocArray, you can use any of: To access a Weaviate instance. In general, **Weaviate recommends using API-key based authentication** for balance between security and ease of use. You can create, for example, read-only keys to distribute to certain users, while providing read/write keys to administrators. See below for examples of connection to Weaviate for each scenario. - -## 1.4. Connect to Weaviate +### Connect to Weaviate ```python from docarray.index.backends.weaviate import WeaviateDocumentIndex @@ -126,7 +142,6 @@ from docarray.index.backends.weaviate import WeaviateDocumentIndex ### Public instance - If using Embedded Weaviate: ```python @@ -136,7 +151,6 @@ dbconfig = WeaviateDocumentIndex.DBConfig(embedded_options=EmbeddedOptions()) ``` For all other options: - ```python dbconfig = WeaviateDocumentIndex.DBConfig( @@ -144,8 +158,7 @@ dbconfig = WeaviateDocumentIndex.DBConfig( ) # Replace with your endpoint) ``` - -### OIDC with username + password +**OIDC with username + password** To authenticate against a Weaviate instance with OIDC username & password: @@ -156,7 +169,6 @@ dbconfig = WeaviateDocumentIndex.DBConfig( host="http://localhost:8080", # Replace with your endpoint ) ``` - ```python # dbconfig = WeaviateDocumentIndex.DBConfig( @@ -166,8 +178,7 @@ dbconfig = WeaviateDocumentIndex.DBConfig( # ) ``` - -### API key-based authentication +**API key-based authentication** To authenticate against a Weaviate instance an API key: @@ -177,113 +188,101 @@ dbconfig = WeaviateDocumentIndex.DBConfig( host="http://localhost:8080", # Replace with your endpoint ) ``` - - - - -# 2. Configure Weaviate - -## 2.1. Overview -**WCS instances come pre-configured**, and as such additional settings are not configurable outside of those chosen at creation, such as whether to enable authentication. - -For other cases, such as **Docker-Compose deployment**, its settings can be modified through the configuration file, such as the `docker-compose.yaml` file. +### Create an instance +Let's connect to a local Weaviate service and instantiate a `WeaviateDocumentIndex` instance: +```python +dbconfig = WeaviateDocumentIndex.DBConfig( + host="http://localhost:8080" +) +doc_index = WeaviateDocumentIndex[MyDoc](db_config=dbconfig) +``` -Some of the more commonly used settings include: +### Schema definition +In this code snippet, `WeaviateDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. -- [Persistent volume](https://weaviate.io/developers/weaviate/installation/docker-compose#persistent-volume): Set up data persistence so that data from inside the Docker container is not lost on shutdown -- [Enabling a multi-node setup](https://weaviate.io/developers/weaviate/installation/docker-compose#multi-node-setup) -- [Backups](https://weaviate.io/developers/weaviate/configuration/backups) -- [Authentication (server-side)](https://weaviate.io/developers/weaviate/configuration/authentication) -- [Modules enabled](https://weaviate.io/developers/weaviate/configuration/modules#enable-modules) +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). -And a list of environment variables is [available on this page](https://weaviate.io/developers/weaviate/config-refs/env-vars). +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. -## 2.2. DocArray instantiation configuration options +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! -Additionally, you can specify the below settings when you instantiate a configuration object in DocArray. +### Using a predefined document as schema -| name | type | explanation | default | example | -| ---- | ---- | ----------- |------------------------------------------------------------------------| ------- | -| **Category: General** | -| host | str | Weaviate instance url | http://localhost:8080 | -| **Category: Authentication** | -| username | str | Username known to the specified authentication provider (e.g. WCS) | None | `jp@weaviate.io` | -| password | str | Corresponding password | None | `p@ssw0rd` | -| auth_api_key | str | API key known to the Weaviate instance | None | `mys3cretk3y` | -| **Category: Data schema** | -| index_name | str | Class name to use to store the document| The document class name, e.g. `MyDoc` for `WeaviateDocumentIndex[MyDoc]` | `Document` | -| **Category: Embedded Weaviate** | -| embedded_options| EmbeddedOptions | Options for embedded weaviate | None | +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. -The type `EmbeddedOptions` can be specified as described [here](https://weaviate.io/developers/weaviate/installation/embedded#embedded-options) +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! -## 2.3. Runtime configuration +You can work around this problem by subclassing the predefined document and adding the dimensionality information: -Weaviate strongly recommends using batches to perform bulk operations such as importing data, as it will significantly impact performance. You can specify a batch configuration as in the below example, and pass it on as runtime configuration. +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import WeaviateDocumentIndex + from pydantic import Field -```python -batch_config = { - "batch_size": 20, - "dynamic": False, - "timeout_retries": 3, - "num_workers": 1, -} -runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config) + class MyDoc(TextDoc): + embedding: NdArray[128] = Field(is_embedding=True) -dbconfig = WeaviateDocumentIndex.DBConfig( - host="http://localhost:8080" -) # Replace with your endpoint and/or auth settings -store = WeaviateDocumentIndex[Document](db_config=dbconfig) -store.configure(runtimeconfig) # Batch settings being passed on -``` -| name | type | explanation | default | -| ---- | ---- | ----------- | ------- | -| batch_config | Dict[str, Any] | dictionary to configure the weaviate client's batching logic | see below | + doc_index = WeaviateDocumentIndex[MyDoc]() + ``` -Read more: +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import WeaviateDocumentIndex + from pydantic import Field -- Weaviate [docs on batching with the Python client](https://weaviate.io/developers/weaviate/client-libraries/python#batching) - - -## 3. Available column types + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128, is_embedding=True) -Python data types are mapped to Weaviate type according to the below conventions. -| Python type | Weaviate type | -| ----------- | ------------- | -| docarray.typing.ID | string | -| str | text | -| int | int | -| float | number | -| bool | boolean | -| np.ndarray | number[] | -| AbstractTensor | number[] | -| bytes | blob | + doc_index = WeaviateDocumentIndex[MyDoc]() + ``` -You can override this default mapping by passing a `col_type` to the `Field` of a schema. +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. -For example to map `str` to `string` you can: +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: ```python -class StringDoc(BaseDoc): - text: str = Field(col_type="string") +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) + +# you can index this into Document Index of type MyDoc +doc_index.index(data) ``` -A list of available Weaviate data types [is here](https://weaviate.io/developers/weaviate/config-refs/datatypes). - -## 4. Adding example data +## Index Putting it together, we can add data below using Weaviate as the Document Index: ```python import numpy as np from pydantic import Field -from docarray import BaseDoc +from docarray import BaseDoc, DocList from docarray.typing import NdArray from docarray.index.backends.weaviate import WeaviateDocumentIndex @@ -298,23 +297,28 @@ class Document(BaseDoc): # Make a list of 3 docs to index -docs = [ - Document( - text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1" - ), - Document( - text="Hello world, how are you?", - embedding=np.array([3, 4]), - file=np.random.rand(100), - id="2", - ), - Document( - text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut", - embedding=np.array([5, 6]), - file=np.random.rand(100), - id="3", - ), -] +docs = DocList[Document]( + [ + Document( + text="Hello world", + embedding=np.array([1, 2]), + file=np.random.rand(100), + id="1", + ), + Document( + text="Hello world, how are you?", + embedding=np.array([3, 4]), + file=np.random.rand(100), + id="2", + ), + Document( + text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut", + embedding=np.array([5, 6]), + file=np.random.rand(100), + id="3", + ), + ] +) batch_config = { "batch_size": 20, @@ -325,12 +329,12 @@ batch_config = { runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config) -store = WeaviateDocumentIndex[Document](db_config=dbconfig) +store = WeaviateDocumentIndex[Document]() store.configure(runtimeconfig) # Batch settings being passed on store.index(docs) ``` -### 4.1. Notes +### Notes - To use vector search, you need to specify `is_embedding` for exactly one field. - This is because Weaviate is configured to allow one vector per data object. @@ -340,50 +344,192 @@ store.index(docs) - It is possible to create a schema without specifying `is_embedding` for any field. - This will however mean that the document will not be vectorized and cannot be searched using vector search. -## 5. Query Builder/Hybrid Search +As you can see, `DocList[Document]` and `WeaviateDocumentIndex[Document]` both have `Document` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. + + + +## Vector search + +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. + +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: + +=== "Search by Document" + + ```python + # create a query document + query = Document( + text="Hello world", + embedding=np.array([1, 2]), + file=np.random.rand(100), + ) + + # find similar documents + matches, scores = doc_index.find(query, limit=5) + + print(f"{matches=}") + print(f"{matches.text=}") + print(f"{scores=}") + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(2) + + # find similar documents + matches, scores = store.find(query, limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched search + +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + Document( + text=f"Hello world {i}", + embedding=np.array([i, i + 1]), + file=np.random.rand(100), + ) + for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, limit=5) + + print(f"{matches=}") + print(f"{matches[0].text=}") + print(f"{scores=}") + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 2) + + # find similar documents + matches, scores = doc_index.find_batched(query, limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` -### 5.1. Text search +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. -To perform a text search, follow the below syntax. -This will perform a text search for the word "hello" in the field "text" and return the first two results: +## Filter +To perform filtering, follow the below syntax. + +This will perform a filtering on the field `text`: ```python -q = store.build_query().text_search("world", search_field="text").limit(2).build() +docs = store.filter({"path": ["text"], "operator": "Equal", "valueText": "Hello world"}) +``` -docs = store.execute_query(q) -docs +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of the Weaviate](https://weaviate.io/developers/weaviate/search/filters). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from pydantic import Field +import numpy as np + + +class Book(BaseDoc): + price: int + embedding: NdArray[10] = Field(is_embedding=True) + + +books = DocList[Book]([Book(price=i * 10, embedding=np.random.rand(10)) for i in range(10)]) +book_index = WeaviateDocumentIndex[Book](index_name='tmp_index') +book_index.index(books) + +# filter for books that are cheaper than 29 dollars +query = {"path": ["price"], "operator": "LessThan", "valueInt": 29} +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() ``` -### 5.2. Vector similarity search -To perform a vector similarity search, follow the below syntax. +## Text search -This will perform a vector similarity search for the vector [1, 2] and return the first two results: +In addition to vector similarity search, the Document Index interface offers methods for text search: +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. -```python -q = store.build_query().find([1, 2]).limit(2).build() +You can use text search directly on the field of type `str`. -docs = store.execute_query(q) -docs +The following line will perform a text search for the word "hello" in the field "text" and return the first two results: + +```python +docs = store.text_search("world", search_field="text", limit=2) ``` -### 5.3. Hybrid search + +## Hybrid search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]. To perform a hybrid search, follow the below syntax. This will perform a hybrid search for the word "hello" and the vector [1, 2] and return the first two results: -**Note**: Hybrid search searches through the object vector and all fields. Accordingly, the `search_field` keyword it will have no effect. +**Note**: Hybrid search searches through the object vector and all fields. Accordingly, the `search_field` keyword will have no effect. ```python q = store.build_query().text_search("world").find([1, 2]).limit(2).build() docs = store.execute_query(q) -docs ``` -### 5.4. GraphQL query +### GraphQL query You can also perform a raw GraphQL query using any syntax as you might natively in Weaviate. This allows you to run any of the full range of queries that you might wish to. @@ -409,30 +555,145 @@ Note that running a raw GraphQL query will return Weaviate-type responses, rathe You can find the documentation for [Weaviate's GraphQL API here](https://weaviate.io/developers/weaviate/api/graphql). - -## 6. Other notes +## Access documents -### 6.1. DocArray IDs vs Weaviate IDs +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. -As you saw earlier, the `id` field is a special field that is used to identify a document in `BaseDoc`. +You can also access data by the `id` that was assigned to each document: ```python -Document( - text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1" -), +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), title=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +doc_index.index(data) + +# access the documents by id +doc = doc_index[ids[0]] # get by single id +docs = doc_index[ids] # get by list of ids ``` -This is not the same as Weaviate's own `id`, which is a reserved keyword and can't be used as a field name. -Accordingly, the DocArray document id is stored internally in Weaviate as `docarrayid`. - +## Delete documents -## 7. Shut down Weaviate instance +In the same way you can access documents by `id`, you can also delete them: -```bash -docker-compose down +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), title=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +doc_index.index(data) + +# access the documents by id +del doc_index[ids[0]] # del by single id +del doc_index[ids[1:]] # del by list of ids +``` + +## Configuration + +### Overview + +**WCS instances come pre-configured**, and as such additional settings are not configurable outside of those chosen at creation, such as whether to enable authentication. + +For other cases, such as **Docker-Compose deployment**, its settings can be modified through the configuration file, such as the `docker-compose.yaml` file. + +Some of the more commonly used settings include: + +- [Persistent volume](https://weaviate.io/developers/weaviate/installation/docker-compose#persistent-volume): Set up data persistence so that data from inside the Docker container is not lost on shutdown +- [Enabling a multi-node setup](https://weaviate.io/developers/weaviate/installation/docker-compose#multi-node-setup) +- [Backups](https://weaviate.io/developers/weaviate/configuration/backups) +- [Authentication (server-side)](https://weaviate.io/developers/weaviate/configuration/authentication) +- [Modules enabled](https://weaviate.io/developers/weaviate/configuration/modules#enable-modules) + +And a list of environment variables is [available on this page](https://weaviate.io/developers/weaviate/config-refs/env-vars). + +### DocArray instantiation configuration options + +Additionally, you can specify the below settings when you instantiate a configuration object in DocArray. + +| name | type | explanation | default | example | +| ---- | ---- | ----------- |------------------------------------------------------------------------| ------- | +| **Category: General** | +| host | str | Weaviate instance url | http://localhost:8080 | +| **Category: Authentication** | +| username | str | Username known to the specified authentication provider (e.g. WCS) | `None` | `jp@weaviate.io` | +| password | str | Corresponding password | `None` | `p@ssw0rd` | +| auth_api_key | str | API key known to the Weaviate instance | `None` | `mys3cretk3y` | +| **Category: Data schema** | +| index_name | str | Class name to use to store the document| The document class name, e.g. `MyDoc` for `WeaviateDocumentIndex[MyDoc]` | `Document` | +| **Category: Embedded Weaviate** | +| embedded_options| EmbeddedOptions | Options for embedded weaviate | `None` | + +The type `EmbeddedOptions` can be specified as described [here](https://weaviate.io/developers/weaviate/installation/embedded#embedded-options) + +### Runtime configuration + +Weaviate strongly recommends using batches to perform bulk operations such as importing data, as it will significantly impact performance. You can specify a batch configuration as in the below example, and pass it on as runtime configuration. + +```python +batch_config = { + "batch_size": 20, + "dynamic": False, + "timeout_retries": 3, + "num_workers": 1, +} + +runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config) + +dbconfig = WeaviateDocumentIndex.DBConfig( + host="http://localhost:8080" +) # Replace with your endpoint and/or auth settings +store = WeaviateDocumentIndex[Document](db_config=dbconfig) +store.configure(runtimeconfig) # Batch settings being passed on ``` ------ ------ ------ +| name | type | explanation | default | +| ---- | ---- | ----------- | ------- | +| batch_config | Dict[str, Any] | dictionary to configure the weaviate client's batching logic | see below | + +Read more: + +- Weaviate [docs on batching with the Python client](https://weaviate.io/developers/weaviate/client-libraries/python#batching) + + +### Available column types + +Python data types are mapped to Weaviate type according to the below conventions. + +| Python type | Weaviate type | +| ----------- | ------------- | +| docarray.typing.ID | string | +| str | text | +| int | int | +| float | number | +| bool | boolean | +| np.ndarray | number[] | +| AbstractTensor | number[] | +| bytes | blob | + +You can override this default mapping by passing a `col_type` to the `Field` of a schema. + +For example to map `str` to `string` you can: + +```python +class StringDoc(BaseDoc): + text: str = Field(col_type="string") +``` + +A list of available Weaviate data types [is here](https://weaviate.io/developers/weaviate/config-refs/datatypes). + + +## Nested data and subindex search + +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. + +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/nested_data.md b/docs/user_guide/storing/nested_data.md new file mode 100644 index 00000000000..feb7c4ee9b4 --- /dev/null +++ b/docs/user_guide/storing/nested_data.md @@ -0,0 +1,169 @@ +# Nested Data + +Most of the examples you've seen operate on a simple schema: each field corresponds to a "basic" type, such as `str` or `NdArray`. + +It is, however, also possible to represent nested documents and store them in a Document Index. + +!!! note "Using a different vector database" + In the following examples, we will use `InMemoryExactNNIndex` as our Document Index. + You can easily use Weaviate, Qdrant, Redis, Milvus or Elasticsearch instead -- their APIs are largely identical! + To do so, check their respective documentation sections. + +## Create and index +In the following example you can see a complex schema that contains nested documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: + +```python +import numpy as np +from pydantic import Field + +from docarray import BaseDoc, DocList +from docarray.index import InMemoryExactNNIndex +from docarray.typing import AnyTensor, ImageUrl, VideoUrl + +# define a nested schema +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: AnyTensor = Field(space='cosine_sim', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + tensor: AnyTensor = Field(space='cosine_sim', dim=128) + + +class YouTubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc + tensor: AnyTensor = Field(space='cosine_sim', dim=256) + + +# create a Document Index +doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]() + +# create some data +index_docs = [ + YouTubeVideoDoc( + title=f'video {i+1}', + description=f'this is video from author {10*i}', + thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), + tensor=np.ones(256), + ) + for i in range(8) +] + +# index the Documents +doc_index.index(index_docs) +``` + +## Search + +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. + +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: + +```python +# create a query document +query_doc = YouTubeVideoDoc( + title=f'video query', + description=f'this is a query video', + thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), + tensor=np.ones(256), +) + +# find by the `youtubevideo` tensor; root level +docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) + +# find by the `thumbnail` tensor; nested level +docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) + +# find by the `video` tensor; neseted level +docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) +``` + +## Nested data with subindex search + +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one above. + +If a document contains a `DocList`, it can still be stored in a Document Index. +In this case, the `DocList` will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). + +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of `DocLists`. + + +### Index + +In the following example, you can see a complex schema that contains nested `DocLists` of documents where we'll utilize subindex search. + +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: + +```python +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(space='cosine_sim', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(space='cosine_sim', dim=128) + + +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(space='cosine_sim', dim=256) + + +# create a Document Index +doc_index = InMemoryExactNNIndex[MyDoc]() + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] + +# index the Documents +doc_index.index(index_docs) +``` + +### Search + +You can perform search on any level by using [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method +and the dunder operator `'root__subindex'` to specify the index to search on: + +```python +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', search_field='tensor_video', limit=3 +) + +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 +) +``` \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index a7ad4cf8500..457bb0d15ae 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -103,6 +103,9 @@ nav: - user_guide/storing/index_weaviate.md - user_guide/storing/index_elastic.md - user_guide/storing/index_qdrant.md + - user_guide/storing/index_redis.md + - user_guide/storing/index_milvus.md + - user_guide/storing/nested_data.md - DocStore - Bulk storage: - user_guide/storing/doc_store/store_file.md - user_guide/storing/doc_store/store_jac.md pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy