From caa00fd1dbcccddd4a10d512016c04d544343582 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Wed, 28 Jun 2023 17:11:52 +0200 Subject: [PATCH 01/23] chore: first pr Signed-off-by: jupyterjazz --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 7aed194b9b3..a013093d4fc 100644 --- a/README.md +++ b/README.md @@ -393,7 +393,6 @@ query = dl[0] results, scores = index.find(query, limit=10, search_field='embedding') ``` - --- ## Learn DocArray From b45e3a6e8d971dec37ae63f2f8a60823d6b00a3c Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Thu, 6 Jul 2023 17:06:33 +0400 Subject: [PATCH 02/23] docs: modify hnsw Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_hnswlib.md | 486 ++++++++++++++++++++++- 1 file changed, 470 insertions(+), 16 deletions(-) diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index d19973389aa..47a2496ac6c 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -7,6 +7,7 @@ pip install "docarray[hnswlib]" ``` + [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html). @@ -22,8 +23,370 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s ## Basic Usage -To see how to create a [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] instance, add Documents, -perform search, etc. see the [general user guide](./docindex.md). +```python +from docarray import BaseDoc, DocList +from docarray.index import HnswDocumentIndex +from docarray.typing import NdArray +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new HnswDocumentIndex instance and add the documents to the index. +doc_index = HnswDocumentIndex[MyDoc](workdir='./my_index') +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) +``` + +## Initialize + +To create a Document Index, you first need a document that defines the schema of your index: + +```python +from docarray import BaseDoc +from docarray.index import HnswDocumentIndex +from docarray.typing import NdArray + + +class MyDoc(BaseDoc): + embedding: NdArray[128] + text: str + + +db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db') +``` + +### Schema definition + +In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#customize-configurations). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + + +### Using a predefined Document as schema + +DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined Document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import HnswDocumentIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + db = HnswDocumentIndex[MyDoc](work_dir='test_db') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import HnswDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + db = HnswDocumentIndex[MyDoc](work_dir='test_db3') + ``` + +Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the +predefined Document type, or your custom Document type. + +The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) + +# you can index this into Document Index of type MyDoc +db.index(data) +``` + + +**Database location:** + +For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you +usually specify a `host` and a `port` instead. + +In addition to a host and a port, most backends can also take an `index_name`, `table_name`, `collection_name` or similar. +This specifies the name of the index/table/collection that will be created in the database. +You don't have to specify this though: By default, this name will be taken from the name of the Document type that you use as schema. +For example, for `WeaviateDocumentIndex[MyDoc](...)` the data will be stored in a Weaviate Class of name `MyDoc`. + +In any case, if the location does not yet contain any data, we start from a blank slate. +If the location already contains data from a previous session, it will be accessible through the Document Index. + + +## Index + +Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] +) + +# index the data +db.index(docs) +``` + +That call to [index()][docarray.index.backends.hnswlib.HnswDocumentIndex.index] stores all Documents in `docs` into the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `HnswDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. +This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store +need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector Search + +Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. + +By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +similar Documents in the Document Index: + +=== "Search by Document" + + ```python + # create a query Document + query = MyDoc(embedding=np.random.rand(128), text='query') + + # find similar Documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar Documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). + +### Batched Search + +You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by Documents" + + ```python + # create some query Documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar Documents + matches, scores = db.find_batched(queries, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar Documents + matches, scores = db.find_batched(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + +## Filter + +In addition to vector similarity search, the Document Index interface offers methods for filtered search: +[filter()][docarray.index.abstract.BaseDocIndex.filter], +as well as the batched version [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched]. [filter_subindex()][docarray.index.abstract.BaseDocIndex.filter_subindex] is for filter on subindex level. + +!!! note + The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter search. + + To see how to perform filter search, you can check out other backends that offer support. + + +In addition to vector similarity search, the Document Index interface offers methods for text search and filtered search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search] and [filter()][docarray.index.abstract.BaseDocIndex.filter], +as well as their batched versions [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched] and [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched]. [filter_subindex()][docarray.index.abstract.BaseDocIndex.filter_subindex] is for filter on subindex level. + +!!! note + The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter + or text search. + + +## Text Search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +!!! note + The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for text search. + + To see how to perform text search, you can check out other backends that offer support. + + +## Hybrid Search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# prepare a query +q_doc = MyDoc(embedding=np.random.rand(128), text='query') + +query = ( + db.build_query() # get empty query object + .find(query=q_doc, search_field='embedding') # add vector similarity search + .filter(filter_query={'text': {'$exists': True}}) # add filter search + .build() # build the query +) + +# execute the combined query and return the results +results = db.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. +To see what backend can do what, check out the [specific docs](#document-index). + + +## Access Documents + +To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +doc = db[ids[0]] # get by single id +docs = db[ids] # get by list of ids +``` + + +## Delete Documents + +In the same way you can access Documents by id, you can also delete them: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +del db[ids[0]] # del by single id +del db[ids[1:]] # del by list of ids +``` + ## Configuration @@ -142,14 +505,24 @@ In this way, you can pass [all options that Hnswlib supports](https://github.com You can find more details on the parameters [here](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md). -## Nested Index -When using the index, you can define multiple fields and their nested structure. In the following example, you have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. + +## Nested data + +The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. + +**Index nested data:** + +It is, however, also possible to represent nested Documents and store them in a Document Index. + +In the following example you can see a complex schema that contains nested Documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: ```python from docarray.typing import ImageUrl, VideoUrl, AnyTensor +# define a nested schema class ImageDoc(BaseDoc): url: ImageUrl tensor: AnyTensor = Field(space='cosine', dim=64) @@ -168,7 +541,10 @@ class YouTubeVideoDoc(BaseDoc): tensor: AnyTensor = Field(space='cosine', dim=256) -doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2') +# create a Document Index +doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='/tmp2') + +# create some data index_docs = [ YouTubeVideoDoc( title=f'video {i+1}', @@ -179,13 +555,19 @@ index_docs = [ ) for i in range(8) ] + +# index the Documents doc_index.index(index_docs) ``` -You can use the `search_field` to specify which field to use when performing the vector search. You can use the dunder operator to specify the field defined in the nested data. In the following code, you can perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the `thumbnail` and `video` field: +**Search nested data:** + +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. + +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: ```python -# example of find nested and flat index +# create a query Document query_doc = YouTubeVideoDoc( title=f'video query', description=f'this is a query video', @@ -193,25 +575,97 @@ query_doc = YouTubeVideoDoc( video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), tensor=np.ones(256), ) -# find by the youtubevideo tensor + +# find by the `youtubevideo` tensor; root level docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) -# find by the thumbnail tensor + +# find by the `thumbnail` tensor; nested level docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) -# find by the video tensor + +# find by the `video` tensor; neseted level docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) ``` -To delete nested data, you need to specify the `id`. +### Nested data with subindex -!!! note - You can only delete `Doc` at the top level. Deletion of the `Doc` on lower levels is not yet supported. +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). + +If a Document contains a DocList, it can still be stored in a Document Index. +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). + +This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. + + +**Index** + +In the following example you can see a complex schema that contains nested Documents with subindex. +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: ```python -# example of deleting nested and flat index -del doc_index[index_docs[6].id] +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(space='cosine', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(space='cosine', dim=128) + + +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(space='cosine', dim=256) + + +# create a Document Index +doc_index = HnswDocumentIndex[MyDoc](work_dir='/tmp3') + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] + +# index the Documents +doc_index.index(index_docs) ``` -Check [here](../docindex#nested-data-with-subindex) for nested data with subindex. +**Search** + +You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. + +```python +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', search_field='tensor_video', limit=3 +) + +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 +) +``` ### Update elements In order to update a Document inside the index, you only need to reindex it with the updated attributes. From 11bda62be46c3ec046efd131d55c11b391035ef0 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Thu, 6 Jul 2023 17:56:19 +0400 Subject: [PATCH 03/23] docs: rough versions of inmemory and hnsw Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_hnswlib.md | 10 +- docs/user_guide/storing/index_in_memory.md | 597 ++++++++++++++++++--- 2 files changed, 509 insertions(+), 98 deletions(-) diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index 47a2496ac6c..59d587c5486 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -70,7 +70,7 @@ In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`. The Document Index then _creates a column for each field in `MyDoc`_. The column types in the backend database are determined by the type hints of the document's fields. -Optionally, you can [customize the database types for every field](#customize-configurations). +Optionally, you can [customize the database types for every field](#configuration). Most vector databases need to know the dimensionality of the vectors that will be stored. Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that @@ -294,14 +294,6 @@ as well as the batched version [filter_batched()][docarray.index.abstract.BaseDo To see how to perform filter search, you can check out other backends that offer support. -In addition to vector similarity search, the Document Index interface offers methods for text search and filtered search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search] and [filter()][docarray.index.abstract.BaseDocIndex.filter], -as well as their batched versions [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched] and [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched]. [filter_subindex()][docarray.index.abstract.BaseDocIndex.filter_subindex] is for filter on subindex level. - -!!! note - The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter - or text search. - ## Text Search diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 77a911b4599..98bdda9fc50 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -7,47 +7,390 @@ It is a great starting point for small datasets, where you may not want to launc For vector search and filtering the InMemoryExactNNIndex utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][docarray.utils.filter.filter_docs] functions. -## Basic usage +!!! note "Production readiness" + [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] is a great starting point + for small- to medium-sized datasets, but it is not battle tested in production. If scalability, uptime, etc. are + important to you, we recommend you eventually transition to one of our database-backed Document Index implementations: + + - [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] + - [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] + - [ElasticDocumentIndex][docarray.index.backends.elastic.ElasticDocIndex] -To see how to create a [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] instance, add Documents, -perform search, etc. see the [general user guide](./docindex.md). -You can initialize the index as follows: +## Basic usage ```python from docarray import BaseDoc, DocList -from docarray.index.backends.in_memory import InMemoryExactNNIndex +from docarray.index import InMemoryExactNNIndex from docarray.typing import NdArray +import numpy as np - +# Define the document schema. class MyDoc(BaseDoc): - tensor: NdArray = None - + title: str + embedding: NdArray[128] -docs = DocList[MyDoc](MyDoc() for _ in range(10)) +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) +# Initialize a new InMemoryExactNNIndex instance and add the documents to the index. doc_index = InMemoryExactNNIndex[MyDoc]() doc_index.index(docs) -# or in one step, create with inserted docs. -doc_index = InMemoryExactNNIndex[MyDoc](docs) +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) ``` -Alternatively, you can pass an `index_file_path` argument to make sure that the index can be restored if persisted from that specific file. +## Initialize + +To create a Document Index, you first need a document that defines the schema of your index: + +```python +from docarray import BaseDoc +from docarray.index import InMemoryExactNNIndex +from docarray.typing import NdArray + + +class MyDoc(BaseDoc): + embedding: NdArray[128] + text: str + + +db = InMemoryExactNNIndex[MyDoc]() +``` + +### Schema definition + +In this code snippet, `InMemoryExactNNIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + + +### Using a predefined Document as schema + +DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined Document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import InMemoryExactNNIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + db = InMemoryExactNNIndex[MyDoc](work_dir='test_db') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import InMemoryExactNNIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + db = InMemoryExactNNIndex[MyDoc](work_dir='test_db3') + ``` + +Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the +predefined Document type, or your custom Document type. + +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + ```python -# Save your existing index as a binary file -docs = DocList[MyDoc](MyDoc() for _ in range(10)) +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) +# you can index this into Document Index of type MyDoc +db.index(data) +``` + + +**Persist and Load** + +Further, you can pass an `index_file_path` argument to make sure that the index can be restored if persisted from that specific file. +```python doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') doc_index.index(docs) -# or in one step: doc_index.persist() # Initialize a new document index using the saved binary file new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') ``` +## Index + +Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] +) + +# index the data +db.index(docs) +``` + +That call to [index()][docarray.index.backends.in_memory.InMemoryExactNNIndex.index] stores all Documents in `docs` into the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `InMemoryExactNNIndex[MyDoc]` are both parameterized with `MyDoc`. +This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store +need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector Search + +Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. + +By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +similar Documents in the Document Index: + +=== "Search by Document" + + ```python + # create a query Document + query = MyDoc(embedding=np.random.rand(128), text='query') + + # find similar Documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar Documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched Search + +You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by Documents" + + ```python + # create some query Documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar Documents + matches, scores = db.find_batched(queries, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar Documents + matches, scores = db.find_batched(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + +## Filter + +To filter Documents, the `InMemoryExactNNIndex` uses DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + title: str + price: int + + +books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) +book_index = InMemoryExactNNIndex[Book](books) + +# filter for books that are cheaper than 29 dollars +query = {'price': {'$lte': 29}} +cheap_books = book_index.filter(query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + +## Text Search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +!!! note + The [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] implementation does not offer support for text search. + + To see how to perform text search, you can check out other backends that offer support. + + +## Hybrid Search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# prepare a query +q_doc = MyDoc(embedding=np.random.rand(128), text='query') + +query = ( + db.build_query() # get empty query object + .find(query=q_doc, search_field='embedding') # add vector similarity search + .filter(filter_query={'text': {'$exists': True}}) # add filter search + .build() # build the query +) + +# execute the combined query and return the results +results = db.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. +To see what backend can do what, check out the [specific docs](#document-index). + + +## Access Documents + +To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +doc = db[ids[0]] # get by single id +docs = db[ids] # get by list of ids +``` + + +## Delete Documents + +In the same way you can access Documents by id, you can also delete them: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +del db[ids[0]] # del by single id +del db[ids[1:]] # del by list of ids +``` + ## Configuration This section lays out the configurations and options that are specific to [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. @@ -100,26 +443,30 @@ class Schema(BaseDoc): In the example above you can see how to configure two different vector fields, with two different sets of settings. -## Nested index +## Nested data + +The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. + +**Index nested data:** + +It is, however, also possible to represent nested Documents and store them in a Document Index. -When using the index, you can define multiple fields and their nested structure. In the following example, you have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. +In the following example you can see a complex schema that contains nested Documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: ```python -import numpy as np -from docarray import BaseDoc -from docarray.index.backends.in_memory import InMemoryExactNNIndex from docarray.typing import ImageUrl, VideoUrl, AnyTensor -from pydantic import Field +# define a nested schema class ImageDoc(BaseDoc): url: ImageUrl - tensor: AnyTensor = Field(space='cosine_sim') + tensor: AnyTensor = Field(space='cosine', dim=64) class VideoDoc(BaseDoc): url: VideoUrl - tensor: AnyTensor = Field(space='cosine_sim') + tensor: AnyTensor = Field(space='cosine', dim=128) class YouTubeVideoDoc(BaseDoc): @@ -127,10 +474,13 @@ class YouTubeVideoDoc(BaseDoc): description: str thumbnail: ImageDoc video: VideoDoc - tensor: AnyTensor = Field(space='cosine_sim') + tensor: AnyTensor = Field(space='cosine', dim=256) -doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]() +# create a Document Index +doc_index = InMemoryExactNNIndex[YouTubeVideoDoc](work_dir='/tmp2') + +# create some data index_docs = [ YouTubeVideoDoc( title=f'video {i+1}', @@ -141,97 +491,166 @@ index_docs = [ ) for i in range(8) ] + +# index the Documents doc_index.index(index_docs) ``` -## Search Documents +**Search nested data:** -To search Documents, the `InMemoryExactNNIndex` uses DocArray's [`find`][docarray.utils.find.find] function. +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. -You can use the `search_field` to specify which field to use when performing the vector search. -You can use the dunder operator to specify the field defined in nested data. -In the following code, you can perform vector search on the `tensor` field of the `YouTubeVideoDoc` -or the `tensor` field of the `thumbnail` and `video` fields: +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: ```python -# find by the youtubevideo tensor -query = parse_obj_as(NdArray, np.ones(256)) -docs, scores = doc_index.find(query, search_field='tensor', limit=3) +# create a query Document +query_doc = YouTubeVideoDoc( + title=f'video query', + description=f'this is a query video', + thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), + tensor=np.ones(256), +) + +# find by the `youtubevideo` tensor; root level +docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) -# find by the thumbnail tensor -query = parse_obj_as(NdArray, np.ones(64)) -docs, scores = doc_index.find(query, search_field='thumbnail__tensor', limit=3) +# find by the `thumbnail` tensor; nested level +docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) -# find by the video tensor -query = parse_obj_as(NdArray, np.ones(128)) -docs, scores = doc_index.find(query, search_field='video__tensor', limit=3) +# find by the `video` tensor; neseted level +docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) ``` -## Filter Documents +### Nested data with subindex -To filter Documents, the `InMemoryExactNNIndex` uses DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). -You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. -The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. +If a Document contains a DocList, it can still be stored in a Document Index. +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). -In the following example let's filter for all the books that are cheaper than 29 dollars: +This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. + + +**Index** + +In the following example you can see a complex schema that contains nested Documents with subindex. +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: ```python -from docarray import BaseDoc, DocList +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(space='cosine', dim=64) -class Book(BaseDoc): - title: str - price: int +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(space='cosine', dim=128) -books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) -book_index = InMemoryExactNNIndex[Book](books) +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(space='cosine', dim=256) -# filter for books that are cheaper than 29 dollars -query = {'price': {'$lte': 29}} -cheap_books = book_index.filter(query) -assert len(cheap_books) == 3 -for doc in cheap_books: - doc.summary() +# create a Document Index +doc_index = InMemoryExactNNIndex[MyDoc]() + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] + +# index the Documents +doc_index.index(index_docs) ``` -
- Output - ```text - 📄 Book : 1f7da15 ... - ╭──────────────────────┬───────────────╮ - │ Attribute │ Value │ - ├──────────────────────┼───────────────┤ - │ title: str │ title 0 │ - │ price: int │ 0 │ - ╰──────────────────────┴───────────────╯ - 📄 Book : 63fd13a ... - ╭──────────────────────┬───────────────╮ - │ Attribute │ Value │ - ├──────────────────────┼───────────────┤ - │ title: str │ title 1 │ - │ price: int │ 10 │ - ╰──────────────────────┴───────────────╯ - 📄 Book : 49b21de ... - ╭──────────────────────┬───────────────╮ - │ Attribute │ Value │ - ├──────────────────────┼───────────────┤ - │ title: str │ title 2 │ - │ price: int │ 20 │ - ╰──────────────────────┴───────────────╯ - ``` -
+**Search** -## Delete Documents +You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. -To delete nested data, you need to specify the `id`. +```python +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', search_field='tensor_video', limit=3 +) -!!! note - You can only delete Documents at the top level. Deletion of Documents on lower levels is not yet supported. +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 +) +``` + +### Update elements +In order to update a Document inside the index, you only need to reindex it with the updated attributes. + +First lets create a schema for our Index +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import InMemoryExactNNIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` +Now we can instantiate our Index and index some data. + +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = InMemoryExactNNIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Now we can find relevant documents + +```python +res = index.find(query=docs[0], search_field='tens', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +and update all of the text of this documents and reindex them + +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly ```python -# example of deleting nested and flat index -del doc_index[index_docs[6].id] +res = index.find(query=docs[0], search_field='tens', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text ``` From f5825f8a7a7370255b33a90e5b0dc42f01e417c6 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Thu, 6 Jul 2023 19:07:48 +0400 Subject: [PATCH 04/23] docs: weaviate v1 Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_weaviate.md | 537 +++++++++++++++------- 1 file changed, 379 insertions(+), 158 deletions(-) diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index e4663d53d15..75884277d5b 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -1,17 +1,3 @@ ---- -jupyter: - jupytext: - text_representation: - extension: .md - format_name: markdown - format_version: '1.3' - jupytext_version: 1.14.5 - kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - # Weaviate Document Index !!! note "Install dependencies" @@ -24,15 +10,42 @@ jupyter: This is the user guide for the [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], focusing on special features and configurations of Weaviate. -For general usage of a Document Index, see the [general user guide](./docindex.md). -# 1. Start Weaviate service +## Basic Usage +```python +from docarray import BaseDoc, DocList +from docarray.index import WeaviateDocumentIndex +from docarray.typing import NdArray +from pydantic import Field +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] = Field(is_embedding=True) + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new WeaviateDocumentIndex instance and add the documents to the index. +doc_index = WeaviateDocumentIndex[MyDoc]() +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, limit=10) +``` + + +## Initialize + + +### Start Weaviate service To use [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], DocArray needs to hook into a running Weaviate service. There are multiple ways to start a Weaviate instance, depending on your use case. - -## 1.1. Options - Overview +#### Options - Overview | Instance type | General use case | Configurability | Notes | | ----- | ----- | ----- | ----- | @@ -41,15 +54,15 @@ There are multiple ways to start a Weaviate instance, depending on your use case | **Docker-Compose** | Development | Yes | **Recommended for development + customizability** | | **Kubernetes** | Production | Yes | | -## 1.2. Instantiation instructions +### Instantiation instructions -### 1.2.1. WCS (managed instance) +#### WCS (managed instance) Go to the [WCS console](https://console.weaviate.cloud) and create an instance using the visual interface, following [this guide](https://weaviate.io/developers/wcs/guides/create-instance). Weaviate instances on WCS come pre-configured, so no further configuration is required. -### 1.2.2. Docker-Compose (self-managed) +#### Docker-Compose (self-managed) Get a configuration file (`docker-compose.yaml`). You can build it using [this interface](https://weaviate.io/developers/weaviate/installation/docker-compose), or download it directly with: @@ -58,14 +71,12 @@ curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/ ``` Where `v` is the actual version, such as `v1.18.3`. - ```bash curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?modules=standalone&runtime=docker-compose&weaviate_version=v1.18.3" ``` - -#### 1.2.2.1 Start up Weaviate with Docker-Compose +##### Start up Weaviate with Docker-Compose Then you can start up Weaviate by running from a shell: @@ -73,7 +84,7 @@ Then you can start up Weaviate by running from a shell: docker-compose up -d ``` -#### 1.2.2.2 Shut down Weaviate +##### Shut down Weaviate Then you can shut down Weaviate by running from a shell: @@ -86,14 +97,12 @@ docker-compose down Unless data persistence or backups are set up, shutting down the Docker instance will remove all its data. See documentation on [Persistent volume](https://weaviate.io/developers/weaviate/installation/docker-compose#persistent-volume) and [Backups](https://weaviate.io/developers/weaviate/configuration/backups) to prevent this if persistence is desired. - ```bash docker-compose up -d ``` - -### 1.2.3. Embedded Weaviate (from the application) +#### Embedded Weaviate (from the application) With Embedded Weaviate, Weaviate database server can be launched from the client, using: @@ -103,7 +112,7 @@ from docarray.index.backends.weaviate import EmbeddedOptions embedded_options = EmbeddedOptions() ``` -## 1.3. Authentication +### Authentication Weaviate offers [multiple authentication options](https://weaviate.io/developers/weaviate/configuration/authentication), as well as [authorization options](https://weaviate.io/developers/weaviate/configuration/authorization). @@ -116,9 +125,8 @@ With DocArray, you can use any of: To access a Weaviate instance. In general, **Weaviate recommends using API-key based authentication** for balance between security and ease of use. You can create, for example, read-only keys to distribute to certain users, while providing read/write keys to administrators. See below for examples of connection to Weaviate for each scenario. - -## 1.4. Connect to Weaviate +### Connect to Weaviate ```python from docarray.index.backends.weaviate import WeaviateDocumentIndex @@ -126,7 +134,6 @@ from docarray.index.backends.weaviate import WeaviateDocumentIndex ### Public instance - If using Embedded Weaviate: ```python @@ -136,7 +143,6 @@ dbconfig = WeaviateDocumentIndex.DBConfig(embedded_options=EmbeddedOptions()) ``` For all other options: - ```python dbconfig = WeaviateDocumentIndex.DBConfig( @@ -144,8 +150,7 @@ dbconfig = WeaviateDocumentIndex.DBConfig( ) # Replace with your endpoint) ``` - -### OIDC with username + password +#### OIDC with username + password To authenticate against a Weaviate instance with OIDC username & password: @@ -156,7 +161,6 @@ dbconfig = WeaviateDocumentIndex.DBConfig( host="http://localhost:8080", # Replace with your endpoint ) ``` - ```python # dbconfig = WeaviateDocumentIndex.DBConfig( @@ -166,8 +170,7 @@ dbconfig = WeaviateDocumentIndex.DBConfig( # ) ``` - -### API key-based authentication +#### API key-based authentication To authenticate against a Weaviate instance an API key: @@ -177,13 +180,187 @@ dbconfig = WeaviateDocumentIndex.DBConfig( host="http://localhost:8080", # Replace with your endpoint ) ``` - +## Index + +Putting it together, we can add data below using Weaviate as the Document Index: + +```python +import numpy as np +from pydantic import Field +from docarray import BaseDoc +from docarray.typing import NdArray +from docarray.index.backends.weaviate import WeaviateDocumentIndex + + +# Define a document schema +class Document(BaseDoc): + text: str + embedding: NdArray[2] = Field( + dims=2, is_embedding=True + ) # Embedding column -> vector representation of the document + file: NdArray[100] = Field(dims=100) + + +# Make a list of 3 docs to index +docs = [ + Document( + text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1" + ), + Document( + text="Hello world, how are you?", + embedding=np.array([3, 4]), + file=np.random.rand(100), + id="2", + ), + Document( + text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut", + embedding=np.array([5, 6]), + file=np.random.rand(100), + id="3", + ), +] + +batch_config = { + "batch_size": 20, + "dynamic": False, + "timeout_retries": 3, + "num_workers": 1, +} + +runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config) + +store = WeaviateDocumentIndex[Document](db_config=dbconfig) +store.configure(runtimeconfig) # Batch settings being passed on +store.index(docs) +``` + +### Notes + +- To use vector search, you need to specify `is_embedding` for exactly one field. + - This is because Weaviate is configured to allow one vector per data object. + - If you would like to see Weaviate support multiple vectors per object, [upvote the issue](https://github.com/weaviate/weaviate/issues/2465) which will help to prioritize it. +- For a field to be considered as an embedding, its type needs to be of subclass `np.ndarray` or `AbstractTensor` and `is_embedding` needs to be set to `True`. + - If `is_embedding` is set to `False` or not provided, the field will be treated as a `number[]`, and as a result, it will not be added to Weaviate's vector index. +- It is possible to create a schema without specifying `is_embedding` for any field. + - This will however mean that the document will not be vectorized and cannot be searched using vector search. + + +## Vector Search + +To perform a vector similarity search, follow the below syntax. + +This will perform a vector similarity search for the vector [1, 2] and return the first two results: + +```python +docs = store.find([1, 2], limit=2) +``` + +## Filter + +To perform filtering, follow the below syntax. + +This will perform a filtering on the field `text`: +```python +docs = store.filter({"path": ["text"], "operator": "Equal", "valueText": "Hello world"}) +``` + + +## Text search + +To perform a text search, follow the below syntax. + +This will perform a text search for the word "hello" in the field "text" and return the first two results: + +```python +docs = store.text_search("world", search_field="text", limit=2) +``` - -# 2. Configure Weaviate -## 2.1. Overview +## Hybrid search + +To perform a hybrid search, follow the below syntax. + +This will perform a hybrid search for the word "hello" and the vector [1, 2] and return the first two results: + +**Note**: Hybrid search searches through the object vector and all fields. Accordingly, the `search_field` keyword will have no effect. + +```python +q = store.build_query().text_search("world").find([1, 2]).limit(2).build() + +docs = store.execute_query(q) +``` + +### GraphQL query + +You can also perform a raw GraphQL query using any syntax as you might natively in Weaviate. This allows you to run any of the full range of queries that you might wish to. + +The below will perform a GraphQL query to obtain the count of `Document` objects. + +```python +graphql_query = """ +{ + Aggregate { + Document { + meta { + count + } + } + } +} +""" + +store.execute_query(graphql_query) +``` + +Note that running a raw GraphQL query will return Weaviate-type responses, rather than a DocArray object type. + +You can find the documentation for [Weaviate's GraphQL API here](https://weaviate.io/developers/weaviate/api/graphql). + +## Access Documents + +To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +store.index(data) + +# access the Documents by id +doc = store[ids[0]] # get by single id +docs = store[ids] # get by list of ids +``` + + +## Delete Documents + +In the same way you can access Documents by id, you can also delete them: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +store.index(data) + +# access the Documents by id +del store[ids[0]] # del by single id +del store[ids[1:]] # del by list of ids +``` + +## Configuration + +### Overview **WCS instances come pre-configured**, and as such additional settings are not configurable outside of those chosen at creation, such as whether to enable authentication. @@ -199,7 +376,7 @@ Some of the more commonly used settings include: And a list of environment variables is [available on this page](https://weaviate.io/developers/weaviate/config-refs/env-vars). -## 2.2. DocArray instantiation configuration options +### DocArray instantiation configuration options Additionally, you can specify the below settings when you instantiate a configuration object in DocArray. @@ -218,7 +395,7 @@ Additionally, you can specify the below settings when you instantiate a configur The type `EmbeddedOptions` can be specified as described [here](https://weaviate.io/developers/weaviate/installation/embedded#embedded-options) -## 2.3. Runtime configuration +### Runtime configuration Weaviate strongly recommends using batches to perform bulk operations such as importing data, as it will significantly impact performance. You can specify a batch configuration as in the below example, and pass it on as runtime configuration. @@ -246,10 +423,9 @@ store.configure(runtimeconfig) # Batch settings being passed on Read more: - Weaviate [docs on batching with the Python client](https://weaviate.io/developers/weaviate/client-libraries/python#batching) - - -## 3. Available column types + +### Available column types Python data types are mapped to Weaviate type according to the below conventions. @@ -274,165 +450,210 @@ class StringDoc(BaseDoc): ``` A list of available Weaviate data types [is here](https://weaviate.io/developers/weaviate/config-refs/datatypes). - -## 4. Adding example data -Putting it together, we can add data below using Weaviate as the Document Index: +## Nested data -```python -import numpy as np -from pydantic import Field -from docarray import BaseDoc -from docarray.typing import NdArray -from docarray.index.backends.weaviate import WeaviateDocumentIndex +The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. +**Index nested data:** -# Define a document schema -class Document(BaseDoc): - text: str - embedding: NdArray[2] = Field( - dims=2, is_embedding=True - ) # Embedding column -> vector representation of the document - file: NdArray[100] = Field(dims=100) +It is, however, also possible to represent nested Documents and store them in a Document Index. +In the following example you can see a complex schema that contains nested Documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: -# Make a list of 3 docs to index -docs = [ - Document( - text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1" - ), - Document( - text="Hello world, how are you?", - embedding=np.array([3, 4]), - file=np.random.rand(100), - id="2", - ), - Document( - text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut", - embedding=np.array([5, 6]), - file=np.random.rand(100), - id="3", - ), -] +```python +from docarray.typing import ImageUrl, VideoUrl, AnyTensor -batch_config = { - "batch_size": 20, - "dynamic": False, - "timeout_retries": 3, - "num_workers": 1, -} -runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config) +# define a nested schema +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: AnyTensor = Field(space='cosine', dim=64) -store = WeaviateDocumentIndex[Document](db_config=dbconfig) -store.configure(runtimeconfig) # Batch settings being passed on -store.index(docs) -``` -### 4.1. Notes - -- To use vector search, you need to specify `is_embedding` for exactly one field. - - This is because Weaviate is configured to allow one vector per data object. - - If you would like to see Weaviate support multiple vectors per object, [upvote the issue](https://github.com/weaviate/weaviate/issues/2465) which will help to prioritize it. -- For a field to be considered as an embedding, its type needs to be of subclass `np.ndarray` or `AbstractTensor` and `is_embedding` needs to be set to `True`. - - If `is_embedding` is set to `False` or not provided, the field will be treated as a `number[]`, and as a result, it will not be added to Weaviate's vector index. -- It is possible to create a schema without specifying `is_embedding` for any field. - - This will however mean that the document will not be vectorized and cannot be searched using vector search. +class VideoDoc(BaseDoc): + url: VideoUrl + tensor: AnyTensor = Field(space='cosine', dim=128, is_embedding=True) -## 5. Query Builder/Hybrid Search -### 5.1. Text search +class YouTubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc + tensor: AnyTensor = Field(space='cosine', dim=256) -To perform a text search, follow the below syntax. -This will perform a text search for the word "hello" in the field "text" and return the first two results: +# create a Document Index +doc_index = WeaviateDocumentIndex[YouTubeVideoDoc]() -```python -q = store.build_query().text_search("world", search_field="text").limit(2).build() +# create some data +index_docs = [ + YouTubeVideoDoc( + title=f'video {i+1}', + description=f'this is video from author {10*i}', + thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), + tensor=np.ones(256), + ) + for i in range(8) +] -docs = store.execute_query(q) -docs +# index the Documents +doc_index.index(index_docs) ``` -### 5.2. Vector similarity search +**Search nested data:** -To perform a vector similarity search, follow the below syntax. +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. -This will perform a vector similarity search for the vector [1, 2] and return the first two results: +In the following example, you can see how to perform vector search on the `tensor` field of the `video` field: ```python -q = store.build_query().find([1, 2]).limit(2).build() +# create a query Document +query_doc = YouTubeVideoDoc( + title=f'video query', + description=f'this is a query video', + thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), + tensor=np.ones(256), +) -docs = store.execute_query(q) -docs +# find by the `video` tensor; nested level +docs, scores = doc_index.find(query_doc, limit=3) ``` -### 5.3. Hybrid search +### Nested data with subindex -To perform a hybrid search, follow the below syntax. +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). -This will perform a hybrid search for the word "hello" and the vector [1, 2] and return the first two results: +If a Document contains a DocList, it can still be stored in a Document Index. +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). + +This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. -**Note**: Hybrid search searches through the object vector and all fields. Accordingly, the `search_field` keyword it will have no effect. + +**Index** + +In the following example you can see a complex schema that contains nested Documents with subindex. +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: ```python -q = store.build_query().text_search("world").find([1, 2]).limit(2).build() +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(space='cosine', dim=64, is_embedding=True) + + +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(space='cosine', dim=128, is_embedding=True) + + +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(space='cosine', dim=256, is_embedding=True) + + +# create a Document Index +doc_index = WeaviateDocumentIndex[MyDoc]() + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] -docs = store.execute_query(q) -docs +# index the Documents +doc_index.index(index_docs) ``` -### 5.4. GraphQL query - -You can also perform a raw GraphQL query using any syntax as you might natively in Weaviate. This allows you to run any of the full range of queries that you might wish to. +**Search** -The below will perform a GraphQL query to obtain the count of `Document` objects. +You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. ```python -graphql_query = """ -{ - Aggregate { - Document { - meta { - count - } - } - } -} -""" +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', limit=3 +) -store.execute_query(graphql_query) +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', limit=3 +) ``` -Note that running a raw GraphQL query will return Weaviate-type responses, rather than a DocArray object type. - -You can find the documentation for [Weaviate's GraphQL API here](https://weaviate.io/developers/weaviate/api/graphql). +### Update elements +In order to update a Document inside the index, you only need to reindex it with the updated attributes. - -## 6. Other notes +First lets create a schema for our Index +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import WeaviateDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] = Field(is_embedding=True) +``` +Now we can instantiate our Index and index some data. -### 6.1. DocArray IDs vs Weaviate IDs +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = WeaviateDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` -As you saw earlier, the `id` field is a special field that is used to identify a document in `BaseDoc`. +Now we can find relevant documents ```python -Document( - text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1" -), +res = index.find(query=docs[0], limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text ``` -This is not the same as Weaviate's own `id`, which is a reserved keyword and can't be used as a field name. - -Accordingly, the DocArray document id is stored internally in Weaviate as `docarrayid`. - +and update all of the text of this documents and reindex them -## 7. Shut down Weaviate instance +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' -```bash -docker-compose down +index.index(docs) +assert index.num_docs() == 100 ``` ------ ------ ------ +When we retrieve them again we can see that their text attribute has been updated accordingly + +```python +res = index.find(query=docs[0], limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` From 4a3e25cfa789437ca5e9ef7738d6ec44ede6e446 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 17 Jul 2023 14:32:45 +0200 Subject: [PATCH 05/23] docs: introduction page Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 717 ++-------------------------- 1 file changed, 30 insertions(+), 687 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index af9488a11e3..8159ac2ae59 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -38,715 +38,58 @@ Currently, DocArray supports the following vector databases: - [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) - [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) +- InMemoryExactNNSearch | [Docs](index_in_memory.md) -For this user guide you will use the [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] + +## Basic Usage + +For this user guide you will use the [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] because it doesn't require you to launch a database server. Instead, it will store your data locally. !!! note "Using a different vector database" You can easily use Weaviate, Qdrant, or Elasticsearch instead -- they share the same API! To do so, check their respective documentation sections. -!!! note "Hnswlib-specific settings" +!!! note "InMemory-specific settings" The following sections explain the general concept of Document Index by using - [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] as an example. - For HNSWLib-specific settings, check out the [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] documentation - [here](index_hnswlib.md). - -## Create a Document Index - -!!! note - To use [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex], you need to install extra dependencies with the following command: + [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] as an example. + For InMemory-specific settings, check out the [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] documentation + [here](index_in_memory.md). - ```console - pip install "docarray[hnswlib]" - ``` - -To create a Document Index, you first need a document that defines the schema of your index: ```python -from docarray import BaseDoc +from docarray import BaseDoc, DocList from docarray.index import HnswDocumentIndex from docarray.typing import NdArray +import numpy as np - +# Define the document schema. class MyDoc(BaseDoc): + title: str + price: int embedding: NdArray[128] - text: str - - -db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db') -``` - -### Schema definition - -In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`. -The Document Index then _creates a column for each field in `MyDoc`_. - -The column types in the backend database are determined by the type hints of the document's fields. -Optionally, you can [customize the database types for every field](#customize-configurations). - -Most vector databases need to know the dimensionality of the vectors that will be stored. -Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that -the database will store vectors with 128 dimensions. - -!!! note "PyTorch and TensorFlow support" - Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that - for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! - - -### Using a predefined Document as schema - -DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. -If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: -Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. - -The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` -field. But this is crucial information for any vector database to work properly! - -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: - -=== "Using type hint" - ```python - from docarray.documents import TextDoc - from docarray.typing import NdArray - from docarray.index import HnswDocumentIndex - - - class MyDoc(TextDoc): - embedding: NdArray[128] - - - db = HnswDocumentIndex[MyDoc](work_dir='test_db') - ``` - -=== "Using Field()" - ```python - from docarray.documents import TextDoc - from docarray.typing import AnyTensor - from docarray.index import HnswDocumentIndex - from pydantic import Field - - - class MyDoc(TextDoc): - embedding: AnyTensor = Field(dim=128) - - - db = HnswDocumentIndex[MyDoc](work_dir='test_db3') - ``` - -Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or your custom Document type. - -The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: - -```python -from docarray import DocList - -# data of type TextDoc -data = DocList[TextDoc]( - [ - TextDoc(text='hello world', embedding=np.random.rand(128)), - TextDoc(text='hello world', embedding=np.random.rand(128)), - TextDoc(text='hello world', embedding=np.random.rand(128)), - ] -) - -# you can index this into Document Index of type MyDoc -db.index(data) -``` - - -**Database location:** - -For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you -usually specify a `host` and a `port` instead. - -In addition to a host and a port, most backends can also take an `index_name`, `table_name`, `collection_name` or similar. -This specifies the name of the index/table/collection that will be created in the database. -You don't have to specify this though: By default, this name will be taken from the name of the Document type that you use as schema. -For example, for `WeaviateDocumentIndex[MyDoc](...)` the data will be stored in a Weaviate Class of name `MyDoc`. - -In any case, if the location does not yet contain any data, we start from a blank slate. -If the location already contains data from a previous session, it will be accessible through the Document Index. - -## Index data - -Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: - -```python -import numpy as np -from docarray import DocList - -# create some random data -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] -) - -# index the data -db.index(docs) -``` - -That call to [index()][docarray.index.backends.hnswlib.HnswDocumentIndex.index] stores all Documents in `docs` into the Document Index, -ready to be retrieved in the next step. - -As you can see, `DocList[MyDoc]` and `HnswDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. -This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store -need to have compatible schemas. - -!!! question "When are two schemas compatible?" - The schemas of your Document Index and data need to be compatible with each other. - - Let's say A is the schema of your Document Index and B is the schema of your data. - There are a few rules that determine if schema A is compatible with schema B. - If _any_ of the following are true, then A and B are compatible: - - - A and B are the same class - - A and B have the same field names and field types - - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A - - In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. - -## Vector similarity search - -Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. - -By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find -similar Documents in the Document Index: -=== "Search by Document" +# Create documents (using dummy/random vectors) +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', price=i, embedding=np.random.rand(128)) for i in range(10)) - ```python - # create a query Document - query = MyDoc(embedding=np.random.rand(128), text='query') +# Initialize a new HnswDocumentIndex instance and add the documents to the index. +doc_index = HnswDocumentIndex[MyDoc](workdir='./my_index') +doc_index.index(docs) - # find similar Documents - matches, scores = db.find(query, search_field='embedding', limit=5) +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) - print(f'{matches=}') - print(f'{matches.text=}') - print(f'{scores=}') - ``` - -=== "Search by raw vector" - - ```python - # create a query vector - query = np.random.rand(128) - - # find similar Documents - matches, scores = db.find(query, search_field='embedding', limit=5) - - print(f'{matches=}') - print(f'{matches.text=}') - print(f'{scores=}') - ``` - -To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the -basis of comparison between your query and the documents in the Document Index. - -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. -In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose -which one to use for the search. - -The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest -matching documents and their associated similarity scores. - -When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. - -How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). - -### Batched search - -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. - -=== "Search by Documents" - - ```python - # create some query Documents - queries = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) - ) - - # find similar Documents - matches, scores = db.find_batched(queries, search_field='embedding', limit=5) - - print(f'{matches=}') - print(f'{matches[0].text=}') - print(f'{scores=}') - ``` - -=== "Search by raw vectors" - - ```python - # create some query vectors - query = np.random.rand(3, 128) - - # find similar Documents - matches, scores = db.find_batched(query, search_field='embedding', limit=5) - - print(f'{matches=}') - print(f'{matches[0].text=}') - print(f'{scores=}') - ``` - -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing -a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. - -## Filter search and text search - -In addition to vector similarity search, the Document Index interface offers methods for text search and filtered search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search] and [filter()][docarray.index.abstract.BaseDocIndex.filter], -as well as their batched versions [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched] and [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched]. [filter_subindex()][docarray.index.abstract.BaseDocIndex.filter_subindex] is for filter on subindex level. - -!!! note - The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter - or text search. - - To see how to perform filter or text search, you can check out other backends that offer support. - -## Hybrid search through the query builder - -Document Index supports atomic operations for vector similarity search, text search and filter search. - -To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: - -```python -# prepare a query -q_doc = MyDoc(embedding=np.random.rand(128), text='query') +# Perform filtering (price < 5) +query = {'price': {'$lt': 5}} +filtered_docs = doc_index.filter(query, limit=10) +# Perform a hybrid search - combining vector search with filtering query = ( - db.build_query() # get empty query object - .find(query=q_doc, search_field='embedding') # add vector similarity search - .filter(filter_query={'text': {'$exists': True}}) # add filter search + doc_index.build_query() # get empty query object + .find(np.ones(128), search_field='embedding') # add vector similarity search + .filter(filter_query={'price': {'$gte': 2}}) # add filter search .build() # build the query ) - -# execute the combined query and return the results -results = db.execute_query(query) -print(f'{results=}') -``` - -In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search -to obtain a combined set of results. - -The kinds of atomic queries that can be combined in this way depends on the backend. -Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -To see what backend can do what, check out the [specific docs](#document-index). - -## Access documents by `id` - -To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. - -You can also access data by the `id` that was assigned to each document: - -```python -# prepare some data -data = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) -) - -# remember the Document ids and index the data -ids = data.id -db.index(data) - -# access the Documents by id -doc = db[ids[0]] # get by single id -docs = db[ids] # get by list of ids -``` - -## Delete Documents - -In the same way you can access Documents by id, you can also delete them: - -```python -# prepare some data -data = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) -) - -# remember the Document ids and index the data -ids = data.id -db.index(data) - -# access the Documents by id -del db[ids[0]] # del by single id -del db[ids[1:]] # del by list of ids -``` - -## Customize configurations - -DocArray's philosophy is that each Document Index should "just work", meaning that it comes with a sane set of defaults -that get you most of the way there. - -However, there are different configurations that you may want to tweak, including: - -- The [ANN](https://ignite.apache.org/docs/latest/machine-learning/binary-classification/ann) algorithm used, for example [HNSW](https://www.pinecone.io/learn/hnsw/) or [ScaNN](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) -- Hyperparameters of the ANN algorithm, such as `ef_construction` for HNSW -- The distance metric to use, such as cosine or L2 distance -- The data type of each column in the database -- And many more... - -The specific configurations that you can tweak depend on the backend, but the interface to do so is universal. - -Document Indexes differentiate between three different kind of configurations: - -### Database configurations - -_Database configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column), -and that you _don't_ dynamically change at runtime. - -This commonly includes: - -- host and port -- index or collection name -- authentication settings -- ... - -For every backend, you can get a full list of configurations and their defaults: - -```python -from docarray.index import HnswDocumentIndex - - -db_config = HnswDocumentIndex.DBConfig() -print(db_config) - -# > HnswDocumentIndex.DBConfig(work_dir='.') -``` - -As you can see, `HnswDocumentIndex.DBConfig` is a dataclass that contains only one possible configuration, `work_dir`, -that defaults to `.`. - -You can customize every field in this configuration: - -=== "Pass individual settings" - - ```python - db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db') - - custom_db_config = db._db_config - print(custom_db_config) - - # > HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') - ``` - -=== "Pass entire configuration" - - ```python - custom_db_config = HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') - - db = HnswDocumentIndex[MyDoc](custom_db_config) - - print(db._db_config) - - # > HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') - ``` - -### Runtime configurations - -_Runtime configurations_ are configurations that relate to the way how an `instance` operates with respect to a specific -database. - - -This commonly includes: -- default batch size for batching operations -- default consistency level for various database operations -- ... - -For every backend, you can get the full list of configurations and their defaults: - -```python -from docarray.index import ElasticDocIndex - - -runtime_config = ElasticDocIndex.RuntimeConfig() -print(runtime_config) - -# > ElasticDocIndex.RuntimeConfig(chunk_size=500) -``` - -As you can see, `HnswDocumentIndex.RuntimeConfig` is a dataclass that contains only one configuration: -`default_column_config`, which is a mapping from Python types to database column configurations. - -You can customize every field in this configuration using the [configure()][docarray.index.abstract.BaseDocIndex.configure] method: - -=== "Pass individual settings" - - ```python - db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db') - - db.configure( - default_column_config={ - np.ndarray: { - 'dim': -1, - 'index': True, - 'space': 'ip', - 'max_elements': 2048, - 'ef_construction': 100, - 'ef': 15, - 'M': 8, - 'allow_replace_deleted': True, - 'num_threads': 5, - }, - None: {}, - } - ) - - custom_runtime_config = db._runtime_config - print(custom_runtime_config) - - # > HnswDocumentIndex.RuntimeConfig(default_column_config={: {'dim': -1, 'index': True, 'space': 'ip', 'max_elements': 2048, 'ef_construction': 100, 'ef': 15, 'M': 8, 'allow_replace_deleted': True, 'num_threads': 5}, None: {}}) - ``` - -=== "Pass entire configuration" - - ```python - custom_runtime_config = HnswDocumentIndex.RuntimeConfig( - default_column_config={ - np.ndarray: { - 'dim': -1, - 'index': True, - 'space': 'ip', - 'max_elements': 2048, - 'ef_construction': 100, - 'ef': 15, - 'M': 8, - 'allow_replace_deleted': True, - 'num_threads': 5, - }, - None: {}, - } - ) - - db = HnswDocumentIndex[MyDoc](work_dir='/tmp/my_db') - - db.configure(custom_runtime_config) - - print(db._runtime_config) - - # > HHnswDocumentIndex.RuntimeConfig(default_column_config={: {'dim': -1, 'index': True, 'space': 'ip', 'max_elements': 2048, 'ef_construction': 100, 'ef': 15, 'M': 8, 'allow_replace_deleted': True, 'num_threads': 5}, None: {}}) - ``` - -After this change, the new setting will be applied to _every_ column that corresponds to a `np.ndarray` type. - -### Column configurations - -For many vector databases, individual columns can have different configurations. - -This commonly includes: -- the data type of the column, e.g. `vector` vs `varchar` -- the dimensionality of the vector (if it is a vector column) -- whether an index should be built for a specific column - -The available configurations vary from backend to backend, but in any case you can pass them -directly in the schema of your Document Index, using the `Field()` syntax: - -```python -from pydantic import Field - - -class Schema(BaseDoc): - tens: NdArray[100] = Field(max_elements=12, space='cosine') - tens_two: NdArray[10] = Field(M=4, space='ip') - - -db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db') -``` - -The `HnswDocumentIndex` above contains two columns which are configured differently: -- `tens` has a dimensionality of `100`, can take up to `12` elements, and uses the `cosine` similarity space -- `tens_two` has a dimensionality of `10`, and uses the `ip` similarity space, and an `M` hyperparameter of 4 - -All configurations that are not explicitly set will be taken from the `default_column_config` of the `DBConfig`. -You can modify these defaults in the following way: - -```python -import numpy as np -from pydantic import Field - -from docarray import BaseDoc -from docarray.index import HnswDocumentIndex -from docarray.typing import NdArray - - -class Schema(BaseDoc): - tens: NdArray[100] = Field(max_elements=12, space='cosine') - tens_two: NdArray[10] = Field(M=4, space='ip') - - -# create a DBConfig for your Document Index -conf = HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') -# update the default max_elements for np.ndarray columns -conf.default_column_config.get(np.ndarray).update(max_elements=2048) -# create Document Index -# tens has a max_elements of 12, specified in the schema -# tens_two has a max_elements of 2048, specified by the default in the DBConfig -db = HnswDocumentIndex[Schema](conf) -``` - - -For an explanation of the configurations that are tweaked in this example, see the `HnswDocumentIndex` [documentation](index_hnswlib.md). - -## Nested data - -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. - -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='/tmp2') - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the `youtubevideo` tensor; root level -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) - -# find by the `thumbnail` tensor; nested level -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) - -# find by the `video` tensor; neseted level -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) +results = doc_index.execute_query(query) ``` - -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). - -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). - -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine', dim=128) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = HnswDocumentIndex[MyDoc](work_dir='/tmp3') - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) -``` - -!!! note "Subindex not supported with InMemoryExactNNIndex" - Currently, subindex feature is not available for InMemoryExactNNIndex From db77beb3968579261d0f2c43ccf2bb840ec0155a Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 17 Jul 2023 15:17:13 +0200 Subject: [PATCH 06/23] docs: redis v1 Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_redis.md | 618 +++++++++++++++++++++++++ 1 file changed, 618 insertions(+) create mode 100644 docs/user_guide/storing/index_redis.md diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md new file mode 100644 index 00000000000..aac8d0bd1e0 --- /dev/null +++ b/docs/user_guide/storing/index_redis.md @@ -0,0 +1,618 @@ +# Redis Document Index + +!!! note "Install dependencies" + To use [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex], you need to install extra dependencies with the following command: + + ```console + pip install "docarray[redis]" + ``` + +This is the user guide for the [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex], +focusing on special features and configurations of Redis. + + +## Basic Usage +```python +from docarray import BaseDoc, DocList +from docarray.index import RedisDocumentIndex +from docarray.typing import NdArray +from pydantic import Field +import numpy as np + +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) + +# Initialize a new RedisDocumentIndex instance and add the documents to the index. +doc_index = RedisDocumentIndex[MyDoc](host='localhost') +doc_index.index(docs) + +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, limit=10) +``` + + +## Initialize + +Before initializing [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex], +make sure that you have a Redis service that you can connect to. + +You can create a local Redis service with the following command: + +```shell +docker run --name redis-stack-server -p 6379:6379 -d redis/redis-stack-server:7.2.0-RC2 +``` +Next, you can create [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]: +```python +from docarray import BaseDoc +from docarray.index import RedisDocumentIndex +from docarray.typing import NdArray + + +class MyDoc(BaseDoc): + embedding: NdArray[128] + text: str + + +doc_index = RedisDocumentIndex[MyDoc](host='localhost') +``` + + +### Schema definition +In this code snippet, `RedisDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + + +### Using a predefined Document as schema + +DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined Document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import RedisDocumentIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + doc_index = RedisDocumentIndex[MyDoc]() + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import RedisDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + doc_index = RedisDocumentIndex[MyDoc]() + ``` + +Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the +predefined Document type, or your custom Document type. + +The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) + +# you can index this into Document Index of type MyDoc +doc_index.index(data) +``` + +## Index + +Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] +) + +# index the data +doc_index.index(docs) +``` + +That call to [index()][docarray.index.backends.redis.RedisDocumentIndex.index] stores all Documents in `docs` into the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `RedisDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. +This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store +need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector Search + +Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. + +By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +similar Documents in the Document Index: + +=== "Search by Document" + + ```python + # create a query Document + query = MyDoc(embedding=np.random.rand(128), text='query') + + # find similar Documents + matches, scores = doc_index.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar Documents + matches, scores = doc_index.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). + +You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. + + +## Filter + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of the Redis](https://redis.io/docs/interact/search-and-query/query/). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + title: str + price: int + + +books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) +book_index = RedisDocumentIndex[Book](books) + +# filter for books that are cheaper than 29 dollars +query = '@price:[-inf 29]' +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + +## Text Search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +You can use text search directly on the field of type `str`: + +```python +class NewsDoc(BaseDoc): + text: str + + +doc_index = RedisDocumentIndex[NewsDoc]() +index_docs = [ + NewsDoc(id='0', text='this is a news for sport'), + NewsDoc(id='1', text='this is a news for finance'), + NewsDoc(id='2', text='this is another news for sport'), +] +doc_index.index(index_docs) +query = 'finance' + +# search with text +docs, scores = doc_index.text_search(query, search_field='text') +``` + +## Hybrid Search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# prepare a query +q_doc = MyDoc(embedding=np.random.rand(128), text='query') + +query = ( + doc_index.build_query() # get empty query object + .find(query=q_doc, search_field='embedding') # add vector similarity search + .filter(filter_query='@text:*') # add filter search + .build() +) +# execute the combined query and return the results +results = db.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. +To see what backend can do what, check out the [specific docs](#document-index). + + +## Access Documents + +To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +doc = db[ids[0]] # get by single id +docs = db[ids] # get by list of ids +``` + + +## Delete Documents + +In the same way you can access Documents by id, you can also delete them: + +```python +# prepare some data +data = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +db.index(data) + +# access the Documents by id +del db[ids[0]] # del by single id +del db[ids[1:]] # del by list of ids +``` + +## Configuration + +This section lays out the configurations and options that are specific to [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]. + +### DBConfig + +The following configs can be set in `DBConfig`: + +| Name | Description | Default | +|-------------------------|----------------------------------------------------|-------------------------------------------------------------------------------------| +| `host` | The host address for the Redis server. | `localhost` | +| `port` | The port number for the Redis server | 6379 | +| `index_name` | The name of the index in the Redis database | None. Data will be stored in an index named after the Document type used as schema. | +| `username` | The username for the Redis server | None | +| `password` | The password for the Redis server | None | +| `text_scorer` | The method for [scoring text](https://redis.io/docs/interact/search-and-query/advanced-concepts/scoring/) during text search | `BM25` | +| `default_column_config` | The default configurations for every column type. | dict | + +You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. + + +### Field-wise configurations + + +`default_column_config` is the default configurations for every column type. Since there are many column types in Redis, you can also consider changing the column config when defining the schema. + +```python +class SimpleDoc(BaseDoc): + tensor: NdArray[128] = Field(algorithm='FLAT', m=32, distance='COSINE') + + +doc_index = RedisDocumentIndex[SimpleDoc]() +``` + + +### RuntimeConfig + +The `RuntimeConfig` dataclass of `RedisDocumentIndex` consists of `batch_size` index/get/del operations. +You can change `batch_size` in the following way: + +```python +doc_index = RedisDocumentIndex[SimpleDoc]() +doc_index.configure(RedisDocumentIndex.RuntimeConfig(batch_size=128)) +``` + +You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. + + + +## Nested data + +The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. + +**Index nested data:** + +It is, however, also possible to represent nested Documents and store them in a Document Index. + +In the following example you can see a complex schema that contains nested Documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: + +```python +from docarray.typing import ImageUrl, VideoUrl, AnyTensor + + +# define a nested schema +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: AnyTensor = Field(space='cosine', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + tensor: AnyTensor = Field(space='cosine', dim=128) + + +class YouTubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc + tensor: AnyTensor = Field(space='cosine', dim=256) + + +# create a Document Index +doc_index = RedisDocumentIndex[YouTubeVideoDoc](index_name='tmp2') + +# create some data +index_docs = [ + YouTubeVideoDoc( + title=f'video {i+1}', + description=f'this is video from author {10*i}', + thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), + tensor=np.ones(256), + ) + for i in range(8) +] + +# index the Documents +doc_index.index(index_docs) +``` + +**Search nested data:** + +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. + +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: + +```python +# create a query Document +query_doc = YouTubeVideoDoc( + title=f'video query', + description=f'this is a query video', + thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), + tensor=np.ones(256), +) + +# find by the `youtubevideo` tensor; root level +docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) + +# find by the `thumbnail` tensor; nested level +docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) + +# find by the `video` tensor; neseted level +docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) +``` + +### Nested data with subindex + +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). + +If a Document contains a DocList, it can still be stored in a Document Index. +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). + +This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. + + +**Index** + +In the following example you can see a complex schema that contains nested Documents with subindex. +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: + +```python +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(space='cosine', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(space='cosine', dim=128) + + +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(space='cosine', dim=256) + + +# create a Document Index +doc_index = RedisDocumentIndex[MyDoc]() + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] + +# index the Documents +doc_index.index(index_docs) +``` + +**Search** + +You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. + +```python +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', search_field='tensor_video', limit=3 +) + +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 +) +``` + +### Update elements +In order to update a Document inside the index, you only need to reindex it with the updated attributes. + +First lets create a schema for our Index +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import RedisDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` +Now we can instantiate our Index and index some data. + +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = RedisDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Now we can find relevant documents + +```python +res = index.find(query=docs[0], search_field='tens', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +and update all of the text of this documents and reindex them + +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly + +```python +res = index.find(query=docs[0], search_field='tens', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + From 82afb99a11556f2e72d68d2677560186e3a74881 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 17 Jul 2023 15:56:58 +0200 Subject: [PATCH 07/23] docs: qdrant v1 Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_qdrant.md | 680 ++++++++++++++++++++---- 1 file changed, 579 insertions(+), 101 deletions(-) diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index 266b5695d1e..1fb583ea7be 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -10,138 +10,616 @@ The following is a starter script for using the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], based on the [Qdrant](https://qdrant.tech/) vector search engine. -For general usage of a Document Index, see the [general user guide](./docindex.md#document-index). -!!! tip "See all configuration options" - To see all configuration options for the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], - you can do the following: +## Basic Usage +```python +from docarray import BaseDoc, DocList +from docarray.index import QdrantDocumentIndex +from docarray.typing import NdArray +import numpy as np - ```python - from docarray.index import QdrantDocumentIndex +# Define the document schema. +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] - # the following can be passed to the __init__() method - db_config = QdrantDocumentIndex.DBConfig() - print(db_config) # shows default values +# Create dummy documents. +docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) - # the following can be passed to the configure() method - runtime_config = QdrantDocumentIndex.RuntimeConfig() - print(runtime_config) # shows default values - ``` - - Note that the collection_name from the DBConfig is an Optional[str] with None as default value. This is because - the QdrantDocumentIndex will take the name the Document type that you use as schema. For example, for QdrantDocumentIndex[MyDoc](...) - the data will be stored in a collection name MyDoc if no specific collection_name is passed in the DBConfig. +# Initialize a new QdrantDocumentIndex instance and add the documents to the index. +doc_index = QdrantDocumentIndex[MyDoc](host='localhost') +doc_index.index(docs) -```python -import numpy as np +# Perform a vector search. +query = np.ones(128) +retrieved_docs = doc_index.find(query, limit=10) +``` -from typing import Optional +## Initialize -from docarray import BaseDoc -from docarray.index import QdrantDocumentIndex -from docarray.typing import NdArray +You can initialize [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] in three different ways: -from qdrant_client.http import models +**Connecting to a local Qdrant instance running as a Docker container** -class MyDocument(BaseDoc): - title: str - title_embedding: NdArray[786] - image_path: Optional[str] - image_embedding: NdArray[512] +You can use docker-compose to create a local Qdrant service with the following `docker-compose.yml`. +```yaml +version: '3.8' -# Creating an in-memory Qdrant document index -qdrant_config = QdrantDocumentIndex.DBConfig(location=":memory:") -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) +services: + qdrant: + image: qdrant/qdrant:v1.1.2 + ports: + - "6333:6333" + - "6334:6334" + ulimits: # Only required for tests, as there are a lot of collections created + nofile: + soft: 65535 + hard: 65535 +``` + +Run the following command in the folder of the above `docker-compose.yml` to start the service: + +```bash +docker-compose up +``` -# Connecting to a local Qdrant instance running as a Docker container +Next, you can create a [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] instance using: + +```python qdrant_config = QdrantDocumentIndex.DBConfig("http://localhost:6333") -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) +doc_index = QdrantDocumentIndex[MyDoc](qdrant_config) + +# or just +doc_index = QdrantDocumentIndex[MyDoc](host='localhost') +``` -# Connecting to Qdrant Cloud service + +**Creating an in-memory Qdrant document index** +```python +qdrant_config = QdrantDocumentIndex.DBConfig(location=":memory:") +doc_index = QdrantDocumentIndex[MyDoc](qdrant_config) +``` + +**Connecting to Qdrant Cloud service** +```python qdrant_config = QdrantDocumentIndex.DBConfig( "https://YOUR-CLUSTER-URL.aws.cloud.qdrant.io", api_key="", ) -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) +doc_index = QdrantDocumentIndex[MyDoc](qdrant_config) +``` -# Indexing the documents -doc_index.index( +### Schema definition +In this code snippet, `QdrantDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + +### Using a predefined Document as schema + +DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined Document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import QdrantDocumentIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + doc_index = QdrantDocumentIndex[MyDoc](host='localhost') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import QdrantDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128) + + + doc_index = QdrantDocumentIndex[MyDoc](host='localhost') + ``` + +Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the +predefined Document type, or your custom Document type. + +The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( [ - MyDocument( - title=f"My document {i}", - title_embedding=np.random.random(786), - image_path=None, - image_embedding=np.random.random(512), - ) - for i in range(100) + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), ] ) -# Performing a vector search only -results = doc_index.find( - query=np.random.random(512), - search_field="image_embedding", - limit=3, -) +# you can index this into Document Index of type MyDoc +doc_index.index(data) +``` -# Connecting to a local Qdrant instance with Scalar Quantization enabled, -# and using non-default collection name to store the datapoints -qdrant_config = QdrantDocumentIndex.DBConfig( - "http://localhost:6333", - collection_name="another_collection", - quantization_config=models.ScalarQuantization( - scalar=models.ScalarQuantizationConfig( - type=models.ScalarType.INT8, - quantile=0.99, - always_ram=True, - ), - ), -) -doc_index = QdrantDocumentIndex[MyDocument](qdrant_config) -# Indexing the documents -doc_index.index( - [ - MyDocument( - title=f"My document {i}", - title_embedding=np.random.random(786), - image_path=None, - image_embedding=np.random.random(512), - ) - for i in range(100) - ] +## Index + +Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'text {i}') for i in range(100)] ) -# Text lookup, without vector search. Using the Qdrant filtering mechanisms: -# https://qdrant.tech/documentation/filtering/ -results = doc_index.filter( - filter_query=models.Filter( +# index the data +db.index(docs) +``` + +That call to [index()][docarray.index.backends.qdrant.QdrantDocumentIndex.index] stores all Documents in `docs` into the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `QdrantDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. +This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store +need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector Search + +Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. + +By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +similar Documents in the Document Index: + +=== "Search by Document" + + ```python + # create a query Document + query = MyDoc(embedding=np.random.rand(128), text='query') + + # find similar Documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar Documents + matches, scores = db.find(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). + +You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. + +## Filter + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of Qdrant](https://qdrant.tech/documentation/concepts/filtering/). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList +from qdrant_client.http import models as rest + + +class Book(BaseDoc): + title: str + price: int + + +books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) +book_index = QdrantDocumentIndex[Book](books) + +# filter for books that are cheaper than 29 dollars +query = rest.Filter( + must=[rest.FieldCondition(key='price', range=rest.Range(lt=29))] + ) +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + +## Text Search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +You can use text search directly on the field of type `str`: + +```python +class NewsDoc(BaseDoc): + text: str + + +doc_index = QdrantDocumentIndex[NewsDoc](host='localhost') +index_docs = [ + NewsDoc(id='0', text='this is a news for sport'), + NewsDoc(id='1', text='this is a news for finance'), + NewsDoc(id='2', text='this is another news for sport'), +] +doc_index.index(index_docs) +query = 'finance' + +# search with text +docs, scores = doc_index.text_search(query, search_field='text') +``` + + +## Hybrid Search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: + +For example, you can build a hybrid serach query that performs range filtering, vector search and text search: + +```python +class MyDoc(BaseDoc): + tens: NdArray[10] + num: int + text: str + + +doc_index = QdrantDocumentIndex[MyDoc](host='localhost') +index_docs = [ + MyDoc(id=f'{i}', tens=np.ones(10) * i, num=int(i / 2), text=f'Lorem ipsum {int(i/2)}') + for i in range(10) +] +doc_index.index(index_docs) + +find_query = np.ones(10) +text_search_query = 'ipsum 1' +filter_query = rest.Filter( must=[ - models.FieldCondition( - key="title", - match=models.MatchText(text="document 2"), - ), - ], - ), -) + rest.FieldCondition( + key='num', + range=rest.Range( + gte=1, + lt=5, + ), + ) + ] + ) -# Vector search with additional filtering. Qdrant has the additional filters -# incorporated directly into the vector search phase, without a need to perform -# pre or post-filtering. query = ( index.build_query() - .find(np.random.random(512), search_field="image_embedding") - .filter(filter_query=models.Filter( - must=[ - models.FieldCondition( - key="title", - match=models.MatchText(text="document 2"), - ), - ], - )) - .build(limit=5) + .find(find_query, search_field='embedding') + .text_search(text_search_query, search_field='text') + .filter(filter_query) + .build(limit=5) +) + +docs = doc_index.execute_query(query) +``` + + +## Access documents + +To access the `Doc`, you need to specify the `id`. You can also pass a list of `id` to access multiple documents. + +```python +# access a single Doc +doc_index[index_docs[16].id] + +# access multiple Docs +doc_index[index_docs[16].id, index_docs[17].id] +``` + +## Delete documents + +To delete the documents, use the built-in function `del` with the `id` of the Documents that you want to delete. +You can also pass a list of `id`s to delete multiple documents. + +```python +# delete a single Doc +del doc_index[index_docs[16].id] + +# delete multiple Docs +del doc_index[index_docs[17].id, index_docs[18].id] +``` + + +## Configuration + +!!! tip "See all configuration options" To see all configuration options for the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], you can do the following: + +```python +from docarray.index import QdrantDocumentIndex + +# the following can be passed to the __init__() method +db_config = QdrantDocumentIndex.DBConfig() +print(db_config) # shows default values + +# the following can be passed to the configure() method +runtime_config = QdrantDocumentIndex.RuntimeConfig() +print(runtime_config) # shows default values +``` + +Note that the collection_name from the DBConfig is an Optional[str] with None as default value. This is because +the QdrantDocumentIndex will take the name the Document type that you use as schema. For example, for QdrantDocumentIndex[MyDoc](...) +the data will be stored in a collection name MyDoc if no specific collection_name is passed in the DBConfig. + + +## Nested data + +The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. + +**Index nested data:** + +It is, however, also possible to represent nested Documents and store them in a Document Index. + +In the following example you can see a complex schema that contains nested Documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: + +```python +from docarray.typing import ImageUrl, VideoUrl, AnyTensor + + +# define a nested schema +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: AnyTensor = Field(space='cosine', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + tensor: AnyTensor = Field(space='cosine', dim=128) + + +class YouTubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc + tensor: AnyTensor = Field(space='cosine', dim=256) + + +# create a Document Index +doc_index = QdrantDocumentIndex[YouTubeVideoDoc](index_name='tmp2') + +# create some data +index_docs = [ + YouTubeVideoDoc( + title=f'video {i+1}', + description=f'this is video from author {10*i}', + thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), + tensor=np.ones(256), + ) + for i in range(8) +] + +# index the Documents +doc_index.index(index_docs) +``` + +**Search nested data:** + +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. + +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: + +```python +# create a query Document +query_doc = YouTubeVideoDoc( + title=f'video query', + description=f'this is a query video', + thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), + tensor=np.ones(256), +) + +# find by the `youtubevideo` tensor; root level +docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) + +# find by the `thumbnail` tensor; nested level +docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) + +# find by the `video` tensor; neseted level +docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) +``` + +### Nested data with subindex + +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). + +If a Document contains a DocList, it can still be stored in a Document Index. +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). + +This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. + + +**Index** + +In the following example you can see a complex schema that contains nested Documents with subindex. +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: + +```python +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(space='cosine', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(space='cosine', dim=128) + + +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(space='cosine', dim=256) + + +# create a Document Index +doc_index = QdrantDocumentIndex[MyDoc]() + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] + +# index the Documents +doc_index.index(index_docs) +``` + +**Search** + +You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. + +```python +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', search_field='tensor_video', limit=3 +) + +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 +) +``` + +### Update elements +In order to update a Document inside the index, you only need to reindex it with the updated attributes. + +First lets create a schema for our Index +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import QdrantDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` +Now we can instantiate our Index and index some data. + +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] ) -results = index.execute_query(query) +index = QdrantDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Now we can find relevant documents + +```python +res = index.find(query=docs[0], search_field='tens', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text ``` + +and update all of the text of this documents and reindex them + +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly + +```python +res = index.find(query=docs[0], search_field='tens', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + From befc786c43f5b61a9cb3edf835b5b7c27eb59306 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 17 Jul 2023 17:42:45 +0200 Subject: [PATCH 08/23] docs: validate intro inmemory and hnsw examples Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 21 +++--- docs/user_guide/storing/index_hnswlib.md | 16 ++-- docs/user_guide/storing/index_in_memory.md | 88 +++++----------------- 3 files changed, 36 insertions(+), 89 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 8159ac2ae59..7118f7af841 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -38,12 +38,12 @@ Currently, DocArray supports the following vector databases: - [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) - [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) -- InMemoryExactNNSearch | [Docs](index_in_memory.md) +- InMemoryExactNNIndex | [Docs](index_in_memory.md) ## Basic Usage -For this user guide you will use the [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] +For this user guide you will use the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] because it doesn't require you to launch a database server. Instead, it will store your data locally. !!! note "Using a different vector database" @@ -52,14 +52,13 @@ because it doesn't require you to launch a database server. Instead, it will sto !!! note "InMemory-specific settings" The following sections explain the general concept of Document Index by using - [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] as an example. - For InMemory-specific settings, check out the [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] documentation + `InMemoryExactNNIndex` as an example. + For InMemory-specific settings, check out the `InMemoryExactNNIndex` documentation [here](index_in_memory.md). - ```python from docarray import BaseDoc, DocList -from docarray.index import HnswDocumentIndex +from docarray.index import InMemoryExactNNIndex from docarray.typing import NdArray import numpy as np @@ -72,13 +71,13 @@ class MyDoc(BaseDoc): # Create documents (using dummy/random vectors) docs = DocList[MyDoc](MyDoc(title=f'title #{i}', price=i, embedding=np.random.rand(128)) for i in range(10)) -# Initialize a new HnswDocumentIndex instance and add the documents to the index. -doc_index = HnswDocumentIndex[MyDoc](workdir='./my_index') +# Initialize a new InMemoryExactNNIndex instance and add the documents to the index. +doc_index = InMemoryExactNNIndex[MyDoc]() doc_index.index(docs) # Perform a vector search. query = np.ones(128) -retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) +retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) # Perform filtering (price < 5) query = {'price': {'$lt': 5}} @@ -87,9 +86,9 @@ filtered_docs = doc_index.filter(query, limit=10) # Perform a hybrid search - combining vector search with filtering query = ( doc_index.build_query() # get empty query object - .find(np.ones(128), search_field='embedding') # add vector similarity search + .find(query=np.ones(128), search_field='embedding') # add vector similarity search .filter(filter_query={'price': {'$gte': 2}}) # add filter search .build() # build the query ) -results = doc_index.execute_query(query) +retrieved_docs, scores = doc_index.execute_query(query) ``` diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index 59d587c5486..862232a98ed 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -38,7 +38,7 @@ class MyDoc(BaseDoc): docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) # Initialize a new HnswDocumentIndex instance and add the documents to the index. -doc_index = HnswDocumentIndex[MyDoc](workdir='./my_index') +doc_index = HnswDocumentIndex[MyDoc](work_dir='./my_index') doc_index.index(docs) # Perform a vector search. @@ -326,8 +326,8 @@ query = ( ) # execute the combined query and return the results -results = db.execute_query(query) -print(f'{results=}') +retrieved_docs, scores = db.execute_query(query) +print(f'{retrieved_docs=}') ``` In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search @@ -534,7 +534,7 @@ class YouTubeVideoDoc(BaseDoc): # create a Document Index -doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='/tmp2') +doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2') # create some data index_docs = [ @@ -611,7 +611,7 @@ class MyDoc(BaseDoc): # create a Document Index -doc_index = HnswDocumentIndex[MyDoc](work_dir='/tmp3') +doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp3') # create some data index_docs = [ @@ -676,7 +676,7 @@ Now we can instantiate our Index and index some data. ```python docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] ) index = HnswDocumentIndex[MyDoc]() index.index(docs) @@ -686,7 +686,7 @@ assert index.num_docs() == 100 Now we can find relevant documents ```python -res = index.find(query=docs[0], search_field='tens', limit=100) +res = index.find(query=docs[0], search_field='embedding', limit=100) assert len(res.documents) == 100 for doc in res.documents: assert 'I am the first version' in doc.text @@ -705,7 +705,7 @@ assert index.num_docs() == 100 When we retrieve them again we can see that their text attribute has been updated accordingly ```python -res = index.find(query=docs[0], search_field='tens', limit=100) +res = index.find(query=docs[0], search_field='embedding', limit=100) assert len(res.documents) == 100 for doc in res.documents: assert 'I am the second version' in doc.text diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 98bdda9fc50..84bc80441c4 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -39,7 +39,7 @@ doc_index.index(docs) # Perform a vector search. query = np.ones(128) -retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) +retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) ``` ## Initialize @@ -99,7 +99,7 @@ You can work around this problem by subclassing the predefined Document and addi embedding: NdArray[128] - db = InMemoryExactNNIndex[MyDoc](work_dir='test_db') + db = InMemoryExactNNIndex[MyDoc]() ``` === "Using Field()" @@ -114,7 +114,7 @@ You can work around this problem by subclassing the predefined Document and addi embedding: AnyTensor = Field(dim=128) - db = InMemoryExactNNIndex[MyDoc](work_dir='test_db3') + db = InMemoryExactNNIndex[MyDoc]() ``` Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the @@ -126,11 +126,11 @@ The [next section](#index) goes into more detail about data indexing, but note t from docarray import DocList # data of type TextDoc -data = DocList[TextDoc]( +data = DocList[MyDoc]( [ - TextDoc(text='hello world', embedding=np.random.rand(128)), - TextDoc(text='hello world', embedding=np.random.rand(128)), - TextDoc(text='hello world', embedding=np.random.rand(128)), + MyDoc(text='hello world', embedding=np.random.rand(128)), + MyDoc(text='hello world', embedding=np.random.rand(128)), + MyDoc(text='hello world', embedding=np.random.rand(128)), ] ) @@ -338,8 +338,8 @@ query = ( ) # execute the combined query and return the results -results = db.execute_query(query) -print(f'{results=}') +retrieved_docs, scores = db.execute_query(query) +print(f'{retrieved_docs=}') ``` In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search @@ -403,7 +403,7 @@ If you want to set configurations globally, i.e. for all vector fields in your D ```python from collections import defaultdict -from docarray.typing import AbstractTensor +from docarray.typing.tensor.abstract_tensor import AbstractTensor new_doc_index = InMemoryExactNNIndex[MyDoc]( default_column_config=defaultdict( dict, @@ -461,12 +461,12 @@ from docarray.typing import ImageUrl, VideoUrl, AnyTensor # define a nested schema class ImageDoc(BaseDoc): url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) + tensor: AnyTensor = Field(space='cosine_sim', dim=64) class VideoDoc(BaseDoc): url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) + tensor: AnyTensor = Field(space='cosine_sim', dim=128) class YouTubeVideoDoc(BaseDoc): @@ -474,11 +474,11 @@ class YouTubeVideoDoc(BaseDoc): description: str thumbnail: ImageDoc video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) + tensor: AnyTensor = Field(space='cosine_sim', dim=256) # create a Document Index -doc_index = InMemoryExactNNIndex[YouTubeVideoDoc](work_dir='/tmp2') +doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]() # create some data index_docs = [ @@ -540,18 +540,18 @@ The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `I ```python class ImageDoc(BaseDoc): url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine', dim=64) + tensor_image: AnyTensor = Field(space='cosine_sim', dim=64) class VideoDoc(BaseDoc): url: VideoUrl images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine', dim=128) + tensor_video: AnyTensor = Field(space='cosine_sim', dim=128) class MyDoc(BaseDoc): docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine', dim=256) + tensor: AnyTensor = Field(space='cosine_sim', dim=256) # create a Document Index @@ -601,56 +601,4 @@ root_docs, sub_docs, scores = doc_index.find_subindex( root_docs, sub_docs, scores = doc_index.find_subindex( np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 ) -``` - -### Update elements -In order to update a Document inside the index, you only need to reindex it with the updated attributes. - -First lets create a schema for our Index -```python -import numpy as np -from docarray import BaseDoc, DocList -from docarray.typing import NdArray -from docarray.index import InMemoryExactNNIndex -class MyDoc(BaseDoc): - text: str - embedding: NdArray[128] -``` -Now we can instantiate our Index and index some data. - -```python -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] -) -index = InMemoryExactNNIndex[MyDoc]() -index.index(docs) -assert index.num_docs() == 100 -``` - -Now we can find relevant documents - -```python -res = index.find(query=docs[0], search_field='tens', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the first version' in doc.text -``` - -and update all of the text of this documents and reindex them - -```python -for i, doc in enumerate(docs): - doc.text = f'I am the second version of Document {i}' - -index.index(docs) -assert index.num_docs() == 100 -``` - -When we retrieve them again we can see that their text attribute has been updated accordingly - -```python -res = index.find(query=docs[0], search_field='tens', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the second version' in doc.text -``` +``` \ No newline at end of file From 9bdb0dc08137e5e0999d5e552c3f8d26575a465d Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 17 Jul 2023 18:23:41 +0200 Subject: [PATCH 09/23] docs: validate elastic and qdrant examples Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_elastic.md | 16 ++++++------ docs/user_guide/storing/index_qdrant.md | 33 ++++++++++++------------ 2 files changed, 25 insertions(+), 24 deletions(-) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index a8182295c3d..00913e6947d 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -394,10 +394,10 @@ To access the `Doc`, you need to specify the `id`. You can also pass a list of ` ```python # access a single Doc -doc_index[index_docs[16].id] +doc_index[index_docs[1].id] # access multiple Docs -doc_index[index_docs[16].id, index_docs[17].id] +doc_index[index_docs[2].id, index_docs[3].id] ``` ## Delete documents @@ -407,10 +407,10 @@ You can also pass a list of `id`s to delete multiple documents. ```python # delete a single Doc -del doc_index[index_docs[16].id] +del doc_index[index_docs[1].id] # delete multiple Docs -del doc_index[index_docs[17].id, index_docs[18].id] +del doc_index[index_docs[2].id, index_docs[3].id] ``` @@ -439,7 +439,7 @@ class SimpleDoc(BaseDoc): tensor: NdArray[128] = Field(similarity='l2_norm', m=32, num_candidates=5000) -doc_index = ElasticDocIndex[SimpleDoc]() +doc_index = ElasticDocIndex[SimpleDoc](index_name='my_index_1') ``` ### RuntimeConfig @@ -447,7 +447,7 @@ doc_index = ElasticDocIndex[SimpleDoc]() The `RuntimeConfig` dataclass of `ElasticDocIndex` consists of `chunk_size`. You can change `chunk_size` for batch operations: ```python -doc_index = ElasticDocIndex[SimpleDoc]() +doc_index = ElasticDocIndex[SimpleDoc](index_name='my_index_2') doc_index.configure(ElasticDocIndex.RuntimeConfig(chunk_size=1000)) ``` @@ -461,12 +461,12 @@ You can hook into a database index that was persisted during a previous session. To do so, you need to specify `index_name` and the `hosts`: ```python -doc_index = ElasticDocIndex[SimpleDoc]( +doc_index = ElasticDocIndex[MyDoc]( hosts='http://localhost:9200', index_name='previously_stored' ) doc_index.index(index_docs) -doc_index2 = ElasticDocIndex[SimpleDoc]( +doc_index2 = ElasticDocIndex[MyDoc]( hosts='http://localhost:9200', index_name='previously_stored' ) diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index 1fb583ea7be..3851388eeed 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -32,7 +32,7 @@ doc_index.index(docs) # Perform a vector search. query = np.ones(128) -retrieved_docs = doc_index.find(query, limit=10) +retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) ``` ## Initialize @@ -68,7 +68,7 @@ docker-compose up Next, you can create a [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] instance using: ```python -qdrant_config = QdrantDocumentIndex.DBConfig("http://localhost:6333") +qdrant_config = QdrantDocumentIndex.DBConfig('localhost') doc_index = QdrantDocumentIndex[MyDoc](qdrant_config) # or just @@ -182,7 +182,7 @@ docs = DocList[MyDoc]( ) # index the data -db.index(docs) +doc_index.index(docs) ``` That call to [index()][docarray.index.backends.qdrant.QdrantDocumentIndex.index] stores all Documents in `docs` into the Document Index, @@ -220,7 +220,7 @@ similar Documents in the Document Index: query = MyDoc(embedding=np.random.rand(128), text='query') # find similar Documents - matches, scores = db.find(query, search_field='embedding', limit=5) + matches, scores = doc_index.find(query, search_field='embedding', limit=5) print(f'{matches=}') print(f'{matches.text=}') @@ -234,7 +234,7 @@ similar Documents in the Document Index: query = np.random.rand(128) # find similar Documents - matches, scores = db.find(query, search_field='embedding', limit=5) + matches, scores = doc_index.find(query, search_field='embedding', limit=5) print(f'{matches=}') print(f'{matches.text=}') @@ -275,7 +275,8 @@ class Book(BaseDoc): books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) -book_index = QdrantDocumentIndex[Book](books) +book_index = QdrantDocumentIndex[Book]() +book_index.index(books) # filter for books that are cheaper than 29 dollars query = rest.Filter( @@ -325,15 +326,15 @@ through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: For example, you can build a hybrid serach query that performs range filtering, vector search and text search: ```python -class MyDoc(BaseDoc): +class SimpleDoc(BaseDoc): tens: NdArray[10] num: int text: str -doc_index = QdrantDocumentIndex[MyDoc](host='localhost') +doc_index = QdrantDocumentIndex[SimpleDoc](host='localhost') index_docs = [ - MyDoc(id=f'{i}', tens=np.ones(10) * i, num=int(i / 2), text=f'Lorem ipsum {int(i/2)}') + SimpleDoc(id=f'{i}', tens=np.ones(10) * i, num=int(i / 2), text=f'Lorem ipsum {int(i/2)}') for i in range(10) ] doc_index.index(index_docs) @@ -353,8 +354,8 @@ filter_query = rest.Filter( ) query = ( - index.build_query() - .find(find_query, search_field='embedding') + doc_index.build_query() + .find(find_query, search_field='tens') .text_search(text_search_query, search_field='text') .filter(filter_query) .build(limit=5) @@ -517,17 +518,17 @@ class VideoDoc(BaseDoc): tensor_video: AnyTensor = Field(space='cosine', dim=128) -class MyDoc(BaseDoc): +class MediaDoc(BaseDoc): docs: DocList[VideoDoc] tensor: AnyTensor = Field(space='cosine', dim=256) # create a Document Index -doc_index = QdrantDocumentIndex[MyDoc]() +doc_index = QdrantDocumentIndex[MediaDoc](index_name='tmp3') # create some data index_docs = [ - MyDoc( + MediaDoc( docs=DocList[VideoDoc]( [ VideoDoc( @@ -598,7 +599,7 @@ assert index.num_docs() == 100 Now we can find relevant documents ```python -res = index.find(query=docs[0], search_field='tens', limit=100) +res = index.find(query=docs[0], search_field='embedding', limit=100) assert len(res.documents) == 100 for doc in res.documents: assert 'I am the first version' in doc.text @@ -617,7 +618,7 @@ assert index.num_docs() == 100 When we retrieve them again we can see that their text attribute has been updated accordingly ```python -res = index.find(query=docs[0], search_field='tens', limit=100) +res = index.find(query=docs[0], search_field='embedding', limit=100) assert len(res.documents) == 100 for doc in res.documents: assert 'I am the second version' in doc.text From 64f83bf73cf12dc7bbedc168725f5797ef0f82f5 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Tue, 18 Jul 2023 15:08:02 +0200 Subject: [PATCH 10/23] docs: validate code examples for redis and weaviate Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_redis.md | 32 +++++++---- docs/user_guide/storing/index_weaviate.md | 70 +++-------------------- 2 files changed, 29 insertions(+), 73 deletions(-) diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index aac8d0bd1e0..3409556de30 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -16,7 +16,6 @@ focusing on special features and configurations of Redis. from docarray import BaseDoc, DocList from docarray.index import RedisDocumentIndex from docarray.typing import NdArray -from pydantic import Field import numpy as np # Define the document schema. @@ -33,7 +32,7 @@ doc_index.index(docs) # Perform a vector search. query = np.ones(128) -retrieved_docs = doc_index.find(query, limit=10) +retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) ``` @@ -247,7 +246,8 @@ class Book(BaseDoc): books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) -book_index = RedisDocumentIndex[Book](books) +book_index = RedisDocumentIndex[Book]() +book_index.index(books) # filter for books that are cheaper than 29 dollars query = '@price:[-inf 29]' @@ -292,17 +292,25 @@ To combine these operations into a single, hybrid search query, you can use the through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: ```python -# prepare a query -q_doc = MyDoc(embedding=np.random.rand(128), text='query') +# Define the document schema. +class SimpleSchema(BaseDoc): + price: int + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = RedisDocumentIndex[SimpleSchema](host='localhost') +doc_index.index(docs) query = ( doc_index.build_query() # get empty query object - .find(query=q_doc, search_field='embedding') # add vector similarity search - .filter(filter_query='@text:*') # add filter search + .find(query=np.random.rand(128), search_field='embedding') # add vector similarity search + .filter(filter_query='@price:[-inf 3]') # add filter search .build() ) # execute the combined query and return the results -results = db.execute_query(query) +results = doc_index.execute_query(query) print(f'{results=}') ``` @@ -383,7 +391,7 @@ You can pass any of the above as keyword arguments to the `__init__()` method or ```python class SimpleDoc(BaseDoc): - tensor: NdArray[128] = Field(algorithm='FLAT', m=32, distance='COSINE') + tensor: NdArray[128] = Field(algorithm='FLAT', distance='COSINE') doc_index = RedisDocumentIndex[SimpleDoc]() @@ -581,7 +589,7 @@ Now we can instantiate our Index and index some data. ```python docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] ) index = RedisDocumentIndex[MyDoc]() index.index(docs) @@ -591,7 +599,7 @@ assert index.num_docs() == 100 Now we can find relevant documents ```python -res = index.find(query=docs[0], search_field='tens', limit=100) +res = index.find(query=docs[0], search_field='embedding', limit=100) assert len(res.documents) == 100 for doc in res.documents: assert 'I am the first version' in doc.text @@ -610,7 +618,7 @@ assert index.num_docs() == 100 When we retrieve them again we can see that their text attribute has been updated accordingly ```python -res = index.find(query=docs[0], search_field='tens', limit=100) +res = index.find(query=docs[0], search_field='embedding', limit=100) assert len(res.documents) == 100 for doc in res.documents: assert 'I am the second version' in doc.text diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index 75884277d5b..e377e501968 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -230,7 +230,7 @@ batch_config = { runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config) -store = WeaviateDocumentIndex[Document](db_config=dbconfig) +store = WeaviateDocumentIndex[Document]() store.configure(runtimeconfig) # Batch settings being passed on store.index(docs) ``` @@ -326,16 +326,16 @@ You can also access data by the `id` that was assigned to each document: ```python # prepare some data data = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + MyDoc(embedding=np.random.rand(128), title=f'query {i}') for i in range(3) ) # remember the Document ids and index the data ids = data.id -store.index(data) +doc_index.index(data) # access the Documents by id -doc = store[ids[0]] # get by single id -docs = store[ids] # get by list of ids +doc = doc_index[ids[0]] # get by single id +docs = doc_index[ids] # get by list of ids ``` @@ -346,16 +346,16 @@ In the same way you can access Documents by id, you can also delete them: ```python # prepare some data data = DocList[MyDoc]( - MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + MyDoc(embedding=np.random.rand(128), title=f'query {i}') for i in range(3) ) # remember the Document ids and index the data ids = data.id -store.index(data) +doc_index.index(data) # access the Documents by id -del store[ids[0]] # del by single id -del store[ids[1:]] # del by list of ids +del doc_index[ids[0]] # del by single id +del doc_index[ids[1:]] # del by list of ids ``` ## Configuration @@ -605,55 +605,3 @@ root_docs, sub_docs, scores = doc_index.find_subindex( np.ones(64), subindex='docs__images', limit=3 ) ``` - -### Update elements -In order to update a Document inside the index, you only need to reindex it with the updated attributes. - -First lets create a schema for our Index -```python -import numpy as np -from docarray import BaseDoc, DocList -from docarray.typing import NdArray -from docarray.index import WeaviateDocumentIndex -class MyDoc(BaseDoc): - text: str - embedding: NdArray[128] = Field(is_embedding=True) -``` -Now we can instantiate our Index and index some data. - -```python -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] -) -index = WeaviateDocumentIndex[MyDoc]() -index.index(docs) -assert index.num_docs() == 100 -``` - -Now we can find relevant documents - -```python -res = index.find(query=docs[0], limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the first version' in doc.text -``` - -and update all of the text of this documents and reindex them - -```python -for i, doc in enumerate(docs): - doc.text = f'I am the second version of Document {i}' - -index.index(docs) -assert index.num_docs() == 100 -``` - -When we retrieve them again we can see that their text attribute has been updated accordingly - -```python -res = index.find(query=docs[0], limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the second version' in doc.text -``` From ca25feb6f06f4135af43880f48f9f17c2897eacd Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Wed, 19 Jul 2023 16:49:53 +0200 Subject: [PATCH 11/23] docs: milvus v1 Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_milvus.md | 567 +++++++++++++++++++++++- 1 file changed, 565 insertions(+), 2 deletions(-) diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index effba76e3b9..dcb488e8fd7 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -8,7 +8,7 @@ ``` This is the user guide for the [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex], -focusing on special features and configurations of Redis. +focusing on special features and configurations of Milvus. ## Basic Usage @@ -28,7 +28,7 @@ class MyDoc(BaseDoc): docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) # Initialize a new MilvusDocumentIndex instance and add the documents to the index. -doc_index = MilvusDocumentIndex[MyDoc](host='localhost') +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_1') doc_index.index(docs) # Perform a vector search. @@ -37,3 +37,566 @@ retrieved_docs = doc_index.find(query, limit=10) ``` +## Initialize + +First of all, you need to install and run Milvus. Download `docker-compose.yml` with the following command: + +```shell +wget https://github.com/milvus-io/milvus/releases/download/v2.2.11/milvus-standalone-docker-compose.yml -O docker-compose.yml +``` + +And start Milvus by running: +```shell +sudo docker-compose up -d +``` + +Learn more on [Milvus documentation](https://milvus.io/docs/install_standalone-docker.md). + +Next, you can create a [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] instance using: + +```python +from docarray import BaseDoc +from docarray.index import MilvusDocumentIndex +from docarray.typing import NdArray +from pydantic import Field + + +class MyDoc(BaseDoc): + embedding: NdArray[128] = Field(is_embedding=True) + text: str + + +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_2') +``` + +### Schema definition +In this code snippet, `MilvusDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + +### Using a predefined Document as schema + +DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined Document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import MilvusDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: NdArray[128] = Field(is_embedding=True) + + + doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_3') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import MilvusDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128, is_embedding=True) + + + doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_4') + ``` + + +## Index + +Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: + +```python +import numpy as np +from docarray import DocList + +class MyDoc(BaseDoc): + title: str + embedding: NdArray[128] = Field(is_embedding=True) + +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp_index_5') + +# create some random data +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), title=f'text {i}') for i in range(100)] +) + +# index the data +doc_index.index(docs) +``` + +That call to [index()][docarray.index.backends.milvus.MilvusDocumentIndex.index] stores all Documents in `docs` into the Document Index, +ready to be retrieved in the next step. + +As you can see, `DocList[MyDoc]` and `MilvusDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. +This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store +need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + + +## Vector Search + +Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. + +By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +similar Documents in the Document Index: + +=== "Search by Document" + + ```python + # create a query Document + query = MyDoc(embedding=np.random.rand(128), title='query') + + # find similar Documents + matches, scores = doc_index.find(query, limit=5) + + print(f'{matches=}') + print(f'{matches.title=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar Documents + matches, scores = doc_index.find(query, limit=5) + + print(f'{matches=}') + print(f'{matches.title=}') + print(f'{scores=}') + ``` + +To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). + +You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. + + +## Filter + +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of the Milvus](https://milvus.io/docs/boolean.md). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + price: int + embedding: NdArray[10] = Field(is_embedding=True) + + +books = DocList[Book]([Book(price=i * 10, embedding=np.random.rand(10)) for i in range(10)]) +book_index = MilvusDocumentIndex[Book](index_name='tmp_index_6') +book_index.index(books) + +# filter for books that are cheaper than 29 dollars +query = 'price < 29' +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + +## Text Search + +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +!!! note + The [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] implementation does not offer support for text search. + + To see how to perform text search, you can check out other backends that offer support. + + +## Hybrid Search + +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: + +```python +# Define the document schema. +class SimpleSchema(BaseDoc): + price: int + embedding: NdArray[128] = Field(is_embedding=True) + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = MilvusDocumentIndex[SimpleSchema](index_name='tmp_index_7') +doc_index.index(docs) + +query = ( + doc_index.build_query() # get empty query object + .find(query=np.random.rand(128)) # add vector similarity search + .filter(filter_query='price < 3') # add filter search + .build() +) +# execute the combined query and return the results +results = doc_index.execute_query(query) +print(f'{results=}') +``` + +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search +to obtain a combined set of results. + +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. +To see what backend can do what, check out the [specific docs](#document-index). + + +## Access Documents + +To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. + +You can also access data by the `id` that was assigned to each document: + +```python +# prepare some data +data = DocList[SimpleSchema]( + SimpleSchema(embedding=np.random.rand(128), price=i) for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +doc_index.index(data) + +# access the Documents by id +doc = doc_index[ids[0]] # get by single id +docs = doc_index[ids] # get by list of ids +``` + + +## Delete Documents + +In the same way you can access Documents by id, you can also delete them: + +```python +# prepare some data +data = DocList[SimpleSchema]( + SimpleSchema(embedding=np.random.rand(128), price=i) for i in range(3) +) + +# remember the Document ids and index the data +ids = data.id +doc_index.index(data) + +# access the Documents by id +del doc_index[ids[0]] # del by single id +del doc_index[ids[1:]] # del by list of ids +``` + + +## Configuration + +This section lays out the configurations and options that are specific to [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]. + +### DBConfig + +The following configs can be set in `DBConfig`: + +| Name | Description | Default | +|-------------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------| +| `host` | The host address for the Milvus server. | `localhost` | +| `port` | The port number for the Milvus server | 19530 | +| `index_name` | The name of the index in the Milvus database | None. Data will be stored in an index named after the Document type used as schema. | +| `user` | The username for the Milvus server | None | +| `password` | The password for the Milvus server | None | +| `token` | Token for secure connection | '' | +| `collection_description` | Description of the collection in the database | '' | +| `default_column_config` | The default configurations for every column type. | dict | + +You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. + + +### Field-wise configurations + + +`default_column_config` is the default configurations for every column type. Since there are many column types in Milvus, you can also consider changing the column config when defining the schema. + +```python +class SimpleDoc(BaseDoc): + tensor: NdArray[128] = Field(is_embedding=True, index_type='IVF_FLAT', metric_type='L2') + + +doc_index = MilvusDocumentIndex[SimpleDoc](index_name='tmp_index_10') +``` + + +### RuntimeConfig + +The `RuntimeConfig` dataclass of `MilvusDocumentIndex` consists of `batch_size` index/get/del operations. +You can change `batch_size` in the following way: + +```python +doc_index = MilvusDocumentIndex[SimpleDoc]() +doc_index.configure(MilvusDocumentIndex.RuntimeConfig(batch_size=128)) +``` + +You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. + + +## Nested data + +The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. + +**Index nested data:** + +It is, however, also possible to represent nested Documents and store them in a Document Index. + +In the following example you can see a complex schema that contains nested Documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: + +```python +from docarray.typing import ImageUrl, VideoUrl, AnyTensor + + +# define a nested schema +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: AnyTensor = Field(space='cosine', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + tensor: AnyTensor = Field(space='cosine', dim=128) + + +class YouTubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc + tensor: AnyTensor = Field(is_embedding=True, space='cosine', dim=256) + + +# create a Document Index +doc_index = MilvusDocumentIndex[YouTubeVideoDoc](index_name='tmp2') + +# create some data +index_docs = [ + YouTubeVideoDoc( + title=f'video {i+1}', + description=f'this is video from author {10*i}', + thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), + tensor=np.ones(256), + ) + for i in range(8) +] + +# index the Documents +doc_index.index(index_docs) +``` + +**Search nested data:** + +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. + +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: + +```python +# create a query Document +query_doc = YouTubeVideoDoc( + title=f'video query', + description=f'this is a query video', + thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), + tensor=np.ones(256), +) + +# find by the `youtubevideo` tensor; root level +docs, scores = doc_index.find(query_doc, limit=3) +``` + +### Nested data with subindex + +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). + +If a Document contains a DocList, it can still be stored in a Document Index. +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). + +This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. + + +**Index** + +In the following example you can see a complex schema that contains nested Documents with subindex. +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: + +```python +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(is_embedding=True, space='cosine', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(is_embedding=True, space='cosine', dim=128) + + +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(is_embedding=True, space='cosine', dim=256) + + +# create a Document Index +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp4') +doc_index.configure(MilvusDocumentIndex.RuntimeConfig(batch_size=10)) + + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] + +# index the Documents +doc_index.index(index_docs) +``` + +**Search** + +You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. + +```python +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', search_field='tensor_video', limit=3 +) + +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 +) +``` + +### Update elements +In order to update a Document inside the index, you only need to reindex it with the updated attributes. + +First lets create a schema for our Index +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import MilvusDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` +Now we can instantiate our Index and index some data. + +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = MilvusDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Now we can find relevant documents + +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +and update all of the text of this documents and reindex them + +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly + +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + From fe572da95b22857808a7b35a18d20137d198718c Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 24 Jul 2023 13:32:41 +0200 Subject: [PATCH 12/23] docs: validate milvus code Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_milvus.md | 65 +++---------------------- 1 file changed, 6 insertions(+), 59 deletions(-) diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index dcb488e8fd7..fd6ab262bde 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -498,7 +498,7 @@ class MyDoc(BaseDoc): # create a Document Index -doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp4') +doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp5') doc_index.configure(MilvusDocumentIndex.RuntimeConfig(batch_size=10)) @@ -515,17 +515,17 @@ index_docs = [ url=f'http://example.ai/images/{i}-{j}-{k}', tensor_image=np.ones(64), ) - for k in range(10) + for k in range(5) ] ), tensor_video=np.ones(128), ) - for j in range(10) + for j in range(5) ] ), tensor=np.ones(256), ) - for i in range(10) + for i in range(5) ] # index the Documents @@ -539,64 +539,11 @@ You can perform search on any subindex level by using `find_subindex()` method a ```python # find by the `VideoDoc` tensor root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 + np.ones(128), subindex='docs', limit=3 ) # find by the `ImageDoc` tensor root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 + np.ones(64), subindex='docs__images', limit=3 ) ``` - -### Update elements -In order to update a Document inside the index, you only need to reindex it with the updated attributes. - -First lets create a schema for our Index -```python -import numpy as np -from docarray import BaseDoc, DocList -from docarray.typing import NdArray -from docarray.index import MilvusDocumentIndex -class MyDoc(BaseDoc): - text: str - embedding: NdArray[128] -``` -Now we can instantiate our Index and index some data. - -```python -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] -) -index = MilvusDocumentIndex[MyDoc]() -index.index(docs) -assert index.num_docs() == 100 -``` - -Now we can find relevant documents - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the first version' in doc.text -``` - -and update all of the text of this documents and reindex them - -```python -for i, doc in enumerate(docs): - doc.text = f'I am the second version of Document {i}' - -index.index(docs) -assert index.num_docs() == 100 -``` - -When we retrieve them again we can see that their text attribute has been updated accordingly - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the second version' in doc.text -``` - From 10bc14ba9a2cb2ab24a63f4ee160f6b5b2c49ef4 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 24 Jul 2023 14:54:04 +0200 Subject: [PATCH 13/23] docs: make redis and milvus visible Signed-off-by: jupyterjazz --- mkdocs.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mkdocs.yml b/mkdocs.yml index a7ad4cf8500..1df96540849 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -103,6 +103,8 @@ nav: - user_guide/storing/index_weaviate.md - user_guide/storing/index_elastic.md - user_guide/storing/index_qdrant.md + - user_guide/storing/index_redis.md + - user_guide/storing/index_milvus.md - DocStore - Bulk storage: - user_guide/storing/doc_store/store_file.md - user_guide/storing/doc_store/store_jac.md From 6199a2a3224db62a03353e7083fd1be67d0948d2 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Wed, 26 Jul 2023 11:28:11 +0200 Subject: [PATCH 14/23] docs: refine vol1 Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 15 ++ docs/user_guide/storing/first_step.md | 2 + docs/user_guide/storing/index_elastic.md | 2 +- docs/user_guide/storing/index_hnswlib.md | 195 ++++++++++++--------- docs/user_guide/storing/index_in_memory.md | 108 +++++++++--- docs/user_guide/storing/index_milvus.md | 3 +- docs/user_guide/storing/index_qdrant.md | 105 ++++++----- docs/user_guide/storing/index_redis.md | 108 ++++++------ docs/user_guide/storing/index_weaviate.md | 18 +- 9 files changed, 325 insertions(+), 231 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 7118f7af841..7bd8721db1d 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -37,6 +37,8 @@ Currently, DocArray supports the following vector databases: - [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) - [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) +- [Redis](https://redis.com/) | [Docs](index_redis.md) +- [Milvus](https://milvus.io/) | [Docs](index_milvus.md) - [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) - InMemoryExactNNIndex | [Docs](index_in_memory.md) @@ -56,6 +58,7 @@ because it doesn't require you to launch a database server. Instead, it will sto For InMemory-specific settings, check out the `InMemoryExactNNIndex` documentation [here](index_in_memory.md). +### Define Document Schema and Create Data ```python from docarray import BaseDoc, DocList from docarray.index import InMemoryExactNNIndex @@ -70,19 +73,31 @@ class MyDoc(BaseDoc): # Create documents (using dummy/random vectors) docs = DocList[MyDoc](MyDoc(title=f'title #{i}', price=i, embedding=np.random.rand(128)) for i in range(10)) +``` +### Initialize the Document Index and Add Data +```python # Initialize a new InMemoryExactNNIndex instance and add the documents to the index. doc_index = InMemoryExactNNIndex[MyDoc]() doc_index.index(docs) +``` +### Perform a Vector Similarity Search +```python # Perform a vector search. query = np.ones(128) retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) +``` +### Filter Documents +```python # Perform filtering (price < 5) query = {'price': {'$lt': 5}} filtered_docs = doc_index.filter(query, limit=10) +``` +### Combine Different Search Methods +```python # Perform a hybrid search - combining vector search with filtering query = ( doc_index.build_query() # get empty query object diff --git a/docs/user_guide/storing/first_step.md b/docs/user_guide/storing/first_step.md index e8f7ab80315..0da34c4516e 100644 --- a/docs/user_guide/storing/first_step.md +++ b/docs/user_guide/storing/first_step.md @@ -41,5 +41,7 @@ use a vector search library locally (HNSWLib, Exact NN search): - [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) - [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) +- [Redis](https://redis.com/) | [Docs](index_redis.md) +- [Milvus](https://milvus.io/) | [Docs](index_milvus.md) - [Hnswlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) - InMemoryExactNNSearch | [Docs](index_in_memory.md) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index 00913e6947d..bcae64f8998 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -166,7 +166,7 @@ You can work around this problem by subclassing the predefined Document and addi Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the predefined Document type, or your custom Document type. -The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: ```python from docarray import DocList diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index 862232a98ed..9ad6a0def8a 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -20,6 +20,8 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s - [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] - [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] - [ElasticDocumentIndex][docarray.index.backends.elastic.ElasticDocIndex] + - [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] + - [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] ## Basic Usage @@ -124,7 +126,7 @@ You can work around this problem by subclassing the predefined Document and addi Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the predefined Document type, or your custom Document type. -The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: ```python from docarray import DocList @@ -143,20 +145,6 @@ db.index(data) ``` -**Database location:** - -For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you -usually specify a `host` and a `port` instead. - -In addition to a host and a port, most backends can also take an `index_name`, `table_name`, `collection_name` or similar. -This specifies the name of the index/table/collection that will be created in the database. -You don't have to specify this though: By default, this name will be taken from the name of the Document type that you use as schema. -For example, for `WeaviateDocumentIndex[MyDoc](...)` the data will be stored in a Weaviate Class of name `MyDoc`. - -In any case, if the location does not yet contain any data, we start from a blank slate. -If the location already contains data from a previous session, it will be accessible through the Document Index. - - ## Index Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: @@ -242,7 +230,7 @@ matching documents and their associated similarity scores. When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. -How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). ### Batched Search @@ -284,14 +272,33 @@ a list of `DocList`s, one for each query, containing the closest matching docume ## Filter -In addition to vector similarity search, the Document Index interface offers methods for filtered search: -[filter()][docarray.index.abstract.BaseDocIndex.filter], -as well as the batched version [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched]. [filter_subindex()][docarray.index.abstract.BaseDocIndex.filter_subindex] is for filter on subindex level. +To filter Documents, the `InMemoryExactNNIndex` uses DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. -!!! note - The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter search. +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. - To see how to perform filter search, you can check out other backends that offer support. +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + title: str + price: int + + +books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) +book_index = HnswDocumentIndex[Book](work_dir='./tmp_0') + +# filter for books that are cheaper than 29 dollars +query = {'price': {'$lte': 29}} +cheap_books = book_index.filter(query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` @@ -315,19 +322,28 @@ To combine these operations into a single, hybrid search query, you can use the through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: ```python -# prepare a query -q_doc = MyDoc(embedding=np.random.rand(128), text='query') +# Define the document schema. +class SimpleSchema(BaseDoc): + year: int + price: int + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(year=2000-i, price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = HnswDocumentIndex[SimpleSchema](work_dir='./tmp_9') +doc_index.index(docs) query = ( - db.build_query() # get empty query object - .find(query=q_doc, search_field='embedding') # add vector similarity search - .filter(filter_query={'text': {'$exists': True}}) # add filter search - .build() # build the query + doc_index.build_query() # get empty query object + .filter(filter_query={'year': {'$gt': 1994}}) # pre-filtering + .find(query=np.random.rand(128), search_field='embedding') # add vector similarity search + .filter(filter_query={'price': {'$lte': 3}}) # post-filtering + .build() ) - # execute the combined query and return the results -retrieved_docs, scores = db.execute_query(query) -print(f'{retrieved_docs=}') +results = doc_index.execute_query(query) +print(f'{results=}') ``` In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search @@ -335,7 +351,6 @@ to obtain a combined set of results. The kinds of atomic queries that can be combined in this way depends on the backend. Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -To see what backend can do what, check out the [specific docs](#document-index). ## Access Documents @@ -379,6 +394,54 @@ del db[ids[0]] # del by single id del db[ids[1:]] # del by list of ids ``` +## Update Documents +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import HnswDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = HnswDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of this documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` ## Configuration @@ -459,7 +522,7 @@ For more information on these settings, see [below](#field-wise-configurations). Fields that are not vector fields (e.g. of type `str` or `int` etc.) do not offer any configuration, as they are simply stored as-is in a SQLite database. -### Field-wise configurations +### Field-wise Configurations There are various setting that you can tweak for every vector field that you index into Hnswlib. @@ -498,6 +561,20 @@ In this way, you can pass [all options that Hnswlib supports](https://github.com You can find more details on the parameters [here](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md). +### Database location + +For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you +usually specify a `host` and a `port` instead. + +In addition to a host and a port, most backends can also take an `index_name`, `table_name`, `collection_name` or similar. +This specifies the name of the index/table/collection that will be created in the database. +You don't have to specify this though: By default, this name will be taken from the name of the Document type that you use as schema. +For example, for `WeaviateDocumentIndex[MyDoc](...)` the data will be stored in a Weaviate Class of name `MyDoc`. + +In any case, if the location does not yet contain any data, we start from a blank slate. +If the location already contains data from a previous session, it will be accessible through the Document Index. + + ## Nested data @@ -658,55 +735,3 @@ root_docs, sub_docs, scores = doc_index.find_subindex( np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 ) ``` - -### Update elements -In order to update a Document inside the index, you only need to reindex it with the updated attributes. - -First lets create a schema for our Index -```python -import numpy as np -from docarray import BaseDoc, DocList -from docarray.typing import NdArray -from docarray.index import HnswDocumentIndex -class MyDoc(BaseDoc): - text: str - embedding: NdArray[128] -``` -Now we can instantiate our Index and index some data. - -```python -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] -) -index = HnswDocumentIndex[MyDoc]() -index.index(docs) -assert index.num_docs() == 100 -``` - -Now we can find relevant documents - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the first version' in doc.text -``` - -and update all of the text of this documents and reindex them - -```python -for i, doc in enumerate(docs): - doc.text = f'I am the second version of Document {i}' - -index.index(docs) -assert index.num_docs() == 100 -``` - -When we retrieve them again we can see that their text attribute has been updated accordingly - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the second version' in doc.text -``` diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 84bc80441c4..a36a29e2695 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -15,6 +15,9 @@ For vector search and filtering the InMemoryExactNNIndex utilizes DocArray's [`f - [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] - [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] - [ElasticDocumentIndex][docarray.index.backends.elastic.ElasticDocIndex] + - [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] + - [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] + ## Basic usage @@ -138,20 +141,6 @@ data = DocList[MyDoc]( db.index(data) ``` - -**Persist and Load** - -Further, you can pass an `index_file_path` argument to make sure that the index can be restored if persisted from that specific file. -```python -doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') -doc_index.index(docs) - -doc_index.persist() - -# Initialize a new document index using the saved binary file -new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') -``` - ## Index Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: @@ -327,19 +316,27 @@ To combine these operations into a single, hybrid search query, you can use the through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: ```python -# prepare a query -q_doc = MyDoc(embedding=np.random.rand(128), text='query') +# Define the document schema. +class SimpleSchema(BaseDoc): + year: int + price: int + embedding: NdArray[128] + +# Create dummy documents. +docs = DocList[SimpleSchema](SimpleSchema(year=2000-i, price=i, embedding=np.random.rand(128)) for i in range(10)) + +doc_index = InMemoryExactNNIndex[SimpleSchema](docs) query = ( - db.build_query() # get empty query object - .find(query=q_doc, search_field='embedding') # add vector similarity search - .filter(filter_query={'text': {'$exists': True}}) # add filter search - .build() # build the query + doc_index.build_query() # get empty query object + .filter(filter_query={'year': {'$gt': 1994}}) # pre-filtering + .find(query=np.random.rand(128), search_field='embedding') # add vector similarity search + .filter(filter_query={'price': {'$lte': 3}}) # post-filtering + .build() ) - # execute the combined query and return the results -retrieved_docs, scores = db.execute_query(query) -print(f'{retrieved_docs=}') +results = doc_index.execute_query(query) +print(f'{results=}') ``` In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search @@ -347,7 +344,6 @@ to obtain a combined set of results. The kinds of atomic queries that can be combined in this way depends on the backend. Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -To see what backend can do what, check out the [specific docs](#document-index). ## Access Documents @@ -391,6 +387,56 @@ del db[ids[0]] # del by single id del db[ids[1:]] # del by list of ids ``` +## Update Documents +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import InMemoryExactNNIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = InMemoryExactNNIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of this documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + + ## Configuration This section lays out the configurations and options that are specific to [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. @@ -443,6 +489,20 @@ class Schema(BaseDoc): In the example above you can see how to configure two different vector fields, with two different sets of settings. + +### Persist and Load +You can pass an `index_file_path` argument to make sure that the index can be restored if persisted from that specific file. +```python +doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') +doc_index.index(docs) + +doc_index.persist() + +# Initialize a new document index using the saved binary file +new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') +``` + + ## Nested data The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index fd6ab262bde..3962ee7865a 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -217,7 +217,7 @@ matching documents and their associated similarity scores. When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. -How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. @@ -298,7 +298,6 @@ to obtain a combined set of results. The kinds of atomic queries that can be combined in this way depends on the backend. Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -To see what backend can do what, check out the [specific docs](#document-index). ## Access Documents diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index 3851388eeed..deaca07df75 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -149,7 +149,7 @@ You can work around this problem by subclassing the predefined Document and addi Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the predefined Document type, or your custom Document type. -The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: ```python from docarray import DocList @@ -253,7 +253,7 @@ matching documents and their associated similarity scores. When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. -How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. @@ -390,6 +390,55 @@ del doc_index[index_docs[16].id] del doc_index[index_docs[17].id, index_docs[18].id] ``` +## Update elements +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import QdrantDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = QdrantDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of this documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + ## Configuration @@ -572,55 +621,3 @@ root_docs, sub_docs, scores = doc_index.find_subindex( ) ``` -### Update elements -In order to update a Document inside the index, you only need to reindex it with the updated attributes. - -First lets create a schema for our Index -```python -import numpy as np -from docarray import BaseDoc, DocList -from docarray.typing import NdArray -from docarray.index import QdrantDocumentIndex -class MyDoc(BaseDoc): - text: str - embedding: NdArray[128] -``` -Now we can instantiate our Index and index some data. - -```python -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)] -) -index = QdrantDocumentIndex[MyDoc]() -index.index(docs) -assert index.num_docs() == 100 -``` - -Now we can find relevant documents - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the first version' in doc.text -``` - -and update all of the text of this documents and reindex them - -```python -for i, doc in enumerate(docs): - doc.text = f'I am the second version of Document {i}' - -index.index(docs) -assert index.num_docs() == 100 -``` - -When we retrieve them again we can see that their text attribute has been updated accordingly - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the second version' in doc.text -``` - diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index 3409556de30..373fd85a3cb 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -121,7 +121,7 @@ You can work around this problem by subclassing the predefined Document and addi Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the predefined Document type, or your custom Document type. -The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: ```python from docarray import DocList @@ -224,7 +224,7 @@ matching documents and their associated similarity scores. When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. -How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. @@ -319,7 +319,6 @@ to obtain a combined set of results. The kinds of atomic queries that can be combined in this way depends on the backend. Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -To see what backend can do what, check out the [specific docs](#document-index). ## Access Documents @@ -363,6 +362,56 @@ del db[ids[0]] # del by single id del db[ids[1:]] # del by list of ids ``` +## Update elements +In order to update a Document inside the index, you only need to re-index it with the updated attributes. + +First, let's create a schema for our Document Index: +```python +import numpy as np +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from docarray.index import RedisDocumentIndex +class MyDoc(BaseDoc): + text: str + embedding: NdArray[128] +``` + +Now, we can instantiate our Index and add some data: +```python +docs = DocList[MyDoc]( + [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] +) +index = RedisDocumentIndex[MyDoc]() +index.index(docs) +assert index.num_docs() == 100 +``` + +Let's retrieve our data and check its content: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the first version' in doc.text +``` + +Then, let's update all of the text of this documents and re-index them: +```python +for i, doc in enumerate(docs): + doc.text = f'I am the second version of Document {i}' + +index.index(docs) +assert index.num_docs() == 100 +``` + +When we retrieve them again we can see that their text attribute has been updated accordingly: +```python +res = index.find(query=docs[0], search_field='embedding', limit=100) +assert len(res.documents) == 100 +for doc in res.documents: + assert 'I am the second version' in doc.text +``` + + ## Configuration This section lays out the configurations and options that are specific to [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]. @@ -571,56 +620,3 @@ root_docs, sub_docs, scores = doc_index.find_subindex( np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 ) ``` - -### Update elements -In order to update a Document inside the index, you only need to reindex it with the updated attributes. - -First lets create a schema for our Index -```python -import numpy as np -from docarray import BaseDoc, DocList -from docarray.typing import NdArray -from docarray.index import RedisDocumentIndex -class MyDoc(BaseDoc): - text: str - embedding: NdArray[128] -``` -Now we can instantiate our Index and index some data. - -```python -docs = DocList[MyDoc]( - [MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)] -) -index = RedisDocumentIndex[MyDoc]() -index.index(docs) -assert index.num_docs() == 100 -``` - -Now we can find relevant documents - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the first version' in doc.text -``` - -and update all of the text of this documents and reindex them - -```python -for i, doc in enumerate(docs): - doc.text = f'I am the second version of Document {i}' - -index.index(docs) -assert index.num_docs() == 100 -``` - -When we retrieve them again we can see that their text attribute has been updated accordingly - -```python -res = index.find(query=docs[0], search_field='embedding', limit=100) -assert len(res.documents) == 100 -for doc in res.documents: - assert 'I am the second version' in doc.text -``` - diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index e377e501968..cb3f916a8ca 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -45,7 +45,7 @@ retrieved_docs = doc_index.find(query, limit=10) To use [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], DocArray needs to hook into a running Weaviate service. There are multiple ways to start a Weaviate instance, depending on your use case. -#### Options - Overview +**Options - Overview** | Instance type | General use case | Configurability | Notes | | ----- | ----- | ----- | ----- | @@ -56,13 +56,13 @@ There are multiple ways to start a Weaviate instance, depending on your use case ### Instantiation instructions -#### WCS (managed instance) +**WCS (managed instance)** Go to the [WCS console](https://console.weaviate.cloud) and create an instance using the visual interface, following [this guide](https://weaviate.io/developers/wcs/guides/create-instance). Weaviate instances on WCS come pre-configured, so no further configuration is required. -#### Docker-Compose (self-managed) +**Docker-Compose (self-managed)** Get a configuration file (`docker-compose.yaml`). You can build it using [this interface](https://weaviate.io/developers/weaviate/installation/docker-compose), or download it directly with: @@ -76,7 +76,7 @@ Where `v` is the actual version, such as `v1.18.3`. curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?modules=standalone&runtime=docker-compose&weaviate_version=v1.18.3" ``` -##### Start up Weaviate with Docker-Compose +**Start up Weaviate with Docker-Compose** Then you can start up Weaviate by running from a shell: @@ -84,7 +84,7 @@ Then you can start up Weaviate by running from a shell: docker-compose up -d ``` -##### Shut down Weaviate +**Shut down Weaviate** Then you can shut down Weaviate by running from a shell: @@ -92,7 +92,7 @@ Then you can shut down Weaviate by running from a shell: docker-compose down ``` -#### Notes +**Notes** Unless data persistence or backups are set up, shutting down the Docker instance will remove all its data. @@ -102,7 +102,7 @@ See documentation on [Persistent volume](https://weaviate.io/developers/weaviate docker-compose up -d ``` -#### Embedded Weaviate (from the application) +**Embedded Weaviate (from the application)** With Embedded Weaviate, Weaviate database server can be launched from the client, using: @@ -150,7 +150,7 @@ dbconfig = WeaviateDocumentIndex.DBConfig( ) # Replace with your endpoint) ``` -#### OIDC with username + password +**OIDC with username + password** To authenticate against a Weaviate instance with OIDC username & password: @@ -170,7 +170,7 @@ dbconfig = WeaviateDocumentIndex.DBConfig( # ) ``` -#### API key-based authentication +**API key-based authentication** To authenticate against a Weaviate instance an API key: From c257a4e672e4d1a086cad8a59de383a4d567bed6 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Wed, 26 Jul 2023 17:34:35 +0200 Subject: [PATCH 15/23] docs: refine vol2 Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 14 +++++----- docs/user_guide/storing/index_hnswlib.md | 30 ++++++++++------------ docs/user_guide/storing/index_in_memory.md | 11 ++++---- docs/user_guide/storing/index_milvus.md | 11 +++----- docs/user_guide/storing/index_qdrant.md | 2 +- docs/user_guide/storing/index_weaviate.md | 4 +++ 6 files changed, 36 insertions(+), 36 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 7bd8721db1d..abcb4fd7c00 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -45,17 +45,17 @@ Currently, DocArray supports the following vector databases: ## Basic Usage -For this user guide you will use the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] -because it doesn't require you to launch a database server. Instead, it will store your data locally. +Let's learn basic capabilities of Document Index with [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. +It's easy because you don't need a database server, instead it saves your data locally. + !!! note "Using a different vector database" - You can easily use Weaviate, Qdrant, or Elasticsearch instead -- they share the same API! + You can easily use Weaviate, Qdrant, Redis, Milvus or Elasticsearch instead -- they share the same API! To do so, check their respective documentation sections. -!!! note "InMemory-specific settings" - The following sections explain the general concept of Document Index by using - `InMemoryExactNNIndex` as an example. - For InMemory-specific settings, check out the `InMemoryExactNNIndex` documentation +!!! note "InMemoryExactNNIndex in more detail" + The following section only covers the basics of InMemoryExactNNIndex. + For a deeper understanding, please look into its documentation [here](index_in_memory.md). ### Define Document Schema and Create Data diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index 9ad6a0def8a..f8ca5da2845 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -40,7 +40,7 @@ class MyDoc(BaseDoc): docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10)) # Initialize a new HnswDocumentIndex instance and add the documents to the index. -doc_index = HnswDocumentIndex[MyDoc](work_dir='./my_index') +doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp_0') doc_index.index(docs) # Perform a vector search. @@ -63,7 +63,7 @@ class MyDoc(BaseDoc): text: str -db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db') +db = HnswDocumentIndex[MyDoc](work_dir='./tmp_1') ``` ### Schema definition @@ -105,7 +105,7 @@ You can work around this problem by subclassing the predefined Document and addi embedding: NdArray[128] - db = HnswDocumentIndex[MyDoc](work_dir='test_db') + db = HnswDocumentIndex[MyDoc](work_dir='./tmp_2') ``` === "Using Field()" @@ -120,7 +120,7 @@ You can work around this problem by subclassing the predefined Document and addi embedding: AnyTensor = Field(dim=128) - db = HnswDocumentIndex[MyDoc](work_dir='test_db3') + db = HnswDocumentIndex[MyDoc](work_dir='./tmp_3') ``` Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the @@ -187,8 +187,8 @@ need to have compatible schemas. Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find -similar Documents in the Document Index: +You can use the [find()][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +to find similar documents within the Document Index: === "Search by Document" @@ -272,8 +272,6 @@ a list of `DocList`s, one for each query, containing the closest matching docume ## Filter -To filter Documents, the `InMemoryExactNNIndex` uses DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. - You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. @@ -289,10 +287,10 @@ class Book(BaseDoc): books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)]) -book_index = HnswDocumentIndex[Book](work_dir='./tmp_0') +book_index = HnswDocumentIndex[Book](work_dir='./tmp_4') # filter for books that are cheaper than 29 dollars -query = {'price': {'$lte': 29}} +query = {'price': {'$lt': 29}} cheap_books = book_index.filter(query) assert len(cheap_books) == 3 @@ -331,7 +329,7 @@ class SimpleSchema(BaseDoc): # Create dummy documents. docs = DocList[SimpleSchema](SimpleSchema(year=2000-i, price=i, embedding=np.random.rand(128)) for i in range(10)) -doc_index = HnswDocumentIndex[SimpleSchema](work_dir='./tmp_9') +doc_index = HnswDocumentIndex[SimpleSchema](work_dir='./tmp_5') doc_index.index(docs) query = ( @@ -467,7 +465,7 @@ class MyDoc(BaseDoc): text: str -db = HnswDocumentIndex[MyDoc](work_dir='./path/to/db') +db = HnswDocumentIndex[MyDoc](work_dir='./tmp_6') ``` To load existing data, you can specify a directory that stores data from a previous session. @@ -488,7 +486,7 @@ import numpy as np db = HnswDocumentIndex[MyDoc]( - work_dir='/tmp/my_db', + work_dir='./tmp_7', default_column_config={ np.ndarray: { 'dim': -1, @@ -537,7 +535,7 @@ class Schema(BaseDoc): tens_two: NdArray[10] = Field(M=4, space='ip') -db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db') +db = HnswDocumentIndex[Schema](work_dir='./tmp_8') ``` In the example above you can see how to configure two different vector fields, with two different sets of settings. @@ -611,7 +609,7 @@ class YouTubeVideoDoc(BaseDoc): # create a Document Index -doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2') +doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp_9') # create some data index_docs = [ @@ -688,7 +686,7 @@ class MyDoc(BaseDoc): # create a Document Index -doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp3') +doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp_10') # create some data index_docs = [ diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index a36a29e2695..e0bc71d23fe 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -1,11 +1,11 @@ # In-Memory Document Index -[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all Documents in DocLists in memory. +[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all Documents in memory using DocLists. It is a great starting point for small datasets, where you may not want to launch a database server. -For vector search and filtering the InMemoryExactNNIndex utilizes DocArray's [`find()`][docarray.utils.find.find] and -[`filter_docs()`][docarray.utils.filter.filter_docs] functions. +For vector search and filtering the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] +utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][docarray.utils.filter.filter_docs] functions. !!! note "Production readiness" [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] is a great starting point @@ -183,8 +183,9 @@ need to have compatible schemas. Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find -similar Documents in the Document Index: +You can use the [find()][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +to find similar documents within the Document Index: + === "Search by Document" diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index 3962ee7865a..bd16f531ce7 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -12,6 +12,10 @@ focusing on special features and configurations of Milvus. ## Basic Usage +!!! note "Single Search Field Requirement" + In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. + This is due to Milvus' configuration, which permits a single vector for each data object. + ```python from docarray import BaseDoc, DocList from docarray.index import MilvusDocumentIndex @@ -205,13 +209,6 @@ similar Documents in the Document Index: print(f'{scores=}') ``` -To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the -basis of comparison between your query and the documents in the Document Index. - -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. -In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose -which one to use for the search. - The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest matching documents and their associated similarity scores. diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index deaca07df75..fee13fcc32c 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -185,7 +185,7 @@ docs = DocList[MyDoc]( doc_index.index(docs) ``` -That call to [index()][docarray.index.backends.qdrant.QdrantDocumentIndex.index] stores all Documents in `docs` into the Document Index, +That call to `index()` stores all Documents in `docs` into the Document Index, ready to be retrieved in the next step. As you can see, `DocList[MyDoc]` and `QdrantDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index cb3f916a8ca..ee1cd71fae7 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -12,6 +12,10 @@ focusing on special features and configurations of Weaviate. ## Basic Usage +!!! note "Single Search Field Requirement" + In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. + This is due to Weaviate's configuration, which permits a single vector for each data object. + ```python from docarray import BaseDoc, DocList from docarray.index import WeaviateDocumentIndex From f3ca77c68bf0899b279801854c6287f38266e173 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Thu, 27 Jul 2023 09:35:35 +0200 Subject: [PATCH 16/23] docs: update api reference Signed-off-by: jupyterjazz --- docs/API_reference/doc_index/backends/milvus.md | 3 +++ docs/API_reference/doc_index/backends/redis.md | 3 +++ 2 files changed, 6 insertions(+) create mode 100644 docs/API_reference/doc_index/backends/milvus.md create mode 100644 docs/API_reference/doc_index/backends/redis.md diff --git a/docs/API_reference/doc_index/backends/milvus.md b/docs/API_reference/doc_index/backends/milvus.md new file mode 100644 index 00000000000..38514163cac --- /dev/null +++ b/docs/API_reference/doc_index/backends/milvus.md @@ -0,0 +1,3 @@ +# MilvusDocumentIndex + +::: docarray.index.backends.milvus.MilvusDocumentIndex \ No newline at end of file diff --git a/docs/API_reference/doc_index/backends/redis.md b/docs/API_reference/doc_index/backends/redis.md new file mode 100644 index 00000000000..f9622b23d55 --- /dev/null +++ b/docs/API_reference/doc_index/backends/redis.md @@ -0,0 +1,3 @@ +# RedisDocumentIndex + +::: docarray.index.backends.redis.RedisDocumentIndex \ No newline at end of file From e6ef9c4019410bfb6d09d89ec39cb7b31d4aa4c8 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 31 Jul 2023 14:38:54 +0200 Subject: [PATCH 17/23] docs: apply suggestions Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 26 ++++++++------- docs/user_guide/storing/index_elastic.md | 19 +++++------ docs/user_guide/storing/index_hnswlib.md | 38 ++++++++++------------ docs/user_guide/storing/index_in_memory.md | 38 ++++++++++------------ docs/user_guide/storing/index_milvus.md | 29 ++++++++--------- docs/user_guide/storing/index_qdrant.md | 30 ++++++++--------- docs/user_guide/storing/index_redis.md | 32 +++++++++--------- docs/user_guide/storing/index_weaviate.md | 8 ++--- 8 files changed, 106 insertions(+), 114 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index abcb4fd7c00..54b7ede532f 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -43,22 +43,21 @@ Currently, DocArray supports the following vector databases: - InMemoryExactNNIndex | [Docs](index_in_memory.md) -## Basic Usage +## Basic usage -Let's learn basic capabilities of Document Index with [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. -It's easy because you don't need a database server, instead it saves your data locally. +Let's learn the basic capabilities of Document Index with [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. +This doesn't require a database server - rather, it saves your data locally. !!! note "Using a different vector database" - You can easily use Weaviate, Qdrant, Redis, Milvus or Elasticsearch instead -- they share the same API! + You can easily use Weaviate, Qdrant, Redis, Milvus or Elasticsearch instead -- their APIs are largely identical! To do so, check their respective documentation sections. !!! note "InMemoryExactNNIndex in more detail" The following section only covers the basics of InMemoryExactNNIndex. - For a deeper understanding, please look into its documentation - [here](index_in_memory.md). + For a deeper understanding, please look into its [documentation](index_in_memory.md). -### Define Document Schema and Create Data +### Define document schema and create data ```python from docarray import BaseDoc, DocList from docarray.index import InMemoryExactNNIndex @@ -72,31 +71,34 @@ class MyDoc(BaseDoc): embedding: NdArray[128] # Create documents (using dummy/random vectors) -docs = DocList[MyDoc](MyDoc(title=f'title #{i}', price=i, embedding=np.random.rand(128)) for i in range(10)) +docs = DocList[MyDoc]( + MyDoc(title=f"title #{i}", price=i, embedding=np.random.rand(128)) + for i in range(10) +) ``` -### Initialize the Document Index and Add Data +### Initialize the Document Index and add data ```python # Initialize a new InMemoryExactNNIndex instance and add the documents to the index. doc_index = InMemoryExactNNIndex[MyDoc]() doc_index.index(docs) ``` -### Perform a Vector Similarity Search +### Perform a vector similarity search ```python # Perform a vector search. query = np.ones(128) retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) ``` -### Filter Documents +### Filter documents ```python # Perform filtering (price < 5) query = {'price': {'$lt': 5}} filtered_docs = doc_index.filter(query, limit=10) ``` -### Combine Different Search Methods +### Combine different search methods ```python # Perform a hybrid search - combining vector search with filtering query = ( diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index bcae64f8998..df6db3c180a 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -163,8 +163,7 @@ You can work around this problem by subclassing the predefined Document and addi db = ElasticDocIndex[MyDoc](index_name='test_db3') ``` -Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or your custom Document type. +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: @@ -188,7 +187,7 @@ db.index(data) ## Index Use `.index()` to add documents into the index. -The`.num_docs()` method returns the total number of documents in the index. +The `.num_docs()` method returns the total number of documents in the index. ```python index_docs = [SimpleDoc(tensor=np.ones(128)) for _ in range(64)] @@ -224,7 +223,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ ## Filter You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. -The query should follow the [query language of Elastic](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html). +The query should follow [Elastic's query language](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html). The `filter()` method accepts queries that follow the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) and consists of leaf and compound clauses. @@ -390,7 +389,7 @@ You can also manually build a valid ES query and directly pass it to the `execut ## Access documents -To access the `Doc`, you need to specify the `id`. You can also pass a list of `id` to access multiple documents. +To access a document, you need to specify its `id`. You can also pass a list of `id` to access multiple documents. ```python # access a single Doc @@ -402,7 +401,7 @@ doc_index[index_docs[2].id, index_docs[3].id] ## Delete documents -To delete the documents, use the built-in function `del` with the `id` of the Documents that you want to delete. +To delete documents, use the built-in function `del` with the `id` of the documents that you want to delete. You can also pass a list of `id`s to delete multiple documents. ```python @@ -457,8 +456,8 @@ See [here](docindex.md#configuration-options#customize-configurations) for more ### Persistence -You can hook into a database index that was persisted during a previous session. -To do so, you need to specify `index_name` and the `hosts`: +You can hook into a database index that was persisted during a previous session by +specifying the `index_name` and `hosts`: ```python doc_index = ElasticDocIndex[MyDoc]( @@ -476,7 +475,7 @@ print(f'number of docs in the persisted index: {doc_index2.num_docs()}') ## Nested data -When using the index you can define multiple fields, including nesting documents inside another document. +When using the index you can define multiple fields, including nesting documents inside a parent document. Consider the following example: @@ -546,7 +545,7 @@ docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) To delete a nested data, you need to specify the `id`. !!! note - You can only delete `Doc` at the top level. Deletion of `Doc`s on lower levels is not yet supported. + You can only delete top-level documents. Deleting nested documents is not yet supported. ```python # example of delete nested and flat index diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index f8ca5da2845..d232019f605 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -50,7 +50,7 @@ retrieved_docs = doc_index.find(query, search_field='embedding', limit=10) ## Initialize -To create a Document Index, you first need a document that defines the schema of your index: +To create a Document Index, you first need a document class that defines the schema of your index: ```python from docarray import BaseDoc @@ -123,8 +123,7 @@ You can work around this problem by subclassing the predefined Document and addi db = HnswDocumentIndex[MyDoc](work_dir='./tmp_3') ``` -Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or your custom Document type. +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: @@ -147,7 +146,7 @@ db.index(data) ## Index -Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: ```python import numpy as np @@ -162,12 +161,11 @@ docs = DocList[MyDoc]( db.index(docs) ``` -That call to [index()][docarray.index.backends.hnswlib.HnswDocumentIndex.index] stores all Documents in `docs` into the Document Index, +That call to [`index()`][docarray.index.backends.hnswlib.HnswDocumentIndex.index] stores all Documents in `docs` in the Document Index, ready to be retrieved in the next step. -As you can see, `DocList[MyDoc]` and `HnswDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. -This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store -need to have compatible schemas. +As you can see, `DocList[MyDoc]` and `HnswDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. !!! question "When are two schemas compatible?" The schemas of your Document Index and data need to be compatible with each other. @@ -185,9 +183,9 @@ need to have compatible schemas. ## Vector Search -Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -You can use the [find()][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +You can use the [`find()`][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` to find similar documents within the Document Index: === "Search by Document" @@ -225,10 +223,10 @@ In this particular example you only have one field (`embedding`) that is a vecto In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. -The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest matching documents and their associated similarity scores. -When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. How these scores are calculated depends on the backend, and can usually be [configured](#configuration). @@ -273,7 +271,7 @@ a list of `DocList`s, one for each query, containing the closest matching docume ## Filter You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. -The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. +The query should follow the query language of DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. In the following example let's filter for all the books that are cheaper than 29 dollars: @@ -307,7 +305,7 @@ In addition to vector similarity search, the Document Index interface offers met as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. !!! note - The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for text search. + The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not support text search. To see how to perform text search, you can check out other backends that offer support. @@ -353,7 +351,7 @@ Some backends can combine text search and vector search, while others can perfor ## Access Documents -To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. You can also access data by the `id` that was assigned to each document: @@ -375,7 +373,7 @@ docs = db[ids] # get by list of ids ## Delete Documents -In the same way you can access Documents by id, you can also delete them: +In the same way you can access Documents by `id`, you can also delete them: ```python # prepare some data @@ -424,7 +422,7 @@ for doc in res.documents: assert 'I am the first version' in doc.text ``` -Then, let's update all of the text of this documents and re-index them: +Then, let's update all of the text of these documents and re-index them: ```python for i, doc in enumerate(docs): doc.text = f'I am the second version of Document {i}' @@ -520,7 +518,7 @@ For more information on these settings, see [below](#field-wise-configurations). Fields that are not vector fields (e.g. of type `str` or `int` etc.) do not offer any configuration, as they are simply stored as-is in a SQLite database. -### Field-wise Configurations +### Field-wise configuration There are various setting that you can tweak for every vector field that you index into Hnswlib. @@ -658,9 +656,9 @@ docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. **Index** diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index e0bc71d23fe..87d7fcf4bfc 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -47,7 +47,7 @@ retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=1 ## Initialize -To create a Document Index, you first need a document that defines the schema of your index: +To create a Document Index, you first need a document class that defines the schema of your index: ```python from docarray import BaseDoc @@ -120,8 +120,7 @@ You can work around this problem by subclassing the predefined Document and addi db = InMemoryExactNNIndex[MyDoc]() ``` -Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or your custom Document type. +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: @@ -143,7 +142,7 @@ db.index(data) ## Index -Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: ```python import numpy as np @@ -158,12 +157,11 @@ docs = DocList[MyDoc]( db.index(docs) ``` -That call to [index()][docarray.index.backends.in_memory.InMemoryExactNNIndex.index] stores all Documents in `docs` into the Document Index, +That call to [`index()`][docarray.index.backends.in_memory.InMemoryExactNNIndex.index] stores all Documents in `docs` in the Document Index, ready to be retrieved in the next step. -As you can see, `DocList[MyDoc]` and `InMemoryExactNNIndex[MyDoc]` are both parameterized with `MyDoc`. -This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store -need to have compatible schemas. +As you can see, `DocList[MyDoc]` and `InMemoryExactNNIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. !!! question "When are two schemas compatible?" The schemas of your Document Index and data need to be compatible with each other. @@ -181,9 +179,9 @@ need to have compatible schemas. ## Vector Search -Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -You can use the [find()][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +You can use the [`find()`][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` to find similar documents within the Document Index: @@ -222,10 +220,10 @@ In this particular example you only have one field (`embedding`) that is a vecto In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. -The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest matching documents and their associated similarity scores. -When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. How these scores are calculated depends on the backend, and can usually be [configured](#configuration). @@ -272,7 +270,7 @@ a list of `DocList`s, one for each query, containing the closest matching docume To filter Documents, the `InMemoryExactNNIndex` uses DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. -The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. +The query should follow the query language of DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function. In the following example let's filter for all the books that are cheaper than 29 dollars: @@ -304,7 +302,7 @@ In addition to vector similarity search, the Document Index interface offers met as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. !!! note - The [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] implementation does not offer support for text search. + The [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] implementation does not support text search. To see how to perform text search, you can check out other backends that offer support. @@ -349,7 +347,7 @@ Some backends can combine text search and vector search, while others can perfor ## Access Documents -To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. You can also access data by the `id` that was assigned to each document: @@ -371,7 +369,7 @@ docs = db[ids] # get by list of ids ## Delete Documents -In the same way you can access Documents by id, you can also delete them: +In the same way you can access Documents by `id`, you can also delete them: ```python # prepare some data @@ -420,7 +418,7 @@ for doc in res.documents: assert 'I am the first version' in doc.text ``` -Then, let's update all of the text of this documents and re-index them: +Then, let's update all of the text of these documents and re-index them: ```python for i, doc in enumerate(docs): doc.text = f'I am the second version of Document {i}' @@ -468,7 +466,7 @@ For more information on these settings, see [below](#field-wise-configurations). Fields that are not vector fields (e.g. of type `str` or `int` etc.) do not offer any configuration. -### Field-wise configurations +### Field-wise configuration For a vector field you can adjust the `space` parameter. It can be one of: @@ -588,9 +586,9 @@ docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. **Index** diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index bd16f531ce7..b3d19f99f55 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -132,7 +132,7 @@ You can work around this problem by subclassing the predefined Document and addi ## Index -Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: ```python import numpy as np @@ -153,12 +153,11 @@ docs = DocList[MyDoc]( doc_index.index(docs) ``` -That call to [index()][docarray.index.backends.milvus.MilvusDocumentIndex.index] stores all Documents in `docs` into the Document Index, +That call to [`index()`][docarray.index.backends.milvus.MilvusDocumentIndex.index] stores all Documents in `docs` in the Document Index, ready to be retrieved in the next step. -As you can see, `DocList[MyDoc]` and `MilvusDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. -This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store -need to have compatible schemas. +As you can see, `DocList[MyDoc]` and `MilvusDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. !!! question "When are two schemas compatible?" The schemas of your Document Index and data need to be compatible with each other. @@ -176,9 +175,9 @@ need to have compatible schemas. ## Vector Search -Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +By using a document of type `MyDoc`, [`find()`][docarray.index.abstract.BaseDocIndex.find], you can find similar Documents in the Document Index: === "Search by Document" @@ -209,10 +208,10 @@ similar Documents in the Document Index: print(f'{scores=}') ``` -The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest matching documents and their associated similarity scores. -When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. +When searching on the subindex level, you can use the [`find_subindex()]`[docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. How these scores are calculated depends on the backend, and can usually be [configured](#configuration). @@ -255,7 +254,7 @@ In addition to vector similarity search, the Document Index interface offers met as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. !!! note - The [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] implementation does not offer support for text search. + The [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] implementation does not support text search. To see how to perform text search, you can check out other backends that offer support. @@ -299,7 +298,7 @@ Some backends can combine text search and vector search, while others can perfor ## Access Documents -To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. You can also access data by the `id` that was assigned to each document: @@ -321,7 +320,7 @@ docs = doc_index[ids] # get by list of ids ## Delete Documents -In the same way you can access Documents by id, you can also delete them: +In the same way you can access Documents by `id`, you can also delete them: ```python # prepare some data @@ -361,7 +360,7 @@ The following configs can be set in `DBConfig`: You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. -### Field-wise configurations +### Field-wise configuration `default_column_config` is the default configurations for every column type. Since there are many column types in Milvus, you can also consider changing the column config when defining the schema. @@ -466,9 +465,9 @@ docs, scores = doc_index.find(query_doc, limit=3) Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. **Index** diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index fee13fcc32c..c1de5f92266 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -146,8 +146,7 @@ You can work around this problem by subclassing the predefined Document and addi doc_index = QdrantDocumentIndex[MyDoc](host='localhost') ``` -Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or your custom Document type. +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: @@ -170,7 +169,7 @@ doc_index.index(data) ## Index -Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: ```python import numpy as np @@ -185,12 +184,11 @@ docs = DocList[MyDoc]( doc_index.index(docs) ``` -That call to `index()` stores all Documents in `docs` into the Document Index, +That call to `index()` stores all Documents in `docs` in the Document Index, ready to be retrieved in the next step. -As you can see, `DocList[MyDoc]` and `QdrantDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. -This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store -need to have compatible schemas. +As you can see, `DocList[MyDoc]` and `QdrantDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. !!! question "When are two schemas compatible?" The schemas of your Document Index and data need to be compatible with each other. @@ -208,9 +206,9 @@ need to have compatible schemas. ## Vector Search -Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +By using a document of type `MyDoc`, [`find()`][docarray.index.abstract.BaseDocIndex.find], you can find similar Documents in the Document Index: === "Search by Document" @@ -248,10 +246,10 @@ In this particular example you only have one field (`embedding`) that is a vecto In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. -The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest matching documents and their associated similarity scores. -When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. How these scores are calculated depends on the backend, and can usually be [configured](#configuration). @@ -367,7 +365,7 @@ docs = doc_index.execute_query(query) ## Access documents -To access the `Doc`, you need to specify the `id`. You can also pass a list of `id` to access multiple documents. +To access a document, you need to specify its `id`. You can also pass a list of `id` to access multiple documents. ```python # access a single Doc @@ -379,7 +377,7 @@ doc_index[index_docs[16].id, index_docs[17].id] ## Delete documents -To delete the documents, use the built-in function `del` with the `id` of the Documents that you want to delete. +To delete documents, use the built-in function `del` with the `id` of the documents that you want to delete. You can also pass a list of `id`s to delete multiple documents. ```python @@ -422,7 +420,7 @@ for doc in res.documents: assert 'I am the first version' in doc.text ``` -Then, let's update all of the text of this documents and re-index them: +Then, let's update all of the text of these documents and re-index them: ```python for i, doc in enumerate(docs): doc.text = f'I am the second version of Document {i}' @@ -545,9 +543,9 @@ docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. **Index** diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index 373fd85a3cb..5608b671886 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -118,8 +118,7 @@ You can work around this problem by subclassing the predefined Document and addi doc_index = RedisDocumentIndex[MyDoc]() ``` -Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or your custom Document type. +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: @@ -141,7 +140,7 @@ doc_index.index(data) ## Index -Now that you have a Document Index, you can add data to it, using the [index()][docarray.index.abstract.BaseDocIndex.index] method: +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method: ```python import numpy as np @@ -156,12 +155,11 @@ docs = DocList[MyDoc]( doc_index.index(docs) ``` -That call to [index()][docarray.index.backends.redis.RedisDocumentIndex.index] stores all Documents in `docs` into the Document Index, +That call to [`index()`][docarray.index.backends.redis.RedisDocumentIndex.index] stores all Documents in `docs` in the Document Index, ready to be retrieved in the next step. -As you can see, `DocList[MyDoc]` and `RedisDocumentIndex[MyDoc]` are both parameterized with `MyDoc`. -This means that they share the same schema, and in general, the schema of a Document Index and the data that you want to store -need to have compatible schemas. +As you can see, `DocList[MyDoc]` and `RedisDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. !!! question "When are two schemas compatible?" The schemas of your Document Index and data need to be compatible with each other. @@ -179,9 +177,9 @@ need to have compatible schemas. ## Vector Search -Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +By using a document of type `MyDoc`, [`find()`][docarray.index.abstract.BaseDocIndex.find], you can find similar Documents in the Document Index: === "Search by Document" @@ -219,10 +217,10 @@ In this particular example you only have one field (`embedding`) that is a vecto In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. -The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest matching documents and their associated similarity scores. -When searching on subindex level, you can use [find_subindex()][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. How these scores are calculated depends on the backend, and can usually be [configured](#configuration). @@ -323,7 +321,7 @@ Some backends can combine text search and vector search, while others can perfor ## Access Documents -To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. You can also access data by the `id` that was assigned to each document: @@ -345,7 +343,7 @@ docs = db[ids] # get by list of ids ## Delete Documents -In the same way you can access Documents by id, you can also delete them: +In the same way you can access Documents by `id`, you can also delete them: ```python # prepare some data @@ -394,7 +392,7 @@ for doc in res.documents: assert 'I am the first version' in doc.text ``` -Then, let's update all of the text of this documents and re-index them: +Then, let's update all of the text of these documents and re-index them: ```python for i, doc in enumerate(docs): doc.text = f'I am the second version of Document {i}' @@ -433,7 +431,7 @@ The following configs can be set in `DBConfig`: You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. -### Field-wise configurations +### Field-wise configuration `default_column_config` is the default configurations for every column type. Since there are many column types in Redis, you can also consider changing the column config when defining the schema. @@ -545,9 +543,9 @@ docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. **Index** diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index ee1cd71fae7..4c866f3511b 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -323,7 +323,7 @@ You can find the documentation for [Weaviate's GraphQL API here](https://weaviat ## Access Documents -To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. +To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. You can also access data by the `id` that was assigned to each document: @@ -345,7 +345,7 @@ docs = doc_index[ids] # get by list of ids ## Delete Documents -In the same way you can access Documents by id, you can also delete them: +In the same way you can access Documents by `id`, you can also delete them: ```python # prepare some data @@ -534,9 +534,9 @@ docs, scores = doc_index.find(query_doc, limit=3) Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, ...). +In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). -This still lets index and search through all of your data, but if you want to avoid the creation of additional indexes you could try to refactor your document schemas without the use of DocList. +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. **Index** From 19045ec4162d57b84600c3990f5f8e3667f912e7 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 31 Jul 2023 15:15:47 +0200 Subject: [PATCH 18/23] docs: separate nested data section Signed-off-by: jupyterjazz --- docs/user_guide/storing/index_elastic.md | 150 +------------------ docs/user_guide/storing/index_hnswlib.md | 162 +------------------- docs/user_guide/storing/index_in_memory.md | 162 +------------------- docs/user_guide/storing/index_milvus.md | 158 +------------------- docs/user_guide/storing/index_qdrant.md | 163 +------------------- docs/user_guide/storing/index_redis.md | 162 +------------------- docs/user_guide/storing/index_weaviate.md | 156 +------------------- docs/user_guide/storing/nested_data.md | 164 +++++++++++++++++++++ mkdocs.yml | 1 + 9 files changed, 200 insertions(+), 1078 deletions(-) create mode 100644 docs/user_guide/storing/nested_data.md diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index df6db3c180a..60332a39020 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -473,150 +473,10 @@ print(f'number of docs in the persisted index: {doc_index2.num_docs()}') ``` -## Nested data +## Nested Data and Subindex Search -When using the index you can define multiple fields, including nesting documents inside a parent document. - -Consider the following example: - -- You have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. -- `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(similarity='cosine', dims=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(similarity='cosine', dims=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(similarity='cosine', dims=256) - - -doc_index = ElasticDocIndex[YouTubeVideoDoc]() -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] -doc_index.index(index_docs) -``` - -**You can perform search on any nesting level** by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or the `tensor` field of the `thumbnail` and `video` field: - -```python -# example of find nested and flat index -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the youtubevideo tensor -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) - -# find by the thumbnail tensor -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) - -# find by the video tensor -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) -``` - -To delete a nested data, you need to specify the `id`. - -!!! note - You can only delete top-level documents. Deleting nested documents is not yet supported. - -```python -# example of delete nested and flat index -del doc_index[index_docs[3].id, index_docs[4].id] -``` - -### Nested data with subindex - -In the following example you can see a complex schema that contains nested Documents with subindex. - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(dims=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(dims=128) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(dims=256) - - -# create a Document Index -doc_index = ElasticDocIndex[MyDoc](index_name='subindex') - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) - -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) # return both root and subindex docs - -# filter on subindex level -query = {'match': {'url': 'http://example.ai/images/0-0-0'}} -docs = doc_index.filter_subindex(query, subindex='docs__images') -``` +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. +Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index d232019f605..f82c39b4166 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -572,162 +572,10 @@ If the location already contains data from a previous session, it will be access -## Nested data +## Nested Data and Subindex Search -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp_9') - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the `youtubevideo` tensor; root level -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) - -# find by the `thumbnail` tensor; nested level -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) - -# find by the `video` tensor; neseted level -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) -``` - -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). - -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). - -This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine', dim=128) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp_10') - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) -``` +Go to [Nested Data](nested_data.md) section to learn more. diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 87d7fcf4bfc..15ad51d7de9 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -502,162 +502,10 @@ new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') ``` -## Nested data +## Nested Data and Subindex Search -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine_sim', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine_sim', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine_sim', dim=256) - - -# create a Document Index -doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]() - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the `youtubevideo` tensor; root level -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) - -# find by the `thumbnail` tensor; nested level -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) - -# find by the `video` tensor; neseted level -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) -``` - -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). - -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). - -This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine_sim', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine_sim', dim=128) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine_sim', dim=256) - - -# create a Document Index -doc_index = InMemoryExactNNIndex[MyDoc]() - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) -``` \ No newline at end of file +Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index b3d19f99f55..55bfe80563d 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -387,158 +387,10 @@ doc_index.configure(MilvusDocumentIndex.RuntimeConfig(batch_size=128)) You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. -## Nested data +## Nested Data and Subindex Search -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(is_embedding=True, space='cosine', dim=256) - - -# create a Document Index -doc_index = MilvusDocumentIndex[YouTubeVideoDoc](index_name='tmp2') - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the `youtubevideo` tensor; root level -docs, scores = doc_index.find(query_doc, limit=3) -``` - -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). - -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). - -This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(is_embedding=True, space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(is_embedding=True, space='cosine', dim=128) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(is_embedding=True, space='cosine', dim=256) - - -# create a Document Index -doc_index = MilvusDocumentIndex[MyDoc](index_name='tmp5') -doc_index.configure(MilvusDocumentIndex.RuntimeConfig(batch_size=10)) - - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(5) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(5) - ] - ), - tensor=np.ones(256), - ) - for i in range(5) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', limit=3 -) -``` +Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index c1de5f92266..f0835615f20 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -459,163 +459,10 @@ the QdrantDocumentIndex will take the name the Document type that you use as sch the data will be stored in a collection name MyDoc if no specific collection_name is passed in the DBConfig. -## Nested data +## Nested Data and Subindex Search -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. - -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = QdrantDocumentIndex[YouTubeVideoDoc](index_name='tmp2') - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the `youtubevideo` tensor; root level -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) - -# find by the `thumbnail` tensor; nested level -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) - -# find by the `video` tensor; neseted level -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) -``` - -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). - -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). - -This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine', dim=128) - - -class MediaDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = QdrantDocumentIndex[MediaDoc](index_name='tmp3') - -# create some data -index_docs = [ - MediaDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) -``` +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. +Go to [Nested Data](nested_data.md) section to learn more. diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index 5608b671886..04e386d4ff0 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -459,162 +459,10 @@ You can pass the above as keyword arguments to the `configure()` method or pass -## Nested data +## Nested Data and Subindex Search -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = RedisDocumentIndex[YouTubeVideoDoc](index_name='tmp2') - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the `youtubevideo` tensor; root level -docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) - -# find by the `thumbnail` tensor; nested level -docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) - -# find by the `video` tensor; neseted level -docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) -``` - -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). - -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). - -This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine', dim=128) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = RedisDocumentIndex[MyDoc]() - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', search_field='tensor_video', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 -) -``` +Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index 4c866f3511b..a75263a7590 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -456,156 +456,10 @@ class StringDoc(BaseDoc): A list of available Weaviate data types [is here](https://weaviate.io/developers/weaviate/config-refs/datatypes). -## Nested data +## Nested Data and Subindex Search -The examples above all operate on a simple schema: All fields in `MyDoc` have "basic" types, such as `str` or `NdArray`. +The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. +However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document +contains a `DocList` of other documents. -**Index nested data:** - -It is, however, also possible to represent nested Documents and store them in a Document Index. - -In the following example you can see a complex schema that contains nested Documents. -The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: - -```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor - - -# define a nested schema -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: AnyTensor = Field(space='cosine', dim=64) - - -class VideoDoc(BaseDoc): - url: VideoUrl - tensor: AnyTensor = Field(space='cosine', dim=128, is_embedding=True) - - -class YouTubeVideoDoc(BaseDoc): - title: str - description: str - thumbnail: ImageDoc - video: VideoDoc - tensor: AnyTensor = Field(space='cosine', dim=256) - - -# create a Document Index -doc_index = WeaviateDocumentIndex[YouTubeVideoDoc]() - -# create some data -index_docs = [ - YouTubeVideoDoc( - title=f'video {i+1}', - description=f'this is video from author {10*i}', - thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), - tensor=np.ones(256), - ) - for i in range(8) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search nested data:** - -You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. - -In the following example, you can see how to perform vector search on the `tensor` field of the `video` field: - -```python -# create a query Document -query_doc = YouTubeVideoDoc( - title=f'video query', - description=f'this is a query video', - thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), - video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), - tensor=np.ones(256), -) - -# find by the `video` tensor; nested level -docs, scores = doc_index.find(query_doc, limit=3) -``` - -### Nested data with subindex - -Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one [above](#nested-data). - -If a Document contains a DocList, it can still be stored in a Document Index. -In this case, the DocList will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). - -This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of DocLists. - - -**Index** - -In the following example you can see a complex schema that contains nested Documents with subindex. -The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: - -```python -class ImageDoc(BaseDoc): - url: ImageUrl - tensor_image: AnyTensor = Field(space='cosine', dim=64, is_embedding=True) - - -class VideoDoc(BaseDoc): - url: VideoUrl - images: DocList[ImageDoc] - tensor_video: AnyTensor = Field(space='cosine', dim=128, is_embedding=True) - - -class MyDoc(BaseDoc): - docs: DocList[VideoDoc] - tensor: AnyTensor = Field(space='cosine', dim=256, is_embedding=True) - - -# create a Document Index -doc_index = WeaviateDocumentIndex[MyDoc]() - -# create some data -index_docs = [ - MyDoc( - docs=DocList[VideoDoc]( - [ - VideoDoc( - url=f'http://example.ai/videos/{i}-{j}', - images=DocList[ImageDoc]( - [ - ImageDoc( - url=f'http://example.ai/images/{i}-{j}-{k}', - tensor_image=np.ones(64), - ) - for k in range(10) - ] - ), - tensor_video=np.ones(128), - ) - for j in range(10) - ] - ), - tensor=np.ones(256), - ) - for i in range(10) -] - -# index the Documents -doc_index.index(index_docs) -``` - -**Search** - -You can perform search on any subindex level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. - -```python -# find by the `VideoDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(128), subindex='docs', limit=3 -) - -# find by the `ImageDoc` tensor -root_docs, sub_docs, scores = doc_index.find_subindex( - np.ones(64), subindex='docs__images', limit=3 -) -``` +Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/nested_data.md b/docs/user_guide/storing/nested_data.md new file mode 100644 index 00000000000..a45275c75ac --- /dev/null +++ b/docs/user_guide/storing/nested_data.md @@ -0,0 +1,164 @@ +# Nested data + +Most of the examples you've seen operate on a simple schema: each field corresponds to a "basic" type, such as `str` or `NdArray`. + +It is, however, also possible to represent nested documents and store them in a Document Index. + +!!! note "Using a different vector database" + In the following examples, we will use `InMemoryExactNNIndex` as our Document Index. + You can easily use Weaviate, Qdrant, Redis, Milvus or Elasticsearch instead -- their APIs are largely identical! + To do so, check their respective documentation sections. + +## Create and index +In the following example you can see a complex schema that contains nested documents. +The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: + +```python +from docarray.typing import ImageUrl, VideoUrl, AnyTensor + + +# define a nested schema +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: AnyTensor = Field(space='cosine_sim', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + tensor: AnyTensor = Field(space='cosine_sim', dim=128) + + +class YouTubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc + tensor: AnyTensor = Field(space='cosine_sim', dim=256) + + +# create a Document Index +doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]() + +# create some data +index_docs = [ + YouTubeVideoDoc( + title=f'video {i+1}', + description=f'this is video from author {10*i}', + thumbnail=ImageDoc(url=f'http://example.ai/images/{i}', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/{i}', tensor=np.ones(128)), + tensor=np.ones(256), + ) + for i in range(8) +] + +# index the Documents +doc_index.index(index_docs) +``` + +## Search + +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. + +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: + +```python +# create a query Document +query_doc = YouTubeVideoDoc( + title=f'video query', + description=f'this is a query video', + thumbnail=ImageDoc(url=f'http://example.ai/images/1024', tensor=np.ones(64)), + video=VideoDoc(url=f'http://example.ai/videos/1024', tensor=np.ones(128)), + tensor=np.ones(256), +) + +# find by the `youtubevideo` tensor; root level +docs, scores = doc_index.find(query_doc, search_field='tensor', limit=3) + +# find by the `thumbnail` tensor; nested level +docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit=3) + +# find by the `video` tensor; neseted level +docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) +``` + +## Nested data with subindex search + +Documents can be nested by containing a `DocList` of other documents, which is a slightly more complicated scenario than the one above. + +If a document contains a `DocList`, it can still be stored in a Document Index. +In this case, the `DocList` will be represented as a new index (or table, collection, etc., depending on the database backend), that is linked with the parent index (table, collection, etc). + +This still lets you index and search through all of your data, but if you want to avoid the creation of additional indexes you can refactor your document schemas without the use of `DocLists`. + + +### Index + +In the following example, you can see a complex schema that contains nested `DocLists` of documents where we'll utilize subindex search. + +The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `ImageDoc`, alongside some "basic" fields: + +```python +class ImageDoc(BaseDoc): + url: ImageUrl + tensor_image: AnyTensor = Field(space='cosine_sim', dim=64) + + +class VideoDoc(BaseDoc): + url: VideoUrl + images: DocList[ImageDoc] + tensor_video: AnyTensor = Field(space='cosine_sim', dim=128) + + +class MyDoc(BaseDoc): + docs: DocList[VideoDoc] + tensor: AnyTensor = Field(space='cosine_sim', dim=256) + + +# create a Document Index +doc_index = InMemoryExactNNIndex[MyDoc]() + +# create some data +index_docs = [ + MyDoc( + docs=DocList[VideoDoc]( + [ + VideoDoc( + url=f'http://example.ai/videos/{i}-{j}', + images=DocList[ImageDoc]( + [ + ImageDoc( + url=f'http://example.ai/images/{i}-{j}-{k}', + tensor_image=np.ones(64), + ) + for k in range(10) + ] + ), + tensor_video=np.ones(128), + ) + for j in range(10) + ] + ), + tensor=np.ones(256), + ) + for i in range(10) +] + +# index the Documents +doc_index.index(index_docs) +``` + +### Search + +You can perform search on any level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. + +```python +# find by the `VideoDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(128), subindex='docs', search_field='tensor_video', limit=3 +) + +# find by the `ImageDoc` tensor +root_docs, sub_docs, scores = doc_index.find_subindex( + np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3 +) +``` \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 1df96540849..457bb0d15ae 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -105,6 +105,7 @@ nav: - user_guide/storing/index_qdrant.md - user_guide/storing/index_redis.md - user_guide/storing/index_milvus.md + - user_guide/storing/nested_data.md - DocStore - Bulk storage: - user_guide/storing/doc_store/store_file.md - user_guide/storing/doc_store/store_jac.md From 41c73079154aaadaec193ffed8c28976b6ced622 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 31 Jul 2023 16:28:20 +0200 Subject: [PATCH 19/23] docs: apply suggestions vol2 Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 2 +- docs/user_guide/storing/first_step.md | 2 +- docs/user_guide/storing/index_elastic.md | 28 ++++++------ docs/user_guide/storing/index_hnswlib.md | 42 +++++++++--------- docs/user_guide/storing/index_in_memory.md | 51 +++++++++++----------- docs/user_guide/storing/index_milvus.md | 41 ++++++++--------- docs/user_guide/storing/index_qdrant.md | 40 ++++++++--------- docs/user_guide/storing/index_redis.md | 34 +++++++-------- docs/user_guide/storing/index_weaviate.md | 26 +++++------ docs/user_guide/storing/nested_data.md | 2 +- 10 files changed, 135 insertions(+), 133 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 54b7ede532f..efc546a45c6 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -1,6 +1,6 @@ # Introduction -A Document Index lets you store your Documents and search through them using vector similarity. +A Document Index lets you store your documents and search through them using vector similarity. This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to some query that you provide. diff --git a/docs/user_guide/storing/first_step.md b/docs/user_guide/storing/first_step.md index 0da34c4516e..836f12646d1 100644 --- a/docs/user_guide/storing/first_step.md +++ b/docs/user_guide/storing/first_step.md @@ -25,7 +25,7 @@ This section covers the following three topics: ## Document Index -A Document Index lets you store your Documents and search through them using vector similarity. +A Document Index lets you store your documents and search through them using vector similarity. This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to a query that you provide. diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index 60332a39020..cd153a09051 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -34,7 +34,7 @@ The following example is based on [ElasticDocIndex][docarray.index.backends.elas but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex]. -## Basic Usage +## Basic usage ```python from docarray import BaseDoc, DocList @@ -123,16 +123,16 @@ class SimpleDoc(BaseDoc): doc_index = ElasticDocIndex[SimpleDoc](hosts='http://localhost:9200') ``` -### Using a predefined Document as schema +### Using a predefined document as schema -DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. -The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` field. But this is crucial information for any vector database to work properly! -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: +You can work around this problem by subclassing the predefined document and adding the dimensionality information: === "Using type hint" ```python @@ -197,12 +197,12 @@ doc_index.index(index_docs) print(f'number of docs in the index: {doc_index.num_docs()}') ``` -## Vector Search +## Vector search The `.find()` method is used to find the nearest neighbors of a vector. You need to specify the `search_field` that is used when performing the vector search. -This is the field that serves as the basis of comparison between your query and indexed Documents. +This is the field that serves as the basis of comparison between your query and indexed documents. You can use the `limit` argument to configure how many documents to return. @@ -324,7 +324,7 @@ docs = doc_index.filter(query) ``` -## Text Search +## Text search In addition to vector similarity search, the Document Index interface offers methods for text search: [text_search()][docarray.index.abstract.BaseDocIndex.text_search], @@ -351,7 +351,7 @@ docs, scores = doc_index.text_search(query, search_field='text') ``` -## Hybrid Search +## Hybrid search Document Index supports atomic operations for vector similarity search, text search and filter search. @@ -389,7 +389,7 @@ You can also manually build a valid ES query and directly pass it to the `execut ## Access documents -To access a document, you need to specify its `id`. You can also pass a list of `id` to access multiple documents. +To access a document, you need to specify its `id`. You can also pass a list of `id`s to access multiple documents. ```python # access a single Doc @@ -422,8 +422,8 @@ The following configs can be set in `DBConfig`: | Name | Description | Default | |-------------------|----------------------------------------------------------------------------------------------------------------------------------------|-------------------------| | `hosts` | Hostname of the Elasticsearch server | `http://localhost:9200` | -| `es_config` | Other ES [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/python-api/8.6/config.html) in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | None | -| `index_name` | Elasticsearch index name, the name of Elasticsearch index object | None. Data will be stored in an index named after the Document type used as schema. | +| `es_config` | Other ES [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/python-api/8.6/config.html) in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | `None` | +| `index_name` | Elasticsearch index name, the name of Elasticsearch index object | `None`. Data will be stored in an index named after the Document type used as schema. | | `index_settings` | Other [index settings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/index-modules.html#index-modules-settings) in a Dict for creating the index | dict | | `index_mappings` | Other [index mappings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/mapping.html) in a Dict for creating the index | dict | | `default_column_config` | The default configurations for every column type. | dict | @@ -473,10 +473,10 @@ print(f'number of docs in the persisted index: {doc_index2.num_docs()}') ``` -## Nested Data and Subindex Search +## Nested data and subindex search The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document contains a `DocList` of other documents. -Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index f82c39b4166..b1b549d43ff 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -23,7 +23,7 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s - [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] - [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] -## Basic Usage +## Basic usage ```python from docarray import BaseDoc, DocList @@ -83,16 +83,16 @@ the database will store vectors with 128 dimensions. for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! -### Using a predefined Document as schema +### Using a predefined document as schema -DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. -The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` field. But this is crucial information for any vector database to work properly! -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: +You can work around this problem by subclassing the predefined document and adding the dimensionality information: === "Using type hint" ```python @@ -178,10 +178,10 @@ This means that they share the same schema, and in general, both the Document In - A and B have the same field names and field types - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A - In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. -## Vector Search +## Vector search Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. @@ -194,7 +194,7 @@ to find similar documents within the Document Index: # create a query Document query = MyDoc(embedding=np.random.rand(128), text='query') - # find similar Documents + # find similar documents matches, scores = db.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -208,7 +208,7 @@ to find similar documents within the Document Index: # create a query vector query = np.random.rand(128) - # find similar Documents + # find similar documents matches, scores = db.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -216,7 +216,7 @@ to find similar documents within the Document Index: print(f'{scores=}') ``` -To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. @@ -298,19 +298,19 @@ for doc in cheap_books: -## Text Search - -In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +## Text search !!! note The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not support text search. To see how to perform text search, you can check out other backends that offer support. +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + -## Hybrid Search +## Hybrid search Document Index supports atomic operations for vector similarity search, text search and filter search. @@ -349,7 +349,7 @@ The kinds of atomic queries that can be combined in this way depends on the back Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -## Access Documents +## Access documents To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. @@ -371,7 +371,7 @@ docs = db[ids] # get by list of ids ``` -## Delete Documents +## Delete documents In the same way you can access Documents by `id`, you can also delete them: @@ -390,7 +390,7 @@ del db[ids[0]] # del by single id del db[ids[1:]] # del by list of ids ``` -## Update Documents +## Update documents In order to update a Document inside the index, you only need to re-index it with the updated attributes. First, let's create a schema for our Document Index: @@ -572,10 +572,10 @@ If the location already contains data from a previous session, it will be access -## Nested Data and Subindex Search +## Nested data and subindex search The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document contains a `DocList` of other documents. -Go to [Nested Data](nested_data.md) section to learn more. +Go to the [Nested Data](nested_data.md) section to learn more. diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 15ad51d7de9..17c9988d0c6 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -1,7 +1,7 @@ # In-Memory Document Index -[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all Documents in memory using DocLists. +[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all documents in memory using DocLists. It is a great starting point for small datasets, where you may not want to launch a database server. For vector search and filtering the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] @@ -80,16 +80,16 @@ the database will store vectors with 128 dimensions. for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! -### Using a predefined Document as schema +### Using a predefined document as schema -DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. -The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` field. But this is crucial information for any vector database to work properly! -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: +You can work around this problem by subclassing the predefined document and adding the dimensionality information: === "Using type hint" ```python @@ -174,10 +174,10 @@ This means that they share the same schema, and in general, both the Document In - A and B have the same field names and field types - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A - In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. -## Vector Search +## Vector search Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. @@ -191,7 +191,7 @@ to find similar documents within the Document Index: # create a query Document query = MyDoc(embedding=np.random.rand(128), text='query') - # find similar Documents + # find similar documents matches, scores = db.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -205,7 +205,7 @@ to find similar documents within the Document Index: # create a query vector query = np.random.rand(128) - # find similar Documents + # find similar documents matches, scores = db.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -213,7 +213,7 @@ to find similar documents within the Document Index: print(f'{scores=}') ``` -To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. @@ -231,15 +231,15 @@ How these scores are calculated depends on the backend, and can usually be [conf You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. -=== "Search by Documents" +=== "Search by documents" ```python - # create some query Documents + # create some query documents queries = DocList[MyDoc]( MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) ) - # find similar Documents + # find similar documents matches, scores = db.find_batched(queries, search_field='embedding', limit=5) print(f'{matches=}') @@ -253,7 +253,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ # create some query vectors query = np.random.rand(3, 128) - # find similar Documents + # find similar documents matches, scores = db.find_batched(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -295,19 +295,20 @@ for doc in cheap_books: doc.summary() ``` -## Text Search - -In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +## Text search !!! note The [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] implementation does not support text search. To see how to perform text search, you can check out other backends that offer support. +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + + -## Hybrid Search +## Hybrid search Document Index supports atomic operations for vector similarity search, text search and filter search. @@ -345,7 +346,7 @@ The kinds of atomic queries that can be combined in this way depends on the back Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -## Access Documents +## Access documents To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. @@ -367,7 +368,7 @@ docs = db[ids] # get by list of ids ``` -## Delete Documents +## Delete documents In the same way you can access Documents by `id`, you can also delete them: @@ -386,7 +387,7 @@ del db[ids[0]] # del by single id del db[ids[1:]] # del by list of ids ``` -## Update Documents +## Update documents In order to update a Document inside the index, you only need to re-index it with the updated attributes. First, let's create a schema for our Document Index: @@ -502,10 +503,10 @@ new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin') ``` -## Nested Data and Subindex Search +## Nested data and subindex search The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document contains a `DocList` of other documents. -Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index 55bfe80563d..995dccb2734 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -11,7 +11,7 @@ This is the user guide for the [MilvusDocumentIndex][docarray.index.backends.mil focusing on special features and configurations of Milvus. -## Basic Usage +## Basic usage !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. This is due to Milvus' configuration, which permits a single vector for each data object. @@ -88,16 +88,16 @@ the database will store vectors with 128 dimensions. Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! -### Using a predefined Document as schema +### Using a predefined document as schema -DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` field. But this is crucial information for any vector database to work properly! -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: +You can work around this problem by subclassing the predefined document and adding the dimensionality information: === "Using type hint" ```python @@ -173,12 +173,12 @@ This means that they share the same schema, and in general, both the Document In In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. -## Vector Search +## Vector search Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [`find()`][docarray.index.abstract.BaseDocIndex.find], you can find -similar Documents in the Document Index: +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: === "Search by Document" @@ -247,19 +247,20 @@ for doc in cheap_books: doc.summary() ``` -## Text Search - -In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +## Text search !!! note The [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] implementation does not support text search. To see how to perform text search, you can check out other backends that offer support. +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + + -## Hybrid Search +## Hybrid search Document Index supports atomic operations for vector similarity search, text search and filter search. @@ -296,7 +297,7 @@ The kinds of atomic queries that can be combined in this way depends on the back Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -## Access Documents +## Access documents To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. @@ -318,7 +319,7 @@ docs = doc_index[ids] # get by list of ids ``` -## Delete Documents +## Delete documents In the same way you can access Documents by `id`, you can also delete them: @@ -350,9 +351,9 @@ The following configs can be set in `DBConfig`: |-------------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------| | `host` | The host address for the Milvus server. | `localhost` | | `port` | The port number for the Milvus server | 19530 | -| `index_name` | The name of the index in the Milvus database | None. Data will be stored in an index named after the Document type used as schema. | -| `user` | The username for the Milvus server | None | -| `password` | The password for the Milvus server | None | +| `index_name` | The name of the index in the Milvus database | `None`. Data will be stored in an index named after the Document type used as schema. | +| `user` | The username for the Milvus server | `None` | +| `password` | The password for the Milvus server | `None` | | `token` | Token for secure connection | '' | | `collection_description` | Description of the collection in the database | '' | | `default_column_config` | The default configurations for every column type. | dict | @@ -387,10 +388,10 @@ doc_index.configure(MilvusDocumentIndex.RuntimeConfig(batch_size=128)) You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. -## Nested Data and Subindex Search +## Nested data and subindex search The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document contains a `DocList` of other documents. -Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index f0835615f20..01abad2a54b 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -11,7 +11,7 @@ The following is a starter script for using the [QdrantDocumentIndex][docarray.i based on the [Qdrant](https://qdrant.tech/) vector search engine. -## Basic Usage +## Basic usage ```python from docarray import BaseDoc, DocList from docarray.index import QdrantDocumentIndex @@ -106,16 +106,16 @@ the database will store vectors with 128 dimensions. Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! -### Using a predefined Document as schema +### Using a predefined document as schema -DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. -The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` field. But this is crucial information for any vector database to work properly! -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: +You can work around this problem by subclassing the predefined document and adding the dimensionality information: === "Using type hint" ```python @@ -184,7 +184,7 @@ docs = DocList[MyDoc]( doc_index.index(docs) ``` -That call to `index()` stores all Documents in `docs` in the Document Index, +That call to `index()` stores all documents in `docs` in the Document Index, ready to be retrieved in the next step. As you can see, `DocList[MyDoc]` and `QdrantDocumentIndex[MyDoc]` both have `MyDoc` as a parameter. @@ -201,15 +201,15 @@ This means that they share the same schema, and in general, both the Document In - A and B have the same field names and field types - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A - In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. -## Vector Search +## Vector search Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [`find()`][docarray.index.abstract.BaseDocIndex.find], you can find -similar Documents in the Document Index: +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: === "Search by Document" @@ -217,7 +217,7 @@ similar Documents in the Document Index: # create a query Document query = MyDoc(embedding=np.random.rand(128), text='query') - # find similar Documents + # find similar documents matches, scores = doc_index.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -231,7 +231,7 @@ similar Documents in the Document Index: # create a query vector query = np.random.rand(128) - # find similar Documents + # find similar documents matches, scores = doc_index.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -239,7 +239,7 @@ similar Documents in the Document Index: print(f'{scores=}') ``` -To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. @@ -287,7 +287,7 @@ for doc in cheap_books: doc.summary() ``` -## Text Search +## Text search In addition to vector similarity search, the Document Index interface offers methods for text search: [text_search()][docarray.index.abstract.BaseDocIndex.text_search], @@ -314,7 +314,7 @@ docs, scores = doc_index.text_search(query, search_field='text') ``` -## Hybrid Search +## Hybrid search Document Index supports atomic operations for vector similarity search, text search and filter search. @@ -365,7 +365,7 @@ docs = doc_index.execute_query(query) ## Access documents -To access a document, you need to specify its `id`. You can also pass a list of `id` to access multiple documents. +To access a document, you need to specify its `id`. You can also pass a list of `id`s to access multiple documents. ```python # access a single Doc @@ -388,7 +388,7 @@ del doc_index[index_docs[16].id] del doc_index[index_docs[17].id, index_docs[18].id] ``` -## Update elements +## Update documents In order to update a Document inside the index, you only need to re-index it with the updated attributes. First, let's create a schema for our Document Index: @@ -454,15 +454,15 @@ runtime_config = QdrantDocumentIndex.RuntimeConfig() print(runtime_config) # shows default values ``` -Note that the collection_name from the DBConfig is an Optional[str] with None as default value. This is because +Note that the collection_name from the DBConfig is an Optional[str] with `None` as default value. This is because the QdrantDocumentIndex will take the name the Document type that you use as schema. For example, for QdrantDocumentIndex[MyDoc](...) the data will be stored in a collection name MyDoc if no specific collection_name is passed in the DBConfig. -## Nested Data and Subindex Search +## Nested data and subindex search The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document contains a `DocList` of other documents. -Go to [Nested Data](nested_data.md) section to learn more. +Go to the [Nested Data](nested_data.md) section to learn more. diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index 04e386d4ff0..4df40ff5f74 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -11,7 +11,7 @@ This is the user guide for the [RedisDocumentIndex][docarray.index.backends.redi focusing on special features and configurations of Redis. -## Basic Usage +## Basic usage ```python from docarray import BaseDoc, DocList from docarray.index import RedisDocumentIndex @@ -78,7 +78,7 @@ the database will store vectors with 128 dimensions. for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! -### Using a predefined Document as schema +### Using a predefined document as schema DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: @@ -87,7 +87,7 @@ Depending on the backend, an exception will be raised, or no vector index for AN The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` field. But this is crucial information for any vector database to work properly! -You can work around this problem by subclassing the predefined Document and adding the dimensionality information: +You can work around this problem by subclassing the predefined document and adding the dimensionality information: === "Using type hint" ```python @@ -175,12 +175,12 @@ This means that they share the same schema, and in general, both the Document In In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. -## Vector Search +## Vector search Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -By using a document of type `MyDoc`, [`find()`][docarray.index.abstract.BaseDocIndex.find], you can find -similar Documents in the Document Index: +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: === "Search by Document" @@ -210,7 +210,7 @@ similar Documents in the Document Index: print(f'{scores=}') ``` -To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. @@ -256,7 +256,7 @@ for doc in cheap_books: doc.summary() ``` -## Text Search +## Text search In addition to vector similarity search, the Document Index interface offers methods for text search: [text_search()][docarray.index.abstract.BaseDocIndex.text_search], @@ -282,7 +282,7 @@ query = 'finance' docs, scores = doc_index.text_search(query, search_field='text') ``` -## Hybrid Search +## Hybrid search Document Index supports atomic operations for vector similarity search, text search and filter search. @@ -319,7 +319,7 @@ The kinds of atomic queries that can be combined in this way depends on the back Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. -## Access Documents +## Access documents To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. @@ -341,7 +341,7 @@ docs = db[ids] # get by list of ids ``` -## Delete Documents +## Delete documents In the same way you can access Documents by `id`, you can also delete them: @@ -360,7 +360,7 @@ del db[ids[0]] # del by single id del db[ids[1:]] # del by list of ids ``` -## Update elements +## Update documents In order to update a Document inside the index, you only need to re-index it with the updated attributes. First, let's create a schema for our Document Index: @@ -422,9 +422,9 @@ The following configs can be set in `DBConfig`: |-------------------------|----------------------------------------------------|-------------------------------------------------------------------------------------| | `host` | The host address for the Redis server. | `localhost` | | `port` | The port number for the Redis server | 6379 | -| `index_name` | The name of the index in the Redis database | None. Data will be stored in an index named after the Document type used as schema. | -| `username` | The username for the Redis server | None | -| `password` | The password for the Redis server | None | +| `index_name` | The name of the index in the Redis database | `None`. Data will be stored in an index named after the Document type used as schema. | +| `username` | The username for the Redis server | `None` | +| `password` | The password for the Redis server | `None` | | `text_scorer` | The method for [scoring text](https://redis.io/docs/interact/search-and-query/advanced-concepts/scoring/) during text search | `BM25` | | `default_column_config` | The default configurations for every column type. | dict | @@ -459,10 +459,10 @@ You can pass the above as keyword arguments to the `configure()` method or pass -## Nested Data and Subindex Search +## Nested data and subindex search The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document contains a `DocList` of other documents. -Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index a75263a7590..d65341e2878 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -11,7 +11,7 @@ This is the user guide for the [WeaviateDocumentIndex][docarray.index.backends.w focusing on special features and configurations of Weaviate. -## Basic Usage +## Basic usage !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. This is due to Weaviate's configuration, which permits a single vector for each data object. @@ -250,7 +250,7 @@ store.index(docs) - This will however mean that the document will not be vectorized and cannot be searched using vector search. -## Vector Search +## Vector search To perform a vector similarity search, follow the below syntax. @@ -321,7 +321,7 @@ Note that running a raw GraphQL query will return Weaviate-type responses, rathe You can find the documentation for [Weaviate's GraphQL API here](https://weaviate.io/developers/weaviate/api/graphql). -## Access Documents +## Access documents To retrieve a document from a Document Index you don't necessarily need to perform a fancy search. @@ -337,15 +337,15 @@ data = DocList[MyDoc]( ids = data.id doc_index.index(data) -# access the Documents by id +# access the documents by id doc = doc_index[ids[0]] # get by single id docs = doc_index[ids] # get by list of ids ``` -## Delete Documents +## Delete documents -In the same way you can access Documents by `id`, you can also delete them: +In the same way you can access documents by `id`, you can also delete them: ```python # prepare some data @@ -357,7 +357,7 @@ data = DocList[MyDoc]( ids = data.id doc_index.index(data) -# access the Documents by id +# access the documents by id del doc_index[ids[0]] # del by single id del doc_index[ids[1:]] # del by list of ids ``` @@ -389,13 +389,13 @@ Additionally, you can specify the below settings when you instantiate a configur | **Category: General** | | host | str | Weaviate instance url | http://localhost:8080 | | **Category: Authentication** | -| username | str | Username known to the specified authentication provider (e.g. WCS) | None | `jp@weaviate.io` | -| password | str | Corresponding password | None | `p@ssw0rd` | -| auth_api_key | str | API key known to the Weaviate instance | None | `mys3cretk3y` | +| username | str | Username known to the specified authentication provider (e.g. WCS) | `None` | `jp@weaviate.io` | +| password | str | Corresponding password | `None` | `p@ssw0rd` | +| auth_api_key | str | API key known to the Weaviate instance | `None` | `mys3cretk3y` | | **Category: Data schema** | | index_name | str | Class name to use to store the document| The document class name, e.g. `MyDoc` for `WeaviateDocumentIndex[MyDoc]` | `Document` | | **Category: Embedded Weaviate** | -| embedded_options| EmbeddedOptions | Options for embedded weaviate | None | +| embedded_options| EmbeddedOptions | Options for embedded weaviate | `None` | The type `EmbeddedOptions` can be specified as described [here](https://weaviate.io/developers/weaviate/installation/embedded#embedded-options) @@ -456,10 +456,10 @@ class StringDoc(BaseDoc): A list of available Weaviate data types [is here](https://weaviate.io/developers/weaviate/config-refs/datatypes). -## Nested Data and Subindex Search +## Nested data and subindex search The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`. However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document contains a `DocList` of other documents. -Go to [Nested Data](nested_data.md) section to learn more. \ No newline at end of file +Go to the [Nested Data](nested_data.md) section to learn more. \ No newline at end of file diff --git a/docs/user_guide/storing/nested_data.md b/docs/user_guide/storing/nested_data.md index a45275c75ac..aae9edfbcc0 100644 --- a/docs/user_guide/storing/nested_data.md +++ b/docs/user_guide/storing/nested_data.md @@ -1,4 +1,4 @@ -# Nested data +# Nested Data Most of the examples you've seen operate on a simple schema: each field corresponds to a "basic" type, such as `str` or `NdArray`. From a32a1e5aa4218eab8619ebbd864244075f124026 Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Mon, 31 Jul 2023 16:42:15 +0200 Subject: [PATCH 20/23] fix: nested data imports Signed-off-by: jupyterjazz --- docs/user_guide/storing/nested_data.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/user_guide/storing/nested_data.md b/docs/user_guide/storing/nested_data.md index aae9edfbcc0..d2928193995 100644 --- a/docs/user_guide/storing/nested_data.md +++ b/docs/user_guide/storing/nested_data.md @@ -14,8 +14,12 @@ In the following example you can see a complex schema that contains nested docum The `YouTubeVideoDoc` contains a `VideoDoc` and an `ImageDoc`, alongside some "basic" fields: ```python -from docarray.typing import ImageUrl, VideoUrl, AnyTensor +import numpy as np +from pydantic import Field +from docarray import BaseDoc, DocList +from docarray.index import InMemoryExactNNIndex +from docarray.typing import AnyTensor, ImageUrl, VideoUrl # define a nested schema class ImageDoc(BaseDoc): From ef0b7ef869cf332b22f27f18f5da51e837ebedfb Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Tue, 1 Aug 2023 11:46:59 +0200 Subject: [PATCH 21/23] docs: apply johannes suggestions Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 27 +- docs/user_guide/storing/index_elastic.md | 117 ++++++++- docs/user_guide/storing/index_hnswlib.md | 3 + docs/user_guide/storing/index_in_memory.md | 3 + docs/user_guide/storing/index_milvus.md | 39 +++ docs/user_guide/storing/index_qdrant.md | 40 +++ docs/user_guide/storing/index_redis.md | 39 +++ docs/user_guide/storing/index_weaviate.md | 284 +++++++++++++++++++-- 8 files changed, 511 insertions(+), 41 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index efc546a45c6..9b38dd5f07d 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -58,19 +58,20 @@ This doesn't require a database server - rather, it saves your data locally. For a deeper understanding, please look into its [documentation](index_in_memory.md). ### Define document schema and create data +The following code snippet defines a document schema using the `BaseDoc` class. Each document consists of a title (a string), +a price (an integer), and an embedding (a 128-dimensional array). It also creates a list of ten documents with dummy titles, +prices ranging from 0 to 9, and randomly generated embeddings. ```python from docarray import BaseDoc, DocList from docarray.index import InMemoryExactNNIndex from docarray.typing import NdArray import numpy as np -# Define the document schema. class MyDoc(BaseDoc): title: str price: int embedding: NdArray[128] -# Create documents (using dummy/random vectors) docs = DocList[MyDoc]( MyDoc(title=f"title #{i}", price=i, embedding=np.random.rand(128)) for i in range(10) @@ -78,29 +79,31 @@ docs = DocList[MyDoc]( ``` ### Initialize the Document Index and add data +Here we initialize an `InMemoryExactNNIndex` instance with the document schema defined previously, and add the created documents to this index. ```python -# Initialize a new InMemoryExactNNIndex instance and add the documents to the index. doc_index = InMemoryExactNNIndex[MyDoc]() doc_index.index(docs) ``` ### Perform a vector similarity search +Now, let's perform a similarity search on the document embeddings using a query vector of ones. +As a result, we'll retrieve the top 10 most similar documents and their corresponding similarity scores. ```python -# Perform a vector search. query = np.ones(128) retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) ``` ### Filter documents +In this segment, we filter the indexed documents based on their price field, specifically retrieving documents with a price less than 5. ```python -# Perform filtering (price < 5) query = {'price': {'$lt': 5}} filtered_docs = doc_index.filter(query, limit=10) ``` ### Combine different search methods +The final snippet combines the vector similarity search and filtering operations into a single query. +We first perform a similarity search on the document embeddings and then apply a filter to return only those documents with a price greater than or equal to 2. ```python -# Perform a hybrid search - combining vector search with filtering query = ( doc_index.build_query() # get empty query object .find(query=np.ones(128), search_field='embedding') # add vector similarity search @@ -109,3 +112,15 @@ query = ( ) retrieved_docs, scores = doc_index.execute_query(query) ``` + +## Learn more +The code snippets presented above just scratch the surface of what a Document Index can do. +To learn more and get the most out of `DocArray`, take a look at the detailed guides for the vector database backends you're interested in: + +- [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) +- [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) +- [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) +- [Redis](https://redis.com/) | [Docs](index_redis.md) +- [Milvus](https://milvus.io/) | [Docs](index_milvus.md) +- [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) +- InMemoryExactNNIndex | [Docs](index_in_memory.md) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index cd153a09051..45dfcab82c2 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -35,6 +35,9 @@ but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.Ela ## Basic usage +This snippet demonstrates the basic usage of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] to index these documents, +and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList @@ -186,23 +189,44 @@ db.index(data) ## Index -Use `.index()` to add documents into the index. +Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method. The `.num_docs()` method returns the total number of documents in the index. ```python -index_docs = [SimpleDoc(tensor=np.ones(128)) for _ in range(64)] +from docarray import DocList -doc_index.index(index_docs) +# create some random data +docs = DocList[SimpleDoc]([SimpleDoc(tensor=np.ones(128)) for _ in range(64)]) + +doc_index.index(docs) print(f'number of docs in the index: {doc_index.num_docs()}') ``` +As you can see, `DocList[SimpleDoc]` and `ElasticDocIndex[SimpleDoc]` both have `SimpleDoc` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. + + + ## Vector search -The `.find()` method is used to find the nearest neighbors of a vector. +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -You need to specify the `search_field` that is used when performing the vector search. -This is the field that serves as the basis of comparison between your query and indexed documents. +You can use the [`find()`][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc` +to find similar documents within the Document Index: You can use the `limit` argument to configure how many documents to return. @@ -211,14 +235,87 @@ You can use the `limit` argument to configure how many documents to return. This can lead to poor performance when the search involves many vectors. [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] does not have this limitation. -```python -query = SimpleDoc(tensor=np.ones(128)) +=== "Search by Document" -docs, scores = doc_index.find(query, limit=5, search_field='tensor') -``` + ```python + # create a query Document + query = SimpleDoc(tensor=np.ones(128)) + + # find similar documents + matches, scores = doc_index.find(query, search_field='tensor', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(128) + + # find similar documents + matches, scores = doc_index.find(query, search_field='tensor', limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +To peform a vector search, you need to specify a `search_field`. This is the field that serves as the +basis of comparison between your query and the documents in the Document Index. + +In this particular example you only have one field (`tensor`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + + +### Batched search You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +=== "Search by Documents" + + ```python + # create some query Documents + queries = DocList[SimpleDoc]( + SimpleDoc(tensor=np.random.rand(128)) for i in range(3) + ) + + # find similar Documents + matches, scores = doc_index.find_batched(queries, search_field='tensor', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar Documents + matches, scores = doc_index.find_batched(query, search_field='tensor', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + ## Filter diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index b1b549d43ff..fc0da64d298 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -24,6 +24,9 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s - [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] ## Basic usage +This snippet demonstrates the basic usage of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 17c9988d0c6..294fd988b9a 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -21,6 +21,9 @@ utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][d ## Basic usage +This snippet demonstrates the basic usage of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] to index these documents, +and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index 995dccb2734..bcca58699b5 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -12,6 +12,10 @@ focusing on special features and configurations of Milvus. ## Basic usage +This snippet demonstrates the basic usage of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. + !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. This is due to Milvus' configuration, which permits a single vector for each data object. @@ -215,8 +219,43 @@ When searching on the subindex level, you can use the [`find_subindex()]`[docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). +### Batched Search + You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar documents + matches, scores = doc_index.find_batched(query, limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + ## Filter diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index 01abad2a54b..7a0c0df768d 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -12,6 +12,10 @@ based on the [Qdrant](https://qdrant.tech/) vector search engine. ## Basic usage +This snippet demonstrates the basic usage of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. + ```python from docarray import BaseDoc, DocList from docarray.index import QdrantDocumentIndex @@ -253,8 +257,44 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). +### Batched Search + You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar documents + matches, scores = doc_index.find_batched(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + + ## Filter You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index 4df40ff5f74..e511b2ef13b 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -12,6 +12,10 @@ focusing on special features and configurations of Redis. ## Basic usage +This snippet demonstrates the basic usage of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. + ```python from docarray import BaseDoc, DocList from docarray.index import RedisDocumentIndex @@ -224,8 +228,43 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). +### Batched Search + You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 128) + + # find similar documents + matches, scores = doc_index.find_batched(query, search_field='embedding', limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. + ## Filter diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index d65341e2878..739f6e5ac5f 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -12,6 +12,10 @@ focusing on special features and configurations of Weaviate. ## Basic usage +This snippet demonstrates the basic usage of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex]. It defines a document schema with a title and an embedding, +creates ten dummy documents with random embeddings, initializes an instance of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] to index these documents, +and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. + !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. This is due to Weaviate's configuration, which permits a single vector for each data object. @@ -185,6 +189,92 @@ dbconfig = WeaviateDocumentIndex.DBConfig( ) ``` +### Create an instance +Let's connect to a local Weaviate service and instantiate a `WeaviateDocumentIndex` instance: +```python +dbconfig = WeaviateDocumentIndex.DBConfig( + host="http://localhost:8080" +) +doc_index = WeaviateDocumentIndex[MyDoc](db_config=dbconfig) +``` + +### Schema definition +In this code snippet, `WeaviateDocumentIndex` takes a schema of the form of `MyDoc`. +The Document Index then _creates a column for each field in `MyDoc`_. + +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#configuration). + +Most vector databases need to know the dimensionality of the vectors that will be stored. +Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that +the database will store vectors with 128 dimensions. + +!!! note "PyTorch and TensorFlow support" + Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + +### Using a predefined document as schema + +DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import WeaviateDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: NdArray[128] = Field(is_embedding=True) + + + doc_index = WeaviateDocumentIndex[MyDoc]() + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import WeaviateDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(dim=128, is_embedding=True) + + + doc_index = WeaviateDocumentIndex[MyDoc]() + ``` + +Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type. + +The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) + +# you can index this into Document Index of type MyDoc +doc_index.index(data) +``` + + ## Index Putting it together, we can add data below using Weaviate as the Document Index: @@ -192,7 +282,7 @@ Putting it together, we can add data below using Weaviate as the Document Index: ```python import numpy as np from pydantic import Field -from docarray import BaseDoc +from docarray import BaseDoc, DocList from docarray.typing import NdArray from docarray.index.backends.weaviate import WeaviateDocumentIndex @@ -207,23 +297,28 @@ class Document(BaseDoc): # Make a list of 3 docs to index -docs = [ - Document( - text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1" - ), - Document( - text="Hello world, how are you?", - embedding=np.array([3, 4]), - file=np.random.rand(100), - id="2", - ), - Document( - text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut", - embedding=np.array([5, 6]), - file=np.random.rand(100), - id="3", - ), -] +docs = DocList[Document]( + [ + Document( + text="Hello world", + embedding=np.array([1, 2]), + file=np.random.rand(100), + id="1", + ), + Document( + text="Hello world, how are you?", + embedding=np.array([3, 4]), + file=np.random.rand(100), + id="2", + ), + Document( + text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut", + embedding=np.array([5, 6]), + file=np.random.rand(100), + id="3", + ), + ] +) batch_config = { "batch_size": 20, @@ -249,16 +344,116 @@ store.index(docs) - It is possible to create a schema without specifying `is_embedding` for any field. - This will however mean that the document will not be vectorized and cannot be searched using vector search. +As you can see, `DocList[Document]` and `WeaviateDocumentIndex[Document]` both have `Document` as a parameter. +This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas. + +!!! question "When are two schemas compatible?" + The schemas of your Document Index and data need to be compatible with each other. + + Let's say A is the schema of your Document Index and B is the schema of your data. + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: + + - A and B are the same class + - A and B have the same field names and field types + - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + + In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index. + + ## Vector search -To perform a vector similarity search, follow the below syntax. +Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method. -This will perform a vector similarity search for the vector [1, 2] and return the first two results: +You can perform a similarity search and find relevant documents by passing `MyDoc` or a raw vector to +the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: + +=== "Search by Document" + + ```python + # create a query Document + query = Document( + text="Hello world", + embedding=np.array([1, 2]), + file=np.random.rand(100), + ) + + # find similar Documents + matches, scores = doc_index.find(query, limit=5) + + print(f"{matches=}") + print(f"{matches.text=}") + print(f"{scores=}") + ``` + +=== "Search by raw vector" + + ```python + # create a query vector + query = np.random.rand(2) + + # find similar Documents + matches, scores = store.find(query, limit=5) + + print(f'{matches=}') + print(f'{matches.text=}') + print(f'{scores=}') + ``` + +In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose +which one to use for the search. + +The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest +matching documents and their associated similarity scores. + +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. + +How these scores are calculated depends on the backend, and can usually be [configured](#configuration). + +### Batched Search + +You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. + +=== "Search by documents" + + ```python + # create some query documents + queries = DocList[MyDoc]( + Document( + text=f"Hello world {i}", + embedding=np.array([i, i + 1]), + file=np.random.rand(100), + ) + for i in range(3) + ) + + # find similar documents + matches, scores = doc_index.find_batched(queries, limit=5) + + print(f"{matches=}") + print(f"{matches[0].text=}") + print(f"{scores=}") + ``` + +=== "Search by raw vectors" + + ```python + # create some query vectors + query = np.random.rand(3, 2) + + # find similar documents + matches, scores = doc_index.find_batched(query, limit=5) + + print(f'{matches=}') + print(f'{matches[0].text=}') + print(f'{scores=}') + ``` + +The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. -```python -docs = store.find([1, 2], limit=2) -``` ## Filter @@ -269,12 +464,46 @@ This will perform a filtering on the field `text`: docs = store.filter({"path": ["text"], "operator": "Equal", "valueText": "Hello world"}) ``` +You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query. +The query should follow the [query language of the Weaviate](https://weaviate.io/developers/weaviate/search/filters). + +In the following example let's filter for all the books that are cheaper than 29 dollars: + +```python +from docarray import BaseDoc, DocList +from docarray.typing import NdArray +from pydantic import Field +import numpy as np + + +class Book(BaseDoc): + price: int + embedding: NdArray[10] = Field(is_embedding=True) + + +books = DocList[Book]([Book(price=i * 10, embedding=np.random.rand(10)) for i in range(10)]) +book_index = WeaviateDocumentIndex[Book](index_name='tmp_index') +book_index.index(books) + +# filter for books that are cheaper than 29 dollars +query = {"path": ["price"], "operator": "LessThan", "valueInt": 29} +cheap_books = book_index.filter(filter_query=query) + +assert len(cheap_books) == 3 +for doc in cheap_books: + doc.summary() +``` + ## Text search -To perform a text search, follow the below syntax. +In addition to vector similarity search, the Document Index interface offers methods for text search: +[text_search()][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. + +You can use text search directly on the field of type `str`. -This will perform a text search for the word "hello" in the field "text" and return the first two results: +The following line will perform a text search for the word "hello" in the field "text" and return the first two results: ```python docs = store.text_search("world", search_field="text", limit=2) @@ -283,6 +512,11 @@ docs = store.text_search("world", search_field="text", limit=2) ## Hybrid search +Document Index supports atomic operations for vector similarity search, text search and filter search. + +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible +through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]. + To perform a hybrid search, follow the below syntax. This will perform a hybrid search for the word "hello" and the vector [1, 2] and return the first two results: From 926816110db43fc177ffe15bfe0b4859fadec66f Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Tue, 1 Aug 2023 13:18:06 +0200 Subject: [PATCH 22/23] docs: apply suggestions Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 12 ++++++------ docs/user_guide/storing/index_elastic.md | 20 ++++++++++---------- docs/user_guide/storing/index_hnswlib.md | 22 +++++++++++----------- docs/user_guide/storing/index_in_memory.md | 18 +++++++++--------- docs/user_guide/storing/index_milvus.md | 22 +++++++++++----------- docs/user_guide/storing/index_qdrant.md | 18 +++++++++--------- docs/user_guide/storing/index_redis.md | 22 +++++++++++----------- docs/user_guide/storing/index_weaviate.md | 22 +++++++++++----------- docs/user_guide/storing/nested_data.md | 2 +- 9 files changed, 79 insertions(+), 79 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 9b38dd5f07d..289bb8b4b18 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -79,22 +79,22 @@ docs = DocList[MyDoc]( ``` ### Initialize the Document Index and add data -Here we initialize an `InMemoryExactNNIndex` instance with the document schema defined previously, and add the created documents to this index. +Here we initialize an `InMemoryExactNNIndex` instance with the document schema we defined previously, and add the created documents to this index. ```python doc_index = InMemoryExactNNIndex[MyDoc]() doc_index.index(docs) ``` ### Perform a vector similarity search -Now, let's perform a similarity search on the document embeddings using a query vector of ones. -As a result, we'll retrieve the top 10 most similar documents and their corresponding similarity scores. +Now, let's perform a similarity search on the document embeddings. +As a result, we'll retrieve ten most similar documents and their corresponding similarity scores. ```python query = np.ones(128) retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) ``` ### Filter documents -In this segment, we filter the indexed documents based on their price field, specifically retrieving documents with a price less than 5. +In this snippet, we filter the indexed documents based on their price field, specifically retrieving documents with a price less than 5: ```python query = {'price': {'$lt': 5}} filtered_docs = doc_index.filter(query, limit=10) @@ -102,7 +102,7 @@ filtered_docs = doc_index.filter(query, limit=10) ### Combine different search methods The final snippet combines the vector similarity search and filtering operations into a single query. -We first perform a similarity search on the document embeddings and then apply a filter to return only those documents with a price greater than or equal to 2. +We first perform a similarity search on the document embeddings and then apply a filter to return only those documents with a price greater than or equal to 2: ```python query = ( doc_index.build_query() # get empty query object @@ -114,7 +114,7 @@ retrieved_docs, scores = doc_index.execute_query(query) ``` ## Learn more -The code snippets presented above just scratch the surface of what a Document Index can do. +The code snippets above just scratch the surface of what a Document Index can do. To learn more and get the most out of `DocArray`, take a look at the detailed guides for the vector database backends you're interested in: - [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index 45dfcab82c2..684ea81558c 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -37,7 +37,7 @@ but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.Ela ## Basic usage This snippet demonstrates the basic usage of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] to index these documents, -and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. +and performs a vector similarity search to retrieve ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList @@ -238,7 +238,7 @@ You can use the `limit` argument to configure how many documents to return. === "Search by Document" ```python - # create a query Document + # create a query document query = SimpleDoc(tensor=np.ones(128)) # find similar documents @@ -266,7 +266,7 @@ You can use the `limit` argument to configure how many documents to return. To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. -In this particular example you only have one field (`tensor`) that is a vector, so you can trivially choose that one. +In this example you only have one field (`tensor`) that is a vector, so you can trivially choose that one. In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. @@ -280,7 +280,7 @@ How these scores are calculated depends on the backend, and can usually be [conf ### Batched search -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by Documents" @@ -290,7 +290,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ SimpleDoc(tensor=np.random.rand(128)) for i in range(3) ) - # find similar Documents + # find similar documents matches, scores = doc_index.find_batched(queries, search_field='tensor', limit=5) print(f'{matches=}') @@ -304,7 +304,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ # create some query vectors query = np.random.rand(3, 128) - # find similar Documents + # find similar documents matches, scores = doc_index.find_batched(query, search_field='tensor', limit=5) print(f'{matches=}') @@ -312,7 +312,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. @@ -424,8 +424,8 @@ docs = doc_index.filter(query) ## Text search In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. As in "pure" Elasticsearch, you can use text search directly on the field of type `str`: @@ -453,7 +453,7 @@ docs, scores = doc_index.text_search(query, search_field='text') Document Index supports atomic operations for vector similarity search, text search and filter search. To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: For example, you can build a hybrid serach query that performs range filtering, vector search and text search: diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index fc0da64d298..f0f41eb008e 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -26,7 +26,7 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s ## Basic usage This snippet demonstrates the basic usage of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. +and performs a vector similarity search to retrieve ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList @@ -194,7 +194,7 @@ to find similar documents within the Document Index: === "Search by Document" ```python - # create a query Document + # create a query document query = MyDoc(embedding=np.random.rand(128), text='query') # find similar documents @@ -222,7 +222,7 @@ to find similar documents within the Document Index: To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. @@ -233,9 +233,9 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). -### Batched Search +### Batched search -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by Documents" @@ -245,7 +245,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3) ) - # find similar Documents + # find similar documents matches, scores = db.find_batched(queries, search_field='embedding', limit=5) print(f'{matches=}') @@ -259,7 +259,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ # create some query vectors query = np.random.rand(3, 128) - # find similar Documents + # find similar documents matches, scores = db.find_batched(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -267,7 +267,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. @@ -309,8 +309,8 @@ for doc in cheap_books: To see how to perform text search, you can check out other backends that offer support. In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. ## Hybrid search @@ -318,7 +318,7 @@ as well as the batched version [text_search_batched()][docarray.index.abstract.B Document Index supports atomic operations for vector similarity search, text search and filter search. To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: ```python # Define the document schema. diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index 294fd988b9a..cf2285eb55c 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -23,7 +23,7 @@ utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][d ## Basic usage This snippet demonstrates the basic usage of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] to index these documents, -and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. +and performs a vector similarity search to retrieve ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList @@ -191,7 +191,7 @@ to find similar documents within the Document Index: === "Search by Document" ```python - # create a query Document + # create a query document query = MyDoc(embedding=np.random.rand(128), text='query') # find similar documents @@ -219,7 +219,7 @@ to find similar documents within the Document Index: To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. @@ -230,9 +230,9 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). -### Batched Search +### Batched search -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by documents" @@ -264,7 +264,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. @@ -306,8 +306,8 @@ for doc in cheap_books: To see how to perform text search, you can check out other backends that offer support. In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. @@ -316,7 +316,7 @@ as well as the batched version [text_search_batched()][docarray.index.abstract.B Document Index supports atomic operations for vector similarity search, text search and filter search. To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: ```python # Define the document schema. diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index bcca58699b5..ca8cd09a518 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -14,7 +14,7 @@ focusing on special features and configurations of Milvus. ## Basic usage This snippet demonstrates the basic usage of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. +and performs a vector similarity search to retrieve ten most similar documents to a given query vector. !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. @@ -187,10 +187,10 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: === "Search by Document" ```python - # create a query Document + # create a query document query = MyDoc(embedding=np.random.rand(128), title='query') - # find similar Documents + # find similar documents matches, scores = doc_index.find(query, limit=5) print(f'{matches=}') @@ -204,7 +204,7 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: # create a query vector query = np.random.rand(128) - # find similar Documents + # find similar documents matches, scores = doc_index.find(query, limit=5) print(f'{matches=}') @@ -215,13 +215,13 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest matching documents and their associated similarity scores. -When searching on the subindex level, you can use the [`find_subindex()]`[docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. +When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents. How these scores are calculated depends on the backend, and can usually be [configured](#configuration). -### Batched Search +### Batched search -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by documents" @@ -253,7 +253,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. @@ -294,8 +294,8 @@ for doc in cheap_books: To see how to perform text search, you can check out other backends that offer support. In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. @@ -304,7 +304,7 @@ as well as the batched version [text_search_batched()][docarray.index.abstract.B Document Index supports atomic operations for vector similarity search, text search and filter search. To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: ```python # Define the document schema. diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index 7a0c0df768d..fd9d57a8922 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -14,7 +14,7 @@ based on the [Qdrant](https://qdrant.tech/) vector search engine. ## Basic usage This snippet demonstrates the basic usage of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. +and performs a vector similarity search to retrieve ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList @@ -218,7 +218,7 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: === "Search by Document" ```python - # create a query Document + # create a query document query = MyDoc(embedding=np.random.rand(128), text='query') # find similar documents @@ -246,7 +246,7 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. @@ -257,9 +257,9 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). -### Batched Search +### Batched search -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by documents" @@ -291,7 +291,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. @@ -330,8 +330,8 @@ for doc in cheap_books: ## Text search In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. You can use text search directly on the field of type `str`: @@ -359,7 +359,7 @@ docs, scores = doc_index.text_search(query, search_field='text') Document Index supports atomic operations for vector similarity search, text search and filter search. To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: For example, you can build a hybrid serach query that performs range filtering, vector search and text search: diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index e511b2ef13b..0bccb046224 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -14,7 +14,7 @@ focusing on special features and configurations of Redis. ## Basic usage This snippet demonstrates the basic usage of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. +and performs a vector similarity search to retrieve ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList @@ -189,10 +189,10 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: === "Search by Document" ```python - # create a query Document + # create a query document query = MyDoc(embedding=np.random.rand(128), text='query') - # find similar Documents + # find similar documents matches, scores = doc_index.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -206,7 +206,7 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: # create a query vector query = np.random.rand(128) - # find similar Documents + # find similar documents matches, scores = doc_index.find(query, search_field='embedding', limit=5) print(f'{matches=}') @@ -217,7 +217,7 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: To peform a vector search, you need to specify a `search_field`. This is the field that serves as the basis of comparison between your query and the documents in the Document Index. -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. @@ -228,9 +228,9 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). -### Batched Search +### Batched search -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by documents" @@ -262,7 +262,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. @@ -298,8 +298,8 @@ for doc in cheap_books: ## Text search In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. You can use text search directly on the field of type `str`: @@ -326,7 +326,7 @@ docs, scores = doc_index.text_search(query, search_field='text') Document Index supports atomic operations for vector similarity search, text search and filter search. To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]: ```python # Define the document schema. diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index 739f6e5ac5f..d0f400b67fb 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -14,7 +14,7 @@ focusing on special features and configurations of Weaviate. ## Basic usage This snippet demonstrates the basic usage of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector. +and performs a vector similarity search to retrieve ten most similar documents to a given query vector. !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. @@ -372,14 +372,14 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: === "Search by Document" ```python - # create a query Document + # create a query document query = Document( text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), ) - # find similar Documents + # find similar documents matches, scores = doc_index.find(query, limit=5) print(f"{matches=}") @@ -393,7 +393,7 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: # create a query vector query = np.random.rand(2) - # find similar Documents + # find similar documents matches, scores = store.find(query, limit=5) print(f'{matches=}') @@ -401,7 +401,7 @@ the [`find()`][docarray.index.abstract.BaseDocIndex.find] method: print(f'{scores=}') ``` -In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. +In this example you only have one field (`embedding`) that is a vector, so you can trivially choose that one. In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose which one to use for the search. @@ -412,9 +412,9 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr How these scores are calculated depends on the backend, and can usually be [configured](#configuration). -### Batched Search +### Batched search -You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by documents" @@ -451,7 +451,7 @@ You can also search for multiple documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing +The [`find_batched()`][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. @@ -498,8 +498,8 @@ for doc in cheap_books: ## Text search In addition to vector similarity search, the Document Index interface offers methods for text search: -[text_search()][docarray.index.abstract.BaseDocIndex.text_search], -as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched]. +[`text_search()`][docarray.index.abstract.BaseDocIndex.text_search], +as well as the batched version [`text_search_batched()`][docarray.index.abstract.BaseDocIndex.text_search_batched]. You can use text search directly on the field of type `str`. @@ -515,7 +515,7 @@ docs = store.text_search("world", search_field="text", limit=2) Document Index supports atomic operations for vector similarity search, text search and filter search. To combine these operations into a single, hybrid search query, you can use the query builder that is accessible -through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]. +through [`build_query()`][docarray.index.abstract.BaseDocIndex.build_query]. To perform a hybrid search, follow the below syntax. diff --git a/docs/user_guide/storing/nested_data.md b/docs/user_guide/storing/nested_data.md index d2928193995..692dae2e1dc 100644 --- a/docs/user_guide/storing/nested_data.md +++ b/docs/user_guide/storing/nested_data.md @@ -66,7 +66,7 @@ You can perform search on any nesting level by using the dunder operator to spec In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: ```python -# create a query Document +# create a query document query_doc = YouTubeVideoDoc( title=f'video query', description=f'this is a query video', From b4028024d386704c14df0bc5d2676125ea69a67f Mon Sep 17 00:00:00 2001 From: jupyterjazz Date: Tue, 1 Aug 2023 13:36:02 +0200 Subject: [PATCH 23/23] docs: app sgg Signed-off-by: jupyterjazz --- docs/user_guide/storing/docindex.md | 2 +- docs/user_guide/storing/index_elastic.md | 2 +- docs/user_guide/storing/index_hnswlib.md | 2 +- docs/user_guide/storing/index_in_memory.md | 4 ++-- docs/user_guide/storing/index_milvus.md | 2 +- docs/user_guide/storing/index_qdrant.md | 2 +- docs/user_guide/storing/index_redis.md | 2 +- docs/user_guide/storing/index_weaviate.md | 2 +- docs/user_guide/storing/nested_data.md | 3 ++- 9 files changed, 11 insertions(+), 10 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 289bb8b4b18..dee3653e0f9 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -87,7 +87,7 @@ doc_index.index(docs) ### Perform a vector similarity search Now, let's perform a similarity search on the document embeddings. -As a result, we'll retrieve ten most similar documents and their corresponding similarity scores. +As a result, we'll retrieve the ten most similar documents and their corresponding similarity scores. ```python query = np.ones(128) retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index 684ea81558c..062a95c976d 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -37,7 +37,7 @@ but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.Ela ## Basic usage This snippet demonstrates the basic usage of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] to index these documents, -and performs a vector similarity search to retrieve ten most similar documents to a given query vector. +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index f0f41eb008e..e662cc220ae 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -26,7 +26,7 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s ## Basic usage This snippet demonstrates the basic usage of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve ten most similar documents to a given query vector. +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList diff --git a/docs/user_guide/storing/index_in_memory.md b/docs/user_guide/storing/index_in_memory.md index cf2285eb55c..9b275b67063 100644 --- a/docs/user_guide/storing/index_in_memory.md +++ b/docs/user_guide/storing/index_in_memory.md @@ -4,7 +4,7 @@ [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all documents in memory using DocLists. It is a great starting point for small datasets, where you may not want to launch a database server. -For vector search and filtering the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] +For vector search and filtering [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][docarray.utils.filter.filter_docs] functions. !!! note "Production readiness" @@ -23,7 +23,7 @@ utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][d ## Basic usage This snippet demonstrates the basic usage of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] to index these documents, -and performs a vector similarity search to retrieve ten most similar documents to a given query vector. +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList diff --git a/docs/user_guide/storing/index_milvus.md b/docs/user_guide/storing/index_milvus.md index ca8cd09a518..4cf9c91c7d5 100644 --- a/docs/user_guide/storing/index_milvus.md +++ b/docs/user_guide/storing/index_milvus.md @@ -14,7 +14,7 @@ focusing on special features and configurations of Milvus. ## Basic usage This snippet demonstrates the basic usage of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve ten most similar documents to a given query vector. +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index fd9d57a8922..71770e45982 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -14,7 +14,7 @@ based on the [Qdrant](https://qdrant.tech/) vector search engine. ## Basic usage This snippet demonstrates the basic usage of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve ten most similar documents to a given query vector. +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList diff --git a/docs/user_guide/storing/index_redis.md b/docs/user_guide/storing/index_redis.md index 0bccb046224..4e6522d1195 100644 --- a/docs/user_guide/storing/index_redis.md +++ b/docs/user_guide/storing/index_redis.md @@ -14,7 +14,7 @@ focusing on special features and configurations of Redis. ## Basic usage This snippet demonstrates the basic usage of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve ten most similar documents to a given query vector. +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. ```python from docarray import BaseDoc, DocList diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index d0f400b67fb..029c86de377 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -14,7 +14,7 @@ focusing on special features and configurations of Weaviate. ## Basic usage This snippet demonstrates the basic usage of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex]. It defines a document schema with a title and an embedding, creates ten dummy documents with random embeddings, initializes an instance of [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] to index these documents, -and performs a vector similarity search to retrieve ten most similar documents to a given query vector. +and performs a vector similarity search to retrieve the ten most similar documents to a given query vector. !!! note "Single Search Field Requirement" In order to utilize vector search, it's necessary to define 'is_embedding' for one field only. diff --git a/docs/user_guide/storing/nested_data.md b/docs/user_guide/storing/nested_data.md index 692dae2e1dc..feb7c4ee9b4 100644 --- a/docs/user_guide/storing/nested_data.md +++ b/docs/user_guide/storing/nested_data.md @@ -153,7 +153,8 @@ doc_index.index(index_docs) ### Search -You can perform search on any level by using `find_subindex()` method and the dunder operator `'root__subindex'` to specify the index to search on. +You can perform search on any level by using [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method +and the dunder operator `'root__subindex'` to specify the index to search on: ```python # find by the `VideoDoc` tensor pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy