Skip to content

Commit e4665e9

Browse files
author
Joan Fontanals
authored
docs: move hint about schemas to common docindex section (docarray#1868)
1 parent 7c1e18e commit e4665e9

File tree

2 files changed

+60
-61
lines changed

2 files changed

+60
-61
lines changed

docs/user_guide/storing/docindex.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,66 @@ query = (
116116
retrieved_docs, scores = doc_index.execute_query(query)
117117
```
118118

119+
### Using a predefined document as schema
120+
121+
DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
122+
If you try to use these directly as a schema for a Document Index, you will get unexpected behavior:
123+
Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built.
124+
125+
The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding`
126+
field. But this is crucial information for any vector database to work properly!
127+
128+
You can work around this problem by subclassing the predefined document and adding the dimensionality information:
129+
130+
=== "Using type hint"
131+
```python
132+
from docarray.documents import TextDoc
133+
from docarray.typing import NdArray
134+
from docarray.index import HnswDocumentIndex
135+
136+
137+
class MyDoc(TextDoc):
138+
embedding: NdArray[128]
139+
140+
141+
db = HnswDocumentIndex[MyDoc]('test_db')
142+
```
143+
144+
=== "Using Field()"
145+
```python
146+
from docarray.documents import TextDoc
147+
from docarray.typing import AnyTensor
148+
from docarray.index import HnswDocumentIndex
149+
from pydantic import Field
150+
151+
152+
class MyDoc(TextDoc):
153+
embedding: AnyTensor = Field(dim=128)
154+
155+
156+
db = HnswDocumentIndex[MyDoc]('test_db3')
157+
```
158+
159+
Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type.
160+
161+
The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`:
162+
163+
```python
164+
from docarray import DocList
165+
166+
# data of type TextDoc
167+
data = DocList[TextDoc](
168+
[
169+
TextDoc(text='hello world', embedding=np.random.rand(128)),
170+
TextDoc(text='hello world', embedding=np.random.rand(128)),
171+
TextDoc(text='hello world', embedding=np.random.rand(128)),
172+
]
173+
)
174+
175+
# you can index this into Document Index of type MyDoc
176+
db.index(data)
177+
```
178+
119179
## Learn more
120180
The code snippets above just scratch the surface of what a Document Index can do.
121181
To learn more and get the most out of `DocArray`, take a look at the detailed guides for the vector database backends you're interested in:

docs/user_guide/storing/index_elastic.md

Lines changed: 0 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -126,67 +126,6 @@ class SimpleDoc(BaseDoc):
126126
doc_index = ElasticDocIndex[SimpleDoc](hosts='http://localhost:9200')
127127
```
128128

129-
### Using a predefined document as schema
130-
131-
DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
132-
If you try to use these directly as a schema for a Document Index, you will get unexpected behavior:
133-
Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built.
134-
135-
The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding`
136-
field. But this is crucial information for any vector database to work properly!
137-
138-
You can work around this problem by subclassing the predefined document and adding the dimensionality information:
139-
140-
=== "Using type hint"
141-
```python
142-
from docarray.documents import TextDoc
143-
from docarray.typing import NdArray
144-
from docarray.index import ElasticDocIndex
145-
146-
147-
class MyDoc(TextDoc):
148-
embedding: NdArray[128]
149-
150-
151-
db = ElasticDocIndex[MyDoc](index_name='test_db')
152-
```
153-
154-
=== "Using Field()"
155-
```python
156-
from docarray.documents import TextDoc
157-
from docarray.typing import AnyTensor
158-
from docarray.index import ElasticDocIndex
159-
from pydantic import Field
160-
161-
162-
class MyDoc(TextDoc):
163-
embedding: AnyTensor = Field(dim=128)
164-
165-
166-
db = ElasticDocIndex[MyDoc](index_name='test_db3')
167-
```
168-
169-
Once you have defined the schema of your Document Index in this way, the data that you index can be either the predefined Document type or your custom Document type.
170-
171-
The [next section](#index) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`:
172-
173-
```python
174-
from docarray import DocList
175-
176-
# data of type TextDoc
177-
data = DocList[TextDoc](
178-
[
179-
TextDoc(text='hello world', embedding=np.random.rand(128)),
180-
TextDoc(text='hello world', embedding=np.random.rand(128)),
181-
TextDoc(text='hello world', embedding=np.random.rand(128)),
182-
]
183-
)
184-
185-
# you can index this into Document Index of type MyDoc
186-
db.index(data)
187-
```
188-
189-
190129
## Index
191130

192131
Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method.

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy