Skip to content

Ready-to-go containerized RAG service. Implemented with text-embedding-inference + Qdrant/LanceDB.

Notifications You must be signed in to change notification settings

plaggy/rag-containers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

Ready-to-use containerized RAG system built on top of your knowledge base. Wrapped as a Gradio app.

Components:

!! The input documents are expected to be clean raw text and chunked as needed, they'll be embedded as-is !!

Usage:

Set DOCS_DIR in compose files to the path of your documents directory. Then

docker compose -f compose.cpu.yaml up
docker compose -f compose.gpu.yaml up

GPU configs are currently set to Turing GPUs(T4). Adjust the containers for the embed and rerank services as necessary (TEI images).

Set it up via modifying .env:

TOP_K_RETRIEVE - number of items to retrieve before ranking, ignored if a cross-encoder is off
TOP_K_RANK - number of items to use as LLM context. If a cross-encoder is not used this is the number of retrieved items
SEMAPHORE_LIMIT - the limit to the number of coroutines simultaneously working on embedding and ingestion of the documents from DOCS_DIR
BATCH_SIZE - size of batches sent to a TEI embedder on the init stage and TEI reranker during inference. With concurrency and a GPU can be safely set to 1

HF_MODEL - name of a HF LLM used
HF_URL - url to a HF LLM used. Can be a model name if a model is hosted via HuggingChat or a url if you host a model yourself
OPENAI_MODEL - name of an OpenAI LLM
EMBED_MODEL - name of an embedding model from the HF Hub, must match the model used for an embedder

# Indexing and search parameters - parameters of vector DBs, refer to their documentation. [LanceDB](https://blog.lancedb.com/benchmarking-lancedb-92b01032874a), [Qdrant](https://qdrant.tech/documentation/tutorials/retrieval-quality/#tweaking-the-hnsw-parameters)
# Generation parameters - mostly standard [generation params](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig).
PROMPT_TOKEN_LIMIT might come in handy if you expect the context length to go beyond the limit. It'll truncate only context and keep the postfix with appropriate special tokens
REP_PENALTY is a repetition penalty, HF models-specific;
FREQ_PENALTY - frequency penalty, OpenAI models-specific;

The secrets HF_TOKEN and OPENAI_API_KEY must be stored separately in the files with the corresponding names.

Note that cross-encoder reranking would be very slow on CPU


Now part of the Nebius Academy repo

About

Ready-to-go containerized RAG service. Implemented with text-embedding-inference + Qdrant/LanceDB.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy