High-throughput, OpenAI-compatible text embedding & reranker powered by Infinity
- π³ Pull an image β use the tag shown on the latest GitHub release page (e.g.
runpod/worker-infinity-embedding:<version>
) - π§ Configure β set at least
MODEL_NAMES
(see Endpoint Configuration) - π Deploy β create a RunPod Serverless endpoint
- π§ͺ Call the API β follow the example in the Usage section
All behaviour is controlled through environment variables:
Variable | Required | Default | Description |
---|---|---|---|
MODEL_NAMES |
Yes | β | One or more Hugging-Face model IDs. Separate multiple IDs with a semicolon. Example: BAAI/bge-small-en-v1.5 |
BATCH_SIZES |
No | 32 |
Per-model batch size; semicolon-separated list matching MODEL_NAMES . |
BACKEND |
No | torch |
Inference engine for all models: torch , optimum , or ctranslate2 . |
DTYPES |
No | auto |
Precision per model (auto , fp16 , fp8 ). Semicolon-separated, must match MODEL_NAMES . |
INFINITY_QUEUE_SIZE |
No | 48000 |
Max items queueable inside the Infinity engine. |
RUNPOD_MAX_CONCURRENCY |
No | 300 |
Max concurrent requests the RunPod wrapper will accept. |
Two flavours, one schema.
- OpenAI-compatible β drop-in replacement for
/v1/models
,/v1/embeddings
, so you can use this endpoint instead of the API from OpenAI by replacing the base url with the URL of your endpoint:https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1
and use your API key from RunPod instead of the one from OpenAI - Standard RunPod β call
/run
or/runsync
with a JSON body under theinput
key.
Base URL:https://api.runpod.ai/v2/<ENDPOINT_ID>
Except for transport (path + wrapper object) the JSON you send/receive is identical. The tables below describe the shared payload.
Method | Path | Body |
---|---|---|
GET |
/openai/v1/models |
β |
POST |
/runsync |
{ "input": { "openai_route": "/v1/models" } } |
Field | Type | Required | Description |
---|---|---|---|
model |
string | Yes | One of the IDs supplied via MODEL_NAMES . |
input |
string | array | Yes | A single text string or list of texts to embed. |
OpenAI route vs. Standard:
Flavour | Method | Path | Body |
---|---|---|---|
OpenAI | POST |
/v1/embeddings |
{ "model": "β¦", "input": "β¦" } |
Standard | POST |
/runsync |
{ "input": { "model": "β¦", "input": "β¦" } } |
{
"object": "list",
"model": "BAAI/bge-small-en-v1.5",
"data": [
{ "object": "embedding", "embedding": [0.01, -0.02 /* β¦ */], "index": 0 }
],
"usage": { "prompt_tokens": 2, "total_tokens": 2 }
}
Field | Type | Required | Description |
---|---|---|---|
model |
string | Yes | Any deployed reranker model |
query |
string | Yes | The search/query text |
docs |
array | Yes | List of documents to rerank |
return_docs |
bool | No | If true , return the documents in ranked order (default false ) |
Call pattern
POST /runsync
Content-Type: application/json
{
"input": {
"model": "BAAI/bge-reranker-large",
"query": "Which product has warranty coverage?",
"docs": [
"Product A comes with a 2-year warranty",
"Product B is available in red and blue colors",
"All electronics include a standard 1-year warranty"
],
"return_docs": true
}
}
Response contains either scores
or the full docs
list, depending on return_docs
.
Below are minimal curl
snippets so you can copy-paste from any machine.
Replace
<ENDPOINT_ID>
with your endpoint ID and<API_KEY>
with a RunPod API key.
# List models
curl -H "Authorization: Bearer <API_KEY>" \
https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1/models
# Create embeddings
curl -X POST \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"model":"BAAI/bge-small-en-v1.5","input":"Hello world"}' \
https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1/embeddings
# Create embeddings (wait for result)
curl -X POST \
-H "Content-Type: application/json" \
-d '{"input":{"model":"BAAI/bge-small-en-v1.5","input":"Hello world"}}' \
https://api.runpod.ai/v2/<ENDPOINT_ID>/runsync
# Rerank
curl -X POST \
-H "Content-Type: application/json" \
-d '{"input":{"model":"BAAI/bge-reranker-large","query":"Which product has warranty coverage?","docs":["Product A comes with a 2-year warranty","Product B is available in red and blue colors","All electronics include a standard 1-year warranty"],"return_docs":true}}' \
https://api.runpod.ai/v2/<ENDPOINT_ID>/runsync
- Infinity Engine β how the ultra-fast backend works.
- RunPod Docs β serverless concepts, limits, and API reference.
Special thanks to Michael Feil for creating the Infinity engine and for his ongoing support of this project.