Skip to content

Add Llama support to Inference Plugin #130092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Jan-Kazlouski-elastic
Copy link
Contributor

@Jan-Kazlouski-elastic Jan-Kazlouski-elastic commented Jun 26, 2025

Creation of new Llama inference provider integration allowing text_embedding, completion (both streaming and non-streaming) and chat_completion (only streaming) to be executed as part of inference API.

Changes were tested locally against next models:

  • all-MiniLM-L6-v2 (text embedding)
  • llama3.2:3b (completion & chat_completion)

For testing ollama service was used.
Quickstart for setting up running llama service locally: https://llama-stack.readthedocs.io/en/latest/getting_started/index.html

Setup

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Download and execute ollama

https://ollama.com/download

Clone the llama stack repo: git clone git@github.com:meta-llama/llama-stack.git, then follow the detailed instructions in the docs above.

Running `all-minilm:l6-v2`

Download the model:

ollama pull all-minilm:l6-v2
INFERENCE_MODEL=all-minilm:l6-v2 uv run --with llama-stack llama stack build --template starter --image-type venv --run

Examples of RQ/RS from local testing:

Create Embedding Endpoint

No URL:

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "api_key": "{{mistral-api-key}}",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "validation_exception",
                "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
            }
        ],
        "type": "validation_exception",
        "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
    },
    "status": 400
}

No API key (success):

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/inference/embeddings",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "inference_id": "llama-text-embedding",
    "task_type": "text_embedding",
    "service": "llama",
    "service_settings": {
        "model_id": "all-MiniLM-L6-v2",
        "url": "http://localhost:8321/v1/inference/embeddings",
        "dimensions": 384,
        "similarity": "cosine",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}

Not Found:

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/inference/embeddings1",
        "api_key": "{{mistral-api-key}}",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [http://localhost:8321/v1/inference/embeddings1] for request from inference entity id [llama-text-embedding] status [404]. Error message: [{\"detail\":\"Not Found\"}]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [http://localhost:8321/v1/inference/embeddings1] for request from inference entity id [llama-text-embedding] status [404]. Error message: [{\"detail\":\"Not Found\"}]"
        }
    },
    "status": 400
}

Success:

RQ:
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/inference/embeddings",
        "api_key": "{{mistral-api-key}}",
        "model_id": "all-MiniLM-L6-v2"
    }
}
RS:
{
    "inference_id": "llama-text-embedding",
    "task_type": "text_embedding",
    "service": "llama",
    "service_settings": {
        "model_id": "all-MiniLM-L6-v2",
        "url": "http://localhost:8321/v1/inference/embeddings",
        "dimensions": 384,
        "similarity": "cosine",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}
Perform Embedding

Bad Request:

RQ:
POST {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "query": "string",
    "task_settings": {}
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: Field [input] cannot be null;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: Field [input] cannot be null;"
    },
    "status": 400
}

Success:

RQ:
POST {{base-url}}/_inference/text_embedding/llama-text-embedding
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS:
{
    "text_embedding": [
        {
            "embedding": [
                0.055843446,
                0.01615099
            ]
        }
    ]
}
Create Completion Endpoint

No URL:

RQ:
PUT {{base-url}}/_inference/completion/llama-completion
{
    "service": "llama",
    "service_settings": {
        "api_key": "{{api-key}}",
        "model_id": "llama3.2:3b"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "validation_exception",
                "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
            }
        ],
        "type": "validation_exception",
        "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
    },
    "status": 400
}

Success:

RQ:
PUT {{base-url}}/_inference/completion/llama-completion
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "model_id": "ollama/llama3.2:3b"
    }
}
RS:
{
    "inference_id": "llama-completion",
    "task_type": "completion",
    "service": "llama",
    "service_settings": {
        "model_id": "llama3.2:3b",
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform Completion

Success (Non-Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS:
{
    "completion": [
        {
            "result": "You're quoting Joseph Heller's classic novel \"Catch-22\". The famous line from Chapter 14 reads:\n\n\"The sky above the port was the color of television set left on at high heat, which caught the sun in its glassy eye like a garnish on a prawns cocktail.\"\n\nThe phrase has since become a metaphor for a sense of desolation and hopelessness, often used to describe the feeling of being stuck or trapped in a situation."
        }
    ]
}

Success (Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion/_stream
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS:
event: message
data: {"completion":[{"delta":"That"}]}

event: message
data: {"completion":[{"delta":"'s"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" great"}]}

event: message
data: {"completion":[{"delta":" quote"}]}

event: message
data: [DONE]

Bad Request(Non-Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion
{
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: Field [input] cannot be null;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: Field [input] cannot be null;"
    },
    "status": 400
}

Bad Request (Streaming):

RQ:
POST {{base-url}}/_inference/completion/llama-completion/_stream
{
}
RS:
event: error
data: {"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: Field [input] cannot be null;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: Field [input] cannot be null;"},"status":400}

Create Chat Completion Endpoint

No URL:

RQ:
PUT {{base-url}}/_inference/chat_completion/llama-chat-completion
{
    "service": "llama",
    "service_settings": {
        "api_key": "{{mistral-api-key}}",
        "model_id": "ollama/llama3.2:3b"
    }
}
RS:
{
    "error": {
        "root_cause": [
            {
                "type": "validation_exception",
                "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
            }
        ],
        "type": "validation_exception",
        "reason": "Validation Failed: 1: [service_settings] does not contain the required setting [url];"
    },
    "status": 400
}

Success:

RQ:
PUT {{base-url}}/_inference/chat_completion/llama-chat-completion
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "api_key": "{{mistral-api-key}}",
        "model_id": "ollama/llama3.2:3b"
    }
}
RS:
{
    "inference_id": "llama-chat-completion",
    "task_type": "chat_completion",
    "service": "llama",
    "service_settings": {
        "model_id": "ollama/llama3.2:3b",
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform Chat Completion

Success (basic):

RQ:
POST {{base-url}}/_inference/chat_completion/llama-chat-completion/_stream
{
    "messages": [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    "max_completion_tokens": 10
}
RS:
event: message
data: {"id":"chatcmpl-bc589b74-a744-418b-a856-fa11abd98c8c","choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":"length","index":0}],"model":"llama3.2:3b","object":"chat.completion.chunk"}

event: message
data: {"id":"chatcmpl-bc589b74-a744-418b-a856-fa11abd98c8c","choices":[],"model":"llama3.2:3b","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":30,"total_tokens":40}}

event: message
data: [DONE]

Success (Complex):

RQ:
POST {{base-url}}/_inference/chat_completion/llama-chat-completion/_stream
{
    "model": "llama3.2:3b",
    "messages": [{
            "role": "user",
            "content": [{
                    "type": "text",
                    "text": "What's the price of a scarf?"
                }
            ]
        }
    ],
    "tools": [{
            "type": "function",
            "function": {
                "name": "get_current_price",
                "description": "Get the current price of a item",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "item": {
                            "id": "123"
                        }
                    }
                }
            }
        }
    ],
    "tool_choice": {
        "type": "function",
        "function": {
            "name": "get_current_price"
        }
    }
}

RS:
event: message
data: {"id":"chatcmpl-387a656b-2c0a-4fa7-9929-71e89c999c7e","choices":[{"delta":{"content":"","role":"assistant","tool_calls":[{"index":0,"id":"call_4qiq7n2n","function":{"arguments":"{\"item\":\"scarf\"}","name":"get_current_price"},"type":"function"}]},"index":0}],"model":"llama3.2:3b","object":"chat.completion.chunk"}

event: message
data: {"id":"chatcmpl-387a656b-2c0a-4fa7-9929-71e89c999c7e","choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":"tool_calls","index":0}],"model":"llama3.2:3b","object":"chat.completion.chunk"}

event: message
data: {"id":"chatcmpl-387a656b-2c0a-4fa7-9929-71e89c999c7e","choices":[],"model":"llama3.2:3b","object":"chat.completion.chunk","usage":{"completion_tokens":15,"prompt_tokens":160,"total_tokens":175}}

event: message
data: [DONE]

Invalid Model:

RQ:
POST {{base-url}}/_inference/chat_completion/llama-chat-completion/_stream
{
    "model": "ggg",
    "messages": [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    "max_completion_tokens": 10
}
RS:
event: error
data: {"error":{"code":"stream_error","message":"Received an error response for request from inference entity id [llama-chat-completion]. Error message: [400: Invalid value: Model 'ggg' not found]","type":"llama_error"}}


  • - Have you signed the contributor license agreement?
  • - Have you followed the contributor guidelines?
  • - If submitting code, have you built your formula locally prior to submission with gradle check?
  • - If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
  • - If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • - If you are submitting this code for a class then read our policy for that.

@elasticsearchmachine elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 26, 2025
Jan-Kazlouski-elastic and others added 24 commits June 26, 2025 21:14
…r handling and improve error response parsing
…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@Jan-Kazlouski-elastic Jan-Kazlouski-elastic marked this pull request as ready for review July 4, 2025 14:02
@Jan-Kazlouski-elastic Jan-Kazlouski-elastic requested a review from a team as a code owner July 4, 2025 14:02
…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@Jan-Kazlouski-elastic
Copy link
Contributor Author

@jonathan-buttner Thank you for your comments. They are addressed and PR ready to be re-reviewed.

Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, left a few more suggestions.

@@ -212,6 +212,7 @@ static TransportVersion def(int id) {
public static final TransportVersion ESQL_PROFILE_INCLUDE_PLAN_8_19 = def(8_841_0_62);
public static final TransportVersion ESQL_SPLIT_ON_BIG_VALUES_8_19 = def(8_841_0_63);
public static final TransportVersion ESQL_FIXED_INDEX_LIKE_8_19 = def(8_841_0_64);
public static final TransportVersion ML_INFERENCE_LLAMA_ADDED_8_19 = def(8_841_0_65);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to mention this in the previous review, we won't be backporting this to 8.x so we can remove this transport version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -116,9 +116,16 @@ public String getWriteableName() {

@Override
public TransportVersion getMinimalSupportedVersion() {
assert false : "should never be called when supportsVersion is used";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can remove this line now because we won't need to backport to 8.x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

return TransportVersions.ML_INFERENCE_LLAMA_ADDED;
}

@Override
public boolean supportsVersion(TransportVersion version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -154,9 +154,16 @@ public String getWriteableName() {

@Override
public TransportVersion getMinimalSupportedVersion() {
assert false : "should never be called when supportsVersion is used";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this.

return TransportVersions.ML_INFERENCE_LLAMA_ADDED;
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this method override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -49,7 +47,7 @@ public RateLimitSettings rateLimitSettings() {

@Override
public int rateLimitGroupingHash() {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. In the future, let's add these bug fix changes to a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing!

@@ -141,7 +140,7 @@ public boolean isEnabled() {
return true;
}

protected abstract CustomModel createEmbeddingModel(@Nullable SimilarityMeasure similarityMeasure);
protected abstract Model createEmbeddingModel(@Nullable SimilarityMeasure similarityMeasure);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

}
}

public void testParseRequestConfig_CreatesChatCompletionsModel() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the base class covers this test, can you check and see if this test covers anything additional, if not, let's remove it from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Added check for model_id to assert common model method since it was missed before.

@jonathan-buttner
Copy link
Contributor

@Jan-Kazlouski-elastic I did some testing, things are looking good. I think there's one scenario we should add better validation error handling for.

I was struggling to get the llama3.2:3b to be recognized and finally realized that I need to prepend ollama/ in the model_id field. If you use a model string like llama3.2:3bbbb the inference endpoint will still be created even though the test request our inference plugin makes receives:

data: {"error": {"message": "400: Invalid value: Model 'llama3.2:3b' not found"}}
PUT _inference/chat_completion/chat
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/chat/completions",
        "model_id": "llama3.2:3bbbb"
    }
}

I think a better experience would be for the PUT request to fail and report back the error it received. This is probably a larger change unrelated to this implementation though. I'll create an issue to improve the validation.

@jonathan-buttner
Copy link
Contributor

Could you fix the merge conflicts and then I'll approve and merge on Monday 👍

…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferencePlugin.java
…g-completion

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@Jan-Kazlouski-elastic
Copy link
Contributor Author

Conflicts are resolved. Adopted changes for Error Handling and for Service constructors from master.
FYI @jonathan-buttner

@jonathan-buttner jonathan-buttner merged commit beb18a8 into elastic:main Jul 18, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy