mtmd : add support for Voxtral #14862

ngxson · 2025-07-24T19:44:47Z

Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \
    --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \
    --audio ../models/i-have-a-dream-30s.mp3 \
    -p "Transcribe this audio"

Output:

Here's the transcribed text from the audio:

"I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the South with. With this faith, we will be able to hew out of the mountain of despair a stone of hope."

Pre-quantized model

https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

How to convert from safetensors

# download https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
cd Voxtral-Mini-3B-2507
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile mmproj-model.gguf --outtype f16 --mmproj .

This PR is made possible with the help from:

HF transformers implementation Add voxtral huggingface/transformers#39429, thanks to @eustlb
Tokenizer code from Improve Mistral models integration with llama.cpp #14737, thanks to @juliendenize
Chat template from https://huggingface.co/unsloth/Devstral-Small-2507-GGUF, thanks to @unslothai

Disclamer: due to mental health issues, I have very limited amount of time to finish this PR.
Suggestions and testings are welcome, but keep in mind that I won't have time to address all possible issues.

ngxson · 2025-07-24T21:49:54Z

convert_hf_to_gguf.py

+        script_dir = Path(__file__).parent
+        template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
+        with open(template_path, "r", encoding="utf-8") as f:
+            template = f.read()
+            self.gguf_writer.add_chat_template(template)


The fact that mistral wants to delegates chat formatting to their python library mistral-common make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason

A quick look at the tests in mistral-common seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd).

It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE] token based on the tests.

Hmm yeah we're missing the [BEGIN_AUDIO] token. I have no idea if [TRANSCRIBE] is required though, it seems like the model is also able to summarize the audio, not just transcribe it.

Fun thing that I've just found out, without the [BEGIN_AUDIO], the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text

it seems like the model is also able to summarize the audio, not just transcribe it.

From what I understand from the HF page, it basically has two modes:

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly

Without the [TRANSCRIBE] it can understand and talk about the audio, like you've shown. In [TRANSCRIBE] mode it only outputs the transcription, similar to Whisper, I believe.

Kreijstal · 2025-07-25T20:05:36Z

how to use this to trsancribe audio ala whisper.cpp?

ngxson · 2025-07-27T21:52:00Z

This PR is pretty much ready for review.

Bonus, b828887 fixes the conversion of newer mistral models which does not have tokenizer.json, such as https://huggingface.co/mistralai/Devstral-Small-2507

ngxson · 2025-07-27T21:54:44Z

Kindly asking for some quants from my friends @bartowski1182 @danielhanchen 😄

bartowski1182 · 2025-07-27T21:55:47Z

I don't usually like quanting from unreleased, but I can test it out to see if it works :)

ngxson · 2025-07-27T21:58:16Z

@bartowski1182 Aha yeah good point! Indeed, the text model is ~~pretty much~~ exactly the same as Devstral so I don't think it's gonna change much. You can pre-quantize the Voxtral 24B from right now as it might take some time, and upload it once this PR is merged.

The only thing that may change is the mmproj, but it's simple to re-generate.

bartowski1182 · 2025-07-27T22:00:24Z

awesome sounds good!

super excited for the native mistral support, that's a game changer

ngxson · 2025-07-27T22:08:38Z

Test result:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M

mtmd : add support for Voxtral

cd909c2

github-actions bot added examples python python script changes labels Jul 24, 2025

ngxson mentioned this pull request Jul 24, 2025

Improve Mistral models integration with llama.cpp #14737

Open

ngxson added 2 commits July 24, 2025 23:43

clean up

5fc3507

fix python requirements

2da31ed

ngxson commented Jul 24, 2025

View reviewed changes

stduhpf mentioned this pull request Jul 27, 2025

New Voxtral model support ggml-org/whisper.cpp#3326

Open

ngxson added 3 commits July 27, 2025 23:38

Merge branch 'master' into xsn/voxtral

49045bd

add [BEGIN_AUDIO] token

97119dd

also support Devstral conversion

b828887

ngxson requested a review from ggerganov July 27, 2025 21:52

add docs and tests

738be19

github-actions bot added the documentation Improvements or additions to documentation label Jul 27, 2025

fix regression for ultravox

8b2d72d

minor coding style improvement

01bf687

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd : add support for Voxtral #14862

mtmd : add support for Voxtral #14862

Uh oh!

ngxson commented Jul 24, 2025 •

edited

Loading

Uh oh!

ngxson Jul 24, 2025 •

edited

Loading

Uh oh!

isaac-mcfadyen Jul 25, 2025

Uh oh!

ngxson Jul 25, 2025

Uh oh!

ngxson Jul 25, 2025 •

edited

Loading

Uh oh!

isaac-mcfadyen Jul 25, 2025

Uh oh!

Kreijstal commented Jul 25, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

bartowski1182 commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025 •

edited

Loading

Uh oh!

bartowski1182 commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

mtmd : add support for Voxtral #14862

Are you sure you want to change the base?

mtmd : add support for Voxtral #14862

Uh oh!

Conversation

ngxson commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-quantized model

How to convert from safetensors

Uh oh!

ngxson Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaac-mcfadyen Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaac-mcfadyen Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kreijstal commented Jul 25, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

bartowski1182 commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

ngxson commented Jul 24, 2025 •

edited

Loading

ngxson Jul 24, 2025 •

edited

Loading

ngxson Jul 25, 2025 •

edited

Loading

ngxson commented Jul 27, 2025 •

edited

Loading