mtmd : add support for Voxtral #14862

ngxson · 2025-07-24T19:44:47Z

Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \
    --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \
    --audio ../models/i-have-a-dream-30s.mp3 \
    -p "Transcribe this audio"

Output:

Here's the transcribed text from the audio:

"I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the South with. With this faith, we will be able to hew out of the mountain of despair a stone of hope."

Pre-quantized model

https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

How to convert from safetensors

# download https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
cd Voxtral-Mini-3B-2507
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile mmproj-model.gguf --outtype f16 --mmproj .

This PR is made possible with the help from:

HF transformers implementation Add voxtral huggingface/transformers#39429, thanks to @eustlb
Tokenizer code from Improve Mistral models integration with llama.cpp #14737, thanks to @juliendenize
Chat template from https://huggingface.co/unsloth/Devstral-Small-2507-GGUF, thanks to @unslothai

Disclamer: due to mental health issues, I have very limited amount of time to finish this PR.
Suggestions and testings are welcome, but keep in mind that I won't have time to address all possible issues.

ngxson · 2025-07-24T21:49:54Z

convert_hf_to_gguf.py

+        script_dir = Path(__file__).parent
+        template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
+        with open(template_path, "r", encoding="utf-8") as f:
+            template = f.read()
+            self.gguf_writer.add_chat_template(template)


The fact that mistral wants to delegates chat formatting to their python library mistral-common make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason

A quick look at the tests in mistral-common seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd).

It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE] token based on the tests.

Hmm yeah we're missing the [BEGIN_AUDIO] token. I have no idea if [TRANSCRIBE] is required though, it seems like the model is also able to summarize the audio, not just transcribe it.

Fun thing that I've just found out, without the [BEGIN_AUDIO], the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text

it seems like the model is also able to summarize the audio, not just transcribe it.

From what I understand from the HF page, it basically has two modes:

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly

Without the [TRANSCRIBE] it can understand and talk about the audio, like you've shown. In [TRANSCRIBE] mode it only outputs the transcription, similar to Whisper, I believe.

Kreijstal · 2025-07-25T20:05:36Z

how to use this to trsancribe audio ala whisper.cpp?

mtmd : add support for Voxtral

cd909c2

github-actions bot added examples python python script changes labels Jul 24, 2025

ngxson mentioned this pull request Jul 24, 2025

Improve Mistral models integration with llama.cpp #14737

Open

ngxson added 2 commits July 24, 2025 23:43

clean up

5fc3507

fix python requirements

2da31ed

ngxson commented Jul 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd : add support for Voxtral #14862

mtmd : add support for Voxtral #14862

Uh oh!

ngxson commented Jul 24, 2025 •

edited

Loading

Uh oh!

ngxson Jul 24, 2025 •

edited

Loading

Uh oh!

isaac-mcfadyen Jul 25, 2025

Uh oh!

ngxson Jul 25, 2025

Uh oh!

ngxson Jul 25, 2025 •

edited

Loading

Uh oh!

isaac-mcfadyen Jul 25, 2025

Uh oh!

Kreijstal commented Jul 25, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

mtmd : add support for Voxtral #14862

Are you sure you want to change the base?

mtmd : add support for Voxtral #14862

Uh oh!

Conversation

ngxson commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-quantized model

How to convert from safetensors

Uh oh!

ngxson Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaac-mcfadyen Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaac-mcfadyen Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kreijstal commented Jul 25, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

ngxson commented Jul 24, 2025 •

edited

Loading

ngxson Jul 24, 2025 •

edited

Loading

ngxson Jul 25, 2025 •

edited

Loading