mtmd : add support for Voxtral #14862

ngxson · 2025-07-24T19:44:47Z

Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \
    --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \
    --audio ../models/i-have-a-dream-30s.mp3 \
    -p "Transcribe this audio"

Output:

Here's the transcribed text from the audio:

"I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the South with. With this faith, we will be able to hew out of the mountain of despair a stone of hope."

Pre-quantized model

https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

How to convert from safetensors

# download https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
cd Voxtral-Mini-3B-2507
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile mmproj-model.gguf --outtype f16 --mmproj .

This PR is made possible with the help from:

HF transformers implementation Add voxtral huggingface/transformers#39429, thanks to @eustlb
Tokenizer code from Improve Mistral models integration with llama.cpp #14737, thanks to @juliendenize
Chat template from https://huggingface.co/unsloth/Devstral-Small-2507-GGUF, thanks to @unslothai

Disclamer: due to mental health issues, I have very limited amount of time to finish this PR.
Suggestions and testings are welcome, but keep in mind that I won't have time to address all possible issues.

ngxson · 2025-07-24T21:49:54Z

convert_hf_to_gguf.py

+        script_dir = Path(__file__).parent
+        template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
+        with open(template_path, "r", encoding="utf-8") as f:
+            template = f.read()
+            self.gguf_writer.add_chat_template(template)


The fact that mistral wants to delegates chat formatting to their python library mistral-common make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason

A quick look at the tests in mistral-common seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd).

It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE] token based on the tests.

Hmm yeah we're missing the [BEGIN_AUDIO] token. I have no idea if [TRANSCRIBE] is required though, it seems like the model is also able to summarize the audio, not just transcribe it.

Fun thing that I've just found out, without the [BEGIN_AUDIO], the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text

it seems like the model is also able to summarize the audio, not just transcribe it.

From what I understand from the HF page, it basically has two modes:

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly

Without the [TRANSCRIBE] it can understand and talk about the audio, like you've shown. In [TRANSCRIBE] mode it only outputs the transcription, similar to Whisper, I believe.

Kreijstal · 2025-07-25T20:05:36Z

how to use this to trsancribe audio ala whisper.cpp?

ngxson · 2025-07-27T21:52:00Z

This PR is pretty much ready for review.

Bonus, b828887 fixes the conversion of newer mistral models which does not have tokenizer.json, such as https://huggingface.co/mistralai/Devstral-Small-2507

ngxson · 2025-07-27T21:54:44Z

Kindly asking for some quants from my friends @bartowski1182 @danielhanchen 😄

bartowski1182 · 2025-07-27T21:55:47Z

I don't usually like quanting from unreleased, but I can test it out to see if it works :)

ngxson · 2025-07-27T21:58:16Z

@bartowski1182 Aha yeah good point! Indeed, the text model is ~~pretty much~~ exactly the same as Devstral so I don't think it's gonna change much. You can pre-quantize the Voxtral 24B from right now as it might take some time, and upload it once this PR is merged.

The only thing that may change is the mmproj, but it's simple to re-generate.

bartowski1182 · 2025-07-27T22:00:24Z

awesome sounds good!

super excited for the native mistral support, that's a game changer

ngxson · 2025-07-27T22:08:38Z

Test result:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M

ggerganov · 2025-07-28T05:30:07Z

Downloading the model and will run some tests now.

ggerganov · 2025-07-28T06:17:20Z

Ran some tests with audio samples that I normally use with Whisper and results look good.

I noticed only one failure using the "micro machines" sample: https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav

Voxtral does not seem to recognize the speech at all. Not sure if this is problem in the implementation or in the model:

make -j && ./bin/llama-mtmd-cli -m ../models/voxtral-3b/ggml-model-f16.gguf --mmproj ../models/voxtral-3b/mmproj.gguf --audio ./micro-machines.wav -p "Transcribe this audio" --top-k 1

load_hparams: model size:         1267.53 MiB
load_hparams: metadata size:      0.17 MiB
alloc_compute_meta:      Metal compute buffer size =   200.96 MiB
alloc_compute_meta:        CPU compute buffer size =     3.66 MiB
init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
main: loading model: ../models/voxtral-3b/ggml-model-f16.gguf
encoding audio slice...
audio slice encoded in 718 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 129 ms

Sure, please provide the audio file or the transcript of the audio you'd like me to transcribe.

ggerganov · 2025-07-28T06:19:05Z

If I change the prompt from "Transcribe this audio" to "Transcribe" it works:

make -j && ./bin/llama-mtmd-cli -m ../models/voxtral-3b/ggml-model-f16.gguf --mmproj ../models/voxtral-3b/mmproj.gguf --audio ./micro-machines.wav -p "Transcribe" --top-k 1

...

### Transcription

**Speaker 1:**
"This is the Micro-Machine presenting the most minute motorcade of Micro-Machines. Each has dramatic tails, terrific trim, precision paint, plus incredible Micro-Machine pocket sets. There's police station, fire station, restaurant, service station, and more. Perfect pocket portables to any place. And there are many mini-places to play with each one comes with its own special edition Micro-Machine vehicle and fantastic features that miraculously move. Raise the boat lift at the marina, man the gun turret at the army base, clean your car at the wash, raise the toll bridge. And these places fit into a Micro-Machine world. Micro-Machine pocket sets so tremendously tiny, perfectly precise, dazzlingly detailed, you'll pocket them all. Micro-Machines and Micro-Machine pocket sets sold separately from Goob. The smaller they are, the better they are."

ngxson · 2025-07-28T08:55:18Z

Thanks for testing. I corrected a mistake with project's activation function incorrectly set to relu

However, it still does not make the output correct. Probably the model is not trained for non-transcribe tasks.

The main problem is that for transcribe task, we require a specific prompt with an assistant prefix in the chat template:

"<s>[INST][BEGIN_AUDIO]" + "[AUDIO]" * num_expected_frames + "[/INST]lang:en[TRANSCRIBE]"

For now, a small hack is to add this line inside eval_message() in mtmd-cli.cpp:

    auto formatted_chat = common_chat_templates_apply(ctx.tmpls.get(), tmpl_inputs);
    LOG_DBG("formatted_chat.prompt: %s\n", formatted_chat.prompt.c_str());

    formatted_chat.prompt += "lang:en[TRANSCRIBE]"; // ADD THIS LINE

    mtmd_input_text text;

And re-run the example with an empty prompt, -p " " (leave one space inside quotes), it should work correctly.

But I think this can easily be fixed with a new flag --assistant-prefix, to be added in a follow-up PR.

gguf-py/gguf/vocab.py

requirements/requirements-convert_hf_to_gguf.txt

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Kreijstal · 2025-07-28T10:50:36Z

is it hard to make this work on whisper.cpp whisper has this karaoke script and timestamp generation for srt.. I know this is weird on llama.cpp but this prevents me from using it for transcription, even if it has better recognition

Green-Sky · 2025-07-28T12:46:08Z

is it hard to make this work on whisper.cpp whisper has this karaoke script and timestamp generation for srt.. I know this is weird on llama.cpp but this prevents me from using it for transcription, even if it has better recognition

I wonder if their training data contained some timestamp annotations. If so, a grammar might be all you need.

LostRuins · 2025-07-28T15:17:13Z

There is only [BEGIN_AUDIO] but no end audio equivalent?

isaac-mcfadyen · 2025-07-28T15:20:20Z

There is only [BEGIN_AUDIO] but no end audio equivalent?

Correct, because the audio is one full message so the usual end-of-message tokens apply. See the tests here for an example.

danielhanchen · 2025-07-28T15:38:20Z

Nice work @ngxson ! Will give this a try!

stduhpf · 2025-07-28T18:51:22Z

If anyone is interested in trying the 24B model, I did some basic q8 and q4 quants over there, with fp16 mmproj:
https://huggingface.co/stduhpf/Voxtral-Small-24B-2507-GGUF

LostRuins · 2025-07-30T06:48:11Z

I don't know if I am using it wrong or if it's a model/implementation issue. I am using official llama-b6029-bin-win-cpu-x64 binaries.

I am loading the quants from this repo: https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/tree/main

I have a 10 second audio clip containing some basic words, attached here (since github only allows mp4 files I have converted it, but be assured it is a valid MP3)

CrayonsCanMelt.mp4

First, I tried using mtmd-cli: llama-mtmd-cli.exe -m Voxtral-Mini-3B-2507-Q4_K_M.gguf --mmproj mmproj-Voxtral-Mini-3B-2507-Q8_0.gguf --audio 'CrayonsCanMelt.mp3' -p "Transcribe the attached audio"

...
encoding audio slice...
audio slice encoded in 4393 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 1056 ms

I'm unable to transcribe audio files directly, but I can guide you on how to transcribe audio using various tools and methods. Here are some steps you can follow:

then i tried llama-server.exe with similar results

I tried again with different audio files, occasionally the model might reference something in the audio content, but it never really works properly.

CISC · 2025-07-30T06:57:50Z

I don't know if I am using it wrong or if it's a model/implementation issue. I am using official llama-b6029-bin-win-cpu-x64 binaries.

See #14862 (comment)

LostRuins · 2025-07-30T09:38:54Z

But according to the voxtral docs, it's capable of "Built-in Q&A and summarization"

So if it's transcribe only it's basically just a more accurate whisper?

despairTK · 2025-07-30T10:22:07Z

But according to the voxtral docs, it's capable of "Built-in Q&A and summarization"但根据 Voxtral 的文档，它具备"内置问答和摘要"功能

So if it's transcribe only it's basically just a more accurate whisper?所以如果只是进行转录，那它基本上就是更准确的 Whisper？

Friends, is there any documentation about the transcription operation of the GGUF model?

stduhpf · 2025-07-30T10:54:24Z

@LostRuins I'm seeing the same behavior, the model claims it's unable to transcribe audio, but at the same time, if I give it instructions through an audio file, it understands them flawlessly. Seems like it's not really able to distinguish between audio and text modalities.

LostRuins · 2025-07-30T13:11:58Z

@stduhpf have you been able to get it to do anything besides transcribe audio to text? I'm wondering if that modality was even trained for.

Qwen Omni 3B was able to tell the estimated age or gender of the speaker, can identify different sounds like people laughing, a dog barking and a train whistle, it can compare cadence, tone and emotion.

stduhpf · 2025-07-30T15:01:30Z

@stduhpf have you been able to get it to do anything besides transcribe audio to text? I'm wondering if that modality was even trained for.

I tried to make it estimate the accent of the speakers and most of the time it refused, saying it was unable to process audio files directly, and when I forced it to give an answer, it was completely wrong (defaulting to British accent if the language is English regardless of the actual accent for example). So I'm guessing it was really only trained to process the spoken words while ignoring any other audio clues. It doesn't feel much more useful than just hooking up a speech to text model to an LLM.

mtmd : add support for Voxtral

cd909c2

github-actions bot added examples python python script changes labels Jul 24, 2025

ngxson mentioned this pull request Jul 24, 2025

Improve Mistral models integration with llama.cpp #14737

Open

ngxson added 2 commits July 24, 2025 23:43

clean up

5fc3507

fix python requirements

2da31ed

ngxson commented Jul 24, 2025

View reviewed changes

stduhpf mentioned this pull request Jul 27, 2025

New Voxtral model support ggml-org/whisper.cpp#3326

Open

ngxson added 3 commits July 27, 2025 23:38

Merge branch 'master' into xsn/voxtral

49045bd

add [BEGIN_AUDIO] token

97119dd

also support Devstral conversion

b828887

ngxson requested a review from ggerganov July 27, 2025 21:52

add docs and tests

738be19

github-actions bot added the documentation Improvements or additions to documentation label Jul 27, 2025

fix regression for ultravox

8b2d72d

minor coding style improvement

01bf687

ggerganov approved these changes Jul 28, 2025

View reviewed changes

correct project activation fn

4556b40

CISC reviewed Jul 28, 2025

View reviewed changes

gguf-py/gguf/vocab.py Outdated Show resolved Hide resolved

gguf-py/gguf/vocab.py Outdated Show resolved Hide resolved

requirements/requirements-convert_hf_to_gguf.txt Outdated Show resolved Hide resolved

Apply suggestions from code review

8c543f7

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ngxson merged commit 00fa15f into ggml-org:master Jul 28, 2025
50 checks passed

mtmd : add support for Voxtral #14862

mtmd : add support for Voxtral #14862

Conversation

ngxson commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-quantized model

How to convert from safetensors

Uh oh!

ngxson Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaac-mcfadyen Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaac-mcfadyen Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kreijstal commented Jul 25, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

bartowski1182 commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

ggerganov commented Jul 28, 2025

Uh oh!

ggerganov commented Jul 28, 2025

Uh oh!

ggerganov commented Jul 28, 2025

Uh oh!

ngxson commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kreijstal commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Jul 28, 2025

Uh oh!

Uh oh!

LostRuins commented Jul 28, 2025

Uh oh!

isaac-mcfadyen commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Jul 28, 2025

Uh oh!

stduhpf commented Jul 28, 2025

Uh oh!

LostRuins commented Jul 30, 2025

Uh oh!

CISC commented Jul 30, 2025

Uh oh!

LostRuins commented Jul 30, 2025

Uh oh!

despairTK commented Jul 30, 2025

Uh oh!

stduhpf commented Jul 30, 2025

Uh oh!

LostRuins commented Jul 30, 2025

Uh oh!

stduhpf commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ngxson commented Jul 24, 2025 •

edited

Loading

ngxson Jul 24, 2025 •

edited

Loading

ngxson Jul 25, 2025 •

edited

Loading

ngxson commented Jul 27, 2025 •

edited

Loading

Kreijstal commented Jul 28, 2025 •

edited

Loading

isaac-mcfadyen commented Jul 28, 2025 •

edited

Loading

stduhpf commented Jul 30, 2025 •

edited

Loading