Skip to content

mtmd : add support for Voxtral #14862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 28, 2025
Merged

mtmd : add support for Voxtral #14862

merged 11 commits into from
Jul 28, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 24, 2025

Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \
    --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \
    --audio ../models/i-have-a-dream-30s.mp3 \
    -p "Transcribe this audio"

Output:

Here's the transcribed text from the audio:

"I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the South with. With this faith, we will be able to hew out of the mountain of despair a stone of hope."

Pre-quantized model

https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

How to convert from safetensors

# download https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
cd Voxtral-Mini-3B-2507
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile mmproj-model.gguf --outtype f16 --mmproj .

This PR is made possible with the help from:

Disclamer: due to mental health issues, I have very limited amount of time to finish this PR.
Suggestions and testings are welcome, but keep in mind that I won't have time to address all possible issues.

Comment on lines 2307 to 2311
script_dir = Path(__file__).parent
template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
with open(template_path, "r", encoding="utf-8") as f:
template = f.read()
self.gguf_writer.add_chat_template(template)
Copy link
Collaborator Author

@ngxson ngxson Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that mistral wants to delegates chat formatting to their python library mistral-common make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick look at the tests in mistral-common seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd).

It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE] token based on the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah we're missing the [BEGIN_AUDIO] token. I have no idea if [TRANSCRIBE] is required though, it seems like the model is also able to summarize the audio, not just transcribe it.

Copy link
Collaborator Author

@ngxson ngxson Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun thing that I've just found out, without the [BEGIN_AUDIO], the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the model is also able to summarize the audio, not just transcribe it.

From what I understand from the HF page, it basically has two modes:

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly

Without the [TRANSCRIBE] it can understand and talk about the audio, like you've shown. In [TRANSCRIBE] mode it only outputs the transcription, similar to Whisper, I believe.

@Kreijstal
Copy link

how to use this to trsancribe audio ala whisper.cpp?

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

This PR is pretty much ready for review.

Bonus, b828887 fixes the conversion of newer mistral models which does not have tokenizer.json, such as https://huggingface.co/mistralai/Devstral-Small-2507

@ngxson ngxson requested a review from ggerganov July 27, 2025 21:52
@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

Kindly asking for some quants from my friends @bartowski1182 @danielhanchen 😄

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 27, 2025
@bartowski1182
Copy link
Contributor

I don't usually like quanting from unreleased, but I can test it out to see if it works :)

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

@bartowski1182 Aha yeah good point! Indeed, the text model is pretty much exactly the same as Devstral so I don't think it's gonna change much. You can pre-quantize the Voxtral 24B from right now as it might take some time, and upload it once this PR is merged.

The only thing that may change is the mmproj, but it's simple to re-generate.

@bartowski1182
Copy link
Contributor

awesome sounds good!

super excited for the native mistral support, that's a game changer

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

Test result:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M

@ggerganov
Copy link
Member

Downloading the model and will run some tests now.

@ggerganov
Copy link
Member

Ran some tests with audio samples that I normally use with Whisper and results look good.

I noticed only one failure using the "micro machines" sample: https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav

Voxtral does not seem to recognize the speech at all. Not sure if this is problem in the implementation or in the model:

make -j && ./bin/llama-mtmd-cli -m ../models/voxtral-3b/ggml-model-f16.gguf --mmproj ../models/voxtral-3b/mmproj.gguf --audio ./micro-machines.wav -p "Transcribe this audio" --top-k 1

load_hparams: model size:         1267.53 MiB
load_hparams: metadata size:      0.17 MiB
alloc_compute_meta:      Metal compute buffer size =   200.96 MiB
alloc_compute_meta:        CPU compute buffer size =     3.66 MiB
init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
main: loading model: ../models/voxtral-3b/ggml-model-f16.gguf
encoding audio slice...
audio slice encoded in 718 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 129 ms

Sure, please provide the audio file or the transcript of the audio you'd like me to transcribe.

@ggerganov
Copy link
Member

If I change the prompt from "Transcribe this audio" to "Transcribe" it works:

make -j && ./bin/llama-mtmd-cli -m ../models/voxtral-3b/ggml-model-f16.gguf --mmproj ../models/voxtral-3b/mmproj.gguf --audio ./micro-machines.wav -p "Transcribe" --top-k 1

...

### Transcription

**Speaker 1:**
"This is the Micro-Machine presenting the most minute motorcade of Micro-Machines. Each has dramatic tails, terrific trim, precision paint, plus incredible Micro-Machine pocket sets. There's police station, fire station, restaurant, service station, and more. Perfect pocket portables to any place. And there are many mini-places to play with each one comes with its own special edition Micro-Machine vehicle and fantastic features that miraculously move. Raise the boat lift at the marina, man the gun turret at the army base, clean your car at the wash, raise the toll bridge. And these places fit into a Micro-Machine world. Micro-Machine pocket sets so tremendously tiny, perfectly precise, dazzlingly detailed, you'll pocket them all. Micro-Machines and Micro-Machine pocket sets sold separately from Goob. The smaller they are, the better they are."

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 28, 2025

Thanks for testing. I corrected a mistake with project's activation function incorrectly set to relu

However, it still does not make the output correct. Probably the model is not trained for non-transcribe tasks.

The main problem is that for transcribe task, we require a specific prompt with an assistant prefix in the chat template:

"<s>[INST][BEGIN_AUDIO]" + "[AUDIO]" * num_expected_frames + "[/INST]lang:en[TRANSCRIBE]"

For now, a small hack is to add this line inside eval_message() in mtmd-cli.cpp:

    auto formatted_chat = common_chat_templates_apply(ctx.tmpls.get(), tmpl_inputs);
    LOG_DBG("formatted_chat.prompt: %s\n", formatted_chat.prompt.c_str());

    formatted_chat.prompt += "lang:en[TRANSCRIBE]"; // ADD THIS LINE

    mtmd_input_text text;

And re-run the example with an empty prompt, -p " " (leave one space inside quotes), it should work correctly.

But I think this can easily be fixed with a new flag --assistant-prefix, to be added in a follow-up PR.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@Kreijstal
Copy link

Kreijstal commented Jul 28, 2025

is it hard to make this work on whisper.cpp whisper has this karaoke script and timestamp generation for srt.. I know this is weird on llama.cpp but this prevents me from using it for transcription, even if it has better recognition

@Green-Sky
Copy link
Collaborator

is it hard to make this work on whisper.cpp whisper has this karaoke script and timestamp generation for srt.. I know this is weird on llama.cpp but this prevents me from using it for transcription, even if it has better recognition

I wonder if their training data contained some timestamp annotations. If so, a grammar might be all you need.

@ngxson ngxson merged commit 00fa15f into ggml-org:master Jul 28, 2025
50 checks passed
@LostRuins
Copy link
Collaborator

There is only [BEGIN_AUDIO] but no end audio equivalent?

@isaac-mcfadyen
Copy link
Contributor

isaac-mcfadyen commented Jul 28, 2025

There is only [BEGIN_AUDIO] but no end audio equivalent?

Correct, because the audio is one full message so the usual end-of-message tokens apply. See the tests here for an example.

@danielhanchen
Copy link
Contributor

Nice work @ngxson ! Will give this a try!

@stduhpf
Copy link
Contributor

stduhpf commented Jul 28, 2025

If anyone is interested in trying the 24B model, I did some basic q8 and q4 quants over there, with fp16 mmproj:
https://huggingface.co/stduhpf/Voxtral-Small-24B-2507-GGUF

@LostRuins
Copy link
Collaborator

I don't know if I am using it wrong or if it's a model/implementation issue. I am using official llama-b6029-bin-win-cpu-x64 binaries.

I am loading the quants from this repo: https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/tree/main

I have a 10 second audio clip containing some basic words, attached here (since github only allows mp4 files I have converted it, but be assured it is a valid MP3)

CrayonsCanMelt.mp4

First, I tried using mtmd-cli: llama-mtmd-cli.exe -m Voxtral-Mini-3B-2507-Q4_K_M.gguf --mmproj mmproj-Voxtral-Mini-3B-2507-Q8_0.gguf --audio 'CrayonsCanMelt.mp3' -p "Transcribe the attached audio"

...
encoding audio slice...
audio slice encoded in 4393 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 1056 ms

I'm unable to transcribe audio files directly, but I can guide you on how to transcribe audio using various tools and methods. Here are some steps you can follow:

then i tried llama-server.exe with similar results

image

I tried again with different audio files, occasionally the model might reference something in the audio content, but it never really works properly.

@CISC
Copy link
Collaborator

CISC commented Jul 30, 2025

I don't know if I am using it wrong or if it's a model/implementation issue. I am using official llama-b6029-bin-win-cpu-x64 binaries.

See #14862 (comment)

@LostRuins
Copy link
Collaborator

But according to the voxtral docs, it's capable of "Built-in Q&A and summarization"

So if it's transcribe only it's basically just a more accurate whisper?

@despairTK
Copy link

But according to the voxtral docs, it's capable of "Built-in Q&A and summarization"但根据 Voxtral 的文档,它具备"内置问答和摘要"功能

So if it's transcribe only it's basically just a more accurate whisper?所以如果只是进行转录,那它基本上就是更准确的 Whisper?

Friends, is there any documentation about the transcription operation of the GGUF model?

@stduhpf
Copy link
Contributor

stduhpf commented Jul 30, 2025

@LostRuins I'm seeing the same behavior, the model claims it's unable to transcribe audio, but at the same time, if I give it instructions through an audio file, it understands them flawlessly. Seems like it's not really able to distinguish between audio and text modalities.

@LostRuins
Copy link
Collaborator

@stduhpf have you been able to get it to do anything besides transcribe audio to text? I'm wondering if that modality was even trained for.

Qwen Omni 3B was able to tell the estimated age or gender of the speaker, can identify different sounds like people laughing, a dog barking and a train whistle, it can compare cadence, tone and emotion.

@stduhpf
Copy link
Contributor

stduhpf commented Jul 30, 2025

@stduhpf have you been able to get it to do anything besides transcribe audio to text? I'm wondering if that modality was even trained for.

I tried to make it estimate the accent of the speakers and most of the time it refused, saying it was unable to process audio files directly, and when I forced it to give an answer, it was completely wrong (defaulting to British accent if the language is English regardless of the actual accent for example). So I'm guessing it was really only trained to process the spoken words while ignoring any other audio clues. It doesn't feel much more useful than just hooking up a speech to text model to an LLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy