Skip to content

mtmd : add support for Voxtral #14862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 24, 2025

Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \
    --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \
    --audio ../models/i-have-a-dream-30s.mp3 \
    -p "Transcribe this audio"

Output:

Here's the transcribed text from the audio:

"I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the South with. With this faith, we will be able to hew out of the mountain of despair a stone of hope."

Pre-quantized model

https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

How to convert from safetensors

# download https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
cd Voxtral-Mini-3B-2507
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile mmproj-model.gguf --outtype f16 --mmproj .

This PR is made possible with the help from:

Disclamer: due to mental health issues, I have very limited amount of time to finish this PR.
Suggestions and testings are welcome, but keep in mind that I won't have time to address all possible issues.

Comment on lines 2307 to 2311
script_dir = Path(__file__).parent
template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
with open(template_path, "r", encoding="utf-8") as f:
template = f.read()
self.gguf_writer.add_chat_template(template)
Copy link
Collaborator Author

@ngxson ngxson Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that mistral wants to delegates chat formatting to their python library mistral-common make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick look at the tests in mistral-common seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd).

It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE] token based on the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah we're missing the [BEGIN_AUDIO] token. I have no idea if [TRANSCRIBE] is required though, it seems like the model is also able to summarize the audio, not just transcribe it.

Copy link
Collaborator Author

@ngxson ngxson Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun thing that I've just found out, without the [BEGIN_AUDIO], the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the model is also able to summarize the audio, not just transcribe it.

From what I understand from the HF page, it basically has two modes:

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly

Without the [TRANSCRIBE] it can understand and talk about the audio, like you've shown. In [TRANSCRIBE] mode it only outputs the transcription, similar to Whisper, I believe.

@Kreijstal
Copy link

how to use this to trsancribe audio ala whisper.cpp?

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

This PR is pretty much ready for review.

Bonus, b828887 fixes the conversion of newer mistral models which does not have tokenizer.json, such as https://huggingface.co/mistralai/Devstral-Small-2507

@ngxson ngxson requested a review from ggerganov July 27, 2025 21:52
@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

Kindly asking for some quants from my friends @bartowski1182 @danielhanchen 😄

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 27, 2025
@bartowski1182
Copy link
Contributor

I don't usually like quanting from unreleased, but I can test it out to see if it works :)

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

@bartowski1182 Aha yeah good point! Indeed, the text model is pretty much exactly the same as Devstral so I don't think it's gonna change much. You can pre-quantize the Voxtral 24B from right now as it might take some time, and upload it once this PR is merged.

The only thing that may change is the mmproj, but it's simple to re-generate.

@bartowski1182
Copy link
Contributor

awesome sounds good!

super excited for the native mistral support, that's a game changer

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 27, 2025

Test result:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy