Skip to content

mtmd : add support for Voxtral #14862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 24, 2025

Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \
    --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \
    --audio ../models/i-have-a-dream-30s.mp3 \
    -p "Transcribe this audio"

Output:

Here's the transcribed text from the audio:

"I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the South with. With this faith, we will be able to hew out of the mountain of despair a stone of hope."

Pre-quantized model

https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

llama-mtmd-cli -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

How to convert from safetensors

# download https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
cd Voxtral-Mini-3B-2507
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
python3 ~/llama.cpp/convert_hf_to_gguf.py --outfile mmproj-model.gguf --outtype f16 --mmproj .

This PR is made possible with the help from:

Disclamer: due to mental health issues, I have very limited amount of time to finish this PR.
Suggestions and testings are welcome, but keep in mind that I won't have time to address all possible issues.

Comment on lines +2307 to +2311
script_dir = Path(__file__).parent
template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
with open(template_path, "r", encoding="utf-8") as f:
template = f.read()
self.gguf_writer.add_chat_template(template)
Copy link
Collaborator Author

@ngxson ngxson Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that mistral wants to delegates chat formatting to their python library mistral-common make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick look at the tests in mistral-common seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd).

It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE] token based on the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah we're missing the [BEGIN_AUDIO] token. I have no idea if [TRANSCRIBE] is required though, it seems like the model is also able to summarize the audio, not just transcribe it.

Copy link
Collaborator Author

@ngxson ngxson Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun thing that I've just found out, without the [BEGIN_AUDIO], the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the model is also able to summarize the audio, not just transcribe it.

From what I understand from the HF page, it basically has two modes:

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly

Without the [TRANSCRIBE] it can understand and talk about the audio, like you've shown. In [TRANSCRIBE] mode it only outputs the transcription, similar to Whisper, I believe.

@Kreijstal
Copy link

how to use this to trsancribe audio ala whisper.cpp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy