-
Notifications
You must be signed in to change notification settings - Fork 12.5k
mtmd : add support for Voxtral #14862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
script_dir = Path(__file__).parent | ||
template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja" | ||
with open(template_path, "r", encoding="utf-8") as f: | ||
template = f.read() | ||
self.gguf_writer.add_chat_template(template) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that mistral wants to delegates chat formatting to their python library mistral-common
make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers
model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick look at the tests in mistral-common
seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd
).
It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE]
token based on the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah we're missing the [BEGIN_AUDIO]
token. I have no idea if [TRANSCRIBE]
is required though, it seems like the model is also able to summarize the audio, not just transcribe it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun thing that I've just found out, without the [BEGIN_AUDIO]
, the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like the model is also able to summarize the audio, not just transcribe it.
From what I understand from the HF page, it basically has two modes:
Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
Without the [TRANSCRIBE]
it can understand and talk about the audio, like you've shown. In [TRANSCRIBE]
mode it only outputs the transcription, similar to Whisper, I believe.
how to use this to trsancribe audio ala whisper.cpp? |
Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \ --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \ --audio ../models/i-have-a-dream-30s.mp3 \ -p "Transcribe this audio"
Output:
Pre-quantized model
https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
How to convert from safetensors
This PR is made possible with the help from: