-
Notifications
You must be signed in to change notification settings - Fork 12.5k
mtmd : add support for Voxtral #14862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
convert_hf_to_gguf.py
Outdated
script_dir = Path(__file__).parent | ||
template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja" | ||
with open(template_path, "r", encoding="utf-8") as f: | ||
template = f.read() | ||
self.gguf_writer.add_chat_template(template) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that mistral wants to delegates chat formatting to their python library mistral-common
make this part a bit tricky.. chat templates are no longer stored inside HF repo, even with transformers
model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick look at the tests in mistral-common
seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar with mtmd
).
It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a [TRANSCRIBE]
token based on the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah we're missing the [BEGIN_AUDIO]
token. I have no idea if [TRANSCRIBE]
is required though, it seems like the model is also able to summarize the audio, not just transcribe it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun thing that I've just found out, without the [BEGIN_AUDIO]
, the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like the model is also able to summarize the audio, not just transcribe it.
From what I understand from the HF page, it basically has two modes:
Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
Without the [TRANSCRIBE]
it can understand and talk about the audio, like you've shown. In [TRANSCRIBE]
mode it only outputs the transcription, similar to Whisper, I believe.
how to use this to trsancribe audio ala whisper.cpp? |
This PR is pretty much ready for review. Bonus, b828887 fixes the conversion of newer mistral models which does not have |
Kindly asking for some quants from my friends @bartowski1182 @danielhanchen 😄 |
I don't usually like quanting from unreleased, but I can test it out to see if it works :) |
@bartowski1182 Aha yeah good point! Indeed, the text model is The only thing that may change is the mmproj, but it's simple to re-generate. |
awesome sounds good! super excited for the native mistral support, that's a game changer |
Test result:
|
Downloading the model and will run some tests now. |
Ran some tests with audio samples that I normally use with Whisper and results look good. I noticed only one failure using the "micro machines" sample: https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav Voxtral does not seem to recognize the speech at all. Not sure if this is problem in the implementation or in the model: make -j && ./bin/llama-mtmd-cli -m ../models/voxtral-3b/ggml-model-f16.gguf --mmproj ../models/voxtral-3b/mmproj.gguf --audio ./micro-machines.wav -p "Transcribe this audio" --top-k 1
load_hparams: model size: 1267.53 MiB
load_hparams: metadata size: 0.17 MiB
alloc_compute_meta: Metal compute buffer size = 200.96 MiB
alloc_compute_meta: CPU compute buffer size = 3.66 MiB
init_audio: audio input is in experimental stage and may have reduced quality:
https://github.com/ggml-org/llama.cpp/discussions/13759
main: loading model: ../models/voxtral-3b/ggml-model-f16.gguf
encoding audio slice...
audio slice encoded in 718 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 129 ms
Sure, please provide the audio file or the transcript of the audio you'd like me to transcribe. |
If I change the prompt from make -j && ./bin/llama-mtmd-cli -m ../models/voxtral-3b/ggml-model-f16.gguf --mmproj ../models/voxtral-3b/mmproj.gguf --audio ./micro-machines.wav -p "Transcribe" --top-k 1
...
### Transcription
**Speaker 1:**
"This is the Micro-Machine presenting the most minute motorcade of Micro-Machines. Each has dramatic tails, terrific trim, precision paint, plus incredible Micro-Machine pocket sets. There's police station, fire station, restaurant, service station, and more. Perfect pocket portables to any place. And there are many mini-places to play with each one comes with its own special edition Micro-Machine vehicle and fantastic features that miraculously move. Raise the boat lift at the marina, man the gun turret at the army base, clean your car at the wash, raise the toll bridge. And these places fit into a Micro-Machine world. Micro-Machine pocket sets so tremendously tiny, perfectly precise, dazzlingly detailed, you'll pocket them all. Micro-Machines and Micro-Machine pocket sets sold separately from Goob. The smaller they are, the better they are." |
Thanks for testing. I corrected a mistake with project's activation function incorrectly set to However, it still does not make the output correct. Probably the model is not trained for non-transcribe tasks. The main problem is that for transcribe task, we require a specific prompt with an assistant prefix in the chat template:
For now, a small hack is to add this line inside auto formatted_chat = common_chat_templates_apply(ctx.tmpls.get(), tmpl_inputs);
LOG_DBG("formatted_chat.prompt: %s\n", formatted_chat.prompt.c_str());
formatted_chat.prompt += "lang:en[TRANSCRIBE]"; // ADD THIS LINE
mtmd_input_text text; And re-run the example with an empty prompt, But I think this can easily be fixed with a new flag |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
is it hard to make this work on whisper.cpp whisper has this karaoke script and timestamp generation for srt.. I know this is weird on llama.cpp but this prevents me from using it for transcription, even if it has better recognition |
I wonder if their training data contained some timestamp annotations. If so, a grammar might be all you need. |
There is only [BEGIN_AUDIO] but no end audio equivalent? |
Correct, because the audio is one full message so the usual end-of-message tokens apply. See the tests here for an example. |
Nice work @ngxson ! Will give this a try! |
If anyone is interested in trying the 24B model, I did some basic q8 and q4 quants over there, with fp16 mmproj: |
I don't know if I am using it wrong or if it's a model/implementation issue. I am using official llama-b6029-bin-win-cpu-x64 binaries. I am loading the quants from this repo: https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/tree/main I have a 10 second audio clip containing some basic words, attached here (since github only allows mp4 files I have converted it, but be assured it is a valid MP3) CrayonsCanMelt.mp4First, I tried using mtmd-cli:
then i tried llama-server.exe with similar results ![]() I tried again with different audio files, occasionally the model might reference something in the audio content, but it never really works properly. |
See #14862 (comment) |
But according to the voxtral docs, it's capable of "Built-in Q&A and summarization" So if it's transcribe only it's basically just a more accurate whisper? |
Friends, is there any documentation about the transcription operation of the GGUF model? |
@LostRuins I'm seeing the same behavior, the model claims it's unable to transcribe audio, but at the same time, if I give it instructions through an audio file, it understands them flawlessly. Seems like it's not really able to distinguish between audio and text modalities. |
@stduhpf have you been able to get it to do anything besides transcribe audio to text? I'm wondering if that modality was even trained for. Qwen Omni 3B was able to tell the estimated age or gender of the speaker, can identify different sounds like people laughing, a dog barking and a train whistle, it can compare cadence, tone and emotion. |
I tried to make it estimate the accent of the speakers and most of the time it refused, saying it was unable to process audio files directly, and when I forced it to give an answer, it was completely wrong (defaulting to British accent if the language is English regardless of the actual accent for example). So I'm guessing it was really only trained to process the spoken words while ignoring any other audio clues. It doesn't feel much more useful than just hooking up a speech to text model to an LLM. |
Tested on https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
llama-mtmd-cli -m ../models/Voxtral-Mini-3B-2507/model.gguf \ --mmproj ../models/Voxtral-Mini-3B-2507/mmproj-model.gguf \ --audio ../models/i-have-a-dream-30s.mp3 \ -p "Transcribe this audio"
Output:
Pre-quantized model
https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
How to convert from safetensors
This PR is made possible with the help from: