-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Add LLaDA 8b Diffusion model #14771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add LLaDA 8b Diffusion model #14771
Conversation
e4b7346
to
5644f2f
Compare
I would like to avoid adding a second diffusion example - we are increasing the maintenance efforts for not significant benefit. The diffusion architecture is not yet well established. We can think about extending the |
Yeah agree, I initially wrote them to be one example. However, passing arguments via CLI for two separate sets of sampling parameters/algorithms was quite confusing to me and would be even more so for the end-user, so for the sake of clarity I wrote them separately. |
@ggerganov would having them in the same example and having extra CLI args for models be acceptable? |
Yes, merging the examples into a single example would be better. |
llama: fix llama-model fixup working
Made everything into a single example, please have another look when you have the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example can be improved by not branching between "llada" and "dream" and instead have a common logic for any diffusion logic. This would make it much easier to scale with more diffusion models in the future. Otherwise, the way you've implemented it now, you have to add new structs, sampling types, generation functions, etc. for each new architecture and this seems a bit unnecessary.
).set_examples({ LLAMA_EXAMPLE_DIFFUSION })); | ||
|
||
add_opt(common_arg( | ||
{ "--diffusion--dream-eps" }, "F", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ "--diffusion--dream-eps" }, "F", | |
{ "--diffusion-dream-eps" }, "F", |
add_opt(common_arg( | ||
{ "--diffusion-llada-algorithm" }, "N", | ||
string_format("llada remasking algorithm: 0=LOW_CONFIDENCE, 1=RANDOM (default: %d)", params.diffusion.remasking), | ||
[](common_params & params, int value) { params.diffusion.remasking = value; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument names should not be associated with the models. This should be simply --diffusion-algorithm
.
int32_t max_length; | ||
int32_t block_length; | ||
float cfg_scale; | ||
enum diffusion_algorithm_llada algorithm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional keywords enum
, struct
, class
should be omitted in C++ code:
enum diffusion_algorithm_llada algorithm; | |
diffusion_algorithm_llada algorithm; |
// For LLaDA models, forcefully add BOS token at the beginning. TODO: check why | ||
if (arch == "llada") { | ||
llama_token bos_token = llama_vocab_bos(vocab); | ||
if (bos_token != LLAMA_TOKEN_NULL && (input_tokens.empty() || input_tokens[0] != bos_token)) { | ||
input_tokens.insert(input_tokens.begin(), bos_token); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled by the meta data in the GGUF model. There is a boolean field for when BOS is needed or not.
char arch_str[128]; | ||
GGML_ASSERT(llama_model_meta_val_str(model, "general.architecture", arch_str, 128) >= 0); | ||
|
||
std::string arch = std::string(arch_str); | ||
|
||
if (arch != "dream" && arch != "llada") { | ||
LOG_ERR("error: unsupported model architecture '%s' for diffusion. Expected 'dream' or 'llada'\n", arch_str); | ||
llama_model_free(model); | ||
return 1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we check if the model is diffusion using the new API call?
// Dream remasking algorithms | ||
enum diffusion_algorithm_dream { | ||
ORIGIN = 0, | ||
MASKGIT_PLUS = 1, | ||
TOPK_MARGIN = 2, | ||
ENTROPY = 3, | ||
}; | ||
|
||
// LLaDA remasking types | ||
enum diffusion_algorithm_llada { | ||
LOW_CONFIDENCE = 0, | ||
RANDOM = 1, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this separation necessary? For example, can we use "RANDOM" sampling with Dream?
struct dream_diffusion_params : diffusion_params { | ||
float eps; | ||
float top_p; | ||
int32_t top_k; | ||
enum diffusion_algorithm_dream algorithm; | ||
float alg_temp; | ||
}; | ||
|
||
struct llada_diffusion_params : diffusion_params { | ||
int32_t max_length; | ||
int32_t block_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this separation of the diffusion parameters per architecture is necessary. It should be a single flat struct diffusion_params
for all models.
Continuing on #14644, this PR adds another diffusion model https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct, which has different semantics compared to the dream-7b model, and overall seems to have better performance
There are very few similarities between how they seem to generate tokens, so for now I've just created two different examples
llama-diffusion-dream-cli
(for the earlier version) andllama-diffusion-llada-cli
, for running the new LLaDA model. Added a README as wellI've uploaded a GGUF.
Example command
./build/bin/llama-diffusion-llada-cli -m llada-8b.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" --diffusion_steps 128 -ngl 99 --temp 0 -ub 128 --diffusion-visual
Also I would like this to the server, but I'm not sure what API would be acceptable so I'm hoping to have a discussion on that as well