Skip to content

[bugfix] fix megatron model merger #1774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ShareLer
Copy link
Contributor

@ShareLer ShareLer commented May 30, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Fix megatron model merger.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

  • Fix get rank method to support just TP.
  • Fix state_dict keys after convert.
  • Add mla/moe convert support.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

Test with Qwen3-8B and Qwen2.5-7B.

Additional Info.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

Signed-off-by: ShareLer <ShareLe@163.com>
@vermouth1992 vermouth1992 requested a review from ETOgaosion May 30, 2025 08:49
@ETOgaosion
Copy link
Collaborator

ETOgaosion commented May 31, 2025

@ShareLer Thanks a lot for helping us to fix this, it helps a lot~

Could you briefly point out what causes vllm inference failure? Seems also involved a lot of refactorization. Is it caused by the missing parameters transferring?

Signed-off-by: ShareLer <ShareLe@163.com>
@ShareLer
Copy link
Contributor Author

ShareLer commented Jun 1, 2025

@ShareLer Thanks a lot for helping us to fix this, it helps a lot~

Could you briefly point out what causes vllm inference failure? Seems also involved a lot of refactorization. Is it caused by the missing parameters transferring?

Three main reasons:

  1. The name of the converted layer has not been modified in _merge_state_dicts().
    model.decoder.layers.xxx in converted ckpt, but the actual is supposed to be model.layers.xxx.
    This is also the reason for the failure in the mentioned issue.

  2. The name of qkv in the attention layer is error after converted.
    linear_qkv is converted to linear_q/linear_k/linear_v, but in reality it should be q_proj/k_proj/v_proj

  3. The weight of the final output layer is processed incorrectly.
    When is_value_model=False, the output_layer is ColumnParallelLinear, but the weights of different TP were not merged in _merge_state_dicts().
    (When submitting for the first time, my method ignored the value_model, resulting in the failure of CI. It has just been fixed.)

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add a test that reproduces the issue?

@ShareLer
Copy link
Contributor Author

ShareLer commented Jun 1, 2025

is it possible to add a test that reproduces the issue?

You can reproduce this problem very simply by using the CI script (like job 'e2e_ppo_trainer_megatron-qwen3' in e2e_ppo_trainer_megatron.yml) just change command option in merger:
You need change the test operation in python scripts/model_merger.py test --backend megatron to merge.

There were no problems in the previous CI test because different logics were used in the test and merge options:
First of all, they all obtained the merged weights through the _merge_state_dicts() method. However, it should be noted that there are some problems with state_dicts at this time (the three problems in the previous reply).
Next, in the merge option, this problematic state_dicts was directly saved as the final ckpt. But these problematic layer names were corrected in the test option (remove the decoder and correct the name of the qkv) which used in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy