Skip to content

[AutoParallel] GPT support shared parameters #10783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

waliwali777
Copy link
Contributor

@waliwali777 waliwali777 commented Jul 1, 2025

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Other

PR changes

Other

Description

对 gpt-13b 动半模型进行如下优化:

  1. 支持共享参数优化,在 pp 下支持对 embedding weight 和 lm_head weight 共享优化 (参考共享参数实现:[AutoParallel] addd sync param dynamic Paddle#73733 ),修改它们切分状态相同且都为列切
  2. 修改 embedding weight 的切分状态为 [dist.Relicate(), dist.Shard(0)],与 lmhead weight 的切分状态保持一致,减少共享参数时因为状态不一致引入的通信
  3. 修改 embedding auto 层输出状态为 [dist.Shard(0), dist.Relicate()],减少最优策略下 encoder 层引入的 allgather 通信

Copy link

paddle-bot bot commented Jul 1, 2025

Thanks for your contribution!

Copy link

codecov bot commented Jul 1, 2025

Codecov Report

Attention: Patch coverage is 11.11111% with 64 lines in your changes missing coverage. Please review.

Project coverage is 46.73%. Comparing base (44eff1f) to head (2682061).
Report is 4 commits behind head on develop.

Current head 2682061 differs from pull request most recent head 20c3b84

Please upload reports for the commit 20c3b84 to get more accurate results.

Files with missing lines Patch % Lines
paddlenlp/transformers/gpt/modeling_auto_pp.py 5.26% 36 Missing ⚠️
paddlenlp/trainer/auto_trainer.py 0.00% 16 Missing ⚠️
paddlenlp/trainer/trainer.py 33.33% 12 Missing ⚠️

❌ Your patch check has failed because the patch coverage (11.11%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (46.73%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #10783      +/-   ##
===========================================
- Coverage    46.75%   46.73%   -0.03%     
===========================================
  Files          802      802              
  Lines       133882   133795      -87     
===========================================
- Hits         62603    62529      -74     
+ Misses       71279    71266      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@waliwali777 waliwali777 changed the title [AutoParallel] init sync param [AutoParallel] GPT support shared parameters Jul 18, 2025
@@ -658,10 +658,10 @@ def __init__(
config.hidden_size,
)
self.word_embeddings.weight = dist.shard_tensor(
self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Replicate()]
self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Shard(0)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

疑问:这里为什么要行切? embedding 算子是不支持行切的,所以实际上还是要 Allgather 的
如果要行切,是否能替换成 c_embedding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为 lmhead 层为了支持 parallel_cross_centropy,所以 weight 必须是 行切;因为共享参数需要切分状态是一致的,所以参数同步时会引入一些通信。
经过测试 lmhead weight 和 embedding weight 不同切分状态的组合,发现同为行切时,此时性能最佳;虽然embedding 计算时会引入 allgather,但该该通信总耗时是最少的
替换成 c_embedding 是可行的,可以之后 PR 中再支持

@@ -2578,11 +2578,11 @@ function llm_gpt_dygraph_auto_bs8_fp32_DP2-MP2() {
ips=-1
mem=-1
echo "result: loss=$loss ips=$ips mem=$mem loss_md5=$loss_md5"
loss_base=10.5657959 # output of dropout is different after supporting spmd
loss_base=10.49585533 # output of dropout is different after supporting spmd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用了共享参数后,loss 差异这么大吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这边是发现隐藏层层数较少时,共享参数 + 其他切分状态修改, loss 受影响比较大

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy