[AutoParallel] GPT support shared parameters #10783

waliwali777 · 2025-07-01T03:02:59Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Other

PR changes

Other

Description

对 gpt-13b 动半模型进行如下优化：

支持共享参数优化，在 pp 下支持对 embedding weight 和 lm_head weight 共享优化（参考共享参数实现：[AutoParallel] addd sync param dynamic Paddle#73733 ），修改它们切分状态相同且都为列切
修改 embedding weight 的切分状态为 [dist.Relicate(), dist.Shard(0)]，与 lmhead weight 的切分状态保持一致，减少共享参数时因为状态不一致引入的通信
修改 embedding auto 层输出状态为 [dist.Shard(0), dist.Relicate()]，减少最优策略下 encoder 层引入的 allgather 通信

paddle-bot · 2025-07-01T03:03:04Z

Thanks for your contribution!

codecov · 2025-07-01T03:40:53Z

Codecov Report

Attention: Patch coverage is 11.11111% with 64 lines in your changes missing coverage. Please review.

Project coverage is 46.73%. Comparing base (44eff1f) to head (2682061).
Report is 4 commits behind head on develop.

❗ Current head 2682061 differs from pull request most recent head 20c3b84

Please upload reports for the commit 20c3b84 to get more accurate results.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/gpt/modeling_auto_pp.py	5.26%	36 Missing ⚠️
paddlenlp/trainer/auto_trainer.py	0.00%	16 Missing ⚠️
paddlenlp/trainer/trainer.py	33.33%	12 Missing ⚠️

❌ Your patch check has failed because the patch coverage (11.11%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (46.73%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #10783      +/-   ##
===========================================
- Coverage    46.75%   46.73%   -0.03%     
===========================================
  Files          802      802              
  Lines       133882   133795      -87     
===========================================
- Hits         62603    62529      -74     
+ Misses       71279    71266      -13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liym27 · 2025-07-18T03:14:00Z

paddlenlp/transformers/gpt/modeling_auto.py

@@ -658,10 +658,10 @@ def __init__(
            config.hidden_size,
        )
        self.word_embeddings.weight = dist.shard_tensor(
-            self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Replicate()]
+            self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Shard(0)]


疑问：这里为什么要行切？ embedding 算子是不支持行切的，所以实际上还是要 Allgather 的
如果要行切，是否能替换成 c_embedding

因为 lmhead 层为了支持 parallel_cross_centropy，所以 weight 必须是行切；因为共享参数需要切分状态是一致的，所以参数同步时会引入一些通信。
经过测试 lmhead weight 和 embedding weight 不同切分状态的组合，发现同为行切时，此时性能最佳；虽然embedding 计算时会引入 allgather，但该该通信总耗时是最少的
替换成 c_embedding 是可行的，可以之后 PR 中再支持

liym27 · 2025-07-18T03:14:50Z

scripts/distribute/ci_case_auto.sh

@@ -2578,11 +2578,11 @@ function llm_gpt_dygraph_auto_bs8_fp32_DP2-MP2() {
    ips=-1
    mem=-1
    echo "result: loss=$loss ips=$ips mem=$mem loss_md5=$loss_md5"
-    loss_base=10.5657959 # output of dropout is different after supporting spmd
+    loss_base=10.49585533 # output of dropout is different after supporting spmd


用了共享参数后，loss 差异这么大吗

我这边是发现隐藏层层数较少时，共享参数 + 其他切分状态修改， loss 受影响比较大

init shared parameters

90db0e7

waliwali777 force-pushed the dynamic_sync_param branch from bdc3f5a to 90db0e7 Compare July 10, 2025 12:50

waliwali777 added 6 commits July 10, 2025 20:55

update shared_parameters

18b121b

fix status

080e701

update shard param placements

4b3b571

update shard param placements

e4f88f4

fix lint

49ba28c

update loss base

20c3b84

waliwali777 force-pushed the dynamic_sync_param branch from 694512e to 20c3b84 Compare July 17, 2025 09:37

waliwali777 changed the title ~~[AutoParallel] init sync param~~ [AutoParallel] GPT support shared parameters Jul 18, 2025

liym27 reviewed Jul 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoParallel] GPT support shared parameters #10783

[AutoParallel] GPT support shared parameters #10783

Uh oh!

waliwali777 commented Jul 1, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 1, 2025

Uh oh!

codecov bot commented Jul 1, 2025 •

edited

Loading

Uh oh!

liym27 Jul 18, 2025

Uh oh!

waliwali777 Jul 18, 2025

Uh oh!

liym27 Jul 18, 2025

Uh oh!

waliwali777 Jul 18, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

[AutoParallel] GPT support shared parameters #10783

Are you sure you want to change the base?

[AutoParallel] GPT support shared parameters #10783

Uh oh!

Conversation

waliwali777 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Jul 1, 2025

Uh oh!

codecov bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

liym27 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

waliwali777 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

liym27 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

waliwali777 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

waliwali777 commented Jul 1, 2025 •

edited

Loading

codecov bot commented Jul 1, 2025 •

edited

Loading