Content-Length: 603758 | pFad | https://github.com/pytorch/pytorch/pull/146756

2A [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 by Xia-Weiwen · Pull Request #146756 · pytorch/pytorch · GitHub
Skip to content

[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 #146756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Feb 8, 2025

Stack from ghstack (oldest at bottom):

Summary
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for torch.ops.aten_weight_int4pack_mm_for_cpu. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of torch.ops.aten_weight_int4pack_mm_for_cpu with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

Test plan

python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Feb 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146756

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cb28987 with merge base c644f4c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Xia-Weiwen added a commit that referenced this pull request Feb 8, 2025
…th AVX512

ghstack-source-id: 4f008ee
Pull Request resolved: #146756
@Xia-Weiwen Xia-Weiwen marked this pull request as draft February 8, 2025 13:53
@Xia-Weiwen Xia-Weiwen added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel labels Feb 8, 2025
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 8, 2025
…th AVX512

ghstack-source-id: d6e9d70
Pull Request resolved: #146756
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 9, 2025
…th AVX512

ghstack-source-id: 186c9e9
Pull Request resolved: #146756
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 10, 2025
…th AVX512

ghstack-source-id: 087f17a
Pull Request resolved: #146756
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 11, 2025
…th AVX512

ghstack-source-id: d5fac79
Pull Request resolved: #146756
@jgong5 jgong5 changed the title [Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu with AVX512 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 Feb 13, 2025
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 21, 2025
…th AVX512

ghstack-source-id: 9d14221
Pull Request resolved: #146756
[ghstack-poisoned]
@Xia-Weiwen Xia-Weiwen requested a review from amjames February 25, 2025 08:06
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jansel Could you please review this PR? Thanks.

@Xia-Weiwen Xia-Weiwen requested a review from jansel February 27, 2025 05:15
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 27, 2025
…th AVX512

ghstack-source-id: 9e902a4
Pull Request resolved: #146756
Comment on lines 160 to 165
# define functions to generate example inputs for weight and group size
# otherwise, autotuner generates example inputs of all zeros for them
def get_example_weight(x: torch._inductor.ir.IRNode) -> torch.Tensor:
shape = x.get_size()
device = x.get_device()
return torch.randint(0, 255, shape, dtype=torch.uint8, device=device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to account for strides here? This assumes contiguous.

Can we share some code with the other example input generation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.

  • Now I have added a check: if weight is not contiguous, we won't take GEMM template as a choice for max-autotuning. And I have also added an assertion inside get_example_weight.
  • As for the second question, I have searched in the PyTorch source code and didn't find any code that can be shared. This case is special because weight's dtype is uint8.

Thanks.

[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 28, 2025
…th AVX512

ghstack-source-id: 6f92833
Pull Request resolved: #146756
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Feb 28, 2025
…th AVX512

ghstack-source-id: 67494e3
Pull Request resolved: #146756
@Xia-Weiwen Xia-Weiwen requested a review from jansel February 28, 2025 04:35
@Xia-Weiwen
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team Raised by workflow job

@Xia-Weiwen
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

majing921201 pushed a commit to majing921201/pytorch that referenced this pull request Mar 4, 2025
…th AVX512 (pytorch#146756)

**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512
```

Pull Request resolved: pytorch#146756
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
@github-actions github-actions bot deleted the gh/Xia-Weiwen/30/head branch April 2, 2025 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: inductor open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/pytorch/pytorch/pull/146756

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy