[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 #146756

Xia-Weiwen · 2025-02-08T13:51:09Z

Stack from ghstack (oldest at bottom):

-> [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 #146756

Summary
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for torch.ops.aten_weight_int4pack_mm_for_cpu. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of torch.ops.aten_weight_int4pack_mm_for_cpu with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

Test plan

python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-02-08T13:51:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146756

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cb28987 with merge base c644f4c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…th AVX512 ghstack-source-id: 4f008ee Pull Request resolved: #146756

[ghstack-poisoned]

…th AVX512 ghstack-source-id: d6e9d70 Pull Request resolved: #146756

torch/_inductor/codegen/cpp_gemm_template.py

torch/_inductor/codegen/cpp_micro_gemm.py

[ghstack-poisoned]

…th AVX512 ghstack-source-id: 186c9e9 Pull Request resolved: #146756

[ghstack-poisoned]

…th AVX512 ghstack-source-id: 087f17a Pull Request resolved: #146756

[ghstack-poisoned]

…th AVX512 ghstack-source-id: d5fac79 Pull Request resolved: #146756

test/inductor/test_cpu_select_algorithm.py

torch/_inductor/codegen/cpp_gemm_template.py

torch/_inductor/codegen/cpp_micro_gemm.py

torch/_inductor/codegen/cpp_prefix.h

torch/_inductor/codegen/cpp_micro_gemm.py

torch/_inductor/quantized_lowerings.py

[ghstack-poisoned]

…th AVX512 ghstack-source-id: 9d14221 Pull Request resolved: #146756

[ghstack-poisoned]

Xia-Weiwen · 2025-02-27T01:51:52Z

Hi @jansel Could you please review this PR? Thanks.

torch/_inductor/quantized_lowerings.py

[ghstack-poisoned]

…th AVX512 ghstack-source-id: 9e902a4 Pull Request resolved: #146756

jansel · 2025-02-27T18:58:50Z

torch/_inductor/quantized_lowerings.py

+        # define functions to generate example inputs for weight and group size
+        # otherwise, autotuner generates example inputs of all zeros for them
+        def get_example_weight(x: torch._inductor.ir.IRNode) -> torch.Tensor:
+            shape = x.get_size()
+            device = x.get_device()
+            return torch.randint(0, 255, shape, dtype=torch.uint8, device=device)


don't we need to account for strides here? This assumes contiguous.

Can we share some code with the other example input generation?

Thanks for your comments.

Now I have added a check: if weight is not contiguous, we won't take GEMM template as a choice for max-autotuning. And I have also added an assertion inside get_example_weight.

As for the second question, I have searched in the PyTorch source code and didn't find any code that can be shared. This case is special because weight's dtype is uint8.

Thanks.

[ghstack-poisoned]

…th AVX512 ghstack-source-id: 6f92833 Pull Request resolved: #146756

[ghstack-poisoned]

…th AVX512 ghstack-source-id: 67494e3 Pull Request resolved: #146756

Xia-Weiwen · 2025-03-01T08:06:43Z

@pytorchbot merge

pytorchmergebot · 2025-03-01T08:08:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-01T08:08:38Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

Xia-Weiwen · 2025-03-03T00:48:59Z

@pytorchbot merge

pytorchmergebot · 2025-03-03T00:50:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…th AVX512 (pytorch#146756) **Summary** It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes. Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested. **Test plan** ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512 ``` Pull Request resolved: pytorch#146756 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel

Update

ad2dc02

[ghstack-poisoned]

Xia-Weiwen mentioned this pull request Jan 27, 2025

[Quant][CPU] add a wrapper op for _weight_int4pack_mm_for_cpu with tensor args #145245

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Feb 8, 2025

Xia-Weiwen mentioned this pull request Feb 8, 2025

[Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu #145250

Closed

Xia-Weiwen added a commit that referenced this pull request Feb 8, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

34fd336

…th AVX512 ghstack-source-id: 4f008ee Pull Request resolved: #146756

Xia-Weiwen marked this pull request as draft February 8, 2025 13:53

Xia-Weiwen added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel labels Feb 8, 2025

pytorchbot added the open source label Feb 8, 2025

Update

e2a16a7

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 8, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

ee839fa

…th AVX512 ghstack-source-id: d6e9d70 Pull Request resolved: #146756

sanchitintel reviewed Feb 8, 2025

View reviewed changes

torch/_inductor/codegen/cpp_gemm_template.py Show resolved Hide resolved

sanchitintel reviewed Feb 8, 2025

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

Update

4372744

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 9, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

b58ab3c

…th AVX512 ghstack-source-id: 186c9e9 Pull Request resolved: #146756

Update

b1090ac

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 10, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

7906507

…th AVX512 ghstack-source-id: 087f17a Pull Request resolved: #146756

Update

737196a

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 11, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

9872ab6

…th AVX512 ghstack-source-id: d5fac79 Pull Request resolved: #146756

Xia-Weiwen requested review from jgong5 and leslie-fang-intel February 13, 2025 08:35

jgong5 changed the title ~~[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu with AVX512~~ [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 Feb 13, 2025

jgong5 requested changes Feb 13, 2025

View reviewed changes

leslie-fang-intel reviewed Feb 14, 2025

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

torch/_inductor/quantized_lowerings.py Outdated Show resolved Hide resolved

Update

273decb

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 21, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

88d4079

…th AVX512 ghstack-source-id: 9d14221 Pull Request resolved: #146756

Update

37d540d

[ghstack-poisoned]

Xia-Weiwen requested a review from amjames February 25, 2025 08:06

jgong5 approved these changes Feb 26, 2025

View reviewed changes

jansel requested changes Feb 27, 2025

View reviewed changes

torch/_inductor/quantized_lowerings.py Outdated Show resolved Hide resolved

Xia-Weiwen requested a review from jansel February 27, 2025 05:15

Update

231edd2

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 27, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

f2d04f9

…th AVX512 ghstack-source-id: 9e902a4 Pull Request resolved: #146756

jansel requested changes Feb 27, 2025

View reviewed changes

Update

1e3b6b0

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 28, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

581f9a9

…th AVX512 ghstack-source-id: 6f92833 Pull Request resolved: #146756

Update

cb28987

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Feb 28, 2025

[Inductor][CPU] Add GEMM tamplates for _weight_int4pack_mm_for_cpu wi…

d61295d

…th AVX512 ghstack-source-id: 67494e3 Pull Request resolved: #146756

Xia-Weiwen requested a review from jansel February 28, 2025 04:35

jansel approved these changes Feb 28, 2025

View reviewed changes

pytorchmergebot added the merging label Mar 1, 2025

pytorchmergebot removed the merging label Mar 1, 2025

pytorchmergebot added the merging label Mar 3, 2025

pytorchmergebot added the Merged label Mar 3, 2025

pytorchmergebot closed this in ab81ca5 Mar 3, 2025

pytorchmergebot removed the merging label Mar 3, 2025

janeyx99 mentioned this pull request Mar 9, 2025

Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor #148834

Closed

github-actions bot deleted the gh/Xia-Weiwen/30/head branch April 2, 2025 02:11

Xia-Weiwen mentioned this pull request Jun 9, 2025

High-performance LLM quantization on X86 CPU with native PyTorch #155435

Open

[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 #146756

[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 #146756

Uh oh!

Conversation

Xia-Weiwen commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146756

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Feb 27, 2025

Uh oh!

Uh oh!

jansel Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen commented Mar 1, 2025

Uh oh!

pytorchmergebot commented Mar 1, 2025

Merge started

Uh oh!

pytorchmergebot commented Mar 1, 2025

Merge failed

Uh oh!

Xia-Weiwen commented Mar 3, 2025

Uh oh!

pytorchmergebot commented Mar 3, 2025

Merge started

Uh oh!

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Xia-Weiwen commented Feb 8, 2025 •

edited

Loading

pytorch-bot bot commented Feb 8, 2025 •

edited

Loading