Add AOTI shim for _weight_int4pack_mm_cpu_tensor #149031

Xia-Weiwen · 2025-03-12T07:28:58Z

Stack from ghstack (oldest at bottom):

-> Add AOTI shim for _weight_int4pack_mm_cpu_tensor #149031

Summary
Previous implementation of shim did not align with the design and it was removed by #148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.

Test plan

pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-03-12T07:29:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149031

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 8f28345 with merge base a8b1767 ():

NEW FAILURE - The following job has failed:

linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test (gh)
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 8, 0) but found runtime version (9, 7, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: c72bbbb Pull Request resolved: #149031

github-actions · 2025-03-12T07:32:25Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/c/shim_mkldnn.h

[ghstack-poisoned]

ghstack-source-id: 38166c3 Pull Request resolved: #149031

[ghstack-poisoned]

ghstack-source-id: 924380e Pull Request resolved: #149031

Xia-Weiwen · 2025-03-14T01:23:50Z

Hi @EikanWang Could you please take a look? And since all these files are named after MKLDNN, is it still ok to rename them to cpu? Thanks.

jgong5 · 2025-03-15T01:32:44Z

torch/csrc/inductor/aoti_torch/shim_mkldnn.cpp

@@ -523,3 +523,19 @@ AOTITorchError aoti_torch_cpu__mkl_linear(
 #endif // AT_MKL_ENABLED

 #endif // AT_MKLDNN_ENABLED()
+
+AOTITorchError aoti_torch_cpu__weight_int4pack_mm_cpu_tensor(


But are we using mkldnn here?

No. It does not require MKLDNN. We are considering renaming these files from mkldnn to cpu. Thanks.

@jgong5 , currently, the shim_mkldnn indicates that it should only serve oneDNN. However, Weiwen and I checked the motivation as it is CPU-dedicated and cannot be reused across different hardware backends. For XPU, we have provided shim_xpu regardless of its implementation being on the top of oneDNN. That's why we want to rename the file to CPU. Does it make sense to you?

EikanWang · 2025-03-17T06:20:50Z

torch/csrc/inductor/aoti_torch/shim_mkldnn.cpp

@@ -523,3 +523,19 @@ AOTITorchError aoti_torch_cpu__mkl_linear(
 #endif // AT_MKL_ENABLED

 #endif // AT_MKLDNN_ENABLED()
+
+AOTITorchError aoti_torch_cpu__weight_int4pack_mm_cpu_tensor(


@jgong5 , currently, the shim_mkldnn indicates that it should only serve oneDNN. However, Weiwen and I checked the motivation as it is CPU-dedicated and cannot be reused across different hardware backends. For XPU, we have provided shim_xpu regardless of its implementation being on the top of oneDNN. That's why we want to rename the file to CPU. Does it make sense to you?

EikanWang · 2025-03-17T06:23:44Z

test/inductor/test_mkldnn_pattern_matcher.py

@@ -3835,7 +3835,7 @@ def matcher_check_fn():
            include_ops = [
                "aoti_torch_cpu__weight_int4pack_mm_cpu_tensor"
                if torch._inductor.config.cpp_wrapper
-                else "extern_kernels.int4mm_packed_weight_cpu"
+                else "torch.ops.quantized.int4mm_packed_weight_cpu.default"


@Xia-Weiwen , as we synced offline, the test cases do not cover aoti_torch_cpu__weight_int4pack_mm_cpu_tensor. Could you help elaborate on the changes? How do the changes test aoti_torch_cpu__weight_int4pack_mm_cpu_tensor?

It seems that the UT runs both the fallback kernel and the template-based kernel for max-autotune. So, the fallback kernel should be used for codegen, compiling with gcc and benchmarking.

Xia-Weiwen · 2025-03-17T14:37:11Z

@pytorchbot merge

Xia-Weiwen · 2025-03-17T14:39:00Z

Hi @desertfire Since it breaks the lowering of this op (compiling error with cpp wrapper), can we cherry-pick this patch to the 2.7 branch? Thanks.

pytorchmergebot · 2025-03-17T14:39:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

desertfire · 2025-03-17T14:46:58Z

Hi @desertfire Since it breaks the lowering of this op (compiling error with cpp wrapper), can we cherry-pick this patch to the 2.7 branch? Thanks.

You can add it to #149044

pytorchmergebot · 2025-03-17T17:03:38Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test

Details for Dev Infra team

Raised by workflow job

Xia-Weiwen · 2025-03-18T00:26:36Z

Hi @desertfire Since it breaks the lowering of this op (compiling error with cpp wrapper), can we cherry-pick this patch to the 2.7 branch? Thanks.

You can add it to #149044

Thanks

Xia-Weiwen · 2025-03-18T01:31:28Z

@pytorchbot merge -f "CI failure is unrelated"

pytorchmergebot · 2025-03-18T01:32:58Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary** Previous implementation of shim did not align with the design and it was removed by pytorch#148907 This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT. **Test plan** ``` pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: pytorch#149031 Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire

**Summary** Previous implementation of shim did not align with the design and it was removed by #148907 This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT. **Test plan** ``` pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: #149031 Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire

Update

e79b37d

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 12, 2025

Xia-Weiwen added a commit that referenced this pull request Mar 12, 2025

Add AOTI shim for _weight_int4pack_mm_cpu_tensor

1090df4

ghstack-source-id: c72bbbb Pull Request resolved: #149031

Xia-Weiwen marked this pull request as draft March 12, 2025 07:29

Xia-Weiwen requested review from jgong5, leslie-fang-intel, EikanWang and chunyuan-w March 12, 2025 07:29

pytorchbot added the open source label Mar 12, 2025

Xia-Weiwen added the topic: not user facing topic category label Mar 12, 2025

Update

69422d9

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Mar 13, 2025

Add AOTI shim for _weight_int4pack_mm_cpu_tensor

5dcb80d

ghstack-source-id: 38166c3 Pull Request resolved: #149031

Update

8f28345

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Mar 13, 2025

[Inductor] Add AOTI shim for _weight_int4pack_mm_cpu_tensor

e556f26

ghstack-source-id: 924380e Pull Request resolved: #149031

leslie-fang-intel approved these changes Mar 14, 2025

View reviewed changes

jgong5 reviewed Mar 15, 2025

View reviewed changes

EikanWang reviewed Mar 17, 2025

View reviewed changes

EikanWang approved these changes Mar 17, 2025

View reviewed changes

Xia-Weiwen marked this pull request as ready for review March 17, 2025 07:57

Xia-Weiwen requested a review from desertfire March 17, 2025 07:57

desertfire approved these changes Mar 17, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2025

pytorchmergebot added the merging label Mar 17, 2025

pytorchmergebot removed the merging label Mar 17, 2025

pytorchmergebot added the merging label Mar 18, 2025

pytorchmergebot added the Merged label Mar 18, 2025

pytorchmergebot closed this in c1dd75e Mar 18, 2025

pytorchmergebot removed the merging label Mar 18, 2025

Xia-Weiwen mentioned this pull request Mar 18, 2025

[v.2.7.0] Release Tracker #149044

Closed

Xia-Weiwen added the intel This tag is for PR from Intel label Mar 18, 2025

Xia-Weiwen mentioned this pull request Mar 18, 2025

Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031) #149386

Merged

github-actions bot deleted the gh/Xia-Weiwen/31/head branch April 20, 2025 02:21

Xia-Weiwen mentioned this pull request Jun 9, 2025

High-performance LLM quantization on X86 CPU with native PyTorch #155435

Open

Add AOTI shim for _weight_int4pack_mm_cpu_tensor #149031

Add AOTI shim for _weight_int4pack_mm_cpu_tensor #149031

Uh oh!

Conversation

Xia-Weiwen commented Mar 12, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149031

❌ 1 New Failure

Uh oh!

github-actions bot commented Mar 12, 2025

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

Xia-Weiwen commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen commented Mar 17, 2025

Uh oh!

Xia-Weiwen commented Mar 17, 2025

Uh oh!

pytorchmergebot commented Mar 17, 2025

Merge started

Uh oh!

desertfire commented Mar 17, 2025

Uh oh!

pytorchmergebot commented Mar 17, 2025

Merge failed

Uh oh!

Xia-Weiwen commented Mar 18, 2025

Uh oh!

Xia-Weiwen commented Mar 18, 2025

Uh oh!

pytorchmergebot commented Mar 18, 2025

Merge started

Uh oh!

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Xia-Weiwen commented Mar 12, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 12, 2025 •

edited

Loading

Xia-Weiwen commented Mar 14, 2025 •

edited

Loading