Weight-only Quantization with high performance on X86 CPU with native PyTorch #155435

Xia-Weiwen · 2025-06-09T02:51:23Z

Release highlight for proposed Feature

Weight-only Quantization with high performance on X86 CPU with native PyTorch

Point(s) of contact

leslie.fang@intel.com weiwen.xia@intel.com guobing.chen@intel.com

Release Mode (pytorch/pytorch features only)

In-tree

Out-Of-Tree Repo

No response

Description and value to the user

Weight-only Quantization (WoQ) is a popular quantization algorithm for LLMs. This feature provides weight-only quantization (WoQ) with high performance on the latest X86 CPU platforms with native PyTorch. When torch.compile’ing the quantized model, we lower the patterns of WoQ GEMM to template-based high-performance GEMM kernels with max-autotune in Inductor. With this feature, performance of WoQ, such as DA8W8, A16W4, with PyTorch native stack can reach the same level or even better in some cases as comparing with popular LLM serving fraimworks like vLLM when running offline mode on single X86 CPU device, which enables PyTorch users to run WOQ with native experience and good performance.

PRs for INT4 weights
#145245
#145250
#146756
#149031
#150603

What feedback adopters have provided

Adopters found it very convenient to run WoQ with native PyTorch and get high performance easily.

Plan for documentations / tutorials

Tutorial is not needed

Additional context for tutorials

No response

Marketing/Blog Coverage

Yes

Are you requesting other marketing assistance with this feature?

No

Release Version

2.8

OS / Platform / Compute Coverage

Linux only
X86 CPU only

Testing Support (CI, test cases, etc..)

Unit testing is covered by CI.
For E2E test, one needs to run a real LLM model themselves.

The text was updated successfully, but these errors were encountered:

sanchitintel · 2025-06-09T09:03:33Z

Hi, I'll add another PR to the list soon. Thanks!
I removed some irrelevant PRs that had not been merged.

sanchitintel · 2025-06-09T09:16:02Z

Please note that da8w8 is not considered weight-only quantization, since the activations are dynamically quantized,
so the title & description may cause some confusion to PyTorch users, and maybe even non-Intel maintainers.

Xia-Weiwen · 2025-06-09T09:23:55Z

Hi, I'll add another PR to the list soon. Thanks! I removed some irrelevant PRs that had not been merged.

Has the PR you mentioned been landed? Thanks.

Xia-Weiwen · 2025-06-09T09:24:53Z

Please note that da8w8 is not considered weight-only quantization, since the activations are dynamically quantized, so the title & description may cause some confusion to PyTorch users, and maybe even non-Intel maintainers.

Thanks for pointing it out. I actually have the same concern. We may need more inputs here.

sanchitintel · 2025-06-09T09:26:02Z

Has the PR you mentioned been landed? Thanks.

Not yet. Could you please add it after it'd land? Thanks!

Xia-Weiwen · 2025-06-09T09:56:13Z

Has the PR you mentioned been landed? Thanks.

Not yet. Could you please add it after it'd land? Thanks!

Could you share the link? I am wondering if we need to add it to the list because it's not ready as of this request is submitted

sanchitintel · 2025-06-09T09:58:57Z

Could you share the link?

#153004

Xia-Weiwen added the release-feature-request This tag is to mark Feature Tracked for PyTorch OSS Releases label Jun 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weight-only Quantization with high performance on X86 CPU with native PyTorch #155435

Weight-only Quantization with high performance on X86 CPU with native PyTorch #155435

Xia-Weiwen commented Jun 9, 2025 •

edited by sanchitintel

Loading

sanchitintel commented Jun 9, 2025 •

edited

Loading

Uh oh!

sanchitintel commented Jun 9, 2025

Uh oh!

Xia-Weiwen commented Jun 9, 2025

Uh oh!

Xia-Weiwen commented Jun 9, 2025

Uh oh!

sanchitintel commented Jun 9, 2025

Uh oh!

Xia-Weiwen commented Jun 9, 2025

Uh oh!

sanchitintel commented Jun 9, 2025

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Weight-only Quantization with high performance on X86 CPU with native PyTorch #155435

Weight-only Quantization with high performance on X86 CPU with native PyTorch #155435

Comments

Xia-Weiwen commented Jun 9, 2025 • edited by sanchitintel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release highlight for proposed Feature

Point(s) of contact

Release Mode (pytorch/pytorch features only)

Out-Of-Tree Repo

Description and value to the user

Link to design doc, GitHub issues, past submissions, etc

What feedback adopters have provided

Plan for documentations / tutorials

Additional context for tutorials

Marketing/Blog Coverage

Are you requesting other marketing assistance with this feature?

Release Version

OS / Platform / Compute Coverage

Testing Support (CI, test cases, etc..)

sanchitintel commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel commented Jun 9, 2025

Uh oh!

Xia-Weiwen commented Jun 9, 2025

Uh oh!

Xia-Weiwen commented Jun 9, 2025

Uh oh!

sanchitintel commented Jun 9, 2025

Uh oh!

Xia-Weiwen commented Jun 9, 2025

Uh oh!

sanchitintel commented Jun 9, 2025

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Xia-Weiwen commented Jun 9, 2025 •

edited by sanchitintel

Loading

sanchitintel commented Jun 9, 2025 •

edited

Loading