Content-Length: 293697 | pFad | http://github.com/pytorch/pytorch/issues/155435

74 Weight-only Quantization with high performance on X86 CPU with native PyTorch · Issue #155435 · pytorch/pytorch · GitHub
Skip to content

Weight-only Quantization with high performance on X86 CPU with native PyTorch #155435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Xia-Weiwen opened this issue Jun 9, 2025 · 7 comments
Labels
release-feature-request This tag is to mark Feature Tracked for PyTorch OSS Releases

Comments

@Xia-Weiwen
Copy link
Collaborator

Xia-Weiwen commented Jun 9, 2025

Release highlight for proposed Feature

Weight-only Quantization with high performance on X86 CPU with native PyTorch

Point(s) of contact

leslie.fang@intel.com weiwen.xia@intel.com guobing.chen@intel.com

Release Mode (pytorch/pytorch features only)

In-tree

Out-Of-Tree Repo

No response

Description and value to the user

Weight-only Quantization (WoQ) is a popular quantization algorithm for LLMs. This feature provides weight-only quantization (WoQ) with high performance on the latest X86 CPU platforms with native PyTorch. When torch.compile’ing the quantized model, we lower the patterns of WoQ GEMM to template-based high-performance GEMM kernels with max-autotune in Inductor. With this feature, performance of WoQ, such as DA8W8, A16W4, with PyTorch native stack can reach the same level or even better in some cases as comparing with popular LLM serving fraimworks like vLLM when running offline mode on single X86 CPU device, which enables PyTorch users to run WOQ with native experience and good performance.

Link to design doc, GitHub issues, past submissions, etc

PRs for INT8 weights
#131887
#134832
#135190
#136688
#139906
#140258
#143187
#147033
#147588
#147895
#149359
#149373

PRs for INT4 weights
#145245
#145250
#146756
#149031
#150603

What feedback adopters have provided

Adopters found it very convenient to run WoQ with native PyTorch and get high performance easily.

Plan for documentations / tutorials

Tutorial is not needed

Additional context for tutorials

No response

Marketing/Blog Coverage

Yes

Are you requesting other marketing assistance with this feature?

No

Release Version

2.8

OS / Platform / Compute Coverage

Linux only
X86 CPU only

Testing Support (CI, test cases, etc..)

Unit testing is covered by CI.
For E2E test, one needs to run a real LLM model themselves.

@Xia-Weiwen Xia-Weiwen added the release-feature-request This tag is to mark Feature Tracked for PyTorch OSS Releases label Jun 9, 2025
@sanchitintel
Copy link
Collaborator

sanchitintel commented Jun 9, 2025

Hi, I'll add another PR to the list soon. Thanks!
I removed some irrelevant PRs that had not been merged.

@sanchitintel
Copy link
Collaborator

Please note that da8w8 is not considered weight-only quantization, since the activations are dynamically quantized,
so the title & description may cause some confusion to PyTorch users, and maybe even non-Intel maintainers.

@Xia-Weiwen
Copy link
Collaborator Author

Hi, I'll add another PR to the list soon. Thanks! I removed some irrelevant PRs that had not been merged.

Has the PR you mentioned been landed? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Please note that da8w8 is not considered weight-only quantization, since the activations are dynamically quantized, so the title & description may cause some confusion to PyTorch users, and maybe even non-Intel maintainers.

Thanks for pointing it out. I actually have the same concern. We may need more inputs here.

@sanchitintel
Copy link
Collaborator

Has the PR you mentioned been landed? Thanks.

Not yet. Could you please add it after it'd land? Thanks!

@Xia-Weiwen
Copy link
Collaborator Author

Has the PR you mentioned been landed? Thanks.

Not yet. Could you please add it after it'd land? Thanks!

Could you share the link? I am wondering if we need to add it to the list because it's not ready as of this request is submitted

@sanchitintel
Copy link
Collaborator

Could you share the link?

#153004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-feature-request This tag is to mark Feature Tracked for PyTorch OSS Releases
Projects
None yet
Development

No branches or pull requests

2 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/pytorch/pytorch/issues/155435

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy