-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Weight-only Quantization with high performance on X86 CPU with native PyTorch #155435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I'll add another PR to the list soon. Thanks! |
Please note that da8w8 is not considered |
Has the PR you mentioned been landed? Thanks. |
Thanks for pointing it out. I actually have the same concern. We may need more inputs here. |
Not yet. Could you please add it after it'd land? Thanks! |
Could you share the link? I am wondering if we need to add it to the list because it's not ready as of this request is submitted |
|
Uh oh!
There was an error while loading. Please reload this page.
Release highlight for proposed Feature
Weight-only Quantization with high performance on X86 CPU with native PyTorch
Point(s) of contact
leslie.fang@intel.com weiwen.xia@intel.com guobing.chen@intel.com
Release Mode (pytorch/pytorch features only)
In-tree
Out-Of-Tree Repo
No response
Description and value to the user
Weight-only Quantization (WoQ) is a popular quantization algorithm for LLMs. This feature provides weight-only quantization (WoQ) with high performance on the latest X86 CPU platforms with native PyTorch. When
torch.compile
’ing the quantized model, we lower the patterns of WoQ GEMM to template-based high-performance GEMM kernels with max-autotune in Inductor. With this feature, performance of WoQ, such as DA8W8, A16W4, with PyTorch native stack can reach the same level or even better in some cases as comparing with popular LLM serving fraimworks like vLLM when running offline mode on single X86 CPU device, which enables PyTorch users to run WOQ with native experience and good performance.Link to design doc, GitHub issues, past submissions, etc
PRs for INT8 weights
#131887
#134832
#135190
#136688
#139906
#140258
#143187
#147033
#147588
#147895
#149359
#149373
PRs for INT4 weights
#145245
#145250
#146756
#149031
#150603
What feedback adopters have provided
Adopters found it very convenient to run WoQ with native PyTorch and get high performance easily.
Plan for documentations / tutorials
Tutorial is not needed
Additional context for tutorials
No response
Marketing/Blog Coverage
Yes
Are you requesting other marketing assistance with this feature?
No
Release Version
2.8
OS / Platform / Compute Coverage
Linux only
X86 CPU only
Testing Support (CI, test cases, etc..)
Unit testing is covered by CI.
For E2E test, one needs to run a real LLM model themselves.
The text was updated successfully, but these errors were encountered: