Skip to content

Updated Scaled_mm to support more scaling formats via CuBlas #153555

@drisspg

Description

@drisspg

Summary

In Cuda 12.9 cublas released support for an expanded set of scaling strategies besides just per-tensor: https://developer.nvidia.com/blog/boosting-matrix-multiplication-speed-and-flexibility-with-nvidia-cublas-12-9/

Currently on Cuda:

SM89

_scaled_mm dispatches to one of 2 backends on H100:

  • Per-Tensor scaling -> CublasLT
  • Per-Row scaling -> RowWise Cutlass kernel
  • GroupWise Scaling -> Not supported | some support in AO
  • BlockWise Scaling -> Not supported | some support in AO

H100

_scaled_mm dispatches to one of 2 backends on H100:

  • Per-Tensor scaling -> CublasLT
  • Per-Row scaling -> RowWise Cutlass kernel
  • GroupWise Scaling -> Not supported | some support in AO
  • BlockWise Scaling -> Not supported | some support in AO

B200

_scaled_mm dispatches to one of 2 backends on H100:

  • Per-Tensor scaling -> CublasLT
  • Per-Row scaling -> RowWise Cutlass kernel (template is not optimal)
  • GroupWise Scaling -> MXFP8 BlockWise scaling is support via CublasLT
  • BlockWise Scaling -> Not supported

We should add new cublas bindings to enable this more performant code path.

Blockers

We ideally would remove the cutlass templates since Cublas claims appear to be universally more performant. The main blocker is that we would lose support for SM89 hardware

We don't currently ship a prebuilt version of PyTorch for 12.9

cc @ptrblck @msaroufim @eqy @jerryzh168 @yanbing-j @vkuzo @albanD @kadeng @penguinwu @ngimel, @lw

Metadata

Metadata

Assignees

No one assigned

    Labels

    BlackwellSpecific failures or issues related to sm100 + Cuda archesenhancementNot as big of a feature, but technically not a bug. Should be easy to fixmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: floatx (formerly float8)For torch.float8_e5m2 and torch.float8_e4m3 and other sub 8-bit float typestopic: performancetopic categorytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy