Support float8, int8, int4 in `diffusers`? #7023

sayakpaul · 2024-02-19T06:47:21Z

sayakpaul
Feb 19, 2024
Maintainer

Comfy and A1111 has been supporting Float8 for some time now:

Support FP8 comfyanonymous/ComfyUI#2157
A big improvement for dtype casting system with fp8 storage type and manual cast AUTOMATIC1111/stable-diffusion-webui#14031

A1111 reports quite nice improvements for VRAM consumption:

Timing takes a hit because of the casting overhead but that's okay in the interest of the reduced VRAM, IMO.

So, I tried using qaunto to potentially benefit from FP8 (benchmark run on 4090):

import argparse
from quanto.quantize import quantize, freeze
import torch
import torch.utils.benchmark as benchmark
from diffusers import DiffusionPipeline


CKPT = "runwayml/stable-diffusion-v1-5"
NUM_INFERENCE_STEPS = 50
WARM_UP_ITERS = 5
PROMPT = "ghibli style, a fantasy landscape with castles"

TORCH_DTYPES = {"fp32": torch.float32, "fp16": torch.float16}
UNET_FP8_DTYPES = {"fp8_e4m3fn": torch.float8_e4m3fn, "fp8_e5m2": torch.float8_e5m2}


def load_pipeline(torch_dtype, unet_in_float8=None):
    pipe = DiffusionPipeline.from_pretrained(
        CKPT, torch_dtype=torch_dtype, use_safetensors=True
    ).to("cuda")

    if unet_in_float8:
        quantize(pipe.unet, weights=unet_in_float8)
        freeze(pipe.unet)

    pipe.set_progress_bar_config(disable=True)
    return pipe


def run_inference(pipe, batch_size=1):
    _ = pipe(
        prompt=PROMPT,
        num_inference_steps=NUM_INFERENCE_STEPS,
        num_images_per_prompt=batch_size,
        generator=torch.manual_seed(0),
    )


def benchmark_fn(f, *args, **kwargs):
    t0 = benchmark.Timer(stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f})
    return f"{(t0.blocked_autorange().mean):.3f}"


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--batch_size", type=int, default=1)
    parser.add_argument("--torch_dtype", type=str, default="fp32", choices=list(TORCH_DTYPES.keys()))
    parser.add_argument("--unet_in_float8", type=str, default=None, choices=list(UNET_FP8_DTYPES.keys()))
    args = parser.parse_args()

    pipeline = load_pipeline(
        TORCH_DTYPES[args.torch_dtype], UNET_FP8_DTYPES[args.unet_in_float8] if args.unet_in_float8 else None
    )

    for _ in range(WARM_UP_ITERS):
        run_inference(pipeline, args.batch_size)

    time = benchmark_fn(run_inference, pipeline, args.batch_size)
    memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())  # in GBs.
    print(
        f"batch_size: {args.batch_size}, torch_dtype: {args.torch_dtype}, unet_in_float8: {args.unet_in_float8}  in {time} seconds."
    )
    print(f"Memory: {memory}GB.")

    img_name = f"bs@{args.batch_size}-dtype@{args.torch_dtype}-unet_fp8@{args.unet_in_float8}.png"
    image = pipeline(
        prompt=PROMPT,
        num_inference_steps=NUM_INFERENCE_STEPS,
        num_images_per_prompt=args.batch_size,
    ).images[0]
    image.save(img_name)

Here are the stats and resultant images (batch size of 1):

Settings	Timing (seconds)	Memory (GBs)
FP32	3.338	6.071
FP16	1.314	3.193
FP16 + UNet FP8	2.131	2.652

As we can see we're able to obtain a good amount of VRAM reduction here in comparison to FP16. Do we want to achieve that in diffusers natively, or supporting this via quanto is preferable? I am okay with the latter.

Edit: int8 is even better: #7023 (comment).

See also: huggingface/optimum-quanto#74. Cc: @dacorvo.

Curious to know your thoughts here: @yiyixuxu @DN6.

pcuenca · 2024-02-19T07:56:00Z

pcuenca
Feb 19, 2024
Maintainer

@sayakpaul did you try to run inference just through the unet (i.e., skip the VAE in case it's using that much memory?)

1 reply

sayakpaul Feb 19, 2024
Maintainer Author

As we can see in the code I provided, it's just the UNet, that's what the other repos also do.

But I guess I am misunderstanding something here.

dacorvo · 2024-02-19T08:02:59Z

dacorvo
Feb 19, 2024
Maintainer

@sayakpaul a couple of comments:

float8 weights are actually less efficient than int8 weights in terms of accuracy, so you should try with int8 weights also,
float8 activations are on the other hand quite efficient,
regardless of the target weight quantization, quanto does a fake quantize by default to allow weights to be tuned. This explains the fact that you don't decrease the VRAM. To actually store float8/int8 weights you need to call freeze(model).

7 replies

sayakpaul Feb 19, 2024
Maintainer Author

I see. Thanks for explaining. However, I think we had kind of established here that using float8 might be a better choice to their non-linear representation and wider range.

I can check the distribution of the UNet params and get back to you here.

dacorvo Feb 19, 2024
Maintainer

This was for activations, not weights. Activations indeed have a non-linear distribution that is better captured by float8.

sayakpaul Feb 19, 2024
Maintainer Author

Oh you crushed all my numbers.

Changing to quantize(pipe.unet, weights=torch.int8), yielded:

batch_size: 1, torch_dtype: fp16, unet: torch.int8  in 1.959 seconds.
Memory: 2.655GB.

No loss in the quality as well.

dacorvo Feb 19, 2024
Maintainer

That said, if you don't see any loss for float8, and if the weights follow a linear distribution, this actually means that your weights might be compatible with a 4-bit encoding, i.e. int4.
With the latest version of quanto, you can choose quanto.qint4 as a quantization target.

sayakpaul Feb 19, 2024
Maintainer Author

I am currently using the qconv2d branch. I will merge with main and try that too.

sayakpaul · 2024-02-19T08:31:28Z

sayakpaul
Feb 19, 2024
Maintainer Author

Cc: @younesbelkada for feedback as well (as he is our in-house ninja for working with reduced precision).

0 replies

sayakpaul · 2024-02-19T08:52:14Z

sayakpaul
Feb 19, 2024
Maintainer Author

SD with with batch size of 4

Settings	Timing (seconds)	Memory (GBs)
FP32	11.057	8.902
FP16	3.801	4.587
FP16 + UNet FP8	4.332	3.803

0 replies

sayakpaul · 2024-02-19T09:14:14Z

sayakpaul
Feb 19, 2024
Maintainer Author

SDXL with batch size of 1 (steps: 30)

Settings	Timing (seconds)	Memory (GBs)
FP32	12.180	16.827
FP16	4.112	10.468
FP16 + UNet int8	4.710	8.135

2 replies

dingkwang Dec 21, 2024

Would someone explain why UNet int8/ UNet FP8 are slower than FP16?

OverShifted May 16, 2025

@dingkwang Because GPUs usually have more hardware for dealing with FP32 and FP16 than FP8 which cancels out possible benefits of FP8 in terms of speed (like being able to fit more numbers in the cache).

sayakpaul · 2024-02-19T09:48:33Z

sayakpaul
Feb 19, 2024
Maintainer Author

@dacorvo plotted the distribution of the weights of the UNet as well:

from diffusers import UNet2DConditionModel 
import matplotlib.pyplot as plt

unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet").eval()

weights = []
for name, param in unet.named_parameters():
    if "weight" in name: 
        weights.append(param.view(-1).cpu().detach().numpy())

plt.figure(figsize=(10, 6))
for i, weight in enumerate(weights):
    plt.hist(weight, bins=50, alpha=0.5, label=f'Layer {i+1}')
plt.xlabel('Weight values')
plt.ylabel('Frequency')
plt.title('Distribution of Weights in the Neural Network')
plt.savefig("sdxl_unet_weight_dist.png", bbox_inches="tight", dpi=300)

SDXL

SD v1.5

Weights seem to be concentrated around 0s. Does this quite fit the bill for quanto.int4?

2 replies

dacorvo Feb 19, 2024
Maintainer

Not so sure: I was not expecting such a high peak around zero. int4 will result in extreme pruning here I suspect.
So if it works it would mean that these close-to-zero weights are actually not required. If so this is a hint that the weights are actually sparse -> this opens the way to even smarter optimization techniques.

sayakpaul Feb 19, 2024
Maintainer Author

I am trying int4. However it errors out on from quanto import quantize, freeze:

ImportError: cannot import name 'qbitsdtype' from 'quanto.tensor' (/home/sayak/quanto/quanto/tensor/__init__.py)

dacorvo · 2024-02-19T10:27:28Z

dacorvo
Feb 19, 2024
Maintainer

I am trying int4. However it errors out on from quanto import quantize, freeze:
ImportError: cannot import name 'qbitsdtype' from 'quanto.tensor' (/home/sayak/quanto/quanto/tensor/__init__.py)

I rebased the branch. I did a refactoring and qbitsdtype is now qtype. Note that you also need to pass a quanto.dtype instead of a torch.dtype now.

4 replies

sayakpaul Feb 19, 2024
Maintainer Author

Could you elaborate that a bit? Specifically, how would the code changes look like for the code snippet and also for using int4?

dacorvo Feb 19, 2024
Maintainer

Just import qint4, qint8, ... from quanto and use them as the weights parameter in quantize instead of torch.int8 / torch.float8.

sayakpaul Feb 19, 2024
Maintainer Author

Okay. So, this is my updated testing snippet now:

import argparse
from quanto import quantize, freeze, qint4, qint8, qfloat8_e4m3fn
import torch
import torch.utils.benchmark as benchmark
from diffusers import DiffusionPipeline


CKPT = "stabilityai/stable-diffusion-xl-base-1.0"
NUM_INFERENCE_STEPS = 30
WARM_UP_ITERS = 5
PROMPT = "ghibli style, a fantasy landscape with castles"

TORCH_DTYPES = {"fp32": torch.float32, "fp16": torch.float16}
UNET_DTYPES = {"fp8": qfloat8_e4m3fn, "int8": qint8, "int4": qint4}


def load_pipeline(torch_dtype, unet_dtype=None):
    pipe = DiffusionPipeline.from_pretrained(CKPT, torch_dtype=torch_dtype, use_safetensors=True).to("cuda")

    if unet_dtype:
        quantize(pipe.unet, weights=unet_dtype)
        freeze(pipe.unet)

    pipe.set_progress_bar_config(disable=True)
    return pipe


def run_inference(pipe, batch_size=1):
    _ = pipe(
        prompt=PROMPT,
        num_inference_steps=NUM_INFERENCE_STEPS,
        num_images_per_prompt=batch_size,
        generator=torch.manual_seed(0),
    )


def benchmark_fn(f, *args, **kwargs):
    t0 = benchmark.Timer(stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f})
    return f"{(t0.blocked_autorange().mean):.3f}"


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--batch_size", type=int, default=1)
    parser.add_argument("--torch_dtype", type=str, default="fp32", choices=list(TORCH_DTYPES.keys()))
    parser.add_argument("--unet_dtype", type=str, default=None, choices=list(UNET_DTYPES.keys()))
    args = parser.parse_args()

    pipeline = load_pipeline(TORCH_DTYPES[args.torch_dtype], UNET_DTYPES[args.unet_dtype] if args.unet_dtype else None)

    for _ in range(WARM_UP_ITERS):
        run_inference(pipeline, args.batch_size)

    time = benchmark_fn(run_inference, pipeline, args.batch_size)
    memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())  # in GBs.
    print(
        f"batch_size: {args.batch_size}, torch_dtype: {args.torch_dtype}, unet_dtype: {args.unet_dtype}  in {time} seconds."
    )
    print(f"Memory: {memory}GB.")

    img_name = f"bs@{args.batch_size}-dtype@{args.torch_dtype}-unet_dtype@{args.unet_dtype}.png"
    image = pipeline(
        prompt=PROMPT,
        num_inference_steps=NUM_INFERENCE_STEPS,
        num_images_per_prompt=args.batch_size,
    ).images[0]
    image.save(img_name)

It is leading to:

Traceback (most recent call last):
  File "/home/sayak/diffusers/benchmark_reduced_precision_unet.py", line 53, in <module>
    pipeline = load_pipeline(TORCH_DTYPES[args.torch_dtype], UNET_DTYPES[args.unet_dtype] if args.unet_dtype else None)
  File "/home/sayak/diffusers/benchmark_reduced_precision_unet.py", line 22, in load_pipeline
    freeze(pipe.unet)
  File "/home/sayak/quanto/quanto/quantize.py", line 40, in freeze
    m.freeze()
  File "/home/sayak/quanto/quanto/nn/qmodule.py", line 112, in freeze
    qweight = self.qweight()
  File "/home/sayak/quanto/quanto/nn/qlinear.py", line 44, in qweight
    raise ValueError(f"Invalid quantized weights type {self.weights}")
ValueError: Invalid quantized weights type quanto.qfloat8_e4m3fn

dacorvo Feb 19, 2024
Maintainer

Hum yeah for QLinear I limited it to integer types. Let me push a small modification.

dacorvo · 2024-02-19T13:38:40Z

dacorvo
Feb 19, 2024
Maintainer

Hum yeah for QLinear I limited it to integer types. Let me push a small modification.

It should be OK now.

1 reply

sayakpaul Feb 23, 2024
Maintainer Author

And we're at:

batch_size: 1, torch_dtype: fp16, unet_dtype: int4  in 5.688 seconds.
Memory: 6.819GB.

Reference is here: #7023 (comment). Pretty nice memory savings :)

sayakpaul · 2024-02-23T13:41:29Z

sayakpaul
Feb 23, 2024
Maintainer Author

@dacorvo I am getting:

RuntimeError: Promotion for Float8 Types is not supported, attempted to promote Float8_e4m3fn and Half

4 replies

dacorvo Feb 23, 2024
Maintainer

Yes, I know, I got that too. I think this comes from one of my latest changes to fuse dequantization and matmul.

sayakpaul Feb 23, 2024
Maintainer Author

Should I wait for a fix?

I am working on a final script for the community to experiment with and also to publish the results as I gather them.

dacorvo Feb 23, 2024
Maintainer

Can you share your config ? I cannot reproduce atm.

sayakpaul Feb 23, 2024
Maintainer Author

Full code:

import argparse
from quanto import quantize, freeze, qint4, qint8, qfloat8_e4m3fn
import torch
import torch.utils.benchmark as benchmark
from diffusers import DiffusionPipeline


WARM_UP_ITERS = 5
PROMPT = "ghibli style, a fantasy landscape with castles"

TORCH_DTYPES = {"fp32": torch.float32, "fp16": torch.float16}
UNET_DTYPES = {"fp8": qfloat8_e4m3fn, "int8": qint8, "int4": qint4}


def load_pipeline(ckpt_id, torch_dtype, unet_dtype=None):
    pipe = DiffusionPipeline.from_pretrained(ckpt_id, torch_dtype=torch_dtype).to("cuda")

    if unet_dtype:
        quantize(pipe.unet, weights=unet_dtype)
        freeze(pipe.unet)

    pipe.set_progress_bar_config(disable=True)
    return pipe


def run_inference(pipe, num_inference_steps, batch_size=1):
    _ = pipe(
        prompt=PROMPT,
        num_inference_steps=num_inference_steps,
        num_images_per_prompt=batch_size,
        generator=torch.manual_seed(0),
    )


def benchmark_fn(f, *args, **kwargs):
    t0 = benchmark.Timer(stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f})
    return f"{(t0.blocked_autorange().mean):.3f}"


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--ckpt_id",
        type=str,
        default="runwayml/stable-diffusion-v1-5",
        choices=["runwayml/stable-diffusion-v1-5", "stabilityai/stable-diffusion-xl-base-1.0"],
    )
    parser.add_argument("--num_inference_steps", type=int, default=50)
    parser.add_argument("--batch_size", type=int, default=1)
    parser.add_argument("--torch_dtype", type=str, default="fp32", choices=list(TORCH_DTYPES.keys()))
    parser.add_argument("--unet_dtype", type=str, default=None, choices=list(UNET_DTYPES.keys()))
    args = parser.parse_args()

    pipeline = load_pipeline(
        args.ckpt_id, TORCH_DTYPES[args.torch_dtype], UNET_DTYPES[args.unet_dtype] if args.unet_dtype else None
    )

    for _ in range(WARM_UP_ITERS):
        run_inference(pipeline, args.num_inference_steps, args.batch_size)

    time = benchmark_fn(run_inference, pipeline, args.num_inference_steps, args.batch_size)
    memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())  # in GBs.
    print(
        f"ckpt: {args.ckpt_id} batch_size: {args.batch_size}, "
        f"torch_dtype: {args.torch_dtype}, unet_dtype: {args.unet_dtype}  in {time} seconds and {memory} GBs."
    )

    ckpt_id = args.ckpt_id.replace("/", "_")
    img_name = f"ckpt@{args.ckpt_id}-bs@{args.batch_size}-dtype@{args.torch_dtype}-unet_dtype@{args.unet_dtype}.png"
    image = pipeline(
        prompt=PROMPT,
        num_inference_steps=args.num_inference_steps,
        num_images_per_prompt=args.batch_size,
    ).images[0]
    image.save(img_name)

I am on quanto main.

Run it with

python benchmark_reduced_precision_unet.py --torch_dtype=fp16 --unet_dtype=fp8

Support float8, int8, int4 in diffusers? #7023

Uh oh!

Uh oh!

sayakpaul Feb 19, 2024 Maintainer

Replies: 9 comments · 21 replies

Uh oh!

pcuenca Feb 19, 2024 Maintainer

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

SD with with batch size of 4

Uh oh!

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

SDXL with batch size of 1 (steps: 30)

Uh oh!

dingkwang Dec 21, 2024

Uh oh!

OverShifted May 16, 2025

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

SDXL

SD v1.5

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

sayakpaul Feb 19, 2024 Maintainer Author

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

dacorvo Feb 19, 2024 Maintainer

Uh oh!

sayakpaul Feb 23, 2024 Maintainer Author

Uh oh!

sayakpaul Feb 23, 2024 Maintainer Author

Uh oh!

dacorvo Feb 23, 2024 Maintainer

Uh oh!

sayakpaul Feb 23, 2024 Maintainer Author

Uh oh!

dacorvo Feb 23, 2024 Maintainer

Uh oh!

Support float8, int8, int4 in `diffusers`? #7023

sayakpaul
Feb 19, 2024
Maintainer

Replies: 9 comments 21 replies

pcuenca
Feb 19, 2024
Maintainer

sayakpaul Feb 19, 2024
Maintainer Author

dacorvo
Feb 19, 2024
Maintainer

sayakpaul Feb 19, 2024
Maintainer Author

dacorvo Feb 19, 2024
Maintainer

sayakpaul Feb 19, 2024
Maintainer Author

dacorvo Feb 19, 2024
Maintainer

sayakpaul Feb 19, 2024
Maintainer Author

sayakpaul
Feb 19, 2024
Maintainer Author

sayakpaul
Feb 19, 2024
Maintainer Author

sayakpaul
Feb 19, 2024
Maintainer Author

sayakpaul
Feb 19, 2024
Maintainer Author

dacorvo Feb 19, 2024
Maintainer

sayakpaul Feb 19, 2024
Maintainer Author

dacorvo
Feb 19, 2024
Maintainer

sayakpaul Feb 19, 2024
Maintainer Author

dacorvo Feb 19, 2024
Maintainer

sayakpaul Feb 19, 2024
Maintainer Author

dacorvo Feb 19, 2024
Maintainer

dacorvo
Feb 19, 2024
Maintainer

sayakpaul Feb 23, 2024
Maintainer Author

sayakpaul
Feb 23, 2024
Maintainer Author

dacorvo Feb 23, 2024
Maintainer

sayakpaul Feb 23, 2024
Maintainer Author

dacorvo Feb 23, 2024
Maintainer