Replies: 9 comments 21 replies
-
@sayakpaul did you try to run inference just through the unet (i.e., skip the VAE in case it's using that much memory?) |
Beta Was this translation helpful? Give feedback.
-
@sayakpaul a couple of comments:
|
Beta Was this translation helpful? Give feedback.
-
Cc: @younesbelkada for feedback as well (as he is our in-house ninja for working with reduced precision). |
Beta Was this translation helpful? Give feedback.
-
SD with with batch size of 4
|
Beta Was this translation helpful? Give feedback.
-
SDXL with batch size of 1 (steps: 30)
|
Beta Was this translation helpful? Give feedback.
-
@dacorvo plotted the distribution of the weights of the UNet as well: from diffusers import UNet2DConditionModel
import matplotlib.pyplot as plt
unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet").eval()
weights = []
for name, param in unet.named_parameters():
if "weight" in name:
weights.append(param.view(-1).cpu().detach().numpy())
plt.figure(figsize=(10, 6))
for i, weight in enumerate(weights):
plt.hist(weight, bins=50, alpha=0.5, label=f'Layer {i+1}')
plt.xlabel('Weight values')
plt.ylabel('Frequency')
plt.title('Distribution of Weights in the Neural Network')
plt.savefig("sdxl_unet_weight_dist.png", bbox_inches="tight", dpi=300) SDXLSD v1.5Weights seem to be concentrated around 0s. Does this quite fit the bill for |
Beta Was this translation helpful? Give feedback.
-
I rebased the branch. I did a refactoring and |
Beta Was this translation helpful? Give feedback.
-
It should be OK now. |
Beta Was this translation helpful? Give feedback.
-
@dacorvo I am getting: RuntimeError: Promotion for Float8 Types is not supported, attempted to promote Float8_e4m3fn and Half |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Comfy and A1111 has been supporting Float8 for some time now:
A1111 reports quite nice improvements for VRAM consumption:
Timing takes a hit because of the casting overhead but that's okay in the interest of the reduced VRAM, IMO.
So, I tried using
qaunto
to potentially benefit from FP8 (benchmark run on 4090):Here are the stats and resultant images (batch size of 1):
As we can see we're able to obtain a good amount of VRAM reduction here in comparison to FP16. Do we want to achieve that in
diffusers
natively, or supporting this viaquanto
is preferable? I am okay with the latter.Edit: int8 is even better: #7023 (comment).
See also: huggingface/optimum-quanto#74. Cc: @dacorvo.
Curious to know your thoughts here: @yiyixuxu @DN6.
Beta Was this translation helpful? Give feedback.
All reactions