Support for Stable Diffusion 3.5 Large #2574

super-fun-surf · 2024-10-25T04:59:21Z

I tried updating the hf repo to 3.5 Large but its not working.

Error: cannot find tensor text_encoders.clip_l.transformer.text_model.embeddings.token_embedding.weight

The text was updated successfully, but these errors were encountered:

LaurentMazare · 2024-10-26T19:58:58Z

See #2578

super-fun-surf · 2024-10-28T21:04:32Z

This is working on A100. Takes too much memory for RTX4000 with 20GB.

I see there is a quantized gguf
https://huggingface.co/city96/stable-diffusion-3.5-large-gguf

is it possible currently to use this gguf quantized model?
also is it possible to use safetensors style quantized models..

thanks.

super-fun-surf · 2024-10-28T21:34:07Z

I see in the readme of SD3 there is a benchmark running a RTX 3090 Ti. how much memory does that card have. seams like 35 takes 40 plus GB to run in candle...

LaurentMazare · 2024-10-28T21:42:39Z

I made a few tweaks in #2581 and #2582 and with that it seems to use 20.9GB of memory, fwiw a 3090 Ti has 24GB so it should run there.

super-fun-surf · 2024-10-30T16:09:24Z

great work.
its running so great on the A100.
there seams to be a memory spike when it loads the T5 into F32 which pushes it over the limit for the 20GB RTX4000 on my desktop.
I got the nsys profiler working and I am learning how to track a bit.

LaurentMazare · 2024-10-30T17:10:52Z

I've pushed some further changes in #2589 so that the f32 conversion is done on the flight rather than upfront so that we can benefit from the reduced memory usage while retaining full precision. After this, the memory usage I get from nsys during the text encoding step is done to ~10.5GB. That said, I still see the memory usage getting to ~20GB while running the mmdit so not that likely to fit on a 20GB gpu.

super-fun-surf · 2024-10-30T22:02:52Z

Amazing how much space is saved with the F16 on the T5. And it's only using 17GB during the sampling!
And im happy to report that it is working on RTX4000 with with 20GB.
Sampling done. 28 steps. 79.60s. Average rate: 0.35 iter/s
It's also working on M3. though very slow. Same image takes around 20 Minutes on M3 with 36GB
Rad work! thanks.

LaurentMazare mentioned this issue Nov 5, 2024

Stable Diffusion 3.5 Large CUDA OUT_OF_MEMORY on RTX 3090 #2597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Stable Diffusion 3.5 Large #2574

Support for Stable Diffusion 3.5 Large #2574

super-fun-surf commented Oct 25, 2024

LaurentMazare commented Oct 26, 2024

super-fun-surf commented Oct 28, 2024

super-fun-surf commented Oct 28, 2024

LaurentMazare commented Oct 28, 2024

super-fun-surf commented Oct 30, 2024 •

edited

Loading

LaurentMazare commented Oct 30, 2024

super-fun-surf commented Oct 30, 2024

Support for Stable Diffusion 3.5 Large #2574

Support for Stable Diffusion 3.5 Large #2574

Comments

super-fun-surf commented Oct 25, 2024

LaurentMazare commented Oct 26, 2024

super-fun-surf commented Oct 28, 2024

super-fun-surf commented Oct 28, 2024

LaurentMazare commented Oct 28, 2024

super-fun-surf commented Oct 30, 2024 • edited Loading

LaurentMazare commented Oct 30, 2024

super-fun-surf commented Oct 30, 2024

super-fun-surf commented Oct 30, 2024 •

edited

Loading