Quantized latents? #390

stduhpf · 2024-09-02T10:02:13Z

stduhpf
Sep 2, 2024

Has anyone sucessfuly experimented with quantizing the latent image tensor to f16 or q8? I guess this could help a lot with generating high resolution images with limited memory, that is if f32 precision isn't needed, of course.

Green-Sky · 2024-09-02T10:33:18Z

Green-Sky
Sep 2, 2024

You can check #386 , i did not mention it there, but the flashattention op in ggml only works with f16 k and v, so it is happening there.

1 reply

stduhpf Sep 2, 2024
Author

That's neat. It's not ot quite what I had in mind, but it should also reduce memory usage for DiT models. Though I think this won't affect U-net based models as much?

Edit: It seems to speed up SD2 which is U-net based, so I'm wrong. I thought attention wasn't used in U-nets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized latents? #390

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Quantized latents? #390

stduhpf Sep 2, 2024

Replies: 1 comment · 1 reply

Green-Sky Sep 2, 2024

stduhpf Sep 2, 2024 Author

stduhpf
Sep 2, 2024

Replies: 1 comment 1 reply

Green-Sky
Sep 2, 2024

stduhpf Sep 2, 2024
Author