Replies: 1 comment 1 reply
-
You can check #386 , i did not mention it there, but the flashattention op in ggml only works with f16 k and v, so it is happening there. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Has anyone sucessfuly experimented with quantizing the latent image tensor to f16 or q8? I guess this could help a lot with generating high resolution images with limited memory, that is if f32 precision isn't needed, of course.
Beta Was this translation helpful? Give feedback.
All reactions