You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, DotProductAttention caches decisions like get_attention_backend for 1 set of attention_params and this helps reduce CPU overhead in the DotProductAttention call. However, when using model architectures with more than 1 shape for Attention (for example, Self and Cross Attention), this caching fails as it resets each time the params change.
This is a feature request to support more than 1 attention_params in the cache. Ideally this number can be configurable as some models tend to have more than 2 shapes as well but maybe 4 can be a safe number to start (if not configurable).
The text was updated successfully, but these errors were encountered:
Currently, DotProductAttention caches decisions like
get_attention_backend
for 1 set of attention_params and this helps reduce CPU overhead in the DotProductAttention call. However, when using model architectures with more than 1 shape for Attention (for example, Self and Cross Attention), this caching fails as it resets each time the params change.This is a feature request to support more than 1 attention_params in the cache. Ideally this number can be configurable as some models tend to have more than 2 shapes as well but maybe 4 can be a safe number to start (if not configurable).
The text was updated successfully, but these errors were encountered: