Replies: 1 comment
-
Hi @rymaju Great question! The nice part about malt is that programming in it should be pretty natural for the most part. You can start by defining functions for the smaller blocks (such as scaled dot product attention) etc. and combining them by composing the blocks. There are quite a few examples of blocks chapter 12 onwards. I believe all the pieces required to implement attention are present in the malt distribution. Let us know if you run into any fundamental difficulties or missing pieces as you try to implement it and we can help augment malt to support it. The biggest issue, perhaps, is that we still lack GPU support in malt so larger networks will be more time consuming. Providing GPU support is on our roadmap, but small networks using attention should be easily implementable, at the very least for learning and demonstration purposes. |
Beta Was this translation helpful? Give feedback.
-
Attention was addressed a briefly at the end of the book, but I'm really curious how one might do this. Hoping someone with a better understanding could chime in!
Beta Was this translation helpful? Give feedback.
All reactions