v0.21.0

Latest

Latest

awni released this 22 Nov 20:18

· 6 commits to main since this release

Highlights

Support 3 and 6 bit quantization: benchmarks
Much faster memory efficient attention for headdim 64, 80: benchmarks
Much faster sdpa inference kernel for longer sequences: benchmarks

Core

contiguous op (C++ only) + primitive
Bfs width limit to reduce memory consumption during eval
Fast CPU quantization
Faster indexing math in several kernels:
- unary, binary, ternary, copy, compiled, reduce
Improve dispatch threads for a few kernels:
- conv, gemm splitk, custom kernels
More buffer donation with no-ops to reduce memory use
Use CMAKE_OSX_DEPLOYMENT_TARGET to pick Metal version
Dispatch Metal bf16 type at runtime when using the JIT

NN

nn.AvgPool3d and nn.MaxPool3d
Support groups in nn.Conv2d

Bug fixes

Fix per-example mask + docs in sdpa
Fix FFT synchronization bug (use dispatch method everywhere)
Throw for invalid *fft{2,n} cases
Fix OOB access in qmv
Fix donation in sdpa to reduce memory use
Allocate safetensors header on the heap to avoid stack overflow
Fix sibling memory leak
Fix view segfault for scalars input
Fix concatenate vmap

Assets 2