Matrix multiplication is implemented using a systolic array architecture.
Every cycle feed packed weight data to Input pins and input data to Bidirectional pins. Strobe Enable pin to start receiving results of the matrix multiplication on the Output pins.
MCU is necessary to feed weights and input data into the accelerator and fetch the results.