A Deep Neural Network-inference accelerator is created in hardware. The codes for hardware is written in System Verilog. The hardware module is interfaced with NIOS computer system, thus this hardware acts as a peripheral to the computer system. The driver code to interface the hardware is written in C, and can be found in sources_NIOS folder.
A simple Neural Network implementation in software is also done in C. This implementation can be found in the same C file mentioned above.
- Compile the System Verilog code in Intel Quartus or any other software (This step is unnecessary, and can be skipped if you do not want to check for errors).
- Create a custom peripheral using the system Verilog codes provided.
- Add the peripheral created above to your favorite computer system. The top level assumes NIOS computer system with Avalon interface. However, same System Verilog Code can be used to interface with ARM since the compiler will insert additional buses while compiling.
- Use any software to program the computer system that you just used. Program the computer system with the C file provided, and watch the output.
- Compare the results and performance between the hardware implementation and software implementation based on the result seen on terminal.
- For your custom implementation, please train the network and generate weights and biases. Convert them to fixed point representation. You can update the specific locations in system Verilog code taking this implementation as reference.
The problem that is being solved by the implementation is a classification based on a complex mathematical function. (This mathematical function is, however, very simple compared to the real world problems that we face today. This is just an illustration.)
Here, two mathematical functions, y1 = sin(x^2+2*x+3)
and y2 = cos(x^2+2*x+3)
. The domain of x
taken for y1
is x1=[-7.00:7.00:0.01]
and for y2
is x2=[-7.01:6.99:0.01]
. The individual functions look as follows:
and
When we add these two functions, the added function with respect to x1
looks like following:
We can take a threshold of 0.0 and create a classification problem out of this such that value
Class Label 1 if f(x1,x2) = y1+y2 > 0.00
Class Label 0 if f(x1,x2) = y1+y2 <= 0.00
- Activation of ReLU is used because it is simple.
- The weights and inputs are float for training. Training is done in SKLearn using MLP Classifier.
- For hardware implementation, the weight values and biases are converted to Fixed Point Implementation. 9 bit representation with 4 places of fractional part is used because DE2-115 platform has 9-bit multiplication modules.
- Python scripts are used to automatically generate the codes for Single Layer and Neural Network. These codes are not released however. Please let me know and I will release them personally.
- Software Implementation of NN in NIOS is done using floating point representation.
- The implementation in hardware is fully combinational. This might be optimal for resource usage. All the implementation boils down to multiplication routine. There is a sequential multiplication routine as well. Please fee free to modify the code using sequential multiplication routine so that resource utilization is optimized.
Following shows the result obtained by running the implementation:
We can see the speed up achieved is 400 times.