Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to port the SIMTight project to PYNQ-Z2 FPGA board? #16

Open
Honourable-A opened this issue Oct 19, 2022 · 9 comments
Open

How to port the SIMTight project to PYNQ-Z2 FPGA board? #16

Honourable-A opened this issue Oct 19, 2022 · 9 comments

Comments

@Honourable-A
Copy link

Hi, as you know DE10 pro FPGA is very costly board. I have a PYNQ-Z2 FPGA board (board details are here https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html). I want to do some research on SIMTight. Please tell me how I can port the project to this low end board. I am asking about the exact steps because I am not an expert on FPGA design porting.

@mn416
Copy link
Collaborator

mn416 commented Oct 20, 2022

Hi, unfortunately we don't currently have the resources to port or maintain a port to Xilinx devices. Perhaps the ability to run in simulation is still useful to you? I foresee three issues in porting:

  1. Quad-port BRAMs available on Stratix 10 may have like-for-like replacements in modern Xilinx devices but possibly not PYNQ. This is not a synthesis issue (there are pure Verilog versions of these components available for any FPGA) but an efficiency one (the pure Verilog components may map down to registers rather than BRAMs).

  2. The DRAM bandwidth on the PYNQ is lower, so one would probably halve the DRAM bus width and the number of vector lanes in Config.h.

  3. We use an Intel clock-crossing primitive to put the CPU and SIMT cores in different clock domains, but this isn't really necessary and could simply be removed.

@Honourable-A
Copy link
Author

Hi, thanks for your suggestions. I have a question about simulation because you mentioned it. I understand that there is a SIMTight simulator but it probably simulates the CPU only and not the SIMT cores. Please rectify me if my understanding is inaccurate.

@mn416
Copy link
Collaborator

mn416 commented Oct 20, 2022

It simulates the entire SoC including CPU, SIMT core, memory subsystem, UART and DRAM.

The drawback is that simulation is (of course) slow compared to FPGA. Therefore, in simulation the benchmarks are run only for small data-set sizes. This can lead to underloading of the system, and a dip in IPC. So it may be desirable to increase the data-set sizes slightly in simulation until the point where the benchmarks are performing at an IPC level close to the following level obtained from FPGA:

Samples/VecAdd (build): ok
Samples/VecAdd (run): ok [IPC=29.26,Instrs=9126880,Cycles=311871,DRAMAccs=189100,Retries=23227,Susps=0]
Samples/Histogram (build): ok
Samples/Histogram (run): ok [IPC=31.14,Instrs=7153216,Cycles=229718,DRAMAccs=32994,Retries=4702,Susps=0]
Samples/Reduce (build): ok
Samples/Reduce (run): ok [IPC=31.56,Instrs=6358334,Cycles=201496,DRAMAccs=64101,Retries=733,Susps=0]
Samples/Scan (build): ok
Samples/Scan (run): ok [IPC=30.33,Instrs=222357876,Cycles=7330080,DRAMAccs=162304,Retries=45776,Susps=0]
Samples/Transpose (build): ok
Samples/Transpose (run): ok [IPC=31.28,Instrs=5648320,Cycles=180567,DRAMAccs=50240,Retries=2481,Susps=0]
Samples/MatVecMul (build): ok
Samples/MatVecMul (run): ok [IPC=28.88,Instrs=10864608,Cycles=376171,DRAMAccs=139968,Retries=5969,Susps=0]
Samples/MatMul (build): ok
Samples/MatMul (run): ok [IPC=31.40,Instrs=144054240,Cycles=4588073,DRAMAccs=89472,Retries=82750,Susps=0]
InHouse/BlockedStencil (build): ok
InHouse/BlockedStencil (run): ok [IPC=27.01,Instrs=48971680,Cycles=1812934,DRAMAccs=212416,Retries=10757,Susps=0]
InHouse/StripedStencil (build): ok
InHouse/StripedStencil (run): ok [IPC=31.45,Instrs=35541920,Cycles=1129937,DRAMAccs=175360,Retries=2345,Susps=0]
InHouse/VecGCD (build): ok
InHouse/VecGCD (run): ok [IPC=4.23,Instrs=10955517,Cycles=2591078,DRAMAccs=20350,Retries=892,Susps=0]

@Honourable-A
Copy link
Author

Thanks again for the clarification. Is there any document or user guide to use this simulator? My intention is to develop an OS for SIMTight but I am not sure if I can use this simulator or how I can use it.

@mn416
Copy link
Collaborator

mn416 commented Oct 20, 2022

These are the only docs at the moment:

The first one does explain how to use the simulator. The second one discusses software interfaces.

@Honourable-A
Copy link
Author

Can you please tell me what is Mailbox and what is ITCM in the SoC diagram? Also please tell me how the CPU and SIMT are connected. Also, is it possible to run applications on the CPU and SIMT at the same time? There is an UART(USB) connection to the CPU. What is the purpose of this connection? Thanks

@mn416
Copy link
Collaborator

mn416 commented Nov 20, 2022

We hope to improve SIMTight's documentation over the next year. Hopefully, I will be able to address such questions as part of that process.

@Honourable-A
Copy link
Author

Thanks for your answer. I have another doubt about the scalarisation. How do you implement dynamic scalarisation in hardware? Do you detect it in simple or host core or do you detect it in SIMT? Is there any existing literature on dynamic scalarisation which you can direct me to? As per the description, the entire warp is executed on a single execution unit in a single cycle because of scalarisation. But please tell me what a single execution unit mean. Is it a signle hardware thread inside a block? Also according to the description, it operates in parallel with the main vector pipeline. Please tell me how it is done because currently syncronous kernel invocation is avaialble. That means the host can run only one kernel and waits till it finishes and scalar optimized kernel must finish before any other kerenel can run. Also what is a main vector pipeline? I know that I asked too many questions, but if you can kindly shed some light that will be very helpful.

@mn416
Copy link
Collaborator

mn416 commented Nov 21, 2022

Again, I'll try to address these questions in the upcoming documentation process. Briefly:

  • Regarding existing work on scalarisation, there is lots. To mention two: this GPGPU architecture book and this ISCA'13 paper.
  • SIMTight's SIMT core contains a scalar pipeline and a vector pipeline, both independent of the host CPU which is not part of the SIMT core in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants