[a64] Implement an ARM64 backend #2259

Wunkolo · 2024-05-15T02:01:56Z

Implements a 64-bit ARM backend that emits a64 instructions using oaknut.

Depends on #2258 and xenia-project/FFmpeg#8

Addresses #2002

Tested on a ThinkPad X13s and uses unit tests from #1348 as well. There is currently a ARMv8.1-a requirement due to the use of some of the newer atomic instructions such as CASAL.

Separates the `Windows` platform into `Windows-x86_64` and `Windows-ARM64`. Adds `--arch` argument to `build`. Removes x64 backend on non-x64 targets.

Marked as TODO for now

Uses intrinsics from https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170

Adding the `a64` backend will be a different PR. For now it's stubbed to the null backend to allow the main executable to open without failing initalization.

This value is currently returning `0` on ARM machines and throws an exception.

Wunkolo · 2024-05-23T02:10:52Z

Debugger, instruction-stepping, call-stack unwinding, etc have been implemented as well:

Wunkolo · 2024-05-28T20:41:04Z

Latest iteration running Beautiful Katamari and Geometry Wars. Still some minor issues but serving gameplay now.

kata.mp4

geo.wars.mp4

Wunkolo · 2024-05-29T17:56:46Z

No longer requires Armv8.1. Instructions are emitted with an Armv8.0-a baseline and will detect features such as FP16 and LSE and such before utilizing them(and expose them in the feature-mask config similar to x64).

Addresses a build issue that seems to occur now that xenia-app is not getting SDL2 through one of its submodues

Adds the new `xenia-cpu-backend-a64` build-target with linkage following the x64 backend.

Header-only library for emitting arm64v8 instructions. Enables C++20 only for the a64 backend for now

Mostly element-accessors

First pass framework that gets emitted ARM code executing. Based on the x64 backend, implements an ARM64 JIT backend.

This just reverses the bytes of 32-bit values, not reverse the whole vector.

Passes cpu-ppc-tests

This is a very literal translation from the x64 code into ARM and may not be very optimized. Passes unit test save for a couple off-by-one errors.

Adds two new flags for allowing the use of LSE and FP16C

Narrow-saturation instructions causes off-by-one rounding errors. Using the min+max+shuffle passes more unit tests

Load the pointer to the VConst table once, and use offsets from this base address from the underlying enum value. Reduces the amount of instructions for each VConst memory load.

Detect when all bytes are repeating and use `MOVI` when applicable

Indices and non-const tables were using the same scratch-register

Uses `CNTFRQ` and `CNTVCT` system-registers as a raw clock source. On my ThinkPad x13s, the raw clock source returns a tick-frequency of 19,200,000 while the platform clock source(QueryPerformanceFrequency) returns 10,000,000. Almost double the accuracy over the platform-clock!

Misses some during the first pass. Now the config files with mention a64 differences.

Read direction from the ZR in the case that we are just storing a 64 or 32 bit zero

This directly maps to the QC bit in the FPSR. Just have to make sure that the saturated instruction is the very last instruction(which is currently the case for stuff like VECTOR_ADD and such).

The 64-bit cases uses a particular Replicated 8-bit immediate so something else will have to handle that This cases a lot of cases without having to touch memory. Does not catch cases of `1.0`(0x3f800000).

`FMOV` encodes an 8-bit floating point immediate that can be used to accelerate the loading of certain constant floating point values between -31.0 and 32.0. A lot of immediates such as -1.0, 1.0, 0.5, etc fall within this range and this code gets lots of hits in my testing. This is much more optimal than trying to load a 32/64-bit value in W0/X0 and moving it into an FP register.

Uses LSE when available, but provides an armv8.0 baseline implementation.

Removes all comments relating to x64 implementation details

`dc civac` causes an illegal-instruciton on Windows-ARM. This is likely as a security measure against cache-attacks. On Linux this instruction is trapped into an EL1 kernel function. Windows does not seem to have any user-mode cache-maintenance instructions available for data-cache(only instruction-cache via `FlushInstructionCache`). The closest thing we can do for now is a full data memory-barrier with `dsb ish`. Prefetches are implemented using `prfm pldl1keep, ...`.

Out-of-bound shift-values are handled as modulo-element-size

The emitter doesn't actually hold onto executable code, but just generates the assembly-data into a buffer for the currently-resolving function before placing it into a code-cache. When code gets pushed into the code-cache, it can just be copied from an `std::vector` and reset. The code-cache itself maintains the actual executable memory and stack-unwinding code and such. This also fixes a bunch of errornous relative-addressing glitches where relative addresses were calculated based on the address of the unused CodeBlock rather than being position-independent. `MOVP2R` in particular was generating different instructions depending on its distance from the code block when it should always just use `MOV` and not do any relative-address calculations since we can't predict where the actual instruction's offset will be(we cannot predict what the program counter will be). Oaknut probably needs a "position independent" policy or mode or something so that it avoids PC-relative instructions.

These `MOV`->`DUP` splats can just be a singular `MOVI` instruction

Byte-sized constants can utilize the `MOVI` instructions. This makes many cases such as zero-splats much faster since this encodes as just a register-rename(similar to `xor` on x64).

Moves the `FMOV` constant functions into `a64_util` so it is available to other translation units. Optimize constant-splats with conditional use of `MOVI` and `FMOV`.

The last `FADDP` writes into an `S` register, which automatically masks all the other lanes to zero.

The `SUB` instruction can only encode immediates in the form of `0xFFF` or `0xFFF000`. In the case that the stack size is greater than `0xFFF`, then just align the stack-size by `0x1000` to keep the bottom 12 bits clear.

talynone · 2024-11-27T02:20:54Z

Any progress on this possible?

Wunkolo · 2024-11-27T02:56:16Z

At this point this is pretty much ready for review and merging, but it depends on #2258 and xenia-project/FFmpeg#8 being merged and the submodules being updated in this repo and maybe some additional testing with more games. Though, this repo is somewhat inactive these days it seems. The last PR was merged several months ago.

Wunkolo added 9 commits April 27, 2024 16:45

[Build] Add Windows ARM64 support

1746177

Separates the `Windows` platform into `Windows-x86_64` and `Windows-ARM64`. Adds `--arch` argument to `build`. Removes x64 backend on non-x64 targets.

[Base] Add Windows-ARM64 exception handling

a6d9113

[CPU] Add Windows ARM64 stack-walker

1874f0c

[ImGui] Stub ARM64 host debug text

b48ec84

Marked as TODO for now

[Base] Disable AVX check on ARM64

f254848

[CPU] Disable x64 backend on ARM64

fe9c98e

[Base] Add Windows-ARM64 bit_count implementation

045441a

Uses intrinsics from https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170

[CPU] Stub ARM64 to Null CPU backend

f2b05ea

Adding the `a64` backend will be a different PR. For now it's stubbed to the null backend to allow the main executable to open without failing initalization.

[UI] Fix divide-by-zero hazard

aa4a3e0

This value is currently returning `0` on ARM machines and throws an exception.

Wunkolo force-pushed the arm64-backend branch from abdfaaa to 7d57ba0 Compare May 20, 2024 16:54

Wunkolo force-pushed the arm64-backend branch from 47d801f to 0766b7a Compare May 28, 2024 23:21

Wunkolo force-pushed the arm64-backend branch 3 times, most recently from 4fa2462 to 54790a4 Compare June 8, 2024 21:34

Wunkolo mentioned this pull request Jun 12, 2024

Xenia for Mac? #596

Open

Wunkolo force-pushed the arm64-backend branch from 2486725 to 40d2d33 Compare June 14, 2024 00:57

Wunkolo added 11 commits June 23, 2024 13:48

[Build] Link SDL2 to xenia-app

a0f6cd7

Addresses a build issue that seems to occur now that xenia-app is not getting SDL2 through one of its submodues

[CPU] Add ARM64 backend build target

ffc966c

Adds the new `xenia-cpu-backend-a64` build-target with linkage following the x64 backend.

[a64] Integrate oaknut submodule

59bc265

Header-only library for emitting arm64v8 instructions. Enables C++20 only for the a64 backend for now

[Base] Add ARM64 utility functions

2284ed4

Mostly element-accessors

[CPU] Implement ARM64 CPU backend

9960ef9

First pass framework that gets emitted ARM code executing. Based on the x64 backend, implements an ARM64 JIT backend.

[a64] Fix BYTE_SWAP_V128

39429aa

This just reverses the bytes of 32-bit values, not reverse the whole vector.

[a64] Implement OPCODE_EXTRACT

b9571cf

[a64] Implement OPCODE_SPLAT

652b7a1

[a64] Implement OPCODE_INSERT

10310d7

[a64] Implement OPCODE_LOAD_VECTOR_SHL

61feb6a

[a64] Implement OPCODE_LOAD_VECTOR_SHR

1b574be

Wunkolo added 25 commits June 23, 2024 14:00

[a64] Implement OPCODE_PACK(2101010, 4202020, 8-in-16, 16-in-32)

40d908b

[a64] Fix OPCODE_PACK saturation edge-cases

6478623

Passes cpu-ppc-tests

[a64] Implement OPCODE_UNPACK

96d444d

This is a very literal translation from the x64 code into ARM and may not be very optimized. Passes unit test save for a couple off-by-one errors.

[a64] Implement LSE and FP16C detection

06daedf

Adds two new flags for allowing the use of LSE and FP16C

[a64] Optimize OPCODE_{UN}PACK(float16) with F16C

2d72b40

[a64] Fix OPCODE_PACK(short)

4ff43ae

Narrow-saturation instructions causes off-by-one rounding errors. Using the min+max+shuffle passes more unit tests

[a64] Optimize bulk VConst access with relative addressing

fc1a13d

Load the pointer to the VConst table once, and use offsets from this base address from the underlying enum value. Reduces the amount of instructions for each VConst memory load.

[a64] Optimize constant vector byte-splats

bf12583

Detect when all bytes are repeating and use `MOVI` when applicable

[a64] Fix OPCODE_SWIZZLE register-aliasing

63f31d5

Indices and non-const tables were using the same scratch-register

[a64] Remove VOne constant in favor of FMOV

cba92a2

[a64] Add arch-agnostic documentation configurations

7b9f791

Misses some during the first pass. Now the config files with mention a64 differences.

[a64] Optimize zero MovMem64

818a773

Read direction from the ZR in the case that we are just storing a 64 or 32 bit zero

[a64] Implement OPCODE_DID_SATURATE

f830f79

This directly maps to the QC bit in the FPSR. Just have to make sure that the saturated instruction is the very last instruction(which is currently the case for stuff like VECTOR_ADD and such).

[a64] Detect MOVI utilizations for vector-element splats(u8,u16,u32)

8f6c0ad

The 64-bit cases uses a particular Replicated 8-bit immediate so something else will have to handle that This cases a lot of cases without having to touch memory. Does not catch cases of `1.0`(0x3f800000).

[a64] Implement armv8.0 atomic operations

151700d

Uses LSE when available, but provides an armv8.0 baseline implementation.

[a64] Remove x64 reference implementations

164f1e4

Removes all comments relating to x64 implementation details

[a64] Fix out-of-bounds OPCODE_VECTOR_SHL(all-same) case

02edbd2

Out-of-bound shift-values are handled as modulo-element-size

[a64] Replace instances of MOV+DUP-splats to MOVI`

3acd0a3

These `MOV`->`DUP` splats can just be a singular `MOVI` instruction

[a64] Optimize OPCODE_SPLAT byte-constants

539a03d

Byte-sized constants can utilize the `MOVI` instructions. This makes many cases such as zero-splats much faster since this encodes as just a register-rename(similar to `xor` on x64).

[a64] Optimize OPCODE_SPLAT with MOVI/FMOV

9c8b067

Moves the `FMOV` constant functions into `a64_util` so it is available to other translation units. Optimize constant-splats with conditional use of `MOVI` and `FMOV`.

[a64] Remove redundant OPCODE_DOT_PRODUCT_{3,4} lane-isolation

9c572c3

The last `FADDP` writes into an `S` register, which automatically masks all the other lanes to zero.

Wunkolo force-pushed the arm64-backend branch from 40d2d33 to 9c572c3 Compare June 23, 2024 21:01

[a64] Implement support for large stack sizes

a8b9cd8

The `SUB` instruction can only encode immediates in the form of `0xFFF` or `0xFFF000`. In the case that the stack size is greater than `0xFFF`, then just align the stack-size by `0x1000` to keep the bottom 12 bits clear.

ArminiusTux mentioned this pull request Jul 27, 2024

Windows ARM support xemu-project/xemu#791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[a64] Implement an ARM64 backend #2259

[a64] Implement an ARM64 backend #2259

Wunkolo commented May 15, 2024

Wunkolo commented May 23, 2024

Wunkolo commented May 28, 2024

Wunkolo commented May 29, 2024

talynone commented Nov 27, 2024

Wunkolo commented Nov 27, 2024

[a64] Implement an ARM64 backend #2259

Are you sure you want to change the base?

[a64] Implement an ARM64 backend #2259

Conversation

Wunkolo commented May 15, 2024

Wunkolo commented May 23, 2024

Wunkolo commented May 28, 2024

Wunkolo commented May 29, 2024

talynone commented Nov 27, 2024

Wunkolo commented Nov 27, 2024