Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A64: Add support ARMv9.0 instructions #100

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ijan1
Copy link

@ijan1 ijan1 commented Oct 22, 2023

Support for instructions up to ARMv9.0, most notably SVE and SVE2.

However, there are a few issues yet to be fixed.
Firstly, on a system with a Vector Length (VL) of more than 128-bits it doesn't work. This has been tested with QEMU and natively on an AWS Graviton machine(c7g.4xlarge). The dispatcher for some reason takes the wrong path on a 'CBZ' instruction at around basic block 152. This leads to early program termination with a negative return status.

Secondly, it only works with a VL of 128-bits on a dynamically compiled (target) binary. Statically compiled ones just SEGFAULT.

Thirdly, having the instructions for pushing and popping the predicate registers makes it so that nothing works at all. Regardless of VL and how the (target) binary was compiled (dynamically or statically). I was thinking of disabling them temporarily because compiling Mambo with support for SVE and SVE2 instructions doesn't insert any such instructions, but that's not a guarantee with changes in compiler and/or compiler version.

With a VL of 128-bits (using QEMU), Highway[1] and ARMRAL[2] pass all SVE and SVE2 test.

I have attached a spreadsheet[3] with the SVE/SVE2 instructions added for easier visualisation and some notes on which ones I think might need a bit more work.

[1]https://github.com/google/highway
[2] https://developer.arm.com/documentation/102249/2310/?lang=en
[3] https://docs.google.com/spreadsheets/d/1jABn1_99A2nKaIqYgZTg_JrHdmxqhq-r4T9hTIFiKo0/edit?usp=sharing

ijan1 added 5 commits October 7, 2023 19:44
Armv8.3 Load-Acquire RCpc instructions

Instructions added:
- LDAPR
- LDAPRH
- LDAPRB
Armv8.4 Load-Acquire RCpc instructions version 2

Instructions added:
- STLUR
- STLURB
- STLURH
- LDAPURB
- LDAPURSB
- LDAPURSH
- LDAPUR
- LDAPURSW
Armv8.0 Speculation Barrier

Instructions added:
- SB
Additional SVE Features added:
- FEAT_SVE_AES
- FEAT_SVE_PMULL128
- FEAT_SVE_BitPerm
- FEAT_SVE_SHA3
- FEAT_SVE_SM4
- FEAT_SVE_B16B16
- FEAT_SVE2p1

Link to spreadsheet with added instructions:
https://docs.google.com/spreadsheets/d/1YYuc-CM2gdZJpnx3xm_31EOWQ5T0cZVLmgNM6vW0I2M/edit?usp=sharing
Renamed 'pop_neon' and 'push_neon' to
'pop_simd' and 'push_simd' respectively,
with an IFDEF selecting the corresponding
codepath.

However, there are a few issues yet to be ironed
out.

Firstly, on a system with a Vector Length (VL)
of more than 128-bits it doesn't work. This
has been tested with QEMU and natively on an
AWS Graviton machine(c7g.4xlarge). The dispatcher
for some reason takes the wrong path on a
'CBZ' instruction at around basic block 152.
This leads to early program termination with
a negative return status.

Secondly, it only works with a VL of 128-bits
on a dynamically compiled (target) binary.
Statically compiled ones just SEGFAULT.

Thirdly, having the instructions for pushing
and popping the predicate registers makes it so
that nothing works at all. Regardless of VL
and how the (target) binary was compiled
(dynamically or statically).
@IgWod
Copy link
Collaborator

IgWod commented Oct 24, 2023

Hi @ijan1,

Thank you for your contribution.

The code changes look good to me, but before we merge, let's discuss points you've raised.

Firstly, on a system with a Vector Length (VL) of more than 128-bits it doesn't work.

At the start-up, MAMBO should check the vector length supported by the system, and quit if it's not 128-bits. Not having a general support it's okay, but should be signaled. But before fixing that, take a look at my comments regarding the predicates.

Secondly, it only works with a VL of 128-bits on a dynamically compiled (target) binary. Statically compiled ones just SEGFAULT.

Again, we should have a check for that. If SVE is detected, the loader should check if loaded binary is static and quit with a message.

Thirdly, having the instructions for pushing and popping the predicate registers makes it so that nothing works at all.

This is the biggest issue at the moment, as we get into it may work ground or may not ground. And it seems more and more to me, the failure is not SVE specific, but rather relates to push_neon/push_simd using more than 512 bytes. We could do a quick check: let's have a neon application (no SVE) and just push some extra stuff on the stack in push_neon (and pop accordingly), and see if it works. If it fails, then that means the problem is with the stack and some assumptions being baked in MAMBO with respect to push_neon/pop_neon. So, let's talk about it offline and have one more go, before going ahead with the merge, and finding some temporary solution to the issue.

@IgWod
Copy link
Collaborator

IgWod commented Oct 24, 2023

Me again,

I had another look at the aarch64 dispatcher, and found something interesting:

.global syscall_wrapper
.global syscall_wrapper_svc
syscall_wrapper:
  STP X30, X29, [SP, #-16]!
  BL push_x4_x21
  STP X0, X1, [SP, #-32]!
  STP X2, X3, [SP, #16]
  BL push_neon

  MRS X19, NZCV
  MRS X20, FPCR
  MRS X21, FPSR

  MOV X0, X8
  ADD X1, SP, #512
  MOV X2, X29
  LDR X3, disp_thread_data
  LDR X4, syscall_handler_pre_addr

  BLR X4

  CBZ X0, s_w_r

  ADD X9, SP, #512
  LDP X0, X1, [X9, #0]
  LDP X2, X3, [X9, #16]
  LDP X4, X5, [X9, #32]
  LDP X6, X7, [X9, #48]
  LDR X8,     [X9, #64]

  // Balance the stack on rt_sigreturn, which doesn't return here
  CMP X8, #0x8b
  BNE svc
  ADD SP, SP, #(64 + 144 + 512)

svc: SVC 0
syscall_wrapper_svc:
  ADD X1, SP, #512
  STR X0, [X1, #0]
  MOV X0, X8
  MOV X2, X29
  LDR X3, disp_thread_data
  LDR X4, syscall_handler_post_addr
  BLR X4

s_w_r:
  BL pop_neon
  MSR NZCV, X19
  MSR FPCR, X20
  MSR FPSR, X21

  LDP X2, X3, [SP, #16]
  LDP X0, X1, [SP], #32
  BL pop_x4_x21
  LDP X29, X30, [SP, #16]
  STP X0, X1, [SP, #16]
  LDP X0, X1, [SP], #16

  B checked_cc_return

See how 512 repeats multiple times in the context of the stack, and it also calls push_neon/pop_neon... And the problem with SVE keeps appearing when we push more than 512-bytes. Did you already look into it? If not, then this maybe where the problem is.

`syscall_handler_pre` expects its second argument
to be a pointer to the arguments of the system
call, and it used a fixed offset to account for
the NEON register (512), however with the
addition of SVE and SVE2 this magic number no
longer works.

This commit changes it so that the address
is calculated dynamically when using SVE's
VL and PL registers and falls back to the
NEON value in the absence of SVE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants