-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] Enable EGPRs in JIT by adding REX2 encoding to the backend. #106557
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
@dotnet/avx512-contrib can we reopen this as a PR ready to review? |
@anthonycanino I re-opened it (it wasn't clear to me if your question implied you did not have permission to do so). Either you or @Ruihan-Yin need to update to latest main and resolve the conflicts, then mark it ready-for-review. |
Hi @tannergooding, thanks for the reviews in #104637, it seems like the CPUID changes are just pending merge and there should be no major changes expected, so while waiting, I wonder if we can start the conversion on this PR? |
@Ruihan-Yin, just got #104637 merged. If we could get this PR updated so it contains just the new changes, that should make it a lot simpler to review and get in. |
Thanks! I will work on it soon. |
@@ -647,6 +647,7 @@ class CodeGen final : public CodeGenInterface | |||
|
|||
#if defined(TARGET_AMD64) | |||
void genAmd64EmitterUnitTestsSse2(); | |||
void genAmd64EmitterUnitTestsApx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR really, but it'd be nice if we had similar tests for other ISAs/encodings (VEX, EVEX, etc). Sse2 itself is, afair, really just SimdLegacyEncoding.
src/coreclr/jit/codegenxarch.cpp
Outdated
genDefineTempLabel(genCreateTempLabel()); | ||
|
||
// This test suite needs REX2 enabled. | ||
assert(theEmitter->emitComp->DoJitStressRex2Encoding()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this rather be ApxIsSupported || StressRex2Encoding
?
theEmitter->emitIns_R_R(INS_add, EA_1BYTE, REG_EAX, REG_ECX); | ||
theEmitter->emitIns_R_R(INS_add, EA_2BYTE, REG_EAX, REG_ECX); | ||
theEmitter->emitIns_R_R(INS_add, EA_4BYTE, REG_EAX, REG_ECX); | ||
theEmitter->emitIns_R_R(INS_add, EA_8BYTE, REG_EAX, REG_ECX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be variants of these that explicitly test the new extended registers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests on EGPRs are planned in the PR for LSRA updates, that PR will introduce EGPR definition, which currently is unavailable in this PR.
// TODO-xarch-apx: not enable these 2 for now. | ||
// theEmitter->emitIns_R_I(INS_rcl_N, EA_4BYTE, REG_ECX, 0x05); | ||
// theEmitter->emitIns_R_I(INS_rcr_N, EA_4BYTE, REG_ECX, 0x05); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for these ones being skipped? Can we open tracking issues and list the issue number as part of the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emitxarch.cpp#L18695
It seems that the latency/tp information is missing for rcl_N/rcr_N
, so I was supposing if these 2 instructions are not used in JIT. I can add those information if it is needed
src/coreclr/jit/codegenxarch.cpp
Outdated
|
||
theEmitter->emitIns_S(INS_pop, EA_PTRSIZE, 1, 2); | ||
theEmitter->emitIns_I(INS_push, EA_PTRSIZE, 50); | ||
// TODO-XArch-apx: figure out a way to test emitIns_A, which will require a GenTreeIndir* input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just create a GenTreeIndir
node on the stack. We do this for hwintrinsiccodegenxarch
in a few places to simplify the emitter. For example: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiccodegenxarch.cpp#L407-L408
src/coreclr/jit/codegenxarch.cpp
Outdated
theEmitter->emitIns_R(INS_not, EA_2BYTE, REG_EAX); | ||
theEmitter->emitIns_S(INS_not, EA_2BYTE, 1, 2); | ||
|
||
// TODO-XArch-apx: xadd does not have RM opcode, made it cannot be encoded with emitIns_R_R. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just because we only emit xadd
as part of an interlocked or similar API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly follow the rules that if the instruction does not have an actual use case or it is blocked by the current emit paths, then I don't enable it.
Like, here, xadd
is labeled with Encoding_REX2
, so it will have access to EGPRs but if the current path will only use it with memory operand, then the situation will hold after my changes.
src/coreclr/jit/codegenxarch.cpp
Outdated
// TODO-XArch-apx: S_R_I path only accepts SEE or VEX instructions, | ||
// so I assuem shld/shrd will not be taking the first argument from stack. | ||
// theEmitter->emitIns_S_R_I(INS_shld, EA_2BYTE, 1, 2, REG_EAX, 5); | ||
// theEmitter->emitIns_S_R_I(INS_shrd, EA_2BYTE, 1, 2, REG_EAX, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not ever emit this encoding today? Possibly a more general missing optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to comment above, there are instructions that "theoretically" accept some combination of operands, but if JIT does not use those combination, then this PR does not intend to change this fact.
Is this what we want in this PR, or we want to extend coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/coreclr/jit/compiler.cpp
Outdated
@@ -2297,6 +2297,13 @@ void Compiler::compSetProcessor() | |||
codeGen->GetEmitter()->SetUseEvexEncoding(true); | |||
// TODO-XArch-AVX512 : Revisit other flags to be set once avx512 instructions are added. | |||
} | |||
if (canUseRex2Encoding() || DoJitStressRex2Encoding()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the DoJitStressRex2Encoding()
check needed here? We notably don't have the equivalent for the canUseEvexEncoding()
path even though a DoJitStressEvexEncoding()
API exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will eventually turn this part into in the APX-EVEX PR:
if (canUseApxEncoding)
{
codeGen->GetEmitter()->SetUseRex2Encoding(true);
codeGen->GetEmitter()->SetUsePromotedEVEXEncoding(true);
}
the code here in this branch is a bit outdated, I will mirror the changes to this PR.
src/coreclr/jit/compiler.cpp
Outdated
// TODO-Xarch-apx: | ||
// At this stage, since no machine will pass the CPUID check for APX, we need a special stress mode that | ||
// enables REX2 on incompatible platform, `DoJitStressRex2Encoding` is expected to be removed eventually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected to be removed? There's a general benefit to the stress mode even when APX CPUs exist, in that it forces all instructions to emit using the APX/EVEX encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mis-statement, we will keep this REX2 stress mode. The difference between REX2 stress and the existing EVEX stress is that now REX2 does not do the CPUID check, we will need to add it when we have the compatible machine. Or, we may make use of the JitBypassAPXCheck
knob to skip the CPUID check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just rely on the AltJit and/or NAOT scenario for "bypassing" the CPUID check, seeing as we can't run the tests anyways.
There are the existing mechanisms for generating code for a CPU that doesn't match the host CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just rely on the AltJit and/or NAOT scenario for "bypassing" the CPUID check, seeing as we can't run the tests anyways.
Thanks for the explanation! And could you elaborate more on this technique that I can "get" APX on non-APX machine?
If I follow the existing EVEX stress mode, inside DoJitStressEvexEncoding
, it will return true only if Avx512/Avx10 is available, so DoJitStressRex2Encoding
will require APX available, can you give me some instructions on how I can get the ISA check pass on non-APX machine? Thanks!
src/coreclr/jit/compiler.h
Outdated
if (JitConfig.JitBypassAPXCheck()) | ||
{ | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed given the JitStressApxCheck
right?
It can rather be handled via the general AltJit
support for the "normal" path, like we do for other ISAs?
src/coreclr/jit/compiler.h
Outdated
bool DoJitStressRex2Encoding() const | ||
{ | ||
#ifdef DEBUG | ||
if (JitConfig.JitStressRex2Encoding()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to interplay with JitStressEvexEncoding
or similar at all (or does that need to be renamed, to clarify its only SIMD EVEX?)
In particular, I'm considering what the behavior would be on a machine without EVEX but where stress REX2 is enabled (which can't exist for real hardware), so there may be some consideration of ensuring that EVEX is enabled when REX2 is being stressed so everything "lines up".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to interplay with
JitStressEvexEncoding
or similar at all (or does that need to be renamed, to clarify its only SIMD EVEX?)
We will use JitStressPromotedEVEXEncoding
to distinguish between APX-EVEX and pre-APX EVEX.
In particular, I'm considering what the behavior would be on a machine without EVEX but where stress REX2 is enabled (which can't exist for real hardware), so there may be some consideration of ensuring that EVEX is enabled when REX2 is being stressed so everything "lines up".
If we are simply stressing the "encodings", say letting all the compatible instructions to be encoded with the new encodings, since REX2 and APX-EVEX compatibility are tracked separately, I think letting REX2 and APX-EVEX stand alone should be fine.
But if we are to stress the register allocator to use the new registers (which seems to be another scenario.), we might need to ensure both encodings are available.
Is this the case we are considering, or I have some misunderstanding?
// TODO-Xarch-apx: we have special stress mode for REX2 on non-compatible machine, that will | ||
// force UseRex2Encoding return true regardless of the CPUID results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this comment? We don't have a corresponding one for IsEvexEncodableInstruction
(same general question/concept applies throughout; I think we can generally mirror the comments/semantics that EVEX
and stress EVEX
has)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same general question/concept applies throughout; I think we can generally mirror the comments/semantics that
EVEX
andstress EVEX
has)
Yes, the overall concept of the REX2 stress is mirrored from EVEX stress mode, thanks for the inputs above, I will try to do the improvement to make the REX2 stress mode more aligned with the current stress mode design like EVEX stress mode does.
Most of the code looks to be generally correct and in the right/expected shape. But there's a number of remaining todo comments that don't have corresponding issues or that appear unnecessary when viewed in contrast to the existing I think we can clean up and simplify much of it based on that and get it a little bit more streamlined. |
Thanks for describing your testing plan. I'm glad to hear that JitLateDisasm was used (and hopefully helped) with encoding testing. Thanks for adding unit tests. It might be useful to PR any changes you had to make to build a custom coredistool.dll, to https://github.com/dotnet/jitutils/. It was mentioned in a tracking issue dotnet/jitutils#414 that current LLVM builds support APX disassembly, so perhaps the only change you needed was to bump the LLVM version number and build? (My latest attempt to update it was a little more ambitious: dotnet/jitutils#412, but it also bumped the LLVM version number.) |
Hi Bruce. Looks like I had bumped the Do you have plans to bump the LLVM version for .NET 10? It looks like that LLVM 19 will cover APX, but LLVM 20 will be required for AVX10.2 after further discussions internally (which should be released early 2025 I believe). |
#109939 (do note that only covers official builds, distro builds will probably still use older Clang/GCC) |
866253d
to
42c6cfc
Compare
Yes. Also, I expect to re-tool the cordistools build process to make it easier than it is currently to update the dependent LLVM version. @MichalPetryka thanks for that link. That might make it easier to build cordistools with our AzureLinux containers (currently, the libc in Ubuntu 16.04 I think is limiting our ability to build new LLVM). However, note that coredistools version of LLVM and .NET 10 version of LLVM are not currently related, and hopefully can still be chosen independently, in the future. |
Hi @tannergooding, I tried to refactor the stress mode for REX2, can you please check if it is as expected? Plus, there are a few questions regarding to the comments, would appreciate it if you can take a look. Thanks! |
Overview
This PR is the follow-up PR after #104637, which added the initial CPUID and XSAVE updates for APX.
This PR adds REX2 encoding functionality for legacy instructions which enables the use of EGPR for
add
,sub
, etc. Note that this PR focuses on REX2 encoding only: a follow up PR will enable EGPR support via the register allocator.Specification
REX2 is a 2-byte prefix with a leading byte of
0xD5
, detailed format below:Similar to REX prefix, it provides the extended bits for the MODRM.REG field, REX2.R4/R3, and MODRM.R/M field, REX2.B4/B3, and the index register in SIB byte, REX2.X4/X3, those bits will act as the higher 5th/4th bits and combine with the field in MODRM and SIB byte as a 5-bit binary to access up to 32 registers.
REX2 prefix is generally available for legacy-map-0 and legacy-map-1 instructions, say 1-byte opcode or 2-byte opcode with escape byte 0x0F, with some exceptions.
Like VEX/EVEX, REX2 is considered as the last prefix before the main opcode, so it can not co-exist with REX/VEX/EVEX.
Design
The bulk of the changes occur in the backend emitter.
As there is no existing hardware that has APX support yet, we had some hacks to bypass the CPUID checks. In this PR,
DOTNET_JitStressRex2Encoding
will force all the eligible instructions to be encoded in REX2, regardless the presence of EGPRs in the operand. We had another switchDOTNET_JitBypassAPXCheck
, with which will only bypass the APX CPUID check but JIT will encode REX2 only if needed, this is more useful when the LSRA changes come.Testing
We followed a multi-step testing plan to verify the encoding correctness and the semantic correctness.
Testing results will be presented below.
1. Emitter unit tests
In
codgenxarch.cpp
, similar togenAmd64EmitterUnitTestsSse2
, we used theJitLateDisasm
feature to insert instructions to encode as unit tests for emitter, andLateDisasm
will invoke LLVM to disasm the code stream, this gave us the chance to cross validate the disassembly from JIT and LLVM. The output of this step is to verify the emit paths are generating "correct" code that would not trigger #UD or have wrong semantics.Note that we are using a custom
coredistools.dll
which uses a recent LLVM that supports APX decoding.2. SuperPMI
In this step, we would run the SuperPMI pipeline to get the asmdiffs with REX2 on and off, the inputs are all the MCH files. This step will give us the chance to check if there is any assertion failure or internal error within JIT and since the pipeline will invoke
coredistools.dll
as well, so we can verify the encoding correctness in a larger scope.To ensure the new changes will not hit the existing code path in terms of throughput, we ran tpdiff with base JIT to be the main branch where changes are based on, and diff JIT to be the one with all the REX2 changes.
3. JIT unit tests
The 2 steps mentioned above are mainly verifying the encoding correctness of the generated binary code. Then the last will examine the semantic correctness of the generated code, say since we are simply forcing all the compatible instructions to be encoded in REX2, so the original semantics should not change, so we expect exactly the same output with REX2 on/off.
We used the existing CoreCLR unit test set:
JIT
and run it in the Intel SDE emulator.Follow-up plans
This PR is only intended to provide the REX2 encoding functionality to the JIT backend, in terms of how to properly use it, we are preparing another PR that includes the updates on LSRA such that JIT will be able to allocate EGPRs only when needed, and generate optimal code.