Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/cpu: detect OS support for AVX-512 and CPU support for VAES and VPCLMULQDQ instructions for performance optimization of crypto ciphers #43925

Open
tpaint opened this issue Jan 26, 2021 · 32 comments

Comments

@tpaint
Copy link
Contributor

tpaint commented Jan 26, 2021

What version of Go are you using (go version)?

$ go version 1.15.5 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

What did you do?

What did you expect to see?

Detect OS support for AVX-512 registers, detect CPU support for AVX-512 VAES and VPCLMULQDQ instructions.

What did you see instead?

AVX-512 OS support and VAES crypto instructions are not currently supported in Go. We have developed proposed patches for go v1.15.5 for internal/cpu: check OS support for AVX-512 registers and check cpu registers for presence of VAES and VPCLMULQDQ, set flags accordingly. The patches will be contributed and submitted to the Go Gerrit for review.

References:

  1. https://www.tomshardware.com/news/intel-10nm-xeon-ice-lake-sp-sunny-cove-core-architecture
  2. https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf
@gopherbot gopherbot added this to the Proposal milestone Jan 26, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/271521 mentions this issue: internal/cpu: Add detection for OS support of AVX-512 and detection of CPU support for

@martisch
Copy link
Contributor

martisch commented Jan 27, 2021

I dont think the standard library should use AVX-512 as this can cause whole program and other tenants to regress in performance due to downclocking. I think its fine for tailored high performance applications to explicitly use (e.g. by coder or user explitily opting in) AVX-512 routines directly if that is beneficial. For that internal/cpu is not needed to support AVX-512 and sys/cpu with existing AVX-512 support can be used.

Happy to be proven otherwise but AFAIK AVX-512 isnt always a net win for the efficiency across all processes of a machine on al l supported CPU implementations. And therefore this may make the default to use AVX-512 if supported regress Go performance for some applications and machines.

@bcmills
Copy link
Contributor

bcmills commented Jan 27, 2021

(Compare #42726.)

@bcmills
Copy link
Contributor

bcmills commented Jan 27, 2021

@martisch

I dont think the standard library should use AVX-512 as this can cause whole program and other tenants to regress in performance due to downclocking.

I maintain that in the cases where that matters, it is the responsibility of the OS — not individual language runtimes — to disable AVX512 support for programs that should not use it.

That said, I'm not sure whether we currently have adequate builder coverage to test both AVX-512 and whatever fallback paths would be used when it is disabled.

@martisch
Copy link
Contributor

The reality is that OSes already do quirks to pretend they dont support AVX-512 to avoid the overhead of storing and restoring registers:
#43089

I dont think its feasible for the OS to be able to determine at runtime if the workload benefits or doesnt benefit AVX-512. Especially because its not really possible to disable usage of AVX-512 once a program is running as it may have set internal state to assume AVX-512 can be used.

@vsivsi
Copy link

vsivsi commented Feb 1, 2021

@martisch To be fair, Darwin doesn't "pretend" to not support AVX512.

It optimizes by disabling (via the XCR0 control bits) saving AVX512 register state for processes/threads that haven't yet used any instructions requiring that state to be preserved.

Doing it this way necessitates using a different method for discovering AVX512 support (that is, not treating the XCR0 bits as immutable), but that is not at all the same as needing to somehow trick the OS into enabling AVX512. The CPUID bits correctly reflect the AVX512 features of the hardware, and Darwin documents multiple methods for confirming that the OS supports those features.

On the general performance of AVX512, the new Ice Lake (10th Gen Core) CPUs from Intel have much less severe down clocking of "heavy" AVX512 instructions compared to the previous Xeon based AVX512 implementations. And I expect that trend to continue in the coming years. For more details, see:

https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

BTW, I am supportive of this proposal, and more generally, all proposals that enable programmers to easily detect and use new instructions and other CPU features at their discretion. When the golang standard library chooses to exploit such features with new code-paths should be determined by a separate benchmark-driven cost/benefit analysis.

@martisch
Copy link
Contributor

martisch commented Feb 10, 2021

@vsivsi
Fair. I should have phrased it differently. As far as I understand Darwin communicates that it doesnt support ZMM registers via XCR0 in current state. According to Intel manuals (Intel 64 and IA-32 Architectures
Software Developer’s Manual Volume 2 2.6.11.1 State Dependent #UD) that means instructions that use ZMM (and some others) will cause an Undefined Instruction exception when executed. One has to first trigger mechanisms to enable ZMM support like executing an instruction that will cause UD (or using some darwin specific init functions) and make Darwin change XCR0 and enable ZMM support. (Is it actually documented that darwin will always catch the UD and reexecute and under which conditions?)

BTW x/sys/cpu supports AVX512 and programmers can choose to take decisions based on that (and with your changes will make detection on darwin work). This proposal doesnt change that.

Historically until there is a confirmed use case in the standard/library it was standard to not add feature set detection to internal/cpu as those feature booleans arent exported to be read outside the std library users.

When the golang standard library chooses to exploit such features with new code-paths should be determined by a separate benchmark-driven cost/benefit analysis.

That is why I added: "Happy to be proven otherwise but AFAIK AVX-512 isnt always a net win for the efficiency across all processes of a machine on all supported CPU implementations." So there would need to be some analysis if the average go program on the most "used" platforms benefits to a degree that outways the potential loss of performance for others.

@jhwolf-intel
Copy link

We, Intel, are working on a paper with data to show the impact of AVX-512 to address these concerns. The Travis Downs blog did get several things right about Ice Lake server, though there are differences from the client. And, as @vsivsi noted, "On the general performance of AVX512, the new Ice Lake (10th Gen Core) CPUs from Intel have much less severe down clocking of "heavy" AVX512 instructions compared to the previous Xeon based AVX512 implementations. And I expect that trend to continue in the coming years."

This is true for Ice Lake servers as well. We'll show the actual data in the white paper for the heavy instructions. The Travis Downs blog also noted: "The Ice Lake i5-1035 CPU exhibits only 100 MHz of licence-based downclock with 1 active core when running 512-bit instructions."

Our ICX servers will show similar behavior and exhibit only up to 200MHz of license based downclocking with all active cores when running AVX512L instructions.

The VAES instructions in use for our crypto PR only use these light instructions, so we do not expect much of an impact. @tpaint will update the PR with the benchstat data as requested shortly too.

If there are still concerns about our crypto and cpuid changes, how about we push these to the X-repo so others can do their own testing/experiments?

@jhwolf-intel
Copy link

@tpaint @martisch
Here's the GCM benchstat data - thought it is best to keep it all in this forum than in the CL. Again, let us know if we can move the changes to the X-repo to keep iterating...

The following is a Microbenchmark run on a single core of an early engineering sample of Ice Lake,
our 3rd-gen Intel(r) Xeon(r) Scalable Processor using Go's benchstat. The benchmark is a
block cipher throughput test for AES-GCM that is part of the Go language runtime crypto library.
The figure of merit is MB/s. The A/B comparison here is the AES-GCM benchmark run on Ice lake
A) without VAES and VPCLMULQDQ vs. B) with VAES and VPCLMULQDQ. Lower is better for the first set;
higher is better for the second set. Note, that we are still working on improvements to it and will re-post
in the coming weeks to include AESGCMSign8K as well.

name old time/op new time/op delta
AESGCMSeal1K-96 205ns ± 0% 186ns ± 0% -9.08% (p=0.000 n=9+9)
AESGCMOpen1K-96 194ns ± 0% 186ns ± 0% -4.04% (p=0.000 n=9+10)
AESGCMSeal8K-96 1.23µs ± 0% 0.83µs ± 0% -33.01% (p=0.000 n=10+9)
AESGCMOpen8K-96 1.21µs ± 1% 0.87µs ± 0% -28.08% (p=0.000 n=9+10)

name old time/op new time/op delta
AESGCMSeal1K-96 5.00GB/s ± 0% 5.50GB/s ± 0% +10.01% (p=0.000 n=10+9)
AESGCMOpen1K-96 5.28GB/s ± 0% 5.50GB/s ± 0% +4.20% (p=0.000 n=9+10)
AESGCMSeal8K-96 6.64GB/s ± 0% 9.92GB/s ± 0% +49.26% (p=0.000 n=10+9)
AESGCMOpen8K-96 6.78GB/s ± 1% 9.43GB/s ± 0% +39.07% (p=0.000 n=9+10)

Configuration: 1-node 2x pre-production 24-core 3rd-gen Intel(r) Xeon(r) Scalable Processor on Intel reference
platform with 256 GB total memory, HT on, Turbo on.

@martisch
Copy link
Contributor

martisch commented Mar 9, 2021

cc @randall77 @aclements

@martisch
Copy link
Contributor

martisch commented Jan 19, 2022

As AVX512 came up in https://go-review.googlesource.com/c/go/+/379394 internal/cpu,internal/bytealg: add SIMD prefix match for Index/amd64:

Before using AVX512 in runtime/std I think we should

  • have an understanding and differentiate between allowing AVX512 that can and that cant cause downclocking and thereby performance regressions for other code
  • have builders that specifically run on AVX512 hardware (AMD and newer Intel desktop cpus dont support it) to test it
  • consider if the maintenance burden added is worth it, especially since support in OSes doesnt seem very tested and can cause subtle bugs with signals/preemption: runtime: AVX512 register state clobbered by signal on macOS #49233

@dbussink
Copy link

For VAES & VPCLMULQDQ, they also add support to working with 256 YMM registers from AVX2. So AVX512 is not a hard prerequisite for using VAES & VPCLMULQDQ.

This also shows in the AMD Ryzen Zen 3 CPUs, which don't support AVX512 but do support VAES & VPCLMULQDQ which can be used on 256 bit registers there.

Implementing that would not incur any of the downsides I think of AVX512 and the potential throttling effects it has, but theoretically would still provide a significant speedup of AES operations. It also avoids issues on Darwin with ZMM registers in that case.

@mvdan
Copy link
Member

mvdan commented Jan 24, 2022

Another reason to favor AVX2 over AVX512 is that, in practice, many common environments will start building with GOAMD64=v3, but few will use v4 given that it's just not cost-efficient given that it's not widely supported. For example, Arch Linux will build its packages for v3, Fedora are considering similar options, and I imagine other operating systems might already be doing the same.

This is not to say that we cannot detect AVX512 at run-time and use it, but it does mean that detecting AVX2 seems like a better tradeoff - it's more widely available, and the check could happen at compile time more often.

@martisch
Copy link
Contributor

I think using VAES and VPCLMULQDQ without AVX512 is fine pending on that we actually be able to test them and no other policy e.g. crypto assembly is restricting their use. If they will actually be used we need to add the corresponding feature bit checks to internal/cpu and in the specific code paths also check for any other AVX features needed.

VPCLMULQDQ might have a caveat in that I think some CPUs actually implement them as slow microcode which we could cover in benchmarking on cl submission to figure out if it could make things worse and if that is bad enough to warrant not using it or adding more detection.

@vsivsi
Copy link

vsivsi commented Jan 24, 2022

I think it's important to note that the AVX-512 "license" down-clocking has been greatly reduced in Intel chips beginning with Ice Lake in 2019. And notably, VAES and VPCLMULQDQ are not supported by any AVX-512 implementations prior to Ice Lake, so if those instructions are in use in an AVX-512 optimized code-path (precluding support for earlier AVX-512 implementations), then AVX-512 specific down-clocking is likely to be a much more minor concern than implied in the discussion above. Actual workload specific benchmarks should drive such optimization decisions, not perceptions based on potentially dated assumptions.

See: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

And: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

@martisch
Copy link
Contributor

I dont think performance is the only concern here. As for performance as mentioned earlier "have an understanding and differentiate between allowing AVX512 that can and that cant cause downclocking and thereby performance regressions for other code".

AMD zen 4 seems it will support AVX512. How it will perform there can also be interesting to avoid regressing performance with new code for some CPUs: https://www.extremetech.com/computing/325888-gigabyte-leaks-amd-zen-4-details-5nm-avx-512-96-cores-12-channel-ddr5

@vsivsi
Copy link

vsivsi commented Jan 24, 2022

I agree, performance is rarely the only concern. Early implementations of AVX-512 got a lot of bad press over the down-clocking, and to the extent you care about that installed base, those concerns remain relevant. My point is to simply highlight that if the proposed optimizations use instructions unsupported by those older implementations, then the past "bad reputation" of AVX-512 down-clocking is not likely a relevant consideration.

Clearly the benchmark driven component of any decision should consider all widely available hardware implementations. Presumably AMD won't release an AVX-512 implementation so bad that it makes their CPUs look worse on benchmarks than if they just ran on the equivalent AVX2 codepath. I but I guess we'll find out!

@vsivsi
Copy link

vsivsi commented Jan 24, 2022

One more point fwiw, I've now written a fair amount of AVX2/AVX-512 functions for a specific (integer/non-crypto) workload, and typically see 4-8x speed ups for AVX2 vs 12-40x speed ups for AVX-512 (across multiple HW implementations). The upper end of the latter range is typically driven by the presence of workload-specific instructions, in my case VPOPCNTQ. Older AVX-512 implementations missing that feature still clearly outperform AVX2, but more toward the lower end of the speedup range (including down-clocking impacts). YMMV.

@jbrosenz
Copy link

If the code checked the CPUID to limit the set of CPUs that the optimization were to run on, so that it is not deployed where it is unanticipated, would that be seen as a welcome step in the right direction?

@martisch
Copy link
Contributor

I think we first need to have builders up to speed and see what code paths they can actually run as well as each CL being decided to be in a submittable state. In the end we can start the fine grained performance evaluations and see if anything needs to be tuned to avoid regressions like for memcpy:

isIntelBridgeFamily := isIntel &&

But ideally we can avoid them.

On another note I think https://go-review.googlesource.com/c/go/+/286852 which this would have been used for is blocked on crypto assembly policies.

@rsc
Copy link
Contributor

rsc commented Jul 12, 2023

This proposal is about our stance for whether and when to use AVX-512 in the standard library.
What is the current state of the world?
Do chips still get slower when you use AVX-512?

@rsc
Copy link
Contributor

rsc commented Jul 12, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Jul 12, 2023
@randall77
Copy link
Contributor

Reading here: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking
Downclocking due to AVX512 is bad on Skylake (released 2015), only very slightly bad on Ice Lake (2019), and not a problem on Rocket Lake (2021).
Not sure what happens for AMD chips.

I feel like we want to wait a while yet before using AVX512 in the stdlib generally. There is still a lot of Skylake out there.

As @vsivsi notes here, the particular instructions wanted for crypto are not on Skylake, so maybe for that it would be ok?

And as @martisch notes, we need a builder story.
And we need some idea of what happens on AMD chips.

@aclements
Copy link
Member

As @vsivsi notes #43925 (comment), the particular instructions wanted for crypto are not on Skylake, so maybe for that it would be ok?

We can also generally block-list Skylake (and maybe Ice Lake?) even if they report AVX512 support. I think we would want a block and not an allow for both AMD and future compatibility. We do need to figure out how AMD chips behave.

@vsivsi
Copy link

vsivsi commented Jul 19, 2023

Briefly chiming in here... Anecdotally, I use AVX512 heavily on Ice Lake systems and it is always a net win for my workloads, which includes use of "heavy" instructions that are prone to maximizing any down-clocking. But the right answer for any workload is always to be found in proper benchmarking.

@vsivsi
Copy link

vsivsi commented Jul 19, 2023

I would also be remiss if I didn't take this opportunity to point out that AVX512 remains (unnecessarily) officially disabled in Golang on darwin/amd64. Apple fixed the kernel bug that triggered this long ago, and I've had a Go stdlib fix submitted for a long time with no review activity.
https://go-review.googlesource.com/c/sys/+/382314

@randall77
Copy link
Contributor

Briefly chiming in here... Anecdotally, I use AVX512 heavily on Ice Lake systems and it is always a net win for my workloads, which includes use of "heavy" instructions that are prone to maximizing any down-clocking. But the right answer for any workload is always to be found in proper benchmarking.

I think the worry here is not for heavy users. If you've reached for AVX512, it's probably worth it to you well beyond any down-clocking penalty.
The worry here is that very rare uses cause down-clocking for large periods of time. Maybe you have a rare code path that has an avx512 instruction in it. Maybe you don't even know about it, because it is in some library you import. Or perhaps it is in the runtime. Because it is a rare path, the performance benefit is ~0 but the down-clocking shadow may be large.
That's the worry, anyway. I don't think I've seen good numbers on how big the down-clocking shadow is.

I would also be remiss if I didn't take this opportunity to point out that AVX512 remains (unnecessarily) officially disabled in Golang on darwin/amd64. Apple fixed the kernel bug that triggered this long ago

Do you have a reference for this? In particular, we need to be sure that the fix is in all the Darwin versions we support. Reading https://github.com/golang/go/wiki/MinimumRequirements#macos-n%C3%A9e-os-x-aka-darwin , i think we need Catalina and later to be free of the bug.

Reading #49233 and https://community.intel.com/t5/Software-Tuning-Performance/MacOS-Darwin-kernel-bug-clobbers-AVX-512-opmask-register-state/m-p/1327259#M7970 , I don't think we can turn on AVX512 yet. The bug in question was definitely present in Catalina (10.15.6).

and I've had a Go stdlib fix submitted for a long time with no review activity.
https://go-review.googlesource.com/c/sys/+/382314

Sorry about that. It looks orphaned for some reason. I will adopt it.

@vsivsi
Copy link

vsivsi commented Jul 19, 2023

The worry here is that very rare uses cause down-clocking for large periods of time.

Agreed. The worst case is using an AVX512 "heavy" instruction once per second or something.

On Ice Lake the risk and potential impact of such worst cases is massively reduced.

On the darwin/amd64 fix, the code I submitted does a kernel version check (only on Darwin when AVX512 is present) to ensure that Apple's patch is present.

Here's the spot in the issue thread where I note the properly patched versions of MacOS/darwin and discuss the Golang fix I developed: #49233 (comment)

@vsivsi
Copy link

vsivsi commented Jul 20, 2023

At the risk of complicating this discussion, here's one more wrinkle to consider...

mmcloughlin/avo#364

The TL;DR on the above is that that VAES, VPCLMULQDQ and GFNI are strictly speaking, not AVX512 extensions.

Intel has clearly signaled its desire to produce (and in one case has shipped) processors lacking AVX512F support that nevertheless do support some or all of these extensions with SSE or VEX encoded instructions. That is: SSE4/128-bits and AVX2/256-bit instruction variants are fully implemented, and there's every reason to expect they will be shipped in the absence of AVX512 support.

So the x/sys/cpu support for:

  • X86.HasAVX512VPCLMULQDQ
  • X86.HasAVX512GFNI
  • X86.HasAVX512VAES

are all misnomers and don't correctly detect (per the Intel architecture documentation) support for these extensions.

I keep meaning to file an issue on Golang over this, because as the above issue on Avo describes in detail, Golang currently incorrectly uses both symbols and detection logic that ties use of these extensions to AVX512F, when that is not the correct assumption to make.

The correct Golang implementation would be to deprecate the above symbols and instead define:

  • X86.HasVPCLMULQDQ
  • X86.HasGFNI
  • X86.HasVAES

To correctly detect support for the VEX or EVEX encoded versions one of the above should be combined with X86.HasAVX or X86.HasAVX512F (or with X86.HasAVX512VL for EVEX with 128/256 bit regs) respectively.

Detecting VPCLMULQDQ support is further complicated with the separate PCLMULQDQ CPUID bit, which Golang does correctly handle with X86.HasPCLMULQDQ.

For examples see:
https://www.felixcloutier.com/x86/pclmulqdq
https://www.felixcloutier.com/x86/gf2p8affineqb

(VAES examples aren't in the felixcloutier documentation, but can be found in the official Intel documentation: https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4 at Vol. 2A 3-51).

@rsc
Copy link
Contributor

rsc commented Jul 26, 2023

It sounds like the proposal is to add a few more names in the CPU info:

X86.HasVPCLMULQDQ
X86.HasGFNI
X86.HasVAES

I don't think we necessarily need to deprecate the existing names, since they can still be used to check for AVX512 and the specific feature.

To be clear this proposal is not about using AVX-512 or even these not-quite-AVX-512 things in any specific package.

@rsc
Copy link
Contributor

rsc commented Aug 2, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Accept in Proposals Aug 2, 2023
@rsc
Copy link
Contributor

rsc commented Aug 9, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc moved this from Likely Accept to Accepted in Proposals Aug 9, 2023
@rsc rsc changed the title proposal: internal/cpu: detect OS support for AVX-512 and CPU support for VAES and VPCLMULQDQ instructions for performance optimization of crypto ciphers internal/cpu: detect OS support for AVX-512 and CPU support for VAES and VPCLMULQDQ instructions for performance optimization of crypto ciphers Aug 9, 2023
@rsc rsc modified the milestones: Proposal, Backlog Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests