-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/cpu: detect OS support for AVX-512 and CPU support for VAES and VPCLMULQDQ instructions for performance optimization of crypto ciphers #43925
Comments
Change https://golang.org/cl/271521 mentions this issue: |
I dont think the standard library should use AVX-512 as this can cause whole program and other tenants to regress in performance due to downclocking. I think its fine for tailored high performance applications to explicitly use (e.g. by coder or user explitily opting in) AVX-512 routines directly if that is beneficial. For that internal/cpu is not needed to support AVX-512 and sys/cpu with existing AVX-512 support can be used. Happy to be proven otherwise but AFAIK AVX-512 isnt always a net win for the efficiency across all processes of a machine on al l supported CPU implementations. And therefore this may make the default to use AVX-512 if supported regress Go performance for some applications and machines. |
(Compare #42726.) |
I maintain that in the cases where that matters, it is the responsibility of the OS — not individual language runtimes — to disable AVX512 support for programs that should not use it. That said, I'm not sure whether we currently have adequate builder coverage to test both AVX-512 and whatever fallback paths would be used when it is disabled. |
The reality is that OSes already do quirks to pretend they dont support AVX-512 to avoid the overhead of storing and restoring registers: I dont think its feasible for the OS to be able to determine at runtime if the workload benefits or doesnt benefit AVX-512. Especially because its not really possible to disable usage of AVX-512 once a program is running as it may have set internal state to assume AVX-512 can be used. |
@martisch To be fair, Darwin doesn't "pretend" to not support AVX512. It optimizes by disabling (via the XCR0 control bits) saving AVX512 register state for processes/threads that haven't yet used any instructions requiring that state to be preserved. Doing it this way necessitates using a different method for discovering AVX512 support (that is, not treating the XCR0 bits as immutable), but that is not at all the same as needing to somehow trick the OS into enabling AVX512. The CPUID bits correctly reflect the AVX512 features of the hardware, and Darwin documents multiple methods for confirming that the OS supports those features. On the general performance of AVX512, the new Ice Lake (10th Gen Core) CPUs from Intel have much less severe down clocking of "heavy" AVX512 instructions compared to the previous Xeon based AVX512 implementations. And I expect that trend to continue in the coming years. For more details, see: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html BTW, I am supportive of this proposal, and more generally, all proposals that enable programmers to easily detect and use new instructions and other CPU features at their discretion. When the golang standard library chooses to exploit such features with new code-paths should be determined by a separate benchmark-driven cost/benefit analysis. |
@vsivsi BTW x/sys/cpu supports AVX512 and programmers can choose to take decisions based on that (and with your changes will make detection on darwin work). This proposal doesnt change that. Historically until there is a confirmed use case in the standard/library it was standard to not add feature set detection to internal/cpu as those feature booleans arent exported to be read outside the std library users.
That is why I added: "Happy to be proven otherwise but AFAIK AVX-512 isnt always a net win for the efficiency across all processes of a machine on all supported CPU implementations." So there would need to be some analysis if the average go program on the most "used" platforms benefits to a degree that outways the potential loss of performance for others. |
We, Intel, are working on a paper with data to show the impact of AVX-512 to address these concerns. The Travis Downs blog did get several things right about Ice Lake server, though there are differences from the client. And, as @vsivsi noted, "On the general performance of AVX512, the new Ice Lake (10th Gen Core) CPUs from Intel have much less severe down clocking of "heavy" AVX512 instructions compared to the previous Xeon based AVX512 implementations. And I expect that trend to continue in the coming years." This is true for Ice Lake servers as well. We'll show the actual data in the white paper for the heavy instructions. The Travis Downs blog also noted: "The Ice Lake i5-1035 CPU exhibits only 100 MHz of licence-based downclock with 1 active core when running 512-bit instructions." Our ICX servers will show similar behavior and exhibit only up to 200MHz of license based downclocking with all active cores when running AVX512L instructions. The VAES instructions in use for our crypto PR only use these light instructions, so we do not expect much of an impact. @tpaint will update the PR with the benchstat data as requested shortly too. If there are still concerns about our crypto and cpuid changes, how about we push these to the X-repo so others can do their own testing/experiments? |
@tpaint @martisch The following is a Microbenchmark run on a single core of an early engineering sample of Ice Lake, name old time/op new time/op delta name old time/op new time/op delta Configuration: 1-node 2x pre-production 24-core 3rd-gen Intel(r) Xeon(r) Scalable Processor on Intel reference |
As AVX512 came up in https://go-review.googlesource.com/c/go/+/379394 internal/cpu,internal/bytealg: add SIMD prefix match for Index/amd64: Before using AVX512 in runtime/std I think we should
|
For VAES & VPCLMULQDQ, they also add support to working with 256 YMM registers from AVX2. So AVX512 is not a hard prerequisite for using VAES & VPCLMULQDQ. This also shows in the AMD Ryzen Zen 3 CPUs, which don't support AVX512 but do support VAES & VPCLMULQDQ which can be used on 256 bit registers there. Implementing that would not incur any of the downsides I think of AVX512 and the potential throttling effects it has, but theoretically would still provide a significant speedup of AES operations. It also avoids issues on Darwin with ZMM registers in that case. |
Another reason to favor AVX2 over AVX512 is that, in practice, many common environments will start building with This is not to say that we cannot detect AVX512 at run-time and use it, but it does mean that detecting AVX2 seems like a better tradeoff - it's more widely available, and the check could happen at compile time more often. |
I think using VPCLMULQDQ might have a caveat in that I think some CPUs actually implement them as slow microcode which we could cover in benchmarking on cl submission to figure out if it could make things worse and if that is bad enough to warrant not using it or adding more detection. |
I think it's important to note that the AVX-512 "license" down-clocking has been greatly reduced in Intel chips beginning with Ice Lake in 2019. And notably, VAES and VPCLMULQDQ are not supported by any AVX-512 implementations prior to Ice Lake, so if those instructions are in use in an AVX-512 optimized code-path (precluding support for earlier AVX-512 implementations), then AVX-512 specific down-clocking is likely to be a much more minor concern than implied in the discussion above. Actual workload specific benchmarks should drive such optimization decisions, not perceptions based on potentially dated assumptions. See: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html And: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 |
I dont think performance is the only concern here. As for performance as mentioned earlier "have an understanding and differentiate between allowing AVX512 that can and that cant cause downclocking and thereby performance regressions for other code". AMD zen 4 seems it will support AVX512. How it will perform there can also be interesting to avoid regressing performance with new code for some CPUs: https://www.extremetech.com/computing/325888-gigabyte-leaks-amd-zen-4-details-5nm-avx-512-96-cores-12-channel-ddr5 |
I agree, performance is rarely the only concern. Early implementations of AVX-512 got a lot of bad press over the down-clocking, and to the extent you care about that installed base, those concerns remain relevant. My point is to simply highlight that if the proposed optimizations use instructions unsupported by those older implementations, then the past "bad reputation" of AVX-512 down-clocking is not likely a relevant consideration. Clearly the benchmark driven component of any decision should consider all widely available hardware implementations. Presumably AMD won't release an AVX-512 implementation so bad that it makes their CPUs look worse on benchmarks than if they just ran on the equivalent AVX2 codepath. I but I guess we'll find out! |
One more point fwiw, I've now written a fair amount of AVX2/AVX-512 functions for a specific (integer/non-crypto) workload, and typically see 4-8x speed ups for AVX2 vs 12-40x speed ups for AVX-512 (across multiple HW implementations). The upper end of the latter range is typically driven by the presence of workload-specific instructions, in my case VPOPCNTQ. Older AVX-512 implementations missing that feature still clearly outperform AVX2, but more toward the lower end of the speedup range (including down-clocking impacts). YMMV. |
If the code checked the CPUID to limit the set of CPUs that the optimization were to run on, so that it is not deployed where it is unanticipated, would that be seen as a welcome step in the right direction? |
I think we first need to have builders up to speed and see what code paths they can actually run as well as each CL being decided to be in a submittable state. In the end we can start the fine grained performance evaluations and see if anything needs to be tuned to avoid regressions like for memcpy: go/src/runtime/cpuflags_amd64.go Line 17 in b850f36
But ideally we can avoid them. On another note I think https://go-review.googlesource.com/c/go/+/286852 which this would have been used for is blocked on crypto assembly policies. |
This proposal is about our stance for whether and when to use AVX-512 in the standard library. |
This proposal has been added to the active column of the proposals project |
Reading here: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking I feel like we want to wait a while yet before using AVX512 in the stdlib generally. There is still a lot of Skylake out there. As @vsivsi notes here, the particular instructions wanted for crypto are not on Skylake, so maybe for that it would be ok? And as @martisch notes, we need a builder story. |
We can also generally block-list Skylake (and maybe Ice Lake?) even if they report AVX512 support. I think we would want a block and not an allow for both AMD and future compatibility. We do need to figure out how AMD chips behave. |
Briefly chiming in here... Anecdotally, I use AVX512 heavily on Ice Lake systems and it is always a net win for my workloads, which includes use of "heavy" instructions that are prone to maximizing any down-clocking. But the right answer for any workload is always to be found in proper benchmarking. |
I would also be remiss if I didn't take this opportunity to point out that AVX512 remains (unnecessarily) officially disabled in Golang on darwin/amd64. Apple fixed the kernel bug that triggered this long ago, and I've had a Go stdlib fix submitted for a long time with no review activity. |
I think the worry here is not for heavy users. If you've reached for AVX512, it's probably worth it to you well beyond any down-clocking penalty.
Do you have a reference for this? In particular, we need to be sure that the fix is in all the Darwin versions we support. Reading https://github.com/golang/go/wiki/MinimumRequirements#macos-n%C3%A9e-os-x-aka-darwin , i think we need Catalina and later to be free of the bug. Reading #49233 and https://community.intel.com/t5/Software-Tuning-Performance/MacOS-Darwin-kernel-bug-clobbers-AVX-512-opmask-register-state/m-p/1327259#M7970 , I don't think we can turn on AVX512 yet. The bug in question was definitely present in Catalina (10.15.6).
Sorry about that. It looks orphaned for some reason. I will adopt it. |
Agreed. The worst case is using an AVX512 "heavy" instruction once per second or something. On Ice Lake the risk and potential impact of such worst cases is massively reduced. On the darwin/amd64 fix, the code I submitted does a kernel version check (only on Darwin when AVX512 is present) to ensure that Apple's patch is present. Here's the spot in the issue thread where I note the properly patched versions of MacOS/darwin and discuss the Golang fix I developed: #49233 (comment) |
At the risk of complicating this discussion, here's one more wrinkle to consider... The TL;DR on the above is that that VAES, VPCLMULQDQ and GFNI are strictly speaking, not AVX512 extensions. Intel has clearly signaled its desire to produce (and in one case has shipped) processors lacking AVX512F support that nevertheless do support some or all of these extensions with SSE or VEX encoded instructions. That is: SSE4/128-bits and AVX2/256-bit instruction variants are fully implemented, and there's every reason to expect they will be shipped in the absence of AVX512 support. So the
are all misnomers and don't correctly detect (per the Intel architecture documentation) support for these extensions. I keep meaning to file an issue on Golang over this, because as the above issue on Avo describes in detail, Golang currently incorrectly uses both symbols and detection logic that ties use of these extensions to AVX512F, when that is not the correct assumption to make. The correct Golang implementation would be to deprecate the above symbols and instead define:
To correctly detect support for the VEX or EVEX encoded versions one of the above should be combined with Detecting VPCLMULQDQ support is further complicated with the separate PCLMULQDQ CPUID bit, which Golang does correctly handle with For examples see: (VAES examples aren't in the felixcloutier documentation, but can be found in the official Intel documentation: https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4 at Vol. 2A 3-51). |
It sounds like the proposal is to add a few more names in the CPU info: X86.HasVPCLMULQDQ I don't think we necessarily need to deprecate the existing names, since they can still be used to check for AVX512 and the specific feature. To be clear this proposal is not about using AVX-512 or even these not-quite-AVX-512 things in any specific package. |
Based on the discussion above, this proposal seems like a likely accept. |
No change in consensus, so accepted. 🎉 |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
What did you expect to see?
Detect OS support for AVX-512 registers, detect CPU support for AVX-512 VAES and VPCLMULQDQ instructions.
What did you see instead?
AVX-512 OS support and VAES crypto instructions are not currently supported in Go. We have developed proposed patches for go v1.15.5 for internal/cpu: check OS support for AVX-512 registers and check cpu registers for presence of VAES and VPCLMULQDQ, set flags accordingly. The patches will be contributed and submitted to the Go Gerrit for review.
References:
The text was updated successfully, but these errors were encountered: