Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FusedMultiplyAdd not using all available instructions #110109

Open
PavelCibulka opened this issue Nov 23, 2024 · 2 comments
Open

FusedMultiplyAdd not using all available instructions #110109

PavelCibulka opened this issue Nov 23, 2024 · 2 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@PavelCibulka
Copy link

PavelCibulka commented Nov 23, 2024

I've been trying to write code that uses the vfnmadd213ss instruction, but I've not been able to succeed in .NET 9. I'm using Zen 4 cpu (AMD Ryzen 7 7800X3D 8-Core Processor).

This code should do cos2 = 1 - sin * sin
1st variant:

    public static float M1(float a) {
        float x = MathF.FusedMultiplyAdd(a, -a, 1f);
        return x;
    }
G_M000_IG01:                ;; offset=0x0000
 
G_M000_IG02:                ;; offset=0x0000
       vmovaps  xmm1, xmm0
       vxorps   xmm0, xmm0, xmmword ptr [reloc @RWD00]
       vfmadd213ss xmm1, xmm0, dword ptr [reloc @RWD16]
       vmovaps  xmm0, xmm1
 
G_M000_IG03:                ;; offset=0x0019
       ret      
 
RWD00  	dq	8000000080000000h, 8000000080000000h
RWD16  	dq	3F8000003F800000h, 3F8000003F800000h

2nd variant:

    public static float M2(float a) {
        float x = MathF.FusedMultiplyAdd(-a, a, 1f);
        return x;
    }
G_M000_IG01:                ;; offset=0x0000
 
G_M000_IG02:                ;; offset=0x0000
       vxorps   xmm1, xmm0, xmmword ptr [reloc @RWD00]
       vfmadd213ss xmm1, xmm0, dword ptr [reloc @RWD16]
       vmovaps  xmm0, xmm1
 
G_M000_IG03:                ;; offset=0x0015
       ret      
 
RWD00  	dq	8000000080000000h, 8000000080000000h
RWD16  	dq	3F8000003F800000h, 3F8000003F800000h

3rd variant:

    public static float M3(float a) {
        float x = -MathF.FusedMultiplyAdd(a, a, -1f);
        return x;
    }
G_M000_IG01:                ;; offset=0x0000
 
G_M000_IG02:                ;; offset=0x0000
       vfmadd213ss xmm0, xmm0, dword ptr [reloc @RWD00]
       vxorps   xmm0, xmm0, xmmword ptr [reloc @RWD16]
 
G_M000_IG03:                ;; offset=0x0011
       ret      
 
RWD00  	dq	BF800000BF800000h, BF800000BF800000h
RWD16  	dq	8000000080000000h, 8000000080000000h

All functions are identical, yet they generate different assembly code. None of them use variant of VFNMADD instruction. I anticipated just this assembly code:

       vfnmadd213ss xmm0, xmm0, dword ptr [reloc @RWD00]
@PavelCibulka PavelCibulka added the tenet-performance Performance related issue label Nov 23, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 23, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@huoyaoyuan
Copy link
Member

You can access the instruction from System.Runtime.Intrinsics.X86.Fma directly, and use Vector128.CreateScalarUnsafe and ToScalar to operate the fp value as xmm. It would of course be less portable.

This code should do cos2 = 1 - sin * sin

If what you need is to provide sin and cos with lower cost, you can also check Math{F}.SinCos.

@EgorBo EgorBo added this to the Future milestone Nov 23, 2024
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants