Skip to content

Commit

Permalink
[ROCm] Fix fp32 atomicAdd for non-MI100 GPUs (pytorch#128750)
Browse files Browse the repository at this point in the history
Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.

Fixes pytorch#128631

Benchmarking on MI300X:
```
Before:  1918.5126953125 ms
After: 0.8285150527954102 ms
```

Co-authored-by: Jeff Daily <[email protected]>
Pull Request resolved: pytorch#128750
Approved by: https://github.com/xw285cornell

(cherry picked from commit 1f0a68b)
  • Loading branch information
jerrymannil authored and jithunnair-amd committed Oct 28, 2024
1 parent 0436e08 commit de3e990
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion aten/src/ATen/cuda/Atomic.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,13 @@ static inline __device__ void gpuAtomicAddNoReturn(double *address, double val)

/* Special case fp32 atomic. */
#if defined(USE_ROCM)
static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) { atomicAddNoRet(address, val); }
static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) {
#if defined(__gfx908__)
atomicAddNoRet(address, val);
#else
(void)unsafeAtomicAdd(address, val);
#endif
}
#else
static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) { gpuAtomicAdd(address, val); }
#endif
Expand Down

0 comments on commit de3e990

Please sign in to comment.