[ROCm] Fix fp32 atomicAdd for non-MI100 GPUs (pytorch#128750)

Current implementation is very specific to MI100. This is causing performance degradation for other GPUs. Fixes pytorch#128631 Benchmarking on MI300X: ``` Before: 1918.5126953125 ms After: 0.8285150527954102 ms ``` Co-authored-by: Jeff Daily <[email protected]> Pull Request resolved: pytorch#128750 Approved by: https://github.com/xw285cornell (cherry picked from commit 1f0a68b)
ROCm · Oct 28, 2024 · de3e990 · de3e990
1 parent 0436e08
commit de3e990
Showing 1 changed file with 7 additions and 1 deletion.
diff --git a/aten/src/ATen/cuda/Atomic.cuh b/aten/src/ATen/cuda/Atomic.cuh
@@ -334,7 +334,13 @@ static inline __device__ void gpuAtomicAddNoReturn(double *address, double val)
 
 /* Special case fp32 atomic. */
 #if defined(USE_ROCM)
-static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) { atomicAddNoRet(address, val); }
+static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) {
+#if defined(__gfx908__)
+  atomicAddNoRet(address, val);
+#else
+  (void)unsafeAtomicAdd(address, val);
+#endif
+}
 #else
 static inline __device__ void gpuAtomicAddNoReturn(float *address, float val) { gpuAtomicAdd(address, val); }
 #endif