Debug formatter for `Tensor` is confusing with > 1 GPU #2619

zackangelo · 2024-11-15T17:04:38Z

When running a model across several processes using NCCL, the debug formatter output will print the same ID for two GPUs:

GPU 0: (dev=Cuda(CudaDevice(DeviceId(1))), shape=[1, 128256], len=128256)
GPU 1: (dev=Cuda(CudaDevice(DeviceId(1))), shape=[1, 128256], len=128256)

It's confusing when looking at logs and trying to figure out which GPU is doing what.

This is because candle uses an atomic counter per-PID to assign a device ID:

candle/candle-core/src/cuda_backend/device.rs

Lines 35 to 39 in 00d8a0c

    
           impl std::fmt::Debug for CudaDevice { 
        
               fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { 
        
                   write!(f, "CudaDevice({:?})", self.id) 
        
               } 
        
           }

Would it be a problem to include the CUDA device ordinal in the debug formatter? If not I'll open a PR.

The text was updated successfully, but these errors were encountered:

LaurentMazare · 2024-11-15T17:26:11Z

Yeah feel free to make a PR that would change it to something like CudaDevice(ordinal:id).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug formatter for `Tensor` is confusing with > 1 GPU #2619

Debug formatter for `Tensor` is confusing with > 1 GPU #2619

zackangelo commented Nov 15, 2024

LaurentMazare commented Nov 15, 2024

Debug formatter for Tensor is confusing with > 1 GPU #2619

Debug formatter for Tensor is confusing with > 1 GPU #2619

Comments

zackangelo commented Nov 15, 2024

LaurentMazare commented Nov 15, 2024

Debug formatter for `Tensor` is confusing with > 1 GPU #2619

Debug formatter for `Tensor` is confusing with > 1 GPU #2619