Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan loss during training #10

Open
ItsThanhTung opened this issue Mar 21, 2024 · 6 comments
Open

Nan loss during training #10

ItsThanhTung opened this issue Mar 21, 2024 · 6 comments

Comments

@ItsThanhTung
Copy link

Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The training loss become nan for both LR.

03/21/2024 18:26:37 - INFO - __main__ - Loaded lora parameters into model                                                                                                   [91/1907]
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All model weights loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All optimizer states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All scheduler states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All random states loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.accelerator - Loading in 0 custom states                                                                                                     
Steps:   0%|                                                                                                                                                                  | 0/157
1000 [00:00<?, ?it/s]03/21/2024 18:26:37 - INFO - __main__ - Running validation...
                                                                                           
                    {'timestep_spacing'} was not found in config. Values will be initialized to default values.                                                                      
                    | 0/6 [00:00<?, ?it/s]   
Loaded scheduler as PNDMScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded feature_extractor as CLIPImageProcessor from `feature_extractor` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-2-1-base.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00
:00<00:00, 20.30it/s]
{'use_karras_sigmas', 'solver_type', 'lambda_min_clipped', 'timestep_spacing', 'sample_max_value', 'dynamic_thresholding_ratio', 'solver_order', 'thresholding', 'variance_type', 'al
gorithm_type', 'lower_order_final'} was not found in config. Values will be initialized to default values.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.04it/s]
writing inference outputs failed module 'ffmpeg' has no attribute 'input'█████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.55it/s]
03/21/2024 18:26:49 - INFO - __main__ - Running training...                                                                                                                          
Steps:   0%|                                                                                                                                     | 0/1571000 [00:31<?, ?it/s, lr=2.5e
-5, step_loss=0.0496]/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match buc
ket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an er
ror, but may impair performance.             
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. Thi
s may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair p
erformance.                                  
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
Steps:   0%|                                                                                                                          | 8/1571000 [03:57<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [03:59<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:00<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:02<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:03<10584:31:31, 24.25s/it, lr=2
.5e-5, step_loss=nan]

Do you have any suggestions?
Thanks!

@ItsThanhTung
Copy link
Author

Update: I train the small version, and everything is fine.

@lukasHoel
Copy link
Contributor

Hi,

I also encountered nan loss during training (especially when testing fp16 training), but the final configuration (2x80GB A100 GPUs with the configured learning rate and batch size) worked for me successfully without any nan's.
It might be that you have to change some things in the training pipeline to make it work on your system:

  • Try adjusting the max_grad_norm to a lower value

  • A fix that always worked for me (but is a bit unsatisfying) is to just set nan gradients to zero. Add this right before optimizer.step() is called:

for p in unet.parameters():
    if p.grad is not None:
        p.grad.nan_to_num_()

@ItsThanhTung
Copy link
Author

ItsThanhTung commented Mar 22, 2024

I am still facing the same issue on the train.sh (41GB of GPU memory). Moreover, is it normal to run very slowly like this? I increased num_worker=32 and set max_grad_norm=5e-4 but still facing nan loss

03/22/2024 13:58:27 - INFO - __main__ - Running training...
Steps:   0%|                                                                                                                                           | 0/1572000 [00:45<?, ?it/s, lr=5e-5, step_loss=0.185]
/home/nthanh/miniconda3/envs/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created accordi
ng to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/nthanh/miniconda3/envs/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created accordi
ng to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Steps:   0%|                                                                                                                                            | 0/1572000 [00:49<?, ?it/s, lr=5e-5, step_loss=0.22]
03/22/2024 13:59:02 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
03/22/2024 13:59:02 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
Steps:   0%|                                                                                                                               | 25/1572000 [05:52<5052:48:15, 11.57s/it, lr=5e-5, step_loss=nan]

@lukasHoel
Copy link
Contributor

I would suggest to try set num_workers=0 and also include the other fix from my message above

@ItsThanhTung
Copy link
Author

We've followed your training instructions to avoid nan loss, but now we're encountering exploding gradients after 20k training steps. Will you be releasing the pretrained model? It would greatly assist us in reproducing the results outlined in the paper.

@lukasHoel
Copy link
Contributor

We do not release weights because of licensing issues. I'd be happy to help with any reproduction issues, how exactly did you get the exploding gradients? I never encountered them and think that's what the max_grad_norm parameters is avoiding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants