loss becomes nan during training #517

Davidwhw · 2024-06-13T02:12:04Z

When I pre-trained for phase 1 using the coco dataset (downloaded using the script in lavis), loss quickly became nan. I found the problem that I mistakenly used vicuna-7B-V1.5 as LLM instead of the default Vicuna V07B.
I wonder why different versions of vicuna cause nan errors for loss. The same problem may arise with Llama-3?
Does anyone know the possible cause?

Here is the training log:

2024-06-12 16:20:56,611 [INFO] Start training
2024-06-12 16:21:06,082 [INFO] dataset_ratios not specified, datasets will be concatenated (map-style datasets) or chained (webdataset.DataPipeline).
2024-06-12 16:21:06,082 [INFO] Loaded 414113 records for train split from the dataset.
batch sizes [[64]]
module.llama_proj.weight
module.llama_proj.bias
2024-06-12 16:21:06,100 [INFO] number of trainable parameters: 3149824
2024-06-12 16:21:06,101 [INFO] Start training epoch 0, 5000 iters per inner epoch.
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Train: data epoch: [0] [ 0/5000] eta: 6:21:59 lr: 0.000001 loss: 6.8750 time: 4.5839 data: 0.0000 max mem: 54099
Train: data epoch: [0] [ 50/5000] eta: 5:30:41 lr: 0.000002 loss: 6.9350 time: 4.3366 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 100/5000] eta: 5:36:02 lr: 0.000003 loss: nan time: 4.2825 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 150/5000] eta: 5:35:11 lr: 0.000004 loss: 6.9082 time: 4.2402 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 200/5000] eta: 5:32:37 lr: 0.000005 loss: 6.8793 time: 4.1392 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 250/5000] eta: 5:29:55 lr: 0.000006 loss: nan time: 4.0571 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 300/5000] eta: 5:19:55 lr: 0.000007 loss: nan time: 3.5617 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 350/5000] eta: 5:11:46 lr: 0.000008 loss: nan time: 3.4891 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 400/5000] eta: 5:03:33 lr: 0.000009 loss: nan time: 3.5205 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 450/5000] eta: 4:57:14 lr: 0.000010 loss: nan time: 3.5452 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 500/5000] eta: 4:51:30 lr: 0.000011 loss: nan time: 3.4910 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 550/5000] eta: 4:46:23 lr: 0.000012 loss: nan time: 3.4316 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 600/5000] eta: 4:41:38 lr: 0.000013 loss: nan time: 3.5642 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 650/5000] eta: 4:37:05 lr: 0.000014 loss: nan time: 3.5331 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 700/5000] eta: 4:33:05 lr: 0.000015 loss: nan time: 3.4861 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 750/5000] eta: 4:29:03 lr: 0.000016 loss: nan time: 3.4726 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 800/5000] eta: 4:25:02 lr: 0.000017 loss: nan time: 3.6616 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 850/5000] eta: 4:21:01 lr: 0.000018 loss: nan time: 3.5413 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 900/5000] eta: 4:17:19 lr: 0.000019 loss: nan time: 3.6708 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 950/5000] eta: 4:13:42 lr: 0.000020 loss: nan time: 3.6225 data: 0.0000 max mem: 55634
Train: data epoch: [0] [1000/5000] eta: 4:10:01 lr: 0.000021 loss: nan time: 3.5988 data: 0.0000 max mem: 55634
...

Can you give me some advice or clues?
Thank you for your assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss becomes nan during training #517

loss becomes nan during training #517

Davidwhw commented Jun 13, 2024 •

edited

Loading

loss becomes nan during training #517

loss becomes nan during training #517

Comments

Davidwhw commented Jun 13, 2024 • edited Loading

Davidwhw commented Jun 13, 2024 •

edited

Loading