Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA微调Qwen2-VL-2B时,Loss一直为0,grad_norm为nan #6092

Open
1 task done
Tian-ye1214 opened this issue Nov 20, 2024 · 1 comment
Open
1 task done

LoRA微调Qwen2-VL-2B时,Loss一直为0,grad_norm为nan #6092

Tian-ye1214 opened this issue Nov 20, 2024 · 1 comment
Labels
pending This problem is yet to be addressed

Comments

@Tian-ye1214
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Windows-10-10.0.22631-SP0
  • Python version: 3.11.9
  • PyTorch version: 2.3.1+cu121 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 2.21.0
  • Accelerate version: 0.34.2
  • PEFT version: 0.11.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4070 Ti SUPER

Reproduction

[INFO|2024-11-20 17:36:10] modeling_utils.py:3934 >> loading weights file C:\Users\PC.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4\model.safetensors.index.json

[INFO|2024-11-20 17:36:10] modeling_utils.py:1670 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16.

[INFO|2024-11-20 17:36:10] configuration_utils.py:1096 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }

[INFO|2024-11-20 17:36:10] modeling_utils.py:1670 >> Instantiating Qwen2VisionTransformerPretrainedModel model under default dtype torch.bfloat16.

[WARNING|2024-11-20 17:36:10] logging.py:168 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46

[INFO|2024-11-20 17:36:14] modeling_utils.py:4800 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|2024-11-20 17:36:14] modeling_utils.py:4808 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at C:\Users\PC.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training.

[INFO|2024-11-20 17:36:14] configuration_utils.py:1049 >> loading configuration file C:\Users\PC.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4\generation_config.json

[INFO|2024-11-20 17:36:14] configuration_utils.py:1096 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.01, "top_k": 1, "top_p": 0.001 }

[INFO|2024-11-20 17:36:14] logging.py:157 >> Gradient checkpointing enabled.

[INFO|2024-11-20 17:36:14] logging.py:157 >> Using FlashAttention-2 for faster training and inference.

[INFO|2024-11-20 17:36:14] logging.py:157 >> Upcasting trainable params to float32.

[INFO|2024-11-20 17:36:14] logging.py:157 >> Fine-tuning method: LoRA

[INFO|2024-11-20 17:36:14] logging.py:157 >> Found linear modules: v_proj,k_proj,q_proj,o_proj,gate_proj,up_proj,down_proj

[INFO|2024-11-20 17:36:14] logging.py:157 >> trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162

[INFO|2024-11-20 17:36:14] trainer.py:698 >> Using auto half precision backend

[INFO|2024-11-20 17:36:14] trainer.py:2313 >> ***** Running training *****

[INFO|2024-11-20 17:36:14] trainer.py:2314 >> Num examples = 15

[INFO|2024-11-20 17:36:14] trainer.py:2315 >> Num Epochs = 100

[INFO|2024-11-20 17:36:14] trainer.py:2316 >> Instantaneous batch size per device = 2

[INFO|2024-11-20 17:36:14] trainer.py:2319 >> Total train batch size (w. parallel, distributed & accumulation) = 16

[INFO|2024-11-20 17:36:14] trainer.py:2320 >> Gradient Accumulation steps = 8

[INFO|2024-11-20 17:36:14] trainer.py:2321 >> Total optimization steps = 100

[INFO|2024-11-20 17:36:14] trainer.py:2322 >> Number of trainable parameters = 9,232,384

[INFO|2024-11-20 17:36:38] logging.py:157 >> {'loss': 0.0000, 'learning_rate': 4.9692e-05, 'epoch': 5.00}

[INFO|2024-11-20 17:37:02] logging.py:157 >> {'loss': 0.0000, 'learning_rate': 4.8776e-05, 'epoch': 10.00}

[INFO|2024-11-20 17:37:27] logging.py:157 >> {'loss': 0.0000, 'learning_rate': 4.7275e-05, 'epoch': 15.00}

Expected behavior

No response

Others

训练指令为:
llamafactory-cli train --stage sft
--do_train True --model_name_or_path C:\Users\PC\.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4
--preprocessing_num_workers 16 --finetuning_type lora
--template qwen2_vl --flash_attn fa2
--dataset_dir data --dataset mllm_demo
--cutoff_len 2048 --learning_rate 5e-05
--num_train_epochs 100.0 --max_samples 100000
--per_device_train_batch_size 2 --gradient_accumulation_steps 8
--lr_scheduler_type cosine --max_grad_norm 1.0
--logging_steps 5 --save_steps 100
--warmup_steps 0 --packing False
--report_to none --output_dir saves\Qwen2-VL-2B-Instruct\lora\train_2024-11-20-17-41-13
--bf16 True --plot_loss True
--ddp_timeout 180000000 --optim adamw_torch
--lora_rank 8 --lora_alpha 16
--lora_dropout 0 `
--lora_target all

@github-actions github-actions bot added the pending This problem is yet to be addressed label Nov 20, 2024
@Tian-ye1214
Copy link
Author

补充一下信息,在Linux环境上跑相同的实验,软件依赖配置相同。实验成功运行,loss能稳定下降。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant