You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the open source of Video-ChatGPT, I really like this work very much.
I am now trying to train Video-ChatGPT now.
However, I only have a single node server with 8 4090 GPUs.
I would like to ask how I can modify the initial training code which is adapting to multiple nodes. torchrun --nproc_per_node=8 --master_port 29001 video_chatgpt/train/train_mem.py \ --model_name_or_path <path to LLaVA-7B-Lightening-v-1-1 model> \ --version v1 \ --data_path <path to the video_chatgpt using convert_instruction_json_to_training_format.pyscript.> \ --video_folder <path to the spatio-temporal features generated in step 4 usingsave_spatio_temporal_clip_features.py script> \ --tune_mm_mlp_adapter True \ --mm_use_vid_start_end \ --bf16 True \ --output_dir ./Video-ChatGPT_7B-1.1_Checkpoints \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 3000 \ --save_total_limit 3 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 100 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
Looking forward to your reply, thank you very much.
The text was updated successfully, but these errors were encountered:
Thanks for your reply.
Currently, I'd like to train video-chatgpt with my custom dataset. And my server is equipped with
8 4090 GPUs. When I use torchrun for training, it appears that CUDA out of memory. Does video-chatgpt need each GPU with 40GB memory?
Video-ChatGPT uses a 7B LLM which requires at least 17 GB of Memory to load. Considering other model components and optimizer states, I believe a 32 GB GPU might work.
However, please note that the codes are tested on A100 40GB GPUs.
Thanks for the open source of Video-ChatGPT, I really like this work very much.
I am now trying to train Video-ChatGPT now.
However, I only have a single node server with 8 4090 GPUs.
I would like to ask how I can modify the initial training code which is adapting to multiple nodes.
torchrun --nproc_per_node=8 --master_port 29001 video_chatgpt/train/train_mem.py \ --model_name_or_path <path to LLaVA-7B-Lightening-v-1-1 model> \ --version v1 \ --data_path <path to the video_chatgpt using
convert_instruction_json_to_training_format.pyscript.> \ --video_folder <path to the spatio-temporal features generated in step 4 using
save_spatio_temporal_clip_features.pyscript> \ --tune_mm_mlp_adapter True \ --mm_use_vid_start_end \ --bf16 True \ --output_dir ./Video-ChatGPT_7B-1.1_Checkpoints \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 3000 \ --save_total_limit 3 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 100 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
Looking forward to your reply, thank you very much.
The text was updated successfully, but these errors were encountered: