Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Node Training #111

Open
xiaokj37 opened this issue Jun 17, 2024 · 3 comments
Open

Single Node Training #111

xiaokj37 opened this issue Jun 17, 2024 · 3 comments

Comments

@xiaokj37
Copy link

Thanks for the open source of Video-ChatGPT, I really like this work very much.
I am now trying to train Video-ChatGPT now.
However, I only have a single node server with 8 4090 GPUs.
I would like to ask how I can modify the initial training code which is adapting to multiple nodes.
torchrun --nproc_per_node=8 --master_port 29001 video_chatgpt/train/train_mem.py \ --model_name_or_path <path to LLaVA-7B-Lightening-v-1-1 model> \ --version v1 \ --data_path <path to the video_chatgpt using convert_instruction_json_to_training_format.pyscript.> \ --video_folder <path to the spatio-temporal features generated in step 4 usingsave_spatio_temporal_clip_features.py script> \ --tune_mm_mlp_adapter True \ --mm_use_vid_start_end \ --bf16 True \ --output_dir ./Video-ChatGPT_7B-1.1_Checkpoints \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 3000 \ --save_total_limit 3 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 100 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
Looking forward to your reply, thank you very much.

@mmaaz60
Copy link
Member

mmaaz60 commented Jun 22, 2024

Hi @SeuXiao,

I appreciate your interest in our work. Please note that the Video-ChatGPT code is designed to run on single node with multiple GPUs.

In case if you face any issues, please let me know. Good Luck!

@xiaokj37
Copy link
Author

Thanks for your reply.
Currently, I'd like to train video-chatgpt with my custom dataset. And my server is equipped with
8 4090 GPUs. When I use torchrun for training, it appears that CUDA out of memory. Does video-chatgpt need each GPU with 40GB memory?

@mmaaz60
Copy link
Member

mmaaz60 commented Jun 28, 2024

Hi @SeuXiao

Video-ChatGPT uses a 7B LLM which requires at least 17 GB of Memory to load. Considering other model components and optimizer states, I believe a 32 GB GPU might work.

However, please note that the codes are tested on A100 40GB GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants