LLaVA-Hound:
Video Large Multimodal Models from Large-scale Training

Official implementation for paper:

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Release

[4/14] Video SFT Data and script
[4/3] DPO 17k data + training script, pre-training video 900k + image 650k
[4/2] Project page set up, paper preprint, Test data pipeline

Dataset and Model

In Huggingface Repo, we release

Datasets:

Test data: ShareGPTVideo/test_video_and_instruction
- original videos are released at ShareGPTVideo/test_raw_video_data in case of need.
Train data ShareGPTVideo/train_video_and_instruction:
- 900k detailed caption caption,
- 900k frames data: 300k for finetuning, plus the rest 600k, in total 900k for pre-training.
- video qa data: 900k qa, and 240k subset used in our experiments.
- video instruction data for sft: we provide image instruction, mix-up video caption and qa for sft, see sft training for usage.

Models:

Pre-trained ckpt on large scale video (and image) caption: ShareGPTVideo/LLaVA-Hound-Pretrain
Fine-tuned ckpt on video (and image) instruction: ShareGPTVideo/LLaVA-Hound-SFT
DPO ckpt with 17k video preference data: ShareGPTVideo/LLaVA-Hound-DPO
Additionaly, ShareGPTVideo/LLaVA-Hound-SFT-Image_only

Setup:

# setup requirements
source setup/setup_env.sh

# need to fill in required path and API tokens at
set_path.sh

Inference Example for DPO/SFT Model

cd llava_hound_dpo
sudo apt-get install ffmpeg

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame

video_path = "examples/sample_msrvtt.mp4"

# options ["ShareGPTVideo/LLaVA-Hound-DPO", "ShareGPTVideo/LLaVA-Hound-SFT", "ShareGPTVideo/LLaVA-Hound-SFT-Image_only"]
model_path = "ShareGPTVideo/LLaVA-Hound-DPO" 
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, model_name=model_name, cache_dir=os.environ['CACHE_DIR'])
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

Inference Example for Detailed Caption Model

To generate detailed video captions with our pretrained ckpt use

import numpy as np
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame, detail_templates

video_path = "examples/sample_msrvtt.mp4"

model_path = "ShareGPTVideo/LLaVA-Hound-Pretrain"
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, model_name=model_name, cache_dir=os.environ['CACHE_DIR'])
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

question = np.random.choice(detail_templates) # use pretrained template questions

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

Testing with one-line command

# setup data
source setup/setup_test_data.sh

# Eval for official (a subset of 5k qa)
bash test/pipeline/outdomain_official_test_pipeline.sh \
$model_output_name \
$model_name

# Eval for our in-domain
bash test/pipeline/indomain_test_pipeline.sh \
$model_output_name \
$model_name

# Eval for our out-of-domain 
bash test/pipeline/outdomain_test_pipeline.sh \
$model_output_name \
$model_name

Exampe of official testing with dpo model

bash test/pipeline/outdomain_official_test_pipeline.sh \
videollava_dpo \
ShareGPTVideo/LLaVA-Hound-DPO

More details including discussion, other SOTA model testing, customized model testing, refer to test readme

Training

DPO training refer to DPO data setup and training

Pretrain + SFT refer to Pretrain + SFT

Reference

@misc{zhang2024direct,
      title={Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward}, 
      author={Ruohong Zhang and Liangke Gui and Zhiqing Sun and Yihao Feng and Keyang Xu and Yuanhan Zhang and Di Fu and Chunyuan Li and Alexander Hauptmann and Yonatan Bisk and Yiming Yang},
      year={2024},
      eprint={2404.01258},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

Code is build updo the following projects:

Video-LLaVA as the LMM architecture
trl for DPO implementation

Thanks for their great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLaVA-Hound:
Video Large Multimodal Models from Large-scale Training

Release

Dataset and Model

Setup:

Inference Example for DPO/SFT Model

Inference Example for Detailed Caption Model

Testing with one-line command

Training

Reference

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLaVA-Hound: Video Large Multimodal Models from Large-scale Training

Release

Dataset and Model

Setup:

Inference Example for DPO/SFT Model

Inference Example for Detailed Caption Model

Testing with one-line command

Training

Reference

Acknowledgement

LLaVA-Hound:
Video Large Multimodal Models from Large-scale Training