Skip to content

Latest commit

 

History

History
158 lines (118 loc) · 6.69 KB

README.md

File metadata and controls

158 lines (118 loc) · 6.69 KB

🔍 Shotluck Holmes

Large Language Vision Models For Shot-Level Video Understanding (Richard Luo, Austin Peng, Adithya Vasudev, Rishabh Jain)
🎉 Accepted into ACM MMGR '24!

Read the Paper Here »

GitHub License Shotluck Holmes Badge Shotluck Holmes Badge

Table of Contents
  1. Introduction
  2. 🔧 Requirements and Installation
  3. Data Pre-processing
  4. Finetuning
  5. Model
  6. Results

Introduction

Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos’ more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.

🔧 Requirements and Installation

  1. Clone this repository and navigate to the folder
git clone https://github.com/Skyline-9/Shotluck-Holmes.git
cd Shotluck-Holmes
  1. Install packages
conda create -n shotluck python=3.10 -y
conda activate shotluck
cd model
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
cd ..
pip install flash-attn==2.5.8 --no-build-isolation  # upgrade to this version of flash-attn for H100
# pip install flash-attn==1.0.9 --no-build-isolation  # downgrade to flash attention v1 for older GPUs

Alternatively, you can run setup-speedrun.sh from the root directory to execute all the commands above

sh scripts/setup-speedrun.sh

Data Pre-processing

Note: all the following commands should be run from the project root directory

Downloading

Raw annotations should already be downloaded with this repository. If your annotations are missing, download the annotations by running

sh data/scripts/download/download_annotations.sh

If running on Shot2Story dataset, follow bytedance/Shot2Story#5 to download the data and extract the videos into data/raw/videos.

Pre-processing

First, process the videos by running process_videos.py in scripts/data/process, which will run ffmpeg to split the shot videos into different files. Then, convert the annotation data and scan for corrupted videos by running convert_shot2story_to_llava.py

Set processes to a reasonable number depending on how many CPU cores you have available.

python scripts/data/process/process_videos.py --processes=<YOUR_NUM_PROCESSES>
python scripts/data/process/convert_shot2story_to_llava.py

If you plan on running eval, make sure to run convert_shot2story_to_llava.py on the test set as well.

Note: ffmpeg is required for process_videos.py. If this is not installed, download ffmpeg accordingly for your OS or install it locally using the download-ffmpeg.sh script.

Finetuning

Finetuning scripts are in scripts/run/finetune. Run the finetuning script corresponding to which model you want to use.

sh scripts/run/finetune/finetune_1b5.sh  # finetune the 1.5B model
sh scripts/run/finetune/finetune_3b1.sh  # finetune the 3.1B model

Model

Hugging Face Models

Results

Table 1: Performance of best models on single-shot video captioning

Model BLEU METEOR ROUGE CIDER
Shot2Story (7B+) 10.7 16.2 29.6 37.4
Shotluck-Holmes (3.1B) 8.7 25.7 36.2 63.2
Shotluck-Holmes (1.5B) 9.3 25.3 36.3 68.9

Table 2: Performance of best models on multi-shot video summarization

Model BLEU METEOR ROUGE CIDER
Shot2Story (7B+) 11.7 19.7 26.8 8.6
Shotluck-Holmes (3.1B) 7.67 23.2 43 152.3
Shotluck-Holmes (1.5B) 6.48 21.3 40.2 144.3