MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

This is our implementation for the paper: MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Abstract

With the increasing demand for video understanding, video moment and highlight detection (MHD) has emerged as a critical research topic. MHD aims to localize all moments and predict clip-wise saliency scores simultaneously. Despite progress made by existing DETR-based methods, we observe that these methods coarsely fuse features from different modalities, which weakens the temporal intra-modal context and results in insufficient cross-modal interaction. To address this issue, we propose MH-DETR (Moment and Highlight DEtection TRansformer) tailored for MHD. Specifically, we introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context. Moreover, to obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder, seamlessly integrating visual and textual features. Comprehensive experiments on QVHighlights, Charades-STA, Activity-Net, and TVSum datasets show that MH-DETR outperforms existing state-of-the-art methods, demonstrating its effectiveness and superiority. Our code is available at -.

Setup

The released code consists of the following files.

MH-DETR
├── data
│   ├── activitynet
│   │   └── {train,val}.pkl
│   ├── charades
│   │   └── {train,val}.pkl
│   ├── tvsum
│   │   └── tvsum_{train,val}.jsonl
│   └── highlight_{train,val,test}_release.jsonl
├── features
│   ├── activitynet
│   │   └── c3d.hdf5
│   ├── charades
│   |   ├── vgg.hdf5
│   │   └── i3d.hdf5
|   ├──	clip_features
|   ├── clip_text_features
|   ├── slowfast_features
|   └── tvsum
├── mh_detr
├── standalone_eval
├── utils
├── results
├── README.md
└── ···

Dependencies

# create conda env
conda create --name mh_detr python=3.9
# activate env
conda actiavte mh_detr
# install pytorch 1.13.1
conda install pytorch=1.13.1 torchvision torchaudio -c pytorch
# install other python packages
pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas timm fvcore

Datasets

QVHighlights

Download QVHighlights annotations.

Download QVHighlights_features.tar.gz (8GB) from Moment-DETR repo, extract it under project root directory:

tar -xf path/to/moment_detr_features.tar.gz

Charades-STA

Download VGG features for Charades-STA official server.

Download I3D features (password: 1234) for Charades-STA.

ActivityNet

Download C3D features for ActivityNet.

TVSum

Download TVSum features from UMT repo.

Usage

Train on QVHighlights

bash mh_detr/scripts/train.sh

Evaluation on QVHighlights

bash mh_detr/scripts/inference.sh ${Your_Path}/MH-DETR/results/qvhighlights/model_best.ckpt val

Checkpoint download link. Please replace ${Your_Path} with your path. The result is as follows:

MR [email protected]	MR [email protected]	MR mAP Avg.	HD ($\geq$ VG) mAP	HD ($\geq$ VG) HIT@1	Params	GFLOPs
60.84	44.90	39.26	38.77	61.74	8.2M	0.34

Test on QVHighlights

bash mh_detr/scripts/inference.sh ${Your_Path}/MH-DETR/results/qvhighlights/model_best.ckpt test

Train on other datasets

bash mh_detr/scripts/train_charades.sh --dset_name ${Dataset_Name}

Please replace ${Dataset_Name} with {activitynet, charades, tvsum}.

Debug

bash mh_detr/scripts/train.sh --debug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Abstract

Table of Contents

Setup

Dependencies

Datasets

QVHighlights

Charades-STA

ActivityNet

TVSum

Usage

Train on QVHighlights

Evaluation on QVHighlights

Test on QVHighlights

Train on other datasets

Debug

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
mh_detr		mh_detr
standalone_eval		standalone_eval
utils		utils
readme.md		readme.md

YoucanBaby/MH-DETR

Folders and files

Latest commit

History

Repository files navigation

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Abstract

Table of Contents

Setup

Dependencies

Datasets

QVHighlights

Charades-STA

ActivityNet

TVSum

Usage

Train on QVHighlights

Evaluation on QVHighlights

Test on QVHighlights

Train on other datasets

Debug

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages