The official repository for Set-level Guidance Attack (SGA).
ICCV 2023 Oral Paper: Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (https://arxiv.org/abs/2307.14061)
Please feel free to contact [email protected] if you have any question.
Vision-language pre-training (VLP) models have shown vulnerability to adversarial attacks. However, existing works mainly focus on the adversarial robustness of VLP models in the white-box settings. In this work, we inverstige the robustness of VLP models in the black-box setting from the perspective of adversarial transferability. We propose Set-level Guidance Attack (SGA), which can generate highly transferable adversarial examples aimed for VLP models.
See in requirements.txt
.
Download the datasets, Flickr30k and MSCOCO (the annotations is provided in ./data_annotation/). Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root
.
The checkpoints of the fine-tuned VLP models is accessible in ALBEF, TCL, CLIP.
From ALBEF to TCL on the Flickr30k dataset:
python eval_albef2tcl_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model TCL --target_ckpt ./checkpoint/tcl_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5
From ALBEF to CLIPViT on the Flickr30k dataset:
python eval_albef2clip-vit_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model ViT-B/16 --original_rank_index ./std_eval_idx/flickr30k/ \
--scales 0.5,0.75,1.25,1.5
From CLIPViT to ALBEF on the Flickr30k dataset:
python eval_clip-vit2albef_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16 --target_model ALBEF \
--target_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5
From CLIPViT to CLIPCNN on the Flickr30k dataset:
python eval_clip-vit2clip-cnn_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16 --target_model RN101 \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5
Existing adversarial attacks for VLP models cannot generate highly transferable adversarial examples.
(Note: Sep-Attack indicates the simple combination of two unimodal adversarial attacks: PGD and BERT-Attack)
Attack | ALBEF* | TCL | CLIPViT | CLIPCNN | ||||
---|---|---|---|---|---|---|---|---|
TR R@1* | IR R@1* | TR R@1 | IR R@1 | TR R@1 | IR R@1 | TR R@1 | IR R@1 | |
Sep-Attack | 65.69 | 73.95 | 17.60 | 32.95 | 31.17 | 45.23 | 32.82 | 45.49 |
Sep-Attack + MI | 58.81 | 65.25 | 16.02 | 28.19 | 23.07 | 36.98 | 26.56 | 39.31 |
Sep-Attack + DIM | 56.41 | 64.24 | 16.75 | 29.55 | 24.17 | 37.60 | 25.54 | 38.77 |
Sep-Attack + PNA_PO | 40.56 | 53.95 | 18.44 | 30.98 | 22.33 | 37.02 | 26.95 | 38.63 |
Co-Attack | 77.16 | 83.86 | 15.21 | 29.49 | 23.60 | 36.48 | 25.12 | 38.89 |
Co-Attack + MI | 64.86 | 75.26 | 25.40 | 38.69 | 24.91 | 37.11 | 26.31 | 38.97 |
Co-Attack + DIM | 47.03 | 62.28 | 22.23 | 35.45 | 25.64 | 38.50 | 26.95 | 40.58 |
SGA | 97.24 | 97.28 | 45.42 | 55.25 | 33.38 | 44.16 | 34.93 | 46.57 |
The performance of SGA on four VLP models (ALBEF, TCL, CLIPViT and CLIPCNN), the Flickr30k dataset.
Source | Attack | ALBEF | TCL | CLIPViT | CLIPCNN | ||||
---|---|---|---|---|---|---|---|---|---|
TR R@1 | IR R@1 | TR R@1 | IR R@1 | TR R@1 | IR R@1 | TR R@1 | IR R@1 | ||
ALBEF | PGD | 52.45* | 58.65* | 3.06 | 6.79 | 8.96 | 13.21 | 10.34 | 14.65 |
BERT-Attack | 11.57* | 27.46* | 12.64 | 28.07 | 29.33 | 43.17 | 32.69 | 46.11 | |
Sep-Attack | 65.69* | 73.95* | 17.60 | 32.95 | 31.17 | 45.23 | 32.82 | 45.49 | |
Co-Attack | 77.16* | 83.86* | 15.21 | 29.49 | 23.60 | 36.48 | 25.12 | 38.89 | |
SGA | 97.24±0.22* | 97.28±0.15* | 45.42±0.60 | 55.25±0.06 | 33.38±0.35 | 44.16±0.25 | 34.93±0.99 | 46.57±0.13 | |
TCL | PGD | 6.15 | 10.78 | 77.87* | 79.48* | 7.48 | 13.72 | 10.34 | 15.33 |
BERT-Attack | 11.89 | 26.82 | 14.54* | 29.17* | 29.69 | 44.49 | 33.46 | 46.07 | |
Sep-Attack | 20.13 | 36.48 | 84.72* | 86.07* | 31.29 | 44.65 | 33.33 | 45.80 | |
Co-Attack | 23.15 | 40.04 | 77.94* | 85.59* | 27.85 | 41.19 | 30.74 | 44.11 | |
SGA | 48.91±0.74 | 60.34±0.10 | 98.37±0.08* | 98.81±0.07* | 33.87±0.18 | 44.88±0.54 | 37.74±0.27 | 48.30±0.34 | |
CLIPViT | PGD | 2.50 | 4.93 | 4.85 | 8.17 | 70.92* | 78.61* | 5.36 | 8.44 |
BERT-Attack | 9.59 | 22.64 | 11.80 | 25.07 | 28.34* | 39.08* | 30.40 | 37.43 | |
Sep-Attack | 9.59 | 23.25 | 11.38 | 25.60 | 79.75* | 86.79* | 30.78 | 39.76 | |
Co-Attack | 10.57 | 24.33 | 11.94 | 26.69 | 93.25* | 95.86* | 32.52 | 41.82 | |
SGA | 13.40±0.07 | 27.22±0.06 | 16.23±0.45 | 30.76±0.07 | 99.08±0.08* | 98.94±0.00* | 38.76±0.27 | 47.79±0.58 | |
CLIPCNN | PGD | 2.09 | 4.82 | 4.00 | 7.81 | 1.10 | 6.60 | 86.46* | 92.25* |
BERT-Attack | 8.86 | 23.27 | 12.33 | 25.48 | 27.12 | 37.44 | 30.40* | 40.10* | |
Sep-Attack | 8.55 | 23.41 | 12.64 | 26.12 | 28.34 | 39.43 | 91.44* | 95.44* | |
Co-Attack | 8.79 | 23.74 | 13.10 | 26.07 | 28.79 | 40.03 | 94.76* | 96.89* | |
SGA | 11.42±0.07 | 24.80±0.28 | 14.91±0.08 | 28.82±0.11 | 31.24±0.42 | 42.12±0.11 | 99.24±0.18* | 99.49±0.05* |
Kindly include a reference to this paper in your publications if it helps your research:
@misc{lu2023setlevel,
title={Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models},
author={Dong Lu and Zhiqiang Wang and Teng Wang and Weili Guan and Hongchang Gao and Feng Zheng},
year={2023},
eprint={2307.14061},
archivePrefix={arXiv},
primaryClass={cs.CV}
}