InternImage

This repository is an official implementation of the InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

By Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

News

Nov 18, 2022: 🚀 InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
Nov 10, 2022: 🚀🚀 InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

Coming soon

Classification/detection/segmentation code of the InternImage series.
InternImage-T/S/B/L/XL ImageNet-1k pretrained model.
InternImage-L/XL ImageNet-22k pretrained model.
InternImage-T/S/B/L/XL detection and instance segmentation model.
InternImage-T/S/B/L/XL semantic segmentation model.

Introduction

InternImage, initially described in arxiv, can be a general backbone for computer vision. It takes deformable convolution as the core operator to obtain large effective receptive fields, and introducing adaptive spatial aggregation to reduces the strict inductive bias. Our model makes it possible to learn more stronger and robust models with large-scale parameters from massive data.

Main Results on ImageNet with Pretrained Models

ImageNet-1K and ImageNet-22K Pretrained InternImage Models

name	pretrain	resolution	acc@1	#params	FLOPs
InternImage-T	ImageNet-1K	224x224	83.5	30M	5G
InternImage-S	ImageNet-1K	224x224	84.2	50M	8G
InternImage-B	ImageNet-1K	224x224	84.9	97M	16G
InternImage-L	ImageNet-22K	384x384	87.7	223M	108G
InternImage-XL	ImageNet-22K	384x384	88.0	335M	163G

Main Results on Downstream Tasks

COCO Object Detection

backbone	method	lr schedule	box mAP	mask mAP	#params	FLOPs
InternImage-T	Mask R-CNN	1x	47.2	42.5	49M	270G
InternImage-S	Mask R-CNN	1x	47.8	43.3	69M	340G
InternImage-B	Mask R-CNN	1x	48.8	44.0	115M	501G
InternImage-L	Cascade Mask R-CNN	1x	54.9	47.7	277M	1399G
InternImage-XL	Cascade Mask R-CNN	1x	55.3	48.0	387M	1782G

ADE20K Semantic Segmentation

backbone	resolution	single scale	multi scale	#params	FLOPs
InternImage-T	512x512	47.9	48.1	59M	944G
InternImage-S	512x512	50.1	50.9	80M	1017G
InternImage-B	512x512	50.8	51.3	128M	1185G
InternImage-L	640x640	53.9	54.1	256M	2526G
InternImage-XL	640x640	55.0	55.3	368M	3142G

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figs		figs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InternImage

News

Coming soon

Introduction

Main Results on ImageNet with Pretrained Models

Main Results on Downstream Tasks

Citation

About

Releases

Packages

License

wen0320/InternImage

Folders and files

Latest commit

History

Repository files navigation

InternImage

News

Coming soon

Introduction

Main Results on ImageNet with Pretrained Models

Main Results on Downstream Tasks

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages