Deep Learning based background removal built with U2-Net: U Square Net (Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaiane and Martin Jagersand)
Read the original paper »
Table of Contents
Removing the background of a picture is an old problem, but traditional CV algorithms such as image thresholding fall short without intensive pre/post processing, and even then, the task is very difficult when the object has colors similar to the background.
With recent advances in DL literature, Saliency Object Detection (SOD) has emerged as one of the predominate ways to separate foreground and background. In short, SOD is a task based on segmenting the most visually attractive objects in an image, typically by creating a saliency map to distinguish the important foreground from the background.
Most SOD networks work based on using features extracted by existing backbones such as AlexNet, VGG, ResNet, ResNeXt, and DenseNet. However, the problem is that all of these backbones are originally designed for image classification, meaning they "extract features that are representative of semantic meaning rather than local details and global contrast information, which are essential to saliency detection."
For the International Conference on Pattern Recognition (ICPR) 2020, Qin et al. proposed a novel network for SOD called U2-net that allows training from scratch and achieves comparable or better performance than those based on existing pre-trained backbones.
Qin et al. proposed a novel block called RSU, consisting of
- An input convolution layer which transforms the feature map to an intermediate map
- A U-Net like symmetric encoder-decoder structure which takes the intermediate feature map as input and learns to extract and encode the multi-scale contextual information
- A residual connection which fuses local features and the multi-scale features
In encoder stages En 1, En 2, En 3 and En 4, we use residual U-blocks RSU-7, RSU-6, RSU-5 and RSU-4, respectively. As mentioned before, “7”, “6”, “5” and “4” denote the heights (L) of RSU blocks. The L is usually configured according to the spatial resolution of the input feature maps.
Comparison of model size and performance of the U2-Net with other state-of-the-art SOD models
Credit: original paper
- Data augmentation
- The original paper performed image augmentation by horizontally flipping the training set
- Evaluation metrics
- Precision recall curve, F-measure, Mean Absolute Error
- Add support for video
- Set up web version with Tensorflow.js
@InProceedings{Qin_2020_PR,
title = {U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection},
author = {Qin, Xuebin and Zhang, Zichen and Huang, Chenyang and Dehghan, Masood and Zaiane, Osmar and Jagersand, Martin},
journal = {Pattern Recognition},
volume = {106},
pages = {107404},
year = {2020}
}