Searching ROI for Object Detection based on CNN

Several studies have explored the structural design of CNN to improve the network's performance since a well-designed feature extractor can benefit convolution-based tasks. Although CNNs are able to learn important patterns on raw images, images may contain unpredictable noise that can negatively influence the convolutional stage. Feature extraction cannot always accurately capture the desired features based solely on the input image, but including extra information could improve the result. This paper proposes a fusion input design to generate an additional feature that a CNN can use to provide extra ROI information. Whether a model can utilize the additional information is a determining factor that affects the performance improvement. The proposed method is tested on two public datasets with different structural designs. Overall, the results indicate that additional ROI information can deliver benefits to specific tasks.

[1]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[2]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[5]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[6]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xinlei Chen,et al.  Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[11]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.