Do we really need more training data for object localization

The key factor for training a good neural network lies in both model capacity and large-scale training data. As more datasets are available nowadays, one may wonder whether the success of deep learning descends from data augmentation only. In this paper, we propose a new dataset, namely, Extended ImageNet Classification (EIC) dataset based on the original ILSVRC CLS 2012 set to investigate if more training data is a crucial step. We address the problem of object localization where given an image, some boxes (also called anchors) are generated to localize multiple instances. Different from previous work to place all anchors at the last layer, we split boxes of different sizes at various resolutions in the network, since small anchors are more prone to be identified at larger spatial location in the shallow layers. Inspired by the hourglass work, we apply a conv-deconv network architecture to generate object proposals. The motivation is to fully leverage high-level summarized semantics and to utilize their up-sampling version to help guide local details in the low-level maps. Experimental results demonstrate the effectiveness of such a design. Based on the newly proposed dataset, we find more data could enhance the average recall, but a more balanced data distribution among categories could obtain better results at the cost of fewer training samples.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Xiaogang Wang,et al.  Multi-Bias Non-linear Activation in Deep Neural Networks , 2016, ICML.

[4]  Jingjing Wang,et al.  Image retrieval and classification on deep convolutional SparkNet , 2016, 2016 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC).

[5]  Xuming He,et al.  Learning to Co-Generate Object Proposals with a Deep Structured Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yu Liu,et al.  Zoom Out-and-In Network with Recursive Training for Object Proposal , 2017, ArXiv.

[7]  Luc Van Gool,et al.  DeepProposals: Hunting Objects and Actions by Cascading Deep Convolutional Layers , 2016, International Journal of Computer Vision.

[8]  Huchuan Lu,et al.  On the importance of network architecture in training very deep neural networks , 2016, 2016 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC).

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Huchuan Lu,et al.  Dual Deep Network for Visual Tracking , 2016, IEEE Transactions on Image Processing.

[11]  Vladlen Koltun,et al.  Learning to propose objects , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Yu Liu,et al.  Learning Deep Features via Congenerous Cosine Loss for Person Recognition , 2017, ArXiv.

[15]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Vladlen Koltun,et al.  Geodesic Object Proposals , 2014, ECCV.

[19]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[20]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Huchuan Lu,et al.  CNN for saliency detection with low-level feature integration , 2017, Neurocomputing.

[22]  Jitendra Malik,et al.  DeepBox: Learning Objectness with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Neelima Chavali,et al.  Object-Proposal Evaluation Protocol is ‘Gameable’ , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.