Feature representation learning on multi-scale receptive fields for objection recognition

In this paper, we have proposed a novel feature representation on multi-scale receptive fields for objection recognition. The method is based on a modified convolutional neural networks (CNN), named network-in-network (NIN), which has shown a good performance in some computer vision tasks. However, applying NIN to some specific applications may encounter a few problems. First, the NIN removes the fully connected layers, which makes it unsuited to use in large-scale face recognition due to lack of an efficient feature representation, even though it brings a lot of performance benefits. Second, some lowerlayer features, which can make the feature representation more discriminative, is unused. In the pure forward architecture, these features are unseen to the classifier. To solve the two problems, we present a multi-scale receptive fields (MSRF) representation learning scheme. Based on a well trained NIN, we add a pathway to top layer and design a feature vector as final representation. In our experiments, we compare the result of our multi-scale receptive fields with standard NIN architecture. The results show our method can obtain a more explicit feature representation and improvements in performance.

[1]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Yann LeCun,et al.  Traffic sign recognition with multi-scale Convolutional Networks , 2011, The 2011 International Joint Conference on Neural Networks.

[4]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[5]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[6]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[7]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Yihong Gong,et al.  Human Tracking Using Convolutional Neural Networks , 2010, IEEE Transactions on Neural Networks.

[10]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[12]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[14]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[15]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[16]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[17]  Shree K. Nayar,et al.  Bidirectional Reflection Distribution Function of Thoroughly Pitted Surfaces , 1999, International Journal of Computer Vision.

[18]  Trevor Darrell,et al.  Beyond spatial pyramids: Receptive field learning for pooled image features , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Andrea J. van Doorn,et al.  The Structure of Locally Orderless Images , 1999, International Journal of Computer Vision.