What Is Holding Back Convnets for Detection?

Convolutional neural networks have recently shown excellent results in general object detection and many other tasks. Albeit very effective, they involve many user-defined design choices. In this paper we want to better understand these choices by inspecting two key aspects “what did the network learn?”, and “what can the network learn?”. We exploit new annotations (Pascal3D+), to enable a new empirical analysis of the R-CNN detector. Despite common belief, our results indicate that existing state-of-the-art convnets are not invariant to various appearance factors. In fact, all considered networks have similar weak points which cannot be mitigated by simply increasing the training data (architectural changes are needed). We show that overall performance can improve when using image renderings as data augmentation. We report the best known results on Pascal3D+ detection and view-point estimation tasks.

[1]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[4]  Thomas Brox,et al.  Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT , 2014, ArXiv.

[5]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Derek Hoiem,et al.  Diagnosing Error in Object Detectors , 2012, ECCV.

[7]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[8]  Atsuto Maki,et al.  Persistent Evidence of Local Image Properties in Generic ConvNets , 2015, SCIA.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Michael Goesele,et al.  Back to the Future: Learning Shape Models from 3D CAD Data , 2010, BMVC.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Bernt Schiele,et al.  Articulated people detection and pose estimation: Reshaping the future , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[16]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Charless C. Fowlkes,et al.  Do We Need More Training Data or Better Models for Object Detection? , 2012, BMVC.

[18]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[19]  Cordelia Schmid,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[20]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[21]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[22]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[23]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[26]  Kate Saenko,et al.  Exploring Invariances in Deep Convolutional Neural Networks Using Synthetic Images , 2014, ArXiv.

[27]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[28]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[31]  Peter V. Gehler,et al.  3D object class detection in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Yi Li,et al.  Robust Online Visual Tracking with a Single Convolutional Neural Network , 2014, ACCV.

[33]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[34]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[35]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[36]  Dariu Gavrila,et al.  A mixed generative-discriminative framework for pedestrian classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[38]  Mario Fritz,et al.  Image-Based Synthesis and Re-synthesis of Viewpoints Guided by 3D Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[40]  Peter V. Gehler,et al.  Multi-View and 3D Deformable Part Models , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Jiaolong Xu,et al.  Learning a Part-Based Pedestrian Detector in a Virtual World , 2014, IEEE Transactions on Intelligent Transportation Systems.

[42]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.