Visual Concepts and Compositional Voting

It is very attractive to formulate vision in terms of pattern theory \cite{Mumford2010pattern}, where patterns are defined hierarchically by compositions of elementary building blocks. But applying pattern theory to real world images is currently less successful than discriminative methods such as deep networks. Deep networks, however, are black-boxes which are hard to interpret and can easily be fooled by adding occluding objects. It is natural to wonder whether by better understanding deep networks we can extract building blocks which can be used to develop pattern theoretic models. This motivates us to study the internal representations of a deep network using vehicle images from the PASCAL3D+ dataset. We use clustering algorithms to study the population activities of the features and extract a set of visual concepts which we show are visually tight and correspond to semantic parts of vehicles. To analyze this we annotate these vehicles by their semantic parts to create a new dataset, VehicleSemanticParts, and evaluate visual concepts as unsupervised part detectors. We show that visual concepts perform fairly well but are outperformed by supervised discriminative methods such as Support Vector Machines (SVM). We next give a more detailed analysis of visual concepts and how they relate to semantic parts. Following this, we use the visual concepts as building blocks for a simple pattern theoretical model, which we call compositional voting. In this model several visual concepts combine to detect semantic parts. We show that this approach is significantly better than discriminative methods like SVM and deep networks trained specifically for semantic part detection. Finally, we return to studying occlusion by creating an annotated dataset with occlusion, called VehicleOcclusion, and show that compositional voting outperforms even deep networks when the amount of occlusion becomes large.

[1]  H B Barlow,et al.  Single units and sensation: a neuron doctrine for perceptual psychology? , 1972, Perception.

[2]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[3]  George Papandreou,et al.  Modeling Image Patches with a Generic Dictionary of Mini-epitomes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Wei Zhang,et al.  Maximum likelihood features for generative image models , 2017 .

[5]  Jitendra Malik,et al.  Amodal Completion and Size Constancy in Natural Scenes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  A. P. Georgopoulos,et al.  Neuronal population coding of movement direction. , 1986, Science.

[7]  David Mumford,et al.  On the computational architecture of the neocortex , 2004, Biological Cybernetics.

[8]  Alan Yuille,et al.  Unsupervised learning of object semantic parts from internal states of CNNs by population encoding , 2015, 1511.06855.

[9]  Long Zhu,et al.  Unsupervised Structure Learning: Hierarchical Recursive Composition, Suspicious Coincidence and Competitive Exclusion , 2008, ECCV.

[10]  Alan L. Yuille,et al.  DOC: Deep OCclusion Estimation from a Single Image , 2015, ECCV.

[11]  Ryuzo Okada,et al.  Discriminative generalized hough transform for object dectection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[13]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[14]  Tai Sing Lee,et al.  Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[15]  Kewei Tu,et al.  Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars , 2012, EMNLP.

[16]  David Mumford,et al.  The 2.1-D sketch , 1990, [1990] Proceedings Third International Conference on Computer Vision.

[17]  Kewei Tu,et al.  Unsupervised Structure Learning of Stochastic And-Or Grammars , 2013, NIPS.

[18]  Jian Cheng,et al.  NormFace: L2 Hypersphere Embedding for Face Verification , 2017, ACM Multimedia.

[19]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  H. Barlow,et al.  Single Units and Sensation: A Neuron Doctrine for Perceptual Psychology? , 1972, Perception.

[21]  Song-Chun Zhu,et al.  Integrating Context and Occlusion for Car Detection by Hierarchical And-Or Model , 2014, ECCV.

[22]  Alan L. Yuille,et al.  Parsing occluded people by flexible compositions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  D. Mumford,et al.  Pattern Theory: The Stochastic Analysis of Real-World Signals , 2010 .

[25]  Ann B. Lee Occlusion Models for Natural Images : A Statistical Study of a Scale-Invariant Dead Leaves Model , 2001 .

[26]  Alan L. Yuille,et al.  Adversarial Examples for Semantic Segmentation and Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[28]  Roozbeh Mottaghi,et al.  Complexity of Representation and Inference in Compositional Models with Part Sharing , 2013, J. Mach. Learn. Res..

[29]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[31]  Sanja Fidler,et al.  Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  William Grimson,et al.  Object recognition by computer - the role of geometric constraints , 1991 .

[33]  Liming Chen,et al.  von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification , 2017, ArXiv.

[34]  Yali Amit,et al.  2D Object Detection and Recognition: Models, Algorithms, and Networks , 2002 .

[35]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[36]  Yao Li,et al.  Mid-level deep pattern mining , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Walter Gerbino,et al.  Convexity and Symmetry in Figure-Ground Organization , 1976 .

[38]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  U. Grenander Elements of Pattern Theory , 1996 .

[40]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[41]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[42]  Jitendra Malik,et al.  Object detection using a max-margin Hough transform , 2009, CVPR.

[43]  Alan Yuille,et al.  Detecting Semantic Parts on Partially Occluded Objects , 2017, BMVC.

[44]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Renjie Liao,et al.  Learning Deep Parsimonious Representations , 2016, NIPS.

[47]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[48]  Jitendra Malik,et al.  Amodal Instance Segmentation , 2016, ECCV.