Learning Compositional Representations for Few-Shot Recognition

One of the key limitations of modern deep learning approaches lies in the amount of data required to train them. Humans, by contrast, can learn to recognize novel categories from just a few examples. Instrumental to this rapid learning ability is the compositional structure of concept representations in the human brain --- something that deep learning models are lacking. In this work, we make a step towards bridging this gap between human and machine learning by introducing a simple regularization technique that allows the learned representation to be decomposable into parts. Our method uses category-level attribute annotations to disentangle the feature space of a network into subspaces corresponding to the attributes. These attributes can be either purely visual, like object parts, or more abstract, like openness and symmetry. We demonstrate the value of compositional representations on three datasets: CUB-200-2011, SUN397, and ImageNet, and show that they require fewer examples to learn classifiers for novel categories.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[3]  Xiaohan Chen,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep CNNs? , 2018, NeurIPS.

[4]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[6]  Sanja Fidler,et al.  Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Bharath Hariharan,et al.  Low-Shot Visual Recognition by Shrinking and Hallucinating Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Abhinav Gupta,et al.  Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes , 2012, ECCV.

[10]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[11]  Sanjoy Dasgupta,et al.  Learning from discriminative feature feedback , 2018, NeurIPS.

[12]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Yi Liu,et al.  Teaching Compositionality to CNNs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Antonio Torralba,et al.  Part and appearance sharing: Recursive Compositional Models for multi-view , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Deva Ramanan,et al.  Tinkering Under the Hood: Interactive Zero-Shot Learning with Net Surgery , 2016, ArXiv.

[16]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[17]  Zhangyang Wang,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep Networks? , 2018, NeurIPS.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  R. Weale Vision. A Computational Investigation Into the Human Representation and Processing of Visual Information. David Marr , 1983 .

[20]  Marc'Aurelio Ranzato,et al.  Efficient Lifelong Learning with A-GEM , 2018, ICLR.

[21]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[22]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[23]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[24]  Martial Hebert,et al.  Learning from Small Sample Sets by Combining Unsupervised Meta-Training with CNNs , 2016, NIPS.

[25]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  D. Marr,et al.  Representation and recognition of the spatial organization of three-dimensional shapes , 1978, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[27]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[30]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[32]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[33]  Jacob Andreas,et al.  Measuring Compositionality in Representation Learning , 2019, ICLR.

[34]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[35]  Donald D. Hoffman,et al.  Parts of recognition , 1984, Cognition.

[36]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[37]  D. McDermott LANGUAGE OF THOUGHT , 2012 .

[38]  Mark Everingham,et al.  Shared parts for deformable part-based models , 2011, CVPR 2011.

[39]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[41]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[42]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Patrick Henry Winston,et al.  Learning structural descriptions from examples , 1970 .

[44]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[45]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[46]  Chen Xu,et al.  The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding , 2014, International Journal of Computer Vision.

[47]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[48]  Devi Parikh,et al.  Attributes for Classifier Feedback , 2012, ECCV.

[49]  Martial Hebert,et al.  Low-Shot Learning from Imaginary Data , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[51]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[52]  Bolei Zhou,et al.  Interpreting Deep Visual Representations via Network Dissection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Martial Hebert,et al.  Learning to Learn: Model Regression Networks for Easy Small Sample Learning , 2016, ECCV.