From Red Wine to Red Tomato: Composition with Context

Compositionality and contextuality are key building blocks of intelligence. They allow us to compose known concepts to generate new and complex ones. However, traditional learning methods do not model both these properties and require copious amounts of labeled data to learn new concepts. A large fraction of existing techniques, e.g., using late fusion, compose concepts but fail to model contextuality. For example, red in red wine is different from red in red tomatoes. In this paper, we present a simple method that respects contextuality in order to compose classifiers of known visual concepts. Our method builds upon the intuition that classifiers lie in a smooth space where compositional transforms can be modeled. We show how it can generalize to unseen combinations of concepts. Our results on composing attributes, objects as well as composing subject, predicate, and objects demonstrate its strong generalization performance compared to baselines. Finally, we present detailed analysis of our method and highlight its properties.

[1]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Xu Wei,et al.  Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Jonghyun Choi,et al.  Adding Unlabeled Samples to Categories by Learned Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Ulf Grenander,et al.  General Pattern Theory: A Mathematical Study of Regular Structures , 1993 .

[5]  Song-Chun Zhu,et al.  Learning AND-OR Templates for Object Recognition and Detection , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[7]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Deva Ramanan,et al.  Tinkering Under the Hood: Interactive Zero-Shot Learning with Net Surgery , 2016, ArXiv.

[10]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[11]  Fei-Fei Li,et al.  Shifting Weights: Adapting Object Detectors from Image to Video , 2012, NIPS.

[12]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16]  Sebastian Thrun,et al.  Learning One More Thing , 1994, IJCAI.

[17]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[18]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  Ali Farhadi,et al.  Visalogy: Answering Visual Analogy Questions , 2015, NIPS.

[20]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[21]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[23]  J. Fodor,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[24]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[26]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[27]  Joshua B. Tenenbaum,et al.  Concept learning as motor program induction: A large-scale empirical study , 2012, CogSci.

[28]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[30]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[31]  James Hays,et al.  COCO Attributes: Attributes for People, Animals, and Objects , 2016, ECCV.

[32]  Donald D. Hoffman,et al.  Parts of recognition , 1984, Cognition.

[33]  Song-Chun Zhu,et al.  A Numerical Study of the Bottom-Up and Top-Down Inference Processes in And-Or Graphs , 2011, International Journal of Computer Vision.

[34]  Yifei Lu,et al.  Max Margin AND/OR Graph learning for parsing the human body , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Joshua B. Tenenbaum,et al.  One shot learning of simple visual concepts , 2011, CogSci.

[36]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Edward H. Adelson,et al.  Discovering states and transformations in image collections , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Barbara Caputo,et al.  Learning to Learn, from Transfer Learning to Domain Adaptation: A Unifying Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Sanja Fidler,et al.  Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[41]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[42]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[43]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ross B. Girshick,et al.  Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Martial Hebert,et al.  Learning to Learn: Model Regression Networks for Easy Small Sample Learning , 2016, ECCV.

[46]  G. Frege On Sense and Reference , 1948 .

[47]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[48]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[49]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[51]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[52]  Roozbeh Mottaghi,et al.  Complexity of Representation and Inference in Compositional Models with Part Sharing , 2013, J. Mach. Learn. Res..

[53]  Kristen Grauman,et al.  Decorrelating Semantic Visual Attributes by Resisting the Urge to Share , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  William T. Freeman,et al.  Latent hierarchical structural learning for object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[56]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[57]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[59]  Long Zhu,et al.  Recursive Compositional Models for Vision: Description and Review of Recent Work , 2011, Journal of Mathematical Imaging and Vision.

[60]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[61]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[62]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[63]  Erik G. Learned-Miller,et al.  Building a classification cascade for visual identification from one example , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[64]  Michael Fink,et al.  Object Classification from a Single Example Utilizing Class Relevance Metrics , 2004, NIPS.

[65]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[66]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[67]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[68]  Markus Diesmann,et al.  Compositionality in neural control: an interdisciplinary study of scribbling movements in primates , 2013, Front. Comput. Neurosci..