Shaping Visual Representations with Language for Few-Shot Classification

By describing the features and abstractions of our world, language is a crucial tool for human learning and a promising source of supervision for machine learning models. We use language to improve few-shot visual classification in the underexplored scenario where natural language task descriptions are available during training, but unavailable for novel tasks at test time. Existing models for this setting sample new descriptions at test time and use those to classify images. Instead, we propose language-shaped learning (LSL), an end-to-end model that regularizes visual representations to predict language. LSL is conceptually simpler, more data efficient, and outperforms baselines in two challenging few-shot domains.

[1]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[5]  Tom M. Mitchell,et al.  Joint Concept Learning and Semantic Parsing from Natural Language Explanations , 2017, EMNLP.

[6]  Martial Hebert,et al.  Learning Compositional Representations for Few-Shot Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Noah D. Goodman,et al.  The first crank of the cultural ratchet: Learning and transmitting concepts through language , 2019, CogSci.

[8]  Prasoon Goyal,et al.  Using Natural Language for Reward Shaping in Reinforcement Learning , 2019, IJCAI.

[9]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Alexander Kuhnle,et al.  ShapeWorld - A new test methodology for multimodal language understanding , 2017, ArXiv.

[11]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[12]  Christine D. Piatko,et al.  Using “Annotator Rationales” to Improve Machine Learning for Text Categorization , 2007, NAACL.

[13]  Trevor Darrell,et al.  Grounding Visual Explanations , 2018, ECCV.

[14]  Dan Klein,et al.  Learning with Latent Language , 2017, NAACL.

[15]  Franziska Hoffmann,et al.  Fact Fiction And Forecast , 2016 .

[16]  Michael Wayne Goodman,et al.  Resources for building applications with Dependency Minimal Recursion Semantics , 2016, LREC.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jeff Donahue,et al.  Annotator rationales for visual recognition , 2011, 2011 International Conference on Computer Vision.

[19]  Albert Gordo,et al.  Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[23]  Raja Giryes,et al.  Baby steps towards few-shot learning with multiple semantics , 2019, Pattern Recognit. Lett..

[24]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[25]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[26]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[27]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[28]  Yuxin Peng,et al.  Fine-Grained Image Classification via Combining Vision and Language , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Christopher Ré,et al.  Training Classifiers with Natural Language Explanations , 2018, ACL.

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Pedro H. O. Pinheiro,et al.  Adaptive Cross-Modal Few-Shot Learning , 2019, NeurIPS.

[33]  J. Stevenson The cultural origins of human cognition , 2001 .

[34]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[35]  Jonathan Krause,et al.  Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.