Few-shot Learning with Contextual Cueing for Object Recognition in Complex Scenes

Few-shot Learning aims to recognize new concepts from a small number of training examples. Recent work mainly tackle this problem by improving visual features, feature transfer and meta-training algorithms. In this work, we propose to explore a complementary direction by using scene context semantics to learn and recognize new concepts more easily. Whereas a few visual examples cannot cover all intra-class variations, contextual cueing offers a complementary signal to classify instances with unseen features or ambiguous objects. More specifically, we propose a Class-conditioned Context Attention Module (CCAM) that learns to weight the most important context elements while learning a particular concept. We additionally propose a flexible gating mechanism to ground visual class representations in context semantics. We conduct extensive experiments on Visual Genome dataset, and we show that compared to a visual-only baseline, our model improves top-1 accuracy by 20.47% and 9.13% in 5-way 1-shot and 5-way 5-shot, respectively; and by 20.42% and 12.45% in 20-way 1-shot and 20-way 5-shot, respectively.

[1]  Minija Tamosiunaite,et al.  Distributional semantics of objects in visual scenes in comparison to text , 2019, Artif. Intell..

[2]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Alexandre Lacoste,et al.  TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[4]  Pedro H. O. Pinheiro,et al.  Adaptive Cross-Modal Few-Shot Learning , 2019, NeurIPS.

[5]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[6]  Laure Soulier,et al.  Context-Aware Zero-Shot Learning for Object Recognition , 2019, ICML.

[7]  Xinlei Chen,et al.  Spatial Memory for Context Reasoning in Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[9]  Lei Wang,et al.  Revisiting Local Descriptor Based Image-To-Class Measure for Few-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jinhui Tang,et al.  Few-Shot Image Recognition With Knowledge Transfer , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[12]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[14]  In-So Kweon,et al.  LinkNet: Relational Embedding for Scene Graph , 2018, NeurIPS.

[15]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[16]  Bernt Schiele,et al.  Meta-Transfer Learning for Few-Shot Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bharath Hariharan,et al.  Few-Shot Learning With Localization in Realistic Settings , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Abhinav Gupta,et al.  Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  M. Bar Visual objects in context , 2004, Nature Reviews Neuroscience.

[20]  Yannis Avrithis,et al.  Dense Classification and Implanting for Few-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[22]  Mubarak Shah,et al.  Task Agnostic Meta-Learning for Few-Shot Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Nina Bonderup Dohn,et al.  On the Concept of Context , 2018, Education Sciences.

[25]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[26]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[27]  Rainer Stiefelhagen,et al.  Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Zhiwu Lu,et al.  Large-Scale Few-Shot Learning: Knowledge Transfer With Class Hierarchy , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[30]  Jing Zhang,et al.  Few-Shot Learning via Saliency-Guided Hallucination of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Laure Soulier,et al.  Learning Multi-Modal Word Representation Grounded in Visual Context , 2017, AAAI.

[32]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[33]  Xin Wang,et al.  TAFE-Net: Task-Aware Feature Embeddings for Low Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[35]  Raja Giryes,et al.  Baby steps towards few-shot learning with multiple semantics , 2019, Pattern Recognit. Lett..

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Shiguang Shan,et al.  Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Martial Hebert,et al.  Image Deformation Meta-Networks for One-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.