Compositional Embeddings for Multi-Label One-Shot Learning

We present a compositional embedding framework that infers not just a single class per input image, but a set of classes, in the setting of one-shot learning. Specifically, we propose and evaluate several novel models consisting of (1) an embedding function f trained jointly with a "composition" function g that computes set union operations between the classes encoded in two embedding vectors; and (2) embedding f trained jointly with a "query" function h that computes whether the classes encoded in one embedding subsume the classes encoded in another embedding. In contrast to prior work, these models must both perceive the classes associated with the input examples and encode the relationships between different class label sets, and they are trained using only weak one-shot supervision consisting of the label-set relationships among training examples. Experiments on the OmniGlot, Open Images, and COCO datasets show that the proposed compositional embedding models outperform existing embedding methods. Our compositional embedding models have applications to multi-label object recognition for both one-shot and supervised learning.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[5]  Abhinav Gupta,et al.  3D-RelNet: Joint Object and Relational Network for 3D Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Michael C. Mozer,et al.  Top-Down modulation of neural responses in visual perception: a computational exploration , 2007, Natural Computing.

[7]  Le Song,et al.  Deep Coevolutionary Network: Embedding User and Item Features for Recommendation , 2016, 1609.03675.

[8]  Yuhong Guo,et al.  Deep Triplet Ranking Networks for One-Shot Recognition , 2018, ArXiv.

[9]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[10]  Jordi Luque,et al.  Simultaneous Speech Detection With Spatial Features for Speaker Diarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Omer Levy,et al.  pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference , 2018, NAACL.

[12]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[13]  Rogério Schmidt Feris,et al.  LaSO: Label-Set Operations Networks for Multi-Label Few-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Michael C. Mozer,et al.  Adapted Deep Embeddings: A Synthesis of Methods for k-Shot Inductive Transfer Learning , 2018, NeurIPS.

[17]  Luca Bertinetto,et al.  Learning feed-forward one-shot learners , 2016, NIPS.

[18]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[19]  Hwann-Tzong Chen,et al.  One-Shot Object Detection with Co-Attention and Co-Excitation , 2019, NeurIPS.

[20]  Dat T. Huynh,et al.  A Shared Multi-Attention Framework for Multi-Label Zero-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yi Liu,et al.  Teaching Compositionality to CNNs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Juan Carlos Niebles,et al.  Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos , 2018, ECCV.

[23]  Peng Jiang,et al.  Compositional network embedding for link prediction , 2019, RecSys.

[24]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[25]  Jordan B. Pollack,et al.  Implications of Recursive Distributed Representations , 1988, NIPS.

[26]  Ian D. Reid,et al.  DeepSetNet: Predicting Sets with Deep Neural Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Daan Wierstra,et al.  One-Shot Generalization in Deep Generative Models , 2016, ICML.

[29]  Martial Hebert,et al.  Learning Compositional Representations for Few-Shot Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Anoop Cherian,et al.  Neural Algebra of Classifiers , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Jacob Andreas,et al.  Measuring Compositionality in Representation Learning , 2019, ICLR.

[33]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[34]  Ernest Valveny,et al.  Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[35]  H. Simon,et al.  Perception in chess , 1973 .

[36]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Samy Bengio,et al.  Order Matters: Sequence to sequence for sets , 2015, ICLR.

[39]  R. Devon Hjelm,et al.  Locality and compositionality in zero-shot learning , 2019, ICLR.