Memory-Based Network for Scene Graph with Unbalanced Relations

The scene graph which can be represented by a set of visual triples is composed of objects and the relations between object pairs. It is vital for image captioning, visual question answering, and many other applications. However, there is a long tail distribution on the scene graph dataset, and the tail relation cannot be accurately identified due to the lack of training samples. The problem of the nonstandard label and feature overlap on the scene graph affects the extraction of discriminative features and exacerbates the effect of data imbalance on the model. For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. We use the method of memory features to realize the transfer of high-frequency relation features to low-frequency relation features. Extensive experiments on scene graph datasets show that our model significantly improved the performance of two evaluation metrics R@K and mR@K compared with state-of-the-art baselines.

[1]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Gang Wang,et al.  Unpaired Image Captioning by Language Pivoting , 2018, ECCV.

[4]  Stella X. Yu,et al.  Large-Scale Long-Tailed Recognition in an Open World , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ji Zhang,et al.  Large-Scale Visual Relationship Understanding , 2018, AAAI.

[6]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[7]  Xiao Zhang,et al.  Range Loss for Deep Face Recognition with Long-Tailed Training Data , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[12]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[15]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[21]  Zsolt Kira,et al.  Learning to cluster in order to Transfer across domains and tasks , 2017, ICLR.

[22]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[23]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[26]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Juan-Zi Li,et al.  Explainable and Explicit Visual Reasoning Over Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Guilin Qi,et al.  Richpedia: A Comprehensive Multi-modal Knowledge Graph , 2019, JIST.

[33]  Shiguang Shan,et al.  Exploring Context and Visual Pattern of Relationship for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Matthew A. Brown,et al.  Low-Shot Learning with Imprinted Weights , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Nenghai Yu,et al.  Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[36]  Gang Wang,et al.  An Empirical Study of Language CNN for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Meng Wang,et al.  Answering why-not questions on semantic multimedia queries , 2017, Multimedia Tools and Applications.

[38]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Quanfu Fan,et al.  Semantically Guided Visual Question Answering , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[40]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[41]  Jun Yu,et al.  On Exploring Undetermined Relationships for Visual Relationship Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[44]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[45]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Lina Yao,et al.  One-Shot Learning for Long-Tail Visual Relation Detection , 2020, AAAI.

[47]  Jia Deng,et al.  Pixels to Graphs by Associative Embedding , 2017, NIPS.

[48]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[50]  Ji Zhang,et al.  Graphical Contrastive Losses for Scene Graph Parsing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.