Multi-modal multi-concept-based deep neural network for automatic image annotation

Automatic Image Annotation (AIA) remains as a challenge in computer vision with real-world applications, due to the semantic gap between high-level semantic concepts and low-level visual appearances. Contextual tags attached to visual images and context semantics among semantic concepts can provide further semantic information to bridge this gap. In order to effectively capture these semantic correlations, we present a novel approach called Multi-modal Multi-concept-based Deep Neural Network (M2-DNN) in this study, which models the correlations of visual images, contextual tags, and multi-concept semantics. Unlike traditional AIA methods, our M2-DNN approach takes into account not only single-concept context semantics, but also multi-concept context semantics with abstract scenes. In our model, a multi-concept such as {“plane”,“buildings”}$\{``plane",``buildings"\}$ is viewed as one holistic scene concept for concept learning. Specifically, we first construct a multi-modal Deep Neural Network (DNN) as a concept classifier for visual images and contextual tags, and then employ it to annotate unlabeled images. Second, real-world databases commonly include many difficult concepts that are hard to be recognized, such as concepts with similar appearances, concepts with abstract scenes, and rare concepts. To effectively recognize them, we utilize multi-concept semantics inference and multi-modal correlation learning to refine semantic annotations. Finally, we estimate the most relevant labels for each of unlabeled images through a new strategy of label decision. The results of our comprehensive experiments on two publicly available datasets have shown that our method performs favourably compared with several other state-of-the-art methods.

[1]  Cordelia Schmid,et al.  Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Wei Liu,et al.  Multi-label Learning with Missing Labels Using Mixed Dependency Graphs , 2018, International Journal of Computer Vision.

[3]  Lin Wu,et al.  Beyond Low-Rank Representations: Orthogonal Clustering Basis Reconstruction with Optimized Graph Structure for Multi-view Spectral Clustering , 2017, Neural Networks.

[4]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[5]  Stefanie Nowak,et al.  The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks , 2011, CLEF.

[6]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[7]  Lin Wu,et al.  Iterative Views Agreement: An Iterative Low-Rank Based Structured Optimization Method to Multi-View Spectral Clustering , 2016, IJCAI.

[8]  Fan Zhao,et al.  Dynamic graph fusion label propagation for semi-supervised multi-modality classification , 2017, Pattern Recognit..

[9]  Yangqing Jia,et al.  Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[10]  Deng Cai,et al.  Deep feature based contextual model for object detection , 2016, Neurocomputing.

[11]  Hanjiang Lai,et al.  Instance-Aware Hashing for Multi-Label Image Retrieval , 2016, IEEE Transactions on Image Processing.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Lin Wu,et al.  Unsupervised Metric Fusion Over Multiview Data by Graph Random Walk-Based Cross-View Diffusion , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Lin Wu,et al.  Effective Multi-Query Expansions: Collaborative Deep Networks for Robust Landmark Retrieval , 2017, IEEE Transactions on Image Processing.

[15]  Daniel McDuff,et al.  Exploiting sparsity and co-occurrence structure for action unit recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[16]  Liu Zheng,et al.  MMDF-LDA: An improved Multi-Modal Latent Dirichlet Allocation model for social image annotation , 2018, Expert Syst. Appl..

[17]  Lin Wu,et al.  Deep adaptive feature embedding with local sample distributions for person re-identification , 2017, Pattern Recognit..

[18]  Jefersson Alex dos Santos,et al.  Pointwise and pairwise clothing annotation: combining features from social media , 2016, Multimedia Tools and Applications.

[19]  Junzhou Huang,et al.  Automatic Image Annotation and Retrieval Using Group Sparsity , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[20]  Weiwei Liu,et al.  Large Margin Metric Learning for Multi-Label Prediction , 2015, AAAI.

[21]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[22]  Chong-Wah Ngo,et al.  Semantic context modeling with maximal margin Conditional Random Fields for automatic image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Yale Song,et al.  Improving Pairwise Ranking for Multi-label Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[26]  Lin Wu,et al.  What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification , 2017, Pattern Recognit..

[27]  Changqin Huang,et al.  Image retrieval based on multi-concept detector and semantic correlation , 2015, Science China Information Sciences.

[28]  Ali Farhadi,et al.  Deep Classifiers from Image Tags in the Wild , 2015, MMCommons '15.

[29]  Haroon Idrees,et al.  NMF-KNN: Image Annotation Using Weighted Multi-view Non-negative Matrix Factorization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Yansheng Lu,et al.  Markov random field based fusion for supervised and semi-supervised multi-modal image classification , 2014, Multimedia Tools and Applications.

[31]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[34]  Shuicheng Yan,et al.  Multi-loss Regularized Deep Neural Network , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Lin Wu,et al.  Multiview Spectral Clustering via Structured Low-Rank Matrix Factorization , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Xue Li,et al.  Deep Attention-Based Spatially Recursive Networks for Fine-Grained Visual Recognition , 2019, IEEE Transactions on Cybernetics.

[37]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[38]  Lin Wu,et al.  Robust Subspace Clustering for Multi-View Data by Exploiting Correlation Consensus , 2015, IEEE Transactions on Image Processing.

[39]  Liang Tao,et al.  Learning shared subspace for multi-label dimensionality reduction via dependence maximization , 2015, Neurocomputing.

[40]  Zhongke Shi,et al.  An overview on flight dynamics and control approaches for hypersonic vehicles , 2015, Science China Information Sciences.

[41]  Lin Wu,et al.  LBMCH: Learning Bridging Mapping for Cross-modal Hashing , 2015, SIGIR.

[42]  Kilian Q. Weinberger,et al.  Fast Image Tagging , 2013, ICML.

[43]  Alan L. Yuille,et al.  Multi-Instance Visual-Semantic Embedding , 2015, BMVC 2017.