A Deep Multi-Modal CNN for Multi-Instance Multi-Label Image Classification

Deep convolutional neural networks (CNNs) have shown superior performance on the task of single-label image classification. However, the applicability of CNNs to multi-label images still remains an open problem, mainly because of two reasons. First, each image is usually treated as an inseparable entity and represented as one instance, which mixes the visual information corresponding to different labels. Second, the correlations amongst labels are often overlooked. To address these limitations, we propose a deep multi-modal CNN for multi-instance multi-label image classification, called MMCNN-MIML. By combining CNNs with multi-instance multi-label (MIML) learning, our model represents each image as a bag of instances for image classification and inherits the merits of both CNNs and MIML. In particular, MMCNN-MIML has three main appealing properties: 1) it can automatically generate instance representations for MIML by exploiting the architecture of CNNs; 2) it takes advantage of the label correlations by grouping labels in its later layers; and 3) it incorporates the textual context of label groups to generate multi-modal instances, which are effective in discriminating visually similar objects belonging to different groups. Empirical studies on several benchmark multi-label image data sets show that MMCNN-MIML significantly outperforms the state-of-the-art baselines on multi-label image classification tasks.

[1]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sanghoon Lee,et al.  Blind Deep S3D Image Quality Evaluation via Local to Global Feature Aggregation , 2017, IEEE Transactions on Image Processing.

[3]  Dacheng Tao,et al.  Local Rademacher Complexity for Multi-Label Learning , 2014, IEEE Transactions on Image Processing.

[4]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[5]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[6]  Xindong Wu,et al.  Learning Label-Specific Features and Class-Dependent Labels for Multi-Label Classification , 2016, IEEE Transactions on Knowledge and Data Engineering.

[7]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[8]  Zhi-Hua Zhou,et al.  Fast Multi-Instance Multi-Label Learning , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Minglun Gong,et al.  Multi-modal feature fusion for geographic image annotation , 2018, Pattern Recognit..

[10]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Francisco Herrera,et al.  Multiple Instance Learning , 2016 .

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Huimin Ma,et al.  Edge Preserving and Multi-Scale Contextual Neural Network for Salient Object Detection , 2016, IEEE Transactions on Image Processing.

[14]  Yangqing Jia,et al.  Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[15]  Minnan Luo,et al.  Sparse Relational Topical Coding on multi-modal data , 2017, Pattern Recognit..

[16]  Shuicheng Yan,et al.  Horror Image Recognition Based on Context-Aware Multi-Instance Learning , 2015, IEEE Transactions on Image Processing.

[17]  B. S. Manjunath,et al.  Multi-Label Learning With Fused Multimodal Bi-Relational Graph , 2014, IEEE Transactions on Multimedia.

[18]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[20]  Ji Feng,et al.  Deep MIML Network , 2017, AAAI.

[21]  Nasser M. Nasrabadi,et al.  Image coding using vector quantization: a review , 1988, IEEE Trans. Commun..

[22]  Zhiwu Lu,et al.  Semantic Sparse Recoding of Visual Content for Image Applications , 2011, IEEE Transactions on Image Processing.

[23]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[24]  Wei Liu,et al.  Multi-Modal Curriculum Learning for Semi-Supervised Image Classification , 2016, IEEE Transactions on Image Processing.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yong Luo,et al.  Large Margin Multi-Modal Multi-Task Feature Extraction for Image Classification , 2019, IEEE Transactions on Image Processing.

[27]  Xiaoli Z. Fern,et al.  Rank-loss support instance machines for MIML instance annotation , 2012, KDD.

[28]  Inderjit S. Dhillon,et al.  Large-scale Multi-label Learning with Missing Labels , 2013, ICML.

[29]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[31]  Yong Rui,et al.  Advances in deep learning approaches for image tagging , 2017, APSIPA Transactions on Signal and Information Processing.

[32]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[33]  Chu-Song Chen,et al.  Supervised Learning of Semantics-Preserving Hash via Deep Convolutional Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Yi Yang,et al.  Compact and Discriminative Descriptor Inference Using Multi-Cues , 2015, IEEE Transactions on Image Processing.

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Shangfei Wang,et al.  Deep multimodal network for multi-label classification , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[37]  Massimo Tistarelli,et al.  Robust Multi-modal and Multi-unit Feature Level Fusion of Face and Iris Biometrics , 2009, ICB.

[38]  Qinghua Hu,et al.  Heterogeneous Feature Selection With Multi-Modal Deep Neural Networks and Sparse Group LASSO , 2015, IEEE Transactions on Multimedia.

[39]  Marco Loog,et al.  Multiple instance learning with bag dissimilarities , 2013, Pattern Recognit..

[40]  Lei Wu,et al.  Lift: Multi-Label Learning with Label-Specific Features , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Jian Dong,et al.  Contextualizing Object Detection and Classification , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Bart Thomee,et al.  Overview of the ImageCLEF 2012 Flickr Photo Annotation and Retrieval Task , 2012, CLEF.

[44]  Angeliki Lazaridou,et al.  The red one!: On learning to refer to things based on discriminative properties , 2016, ACL.

[45]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Qi Tian,et al.  Image Annotation by Input–Output Structural Grouping Sparsity , 2012, IEEE Transactions on Image Processing.

[47]  Antonio Torralba,et al.  Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[49]  Thomas Hofmann,et al.  Multi-Instance Multi-Label Learning with Application to Scene Classification , 2007 .

[50]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[51]  Zhi-Hua Zhou,et al.  Towards Discovering What Patterns Trigger What Labels , 2012, AAAI.

[52]  Stefanie Nowak,et al.  The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks , 2011, CLEF.

[53]  Zhi-Hua Zhou,et al.  Multi-instance multi-label learning , 2008, Artif. Intell..

[54]  Liang Wang,et al.  Unconstrained Multimodal Multi-Label Learning , 2015, IEEE Transactions on Multimedia.

[55]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[57]  Diane Gershon,et al.  In the picture , 1990, Nature.

[58]  Qinghua Zheng,et al.  Sparse Multi-Modal Topical Coding for Image Annotation , 2016, Neurocomputing.