Attend and Imagine: Multi-Label Image Classification With Visual Attention and Recurrent Neural Networks

Real images often have multiple labels, i.e., each image is associated with multiple objects or attributes. Compared to single-label image classification, the multilabel classification problem is much more challenging due to several issues. At first, multiple objects can be anywhere in the image. Second, the importance of different regions in an image is different, and the regions of interest in a multilabel image can be very different from another one. Finally, multiple labels of an image can have label dependencies due to complex image structures. To address these challenges, in this paper, we propose to predict the labels sequentially by applying the recurrent neural networks (RNNs), which are used to encode the label dependencies. When predicting a specific label, we introduce a dynamic attention mechanism to enable the model to focus on only regions of interest in the image. Two benchmark datasets (i.e., Pascal VOC and MS-COCO) are adopted to demonstrate the effectiveness of our work. Moreover, we construct a new dataset, which includes many semantic dependent labels in each image, to verify the effectiveness of our model. Experimental results show that our method outperforms several state-of-the-arts, especially when predicting some semantic relative labels.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Jing Liu,et al.  Discrimination-aware Channel Pruning for Deep Neural Networks , 2018, NeurIPS.

[3]  Klaus-Robert Müller,et al.  N-ary decomposition for multi-class classification , 2019, Machine Learning.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Shuicheng Yan,et al.  Hidden-Concept Driven Multilabel Image Annotation and Label Ranking , 2012, IEEE Transactions on Multimedia.

[6]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[7]  Feng Liu,et al.  Semantic Regularisation for Recurrent Image Annotation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[12]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Bo Wang,et al.  Multi-Instance Multi-Label Learning Combining Hierarchical Context and its Application to Image Annotation , 2016, IEEE Transactions on Multimedia.

[14]  Chen Sun,et al.  VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Qi Wu,et al.  Multilabel Image Classification With Regional Latent Semantic Dependencies , 2016, IEEE Transactions on Multimedia.

[16]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.

[18]  Nenghai Yu,et al.  Learning Spatial Regularization with Image-Level Supervisions for Multi-label Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Qingyao Wu,et al.  The Shallow End: Empowering Shallower Deep-Convolutional Networks through Auxiliary Outputs , 2016, ArXiv.

[20]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[21]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[22]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[23]  Rick Siow Mong Goh,et al.  Transfer Hashing: From Shallow to Deep , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Zhiguo Cao,et al.  Learning With Annotation of Various Degrees , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[27]  B. S. Manjunath,et al.  Multi-Label Learning With Fused Multimodal Bi-Relational Graph , 2014, IEEE Transactions on Multimedia.

[28]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[33]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[34]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[35]  Yangqing Jia,et al.  Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[36]  Andrew McCallum,et al.  Collective multi-label classification , 2005, CIKM '05.

[37]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[38]  Qingyao Wu,et al.  Double Forward Propagation for Memorized Batch Normalization , 2018, AAAI.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Anton van den Hengel,et al.  Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge , 2016, ArXiv.

[41]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[42]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[44]  Hideki Nakayama,et al.  Annotation order matters: Recurrent Image Annotator for arbitrary length image tagging , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[45]  Jianfei Cai,et al.  MIML-FCN+: Multi-Instance Multi-Label Learning via Fully Convolutional Networks with Privileged Information , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[48]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[49]  Shou-De Lin,et al.  Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval , 2011, IEEE Transactions on Multimedia.

[50]  Junbin Gao,et al.  Learning graph structure for multi-label image classification via clique generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[52]  Liang Wang,et al.  Unconstrained Multimodal Multi-Label Learning , 2015, IEEE Transactions on Multimedia.

[53]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[54]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[57]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[59]  Pengpeng Zhao,et al.  Weak-Labeled Active Learning With Conditional Label Dependence for Multilabel Image Classification , 2017, IEEE Transactions on Multimedia.