Improved convolutional neural network based scene classification using long short-term memory and label relations

Convolutional neural network (CNN) is more and more important in pattern recognition. In this work, we adopt label relations and long short-term memory (LSTM) to develop an accurate CNN-based scene classification algorithm. Traditional scene classification algorithms assume that labels are mutually exclusive. However, this is not reasonable when an image has a variety of objects and hence has multiple labels. In this work, we apply two label relations, which are exclusive and hierarchy relations, to improve the accuracy of multiple-label scene classification. For example, it is impossible that an image has both the labels of “factory” and “garden”. If the label “factory” is assigned to an image, the probability that it has the label of “garden” should be lowered. We also use image captioning to construct a scene classification model and propose an LSTM based method to further explore label relations and obtain more accurate results for scenic image labeling.

[1]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[3]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[7]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[13]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[14]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[15]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Stéphane Herbin,et al.  Semantic hierarchies for image annotation: A survey , 2012, Pattern Recognit..

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[22]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[23]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[24]  Zhi-Hua Zhou,et al.  Multi-Instance Multi-Label Learning with Application to Scene Classification , 2006, NIPS.

[25]  Cordelia Schmid,et al.  Semantic Hierarchies for Visual Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[28]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Nicholas Roy,et al.  Indoor scene recognition through object detection , 2010, 2010 IEEE International Conference on Robotics and Automation.