论文信息 - Multimodal Recurrent Neural Networks With Information Transfer Layers for Indoor Scene Labeling

Multimodal Recurrent Neural Networks With Information Transfer Layers for Indoor Scene Labeling

This paper proposes a new method called multimodal recurrent neural networks (RNNs) for RGB-D scene semantic segmentation. It is optimized to classify image pixels given two input sources: RGB color channels and depth maps. It simultaneously performs training of two RNNs that are crossly connected through information transfer layers, which are learnt to adaptively extract relevant cross-modality features. Each RNN model learns its representations from its own previous hidden states and transferred patterns from the other RNNs previous hidden states; thus, both model-specific and cross-modality features are retained. We exploit the structure of quad-directional 2D-RNNs to model the short- and long-range contextual information in the 2D input image. We carefully designed various baselines to efficiently examine our proposed model structure. We test our multimodal RNNs method on popular RGB-D benchmarks and show how it outperforms previous methods significantly and achieves competitive results with other state-of-the-art works.

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Yong Jae Lee,et al. Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Miguel Á. Carreira-Perpiñán,et al. Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[5] Gang Wang,et al. Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling , 2014, ECCV.

[6] Luc Van Gool,et al. Depth and Appearance for Mobile Scene Analysis , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[7] Deva Ramanan,et al. Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Robert Ho. Canonical Correlation Analysis , 2013 .

[9] Zhi-Hua Zhou,et al. A New Analysis of Co-Training , 2010, ICML.

[10] Yann LeCun,et al. Indoor Semantic Segmentation using depth information , 2013, ICLR.

[11] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[12] Dieter Fox,et al. RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Bastian Leibe,et al. Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[14] Neil Martin Robertson,et al. Deep Head Pose: Gaze-Direction Estimation in Multimodal Video , 2015, IEEE Transactions on Multimedia.

[15] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[16] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[17] Gang Wang,et al. Quaddirectional 2D-Recurrent Neural Networks For Image Labeling , 2015, IEEE Signal Processing Letters.

[18] Yunchao Wei,et al. Perceptual Generative Adversarial Networks for Small Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Dariu Gavrila,et al. Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle , 2007, International Journal of Computer Vision.

[20] Songcan Chen,et al. MultiK-MHKS: A Novel Multiple Kernel Learning Algorithm , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Honglak Lee,et al. Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[22] Gang Wang,et al. Multi-Task CNN Model for Attribute Prediction , 2015, IEEE Transactions on Multimedia.

[23] Jörg Stückler,et al. Dense real-time mapping of object-class semantics from RGB-D video , 2013, Journal of Real-Time Image Processing.

[24] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[25] Luc Van Gool,et al. Dynamic 3D Scene Analysis from a Moving Vehicle , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Gang Wang,et al. Episodic CAMN: Contextual Attention-Based Memory Networks with Iterative Feedback for Scene Labeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Gang Wang,et al. Convolutional recurrent neural networks: Learning spatial dependencies for image representation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28] Dariu Gavrila,et al. Real-time dense stereo for intelligent vehicles , 2006, IEEE Transactions on Intelligent Transportation Systems.

[29] Gang Wang,et al. Beyond Forward Shortcuts: Fully Convolutional Master-Slave Networks (MSNets) with Backward Skip Connections for Semantic Segmentation , 2017, ArXiv.

[30] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[31] Rob Fergus,et al. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33] Jürgen Schmidhuber,et al. Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[34] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35] Fuchun Sun,et al. Unsupervised multimodal feature learning for semantic image segmentation , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[36] Jitendra Malik,et al. Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37] S. Munder,et al. Pedestrian recognition using combined low-resolution depth and intensity images , 2008, 2008 IEEE Intelligent Vehicles Symposium.

[38] T. Munich,et al. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[39] César Cadena,et al. Semantic Parsing for Priming Object Detection in RGB-D Scenes , 2013 .

[40] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Yann LeCun,et al. Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers , 2012, ICML.

[42] Qiuping Xu. Canonical correlation Analysis , 2014 .

[43] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[44] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[45] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[46] Michael R. Lyu,et al. A Multimodal and Multilevel Ranking Scheme for Large-Scale Video Retrieval , 2008, IEEE Transactions on Multimedia.

[47] Dacheng Tao,et al. Robust Face Recognition via Multimodal Deep Face Representation , 2015, IEEE Transactions on Multimedia.

[48] Tao Mei,et al. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Jürgen Schmidhuber,et al. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[50] Jitendra Malik,et al. Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Jitendra Malik,et al. Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation , 2015, International Journal of Computer Vision.

[52] Eric P. Xing,et al. Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[53] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54] Zhibin Hong,et al. Tracking via Robust Multi-task Multi-view Joint Sparse Representation , 2013, 2013 IEEE International Conference on Computer Vision.

[55] Ronan Collobert,et al. Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[56] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[57] Nathan Silberman,et al. Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[58] Honglak Lee,et al. Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[59] Marcus Liwicki,et al. Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] James M. Keller,et al. Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor , 2012, ACCV.

[61] Wenwu Zhu,et al. Learning Compact Hash Codes for Multimodal Representations Using Orthogonal Deep Structure , 2015, IEEE Transactions on Multimedia.

[62] Dariu Gavrila,et al. High-Level Fusion of Depth and Intensity for Pedestrian Classification , 2009, DAGM-Symposium.

[63] Sheng Tang,et al. Accurate Estimation of Human Body Orientation From RGB-D Sensors , 2013, IEEE Transactions on Cybernetics.

[64] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[65] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[66] Dacheng Tao,et al. A Survey on Multi-view Learning , 2013, ArXiv.

[67] Claire Cardie,et al. Opinion Mining with Deep Recurrent Neural Networks , 2014, EMNLP.

[68] Sven Behnke,et al. Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[69] Jana Kosecka,et al. Semantic parsing for priming object detection in indoors RGB-D scenes , 2015, Int. J. Robotics Res..

[70] Honglak Lee,et al. Sparse deep belief net model for visual area V2 , 2007, NIPS.

[71] Mohammed Bennamoun,et al. Geometry Driven Semantic Labeling of Indoor Scenes , 2014, ECCV.

[72] Razvan Pascanu,et al. Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.