LinkNet: 2D-3D linked multi-modal network for online semantic segmentation of RGB-D videos

Abstract This paper proposes LinkNet, a 2D-3D linked multi-modal network served for online semantic segmentation of RGB-D videos, which is essential for real-time applications such as robot navigation. Existing methods for RGB-D semantic segmentation usually work in the regular image domain, which allows efficient processing using convolutional neural networks (CNNs). However, RGB-D videos are captured from a 3D scene, and different frames can contain useful information of the same local region from different views. Working solely in the image domain fails to utilize such crucial information. Our novel approach is based on joint 2D and 3D analysis. The online process is realized simultaneously with 3D scene reconstruction, from which we set up 2D-3D links between continuous RGB-D frames and 3D point cloud. We combine image color and view-insensitive geometric features generated from the 3D point cloud for multi-modal semantic feature learning. Our LinkNet further uses a recurrent neural network (RNN) module to dynamically maintain the hidden semantic states during 3D fusion, and refines the voxel-based labeling results. The experimental results on SceneNet [1] and ScanNet [2] demonstrate that the semantic segmentation results of our framework are stable and effective.

[1]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[2]  George Loizou,et al.  Computer vision and pattern recognition , 2007, Int. J. Comput. Math..

[3]  Gang Wang,et al.  Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks , 2016, ECCV.

[4]  Ralph R. Martin,et al.  PCT: Point cloud transformer , 2020, Computational Visual Media.

[5]  Yaron Lipman,et al.  Point convolutional neural networks by extension operators , 2018, ACM Trans. Graph..

[6]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Luis Herranz,et al.  Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs , 2017, AAAI.

[8]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[10]  Shi-Min Hu,et al.  Probabilistic Projective Association and Semantic Guided Relocalization for Dense Reconstruction , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[11]  Wolfram Burgard,et al.  Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[12]  Yu Zhang,et al.  Discriminative Feature Learning for Video Semantic Segmentation , 2014, 2014 International Conference on Virtual Reality and Visualization.

[13]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Wei Wu,et al.  PointCNN: Convolution On X-Transformed Points , 2018, NeurIPS.

[15]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[16]  Thomas Funkhouser,et al.  Virtual Multi-view Fusion for 3D Semantic Segmentation , 2020, ECCV.

[17]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Trevor Darrell,et al.  Clockwork Convnets for Video Semantic Segmentation , 2016, ECCV Workshops.

[19]  Lin Gao,et al.  A survey on deep geometry learning: From a representation perspective , 2020, Computational Visual Media.

[20]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[21]  Shi-Min Hu,et al.  Semantic Labeling and Instance Segmentation of 3D Point Clouds Using Patch Context Analysis and Multiscale Processing , 2020, IEEE Transactions on Visualization and Computer Graphics.

[22]  Yann LeCun,et al.  Predicting Deeper into the Future of Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Mingmin Zhen,et al.  Multi-view based neural network for semantic segmentation on 3D scenes , 2019, Science China Information Sciences.

[24]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[25]  Eduardo Romera,et al.  ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation , 2018, IEEE Transactions on Intelligent Transportation Systems.

[26]  Daniel Cremers,et al.  FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture , 2016, ACCV.

[27]  Cheng Wang,et al.  Toward better boundary preserved supervoxel segmentation for 3D point clouds , 2018, ISPRS Journal of Photogrammetry and Remote Sensing.

[28]  Raja Giryes,et al.  PointGMM: A Neural GMM Network for Point Clouds , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Timo Ropinski,et al.  Monte Carlo convolution for learning on non-uniformly sampled point clouds , 2018, ACM Trans. Graph..

[30]  Fuxin Li,et al.  PointConv: Deep Convolutional Networks on 3D Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Seungyong Lee,et al.  RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dieter Fox,et al.  DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks , 2017, Robotics: Science and Systems.

[34]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[35]  Stefan Leutenegger,et al.  ElasticFusion: Real-time dense SLAM and light source estimation , 2016, Int. J. Robotics Res..

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Shi-Min Hu,et al.  Deep point-based scene labeling with depth mapping and geometric patch feature encoding , 2019, Graph. Model..

[38]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[39]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[40]  Wolfram Burgard,et al.  Deep learning for human part discovery in images , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Lourdes Agapito,et al.  MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects , 2018, 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[43]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Pierre Vandergheynst,et al.  Geometric Deep Learning: Going beyond Euclidean data , 2016, IEEE Signal Process. Mag..

[47]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[49]  Kai Xu,et al.  Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[51]  Shuguang Cui,et al.  PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Xin Zhao,et al.  Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Stefan Leutenegger,et al.  SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[56]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[57]  Fei Luo,et al.  RedNet: Residual Encoder-Decoder Network for indoor RGB-D Semantic Segmentation , 2018, ArXiv.

[58]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[59]  Michael W. Vannier,et al.  Biomedical image segmentation , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[60]  Vladlen Koltun,et al.  MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Qinping Zhao,et al.  Semantic part segmentation of single-view point cloud , 2020, Science China Information Sciences.