Motion and Depth Augmented Semantic Segmentation for Autonomous Navigation

Motion and depth provide critical information in autonomous driving and they are commonly used for generic object detection. In this paper, we leverage them for improving semantic segmentation. Depth cues can be useful for detecting road as it lies below the horizon line. There is also a strong structural similarity for different instances of different objects including buildings and trees. Motion cues are useful as the scene is highly dynamic with moving objects including vehicles and pedestrians. This work utilizes geometric information modelled by depth maps and motion cues represented by optical flow vectors to improve the pixel-wise segmentation task. A CNN architecture is proposed and the variations regarding the stage at which color, depth, and motion information are fused, e.g. early-fusion or mid-fusion, are systematically investigated. Additionally, we implement a multimodal fusion algorithm to maximize the benefit from all the information. The proposed algorithms are evaluated on Virtual-KITTI and Cityscapes datasets where results demonstrate enhanced performance with depth and flow augmentation.

[1]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Martin Jägersand,et al.  MODNet: Motion and Appearance based Moving Object Detection Network for Autonomous Driving , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[3]  Mennatullah Siam,et al.  Enhanced Target Tracking in UAV Imagery with P-N Learning and Structural Constraints , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[4]  L. Bottou,et al.  Deep Convolutional Networks for Scene Parsing , 2009 .

[5]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[7]  Senthil Yogamani,et al.  Optical Flow augmented Semantic Segmentation networks for Automated Driving , 2019, VISIGRAPP.

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Yann LeCun,et al.  Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers , 2012, ICML.

[13]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[15]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Martin Jägersand,et al.  Real-Time Segmentation with Appearance, Motion and Geometry , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[18]  Stefan Roth,et al.  UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss , 2017, AAAI.