Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks

Multi-scale deep CNNs have been used successfully for problems mapping each pixel to a label, such as depth estimation and semantic segmentation. It has also been shown that such architectures are reusable and can be used for multiple tasks. These networks are typically trained independently for each task by varying the output layer(s) and training objective. In this work we present a new model for simultaneous depth estimation and semantic segmentation from a single RGB image. Our approach demonstrates the feasibility of training parts of the model for each task and then fine tuning the full, combined model on both tasks simultaneously using a single loss function. Furthermore we couple the deep CNN with fully connected CRF, which captures the contextual relationships and interactions between the semantic and depth cues improving the accuracy of the final results. The proposed model is trained and evaluated on NYUDepth V2 dataset [23] outperforming the state of the art methods on semantic segmentation and achieving comparable results on the task of depth estimation.

[1]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jana Kosecka,et al.  Semantic segmentation with heterogeneous sensor coverages , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Longin Jan Latecki,et al.  Semantic Segmentation of RGBD Images with Mutex Constraints , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[8]  Gregory Shakhnarovich,et al.  Feedforward semantic segmentation with zoom-out features , 2014, CVPR.

[9]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[10]  Xiaoxiao Li,et al.  Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Sanja Fidler,et al.  Monocular Object Instance Segmentation and Depth Ordering with CNNs , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Luca Maria Gambardella,et al.  Fast image scanning with deep max-pooling convolutional neural networks , 2013, 2013 IEEE International Conference on Image Processing.

[16]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[18]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[20]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[21]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling , 2015, CVPR 2015.

[23]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[24]  Alan L. Yuille,et al.  Towards unified depth and semantic prediction from a single image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Sanja Fidler,et al.  Box in the Box: Joint 3D Layout and Object Reasoning from Single Images , 2013, 2013 IEEE International Conference on Computer Vision.