Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation

In this paper, we propose a novel joint Task-Recursive Learning (TRL) framework for the closing-loop semantic segmentation and monocular depth estimation tasks. TRL can recursively refine the results of both tasks through serialized task-level interactions. In order to mutually-boost for each other, we encapsulate the interaction into a specific Task-Attentional Module (TAM) to adaptively enhance some counterpart patterns of both tasks. Further, to make the inference more credible, we propagate previous learning experiences on both tasks into the next network evolution by explicitly concatenating previous responses. The sequence of task-level interactions are finally evolved along a coarse-to-fine scale space such that the required details may be reconstructed progressively. Extensive experiments on NYU-Depth v2 and SUN RGB-D datasets demonstrate that our method achieves state-of-the-art results for monocular depth estimation and semantic segmentation.

[1]  Alan L. Yuille,et al.  Towards unified depth and semantic prediction from a single image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Seungyong Lee,et al.  RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[4]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jian Yang,et al.  Deep hierarchical guidance and regularization learning for end-to-end depth estimation , 2018, Pattern Recognit..

[6]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xin Li,et al.  FoveaNet: Perspective-Aware Urban Scene Parsing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Mario Fritz,et al.  STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Urs A. Muller,et al.  Learning long-range vision for autonomous off-road driving , 2009 .

[13]  Yunchao Wei,et al.  Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi-Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jian Yang,et al.  Progressive Hard-Mining Network for Monocular Depth Estimation , 2018, IEEE Transactions on Image Processing.

[15]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[16]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[19]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[20]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[21]  Guosheng Lin,et al.  Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Xin Zhao,et al.  Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[24]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Timnit Gebru,et al.  Fine-Grained Recognition in the Wild: A Multi-task Domain Adaptation Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[28]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Gang Wang,et al.  Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks , 2016, ECCV.

[31]  Sinisa Todorovic,et al.  Monocular Depth Estimation Using Neural Regression Forest , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Nicu Sebe,et al.  Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Daniel Cohen-Or,et al.  Cascaded Feature Network for Semantic Segmentation of RGB-D Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[35]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[36]  Yao Zhao,et al.  Learning to segment with image-level annotations , 2016, Pattern Recognit..

[37]  N. Taatgen,et al.  The problem state: a cognitive bottleneck in multitasking. , 2010, Journal of experimental psychology. Learning, memory, and cognition.

[38]  Sanja Fidler,et al.  3D Graph Neural Networks for RGBD Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Zhen Li,et al.  LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling , 2016, ECCV.

[40]  Ali Jalali,et al.  A Dirty Model for Multi-task Learning , 2010, NIPS.

[41]  Longin Jan Latecki,et al.  Semantic Segmentation of RGBD Images with Mutex Constraints , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Ashutosh Saxena,et al.  High speed obstacle avoidance using monocular vision and reinforcement learning , 2005, ICML.

[43]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Stephen Lin,et al.  Unified Depth Prediction and Intrinsic Image Decomposition from a Single Image via Joint Convolutional Neural Fields , 2016, ECCV.

[47]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Shuicheng Yan,et al.  Multi-Path Feedback Recurrent Neural Networks for Scene Parsing , 2016, AAAI.

[49]  Chunhua Shen,et al.  Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[50]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.