Real-Time Monocular Human Depth Estimation and Segmentation on Embedded Systems

Estimating a scene’s depth to achieve collision avoidance against moving pedestrians is a crucial and fundamental problem in the robotic field. This paper proposes a novel, low complexity network architecture for fast and accurate human depth estimation and segmentation in indoor environments, aiming to applications for resource-constrained platforms (including battery-powered aerial, micro-aerial, and ground vehicles) with a monocular camera being the primary perception module. Following the encoder-decoder structure, the proposed framework consists of two branches, one for depth prediction and another for semantic segmentation. Moreover, network structure optimization is employed to improve its forward inference speed. Exhaustive experiments on three self-generated datasets prove our pipeline’s capability to execute in real-time, achieving higher frame rates than contemporary state-of-the-art frameworks (114.6 frames per second on an NVIDIA Jetson Nano GPU with TensorRT) while maintaining comparable accuracy.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Hao Su,et al.  Normal Assisted Stereo Depth Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tinne Tuytelaars,et al.  Monocular Depth Estimation in New Environments With Absolute Scale , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[5]  Linda G. Shapiro,et al.  ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation , 2018, ECCV.

[6]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Shuang Bai,et al.  Information aggregation and fusion in deep neural networks for object interaction exploration for semantic segmentation , 2021, Knowl. Based Syst..

[8]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[9]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[10]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[11]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12]  Jian Sun,et al.  DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[14]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[15]  Anton Konushin,et al.  Double Refinement Network for Efficient Monocular Depth Estimation , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jen-Hao Chen,et al.  Collision-Free Motion Planning for Human-Robot Collaborative Safety Under Cartesian Constraint , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Bo Chen,et al.  NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications , 2018, ECCV.

[19]  Danielle Albers Szafir,et al.  Designing for Depth Perceptions in Augmented Reality , 2017, 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[20]  Linda G. Shapiro,et al.  ESPNetv2: A Light-Weight, Power Efficient, and General Purpose Convolutional Neural Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Antonios Gasteratos,et al.  Deep Feature Space: A Geometrical Perspective , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[23]  Paolo Valigi,et al.  J-MOD2: Joint Monocular Obstacle Detection and Depth Estimation , 2017, IEEE Robotics and Automation Letters.

[24]  Paolo Valigi,et al.  Fast robust monocular depth estimation for Obstacle Detection with fully convolutional networks , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Yasin Almalioglu,et al.  GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[27]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[28]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[29]  Yu Wang,et al.  Lednet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[30]  Dinesh Manocha,et al.  Crowd-Steer: Realtime Smooth and Collision-Free Robot Navigation in Densely Crowded Scenarios Trained using High-Fidelity Simulation , 2020, IJCAI.

[31]  Takayuki Okatani,et al.  Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[33]  Ian D. Reid,et al.  Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[34]  Peiyun Hu,et al.  Inferring Distributions Over Depth from a Single Image , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Pascal Fua,et al.  Probability occupancy maps for occluded depth images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Sertac Karaman,et al.  FastDepth: Fast Monocular Depth Estimation on Embedded Systems , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[39]  Michael Ying Yang,et al.  Analyzing modular CNN architectures for joint depth prediction and semantic segmentation , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[40]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[41]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[42]  Tania Stathaki,et al.  Comparison of single channel indices for U-Net based segmentation of vegetation in satellite images , 2020, International Conference on Machine Vision.

[43]  Hossam E. Abd El Munim,et al.  LiteSeg: A Novel Lightweight ConvNet for Semantic Segmentation , 2019, 2019 Digital Image Computing: Techniques and Applications (DICTA).

[44]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[45]  Stefano Mattoccia,et al.  Towards Real-Time Unsupervised Monocular Depth Estimation on CPU , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46]  Amlaan Bhoi,et al.  Monocular Depth Estimation: A Survey , 2019, ArXiv.

[47]  Yao Chen,et al.  Geometric Pretraining for Monocular Depth Estimation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).