Towards Accurate 3D Person Detection and Localization from RGB-D in Cluttered Environments

We focus on the problem of accurately detecting and localizing 3D centroids of persons in RGB-D scenes with frequent heavy occlusions, as often encountered in industrial and service robotics use-cases. While recently, enormous progress has been made in 2D object detection, which is often evaluated in terms of bounding box overlap in image space, robotics systems often rely on metric 3D world coordinates for applications such as human tracking across sensor boundaries, socially aware motion planning or safety and collision avoidance. Starting with a state-of-the-art 2D single-stage detector, we examine how we can robustly lift the coordinates into 3D to outperform the state-of-the-art in RGB-D person detection at 50 frames per second. Evaluation on our Kinect v2 dataset from an intralogistics warehouse indicates that there might be better intermediate representations for this purpose than 2D bounding boxes, such as instance segmentation masks or keypoint estimates. As an alternative strategy, we also compare our method against a recently proposed bottom-up 3D human pose estimation approach. We find that our 2D top-down person detector achieves higher maximum recall, while the bottom-up 3D human pose estimation method can reach higher precision.

[1]  Rong Xiong,et al.  3D-SSD: Learning hierarchical features from RGB-D images for amodal 3D object detection , 2017, Neurocomputing.

[2]  David Filliat,et al.  "Look at this one" detection sharing between modality-independent classifiers for robotic discovery of people , 2017, 2017 European Conference on Mobile Robots (ECMR).

[3]  Wolfram Burgard,et al.  3D Human Pose Estimation in RGBD Images for Robotic Task Learning , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Wolfram Burgard,et al.  Choosing smartly: Adaptive multimodal fusion for object detection in changing environments , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Dietrich Paulus,et al.  Joint operator detection and tracking for person following from mobile platforms , 2017, 2017 20th International Conference on Information Fusion (Fusion).

[7]  Longin Jan Latecki,et al.  Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones in RGB-Depth Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[9]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[12]  Wei Liu,et al.  DSSD : Deconvolutional Single Shot Detector , 2017, ArXiv.

[13]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[15]  Kai Oliver Arras,et al.  How Robust is 3D Human Pose Estimation to Occlusion? , 2018, ArXiv.

[16]  Thomas Brox,et al.  Box2Pix: Single-Shot Instance Segmentation by Assigning Pixels to Object Boxes , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[17]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[18]  Kai Oliver Arras,et al.  On multi-modal people tracking from mobile platforms in very crowded and dynamic environments , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Liang Lin,et al.  Is Faster R-CNN Doing Well for Pedestrian Detection? , 2016, ECCV.

[21]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Wolfram Burgard,et al.  Deep Detection of People and their Mobility Aids for a Hospital Robot , 2017, 2017 European Conference on Mobile Robots (ECMR).