Towards Complete Scene Reconstruction from Single-View Depth and Human Motion

Complete scene reconstruction from single view RGBD is a challenging task, requiring estimation of scene regions occluded from the captured depth surface. We propose that scene-centric analysis of human motion within an indoor scene can reveal fully occluded objects and provide functional cues to enhance scene understanding tasks. Captured skeletal joint positions of humans, utilised as naturally exploring active sensors, are projected into a human-scene motion representation. Inherent body occupancy is leveraged to carve a volumetric scene occupancy map initialised from captured depth, revealing a more complete voxel representation of the scene. To obtain a structured box model representation of the scene, we introduce unique terms to an object detection optimisation that overcome depth occlusions whilst deriving from the same depth data. The method is evaluated on challenging indoor scenes with multiple occluding objects such as tables and chairs. Evaluation shows that human-centric scene analysis can be applied to effectively enhance state-of-the-art scene understanding approaches, resulting in a more complete representation than single view depth alone.

[1]  Pushmeet Kohli,et al.  A Contour Completion Model for Augmenting Surface Reconstructions , 2014, ECCV.

[2]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[3]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yun Jiang,et al.  Learning Object Arrangements in 3D Scenes using Human Context , 2012, ICML.

[5]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[6]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[7]  Derek Hoiem,et al.  Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Derek Hoiem,et al.  Predicting Complete 3D Models of Indoor Scenes , 2015, ArXiv.

[9]  Simon J. Julier,et al.  Structured Prediction of Unobserved Voxels from a Single Depth Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Reinhard Klein,et al.  Efficient RANSAC for Point‐Cloud Shape Detection , 2007, Comput. Graph. Forum.

[12]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[13]  Vladlen Koltun,et al.  Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Erik B. Sudderth,et al.  Three-Dimensional Object Detection and Layout Prediction Using Clouds of Oriented Gradients , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[17]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Adrian Hilton,et al.  Block world reconstruction from spherical stereo image pairs , 2015, Comput. Vis. Image Underst..

[19]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[20]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Wolfram Burgard,et al.  An evaluation of the RGB-D SLAM system , 2012, 2012 IEEE International Conference on Robotics and Automation.

[22]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[23]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[24]  Jianxiong Xiao,et al.  A Linear Approach to Matching Cuboids in RGBD Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Andreas Geiger,et al.  Joint 3D Object and Layout Inference from a Single RGB-D Image , 2015, GCPR.