Human-Centric Scene Understanding from Single View 360 Video

In this paper, we propose an approach to indoor scene understanding from observation of people in single view spherical video. As input, our approach takes a centrally located spherical video capture of an indoor scene, estimating the 3D localisation of human actions performed throughout the long term capture. The central contribution of this work is a deep convolutional encoder-decoder network trained on a synthetic dataset to reconstruct regions of affordance from captured human activity. The predicted affordance segmentation is then applied to compose a reconstruction of the complete 3D scene, integrating the affordance segmentation into 3D space. The mapping learnt between human activity and affordance segmentation demonstrates that omnidirectional observation of human activity can be applied to scene understanding tasks such as 3D reconstruction. We show that our approach using only observation of people performs well against previous approaches, allowing reconstruction of occluded regions and labelling of scene affordances.

[1]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[2]  Yinda Zhang,et al.  PanoContext: A Whole-Room 3D Context Model for Panoramic Scene Understanding , 2014, ECCV.

[3]  Sinisa Todorovic,et al.  A Multi-scale CNN for Affordance Segmentation in RGB Images , 2016, ECCV.

[4]  Hui Zhang,et al.  Modeling room structure from indoor panorama , 2014, VRCAI '14.

[5]  Derek D. Lichti,et al.  IMU and Multiple RGB-D Camera Fusion for Assisting Indoor Stop-and-Go 3D Terrestrial Laser Scanning , 2014, Robotics.

[6]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Song-Chun Zhu,et al.  Understanding tools: Task-oriented object modeling, learning and recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hui Zhang,et al.  Efficient 3D Room Shape Recovery from a Single Panorama , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[10]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[11]  Adrian Hilton,et al.  Towards Complete Scene Reconstruction from Single-View Depth and Human Motion , 2017, BMVC.

[12]  Silvio Savarese,et al.  Knowledge Transfer for Scene-Specific Motion Prediction , 2016, ECCV.

[13]  Raquel Urtasun,et al.  Efficient Exact Inference for 3D Indoor Scene Understanding , 2012, ECCV.

[14]  Adrian Hilton,et al.  Room Layout Estimation with Object and Material Attributes Information Using a Spherical Camera , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[15]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Guillermo Sapiro,et al.  What Can Casual Walkers Tell Us About A 3D Scene? , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[20]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Lasitha Piyathilaka,et al.  Affordance-map: Mapping human context in 3D scenes using cost-sensitive SVM and virtual human models , 2015, 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO).

[22]  Björn Stenger,et al.  Pano2CAD: Room Layout from a Single Panorama Image , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Adrian Hilton,et al.  Block world reconstruction from spherical stereo image pairs , 2015, Comput. Vis. Image Underst..

[25]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[26]  Nikolaos G. Tsagarakis,et al.  Object-based affordances detection with Convolutional Neural Networks and dense Conditional Random Fields , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[27]  Jan-Michael Frahm,et al.  Piecewise planar and non-planar stereo for urban scene reconstruction , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Abhinav Gupta,et al.  Binge Watching: Scaling Affordance Learning from Sitcoms , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Pushmeet Kohli,et al.  A Contour Completion Model for Augmenting Surface Reconstructions , 2014, ECCV.

[30]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.