Scene Semantic Reconstruction from Egocentric RGB-D-Thermal Videos

In this paper we focus on the problem of inferring geometric and semantic properties of a complex scene where humans interact with objects from egocentric views. Unlike most previous work, our goal is to leverage a multimodal sensory stream composed of RGB, depth, and thermal (RGB-D-T) signals and use this data stream as an input to a new framework for joint 6 DOF camera localization, 3D reconstruction, and semantic segmentation. As our extensive experimental evaluation shows, the combination of different sensing modalities allows us to achieve greater robustness in situations where both the observer and the objects in the scene move rapidly (a challenging situation for traditional semantic reconstruction methods). Moreover, we contribute a new dataset that includes a large number of egocentric RGB-D-T videos of humans performing daily real-world activities as well as a new demonstration hardware platform for acquiring such a dataset.

[1]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Marc Levoy,et al.  Real-time 3D model acquisition , 2002, ACM Trans. Graph..

[4]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[5]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  三嶋 博之 The theory of affordances , 2008 .

[8]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[9]  S. Shankar Sastry,et al.  An Invitation to 3-D Vision , 2004 .

[10]  Deva Ramanan,et al.  First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[12]  Yoichi Sato,et al.  Understanding Hand-Object Manipulation with Grasp Types and Object Attributes , 2016, Robotics: Science and Systems.

[13]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[14]  J. J. Gibson The theory of affordances , 1977 .

[15]  Ashutosh Saxena,et al.  rCRF: Recursive Belief Estimation over CRFs in RGB-D Activity Videos , 2015, Robotics: Science and Systems.

[16]  Jean-Yves Bouguet,et al.  Camera calibration toolbox for matlab , 2001 .

[17]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[18]  Matthias Nießner,et al.  VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction , 2016, ECCV.

[19]  Li Zhang,et al.  Rapid shape acquisition using color structured light and multi-pass dynamic programming , 2002, Proceedings. First International Symposium on 3D Data Processing Visualization and Transmission.

[20]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[22]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Kemal Ugur,et al.  Efficient MRF Energy Propagation for Video Segmentation via Bilateral Filters , 2013, IEEE Transactions on Multimedia.

[24]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[25]  Abhinav Gupta,et al.  The Curious Robot: Learning Visual Representations via Physical Interactions , 2016, ECCV.

[26]  Dieter Fox,et al.  DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[29]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[30]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Context-Aware Computing,et al.  Inferring Activities from Interactions with Objects , 2004 .

[32]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[33]  Kristen Grauman,et al.  Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  David W. Murray,et al.  A Square Root Unscented Kalman Filter for visual monoSLAM , 2008, 2008 IEEE International Conference on Robotics and Automation.

[35]  Deva Ramanan,et al.  3D Hand Pose Detection in Egocentric RGB-D Images , 2014, ECCV Workshops.

[36]  Yi Yang,et al.  Depth-Based Hand Pose Estimation: Data, Methods, and Challenges , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  John J. Leonard,et al.  Real-time large-scale dense RGB-D SLAM with volumetric fusion , 2014, Int. J. Robotics Res..

[38]  Kristen Grauman,et al.  Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition , 2013, BMVC.

[39]  Cheng Li,et al.  Model Recommendation with Virtual Probes for Egocentric Hand Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  S. Greenberg,et al.  The Psychology of Everyday Things , 2012 .

[41]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[42]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Sebastian Thrun,et al.  Real-Time Human Pose Tracking from Range Data , 2012, ECCV.

[44]  Olivier Stasse,et al.  MonoSLAM: Real-Time Single Camera SLAM , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Kris M. Kitani,et al.  How do we use our hands? Discovering a diverse set of common grasps , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Ali Farhadi,et al.  "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.

[47]  Kostas Daniilidis,et al.  Fast, robust, continuous monocular egomotion computation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).