Multi-modal RGB–Depth–Thermal Human Body Segmentation

This work addresses the problem of human body segmentation from multi-modal visual cues as a first stage of automatic human behavior analysis. We propose a novel RGB–depth–thermal dataset along with a multi-modal segmentation baseline. The several modalities are registered using a calibration device and a registration algorithm. Our baseline extracts regions of interest using background subtraction, defines a partitioning of the foreground regions into cells, computes a set of image features on those cells using different state-of-the-art feature extractions, and models the distribution of the descriptors per cell using probabilistic models. A supervised learning algorithm then fuses the output likelihoods over cells in a stacked feature vector representation. The baseline, using Gaussian mixture models for the probabilistic modeling and Random Forest for the stacked learning, is superior to other state-of-the-art methods, obtaining an overlap above 75 % on the novel dataset when compared to the manually annotated ground-truth of human segmentations.

[1]  Sergio Escalera,et al.  User Identification and Object Recognition in Clutter Scenes Based on RGB-Depth Analysis , 2012, AMDO.

[2]  Yang Wang,et al.  Learning hierarchical poselets for human parsing , 2011, CVPR 2011.

[3]  Jian Zhao,et al.  Human segmentation by geometrically fusing visible-light and thermal imageries , 2012, Multimedia Tools and Applications.

[4]  Jean-Yves Bouguet,et al.  Camera calibration toolbox for matlab , 2001 .

[5]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[6]  Deva Ramanan,et al.  Steerable part models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Kai Oliver Arras,et al.  People detection in RGB-D data , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[9]  François Brémond,et al.  ETISEO, performance evaluation for video surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[10]  William W. Cohen,et al.  Stacked Sequential Learning , 2005, IJCAI.

[11]  Sebastian Thrun,et al.  Real-time identification and localization of body parts from depth images , 2010, 2010 IEEE International Conference on Robotics and Automation.

[12]  Larry S. Davis,et al.  Human body pose estimation using silhouette shape analysis , 2003, Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003..

[13]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Sridha Sridharan,et al.  A Mask-Based Approach for the Geometric Calibration of Thermal-Infrared Cameras , 2012, IEEE Transactions on Instrumentation and Measurement.

[15]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[16]  Sanja Fidler,et al.  Bottom-Up Segmentation for Top-Down Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  David A. McAllester,et al.  Object Detection with Grammar Models , 2011, NIPS.

[18]  Antonio Fernández-Caballero,et al.  Real-time human segmentation in infrared videos , 2011, Expert Syst. Appl..

[19]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Sergio Escalera,et al.  Spatiotemporal analysis of RGB-D-T facial images for multimodal pain level recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Xin Li,et al.  Pedestrian detection and tracking in infrared imagery using shape and appearance , 2007, Comput. Vis. Image Underst..

[22]  Philip H. S. Torr,et al.  What, Where and How Many? Combining Object Detectors and CRFs , 2010, ECCV.

[23]  Thierry Bouwmans,et al.  Recent Advanced Statistical Background Modeling for Foreground Detection - A Systematic Survey , 2011 .

[24]  Sergio Escalera,et al.  Generalized multi-scale stacked sequential learning for multi-class classification , 2015, Pattern Analysis and Applications.

[25]  Vibhav Vineet,et al.  PoseField: An Efficient Mean-Field Based Method for Joint Estimation of Human Pose, Segmentation, and Depth , 2013, EMMCVPR.

[26]  Jitendra Malik,et al.  Recovering human body configurations: combining segmentation and recognition , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[27]  Ramakant Nevatia,et al.  Pedestrian Detection in Infrared Images based on Local Shape Features , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Sergio Escalera,et al.  Tri-modal Person Re-identification with RGB, Depth and Thermal Features , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[29]  Simone Palazzo,et al.  Kernel Density Estimation Using Joint Spatial-Color-Depth Data for Background Modeling , 2014, 2014 22nd International Conference on Pattern Recognition.

[30]  Luc Van Gool,et al.  Random Forests for Real Time 3D Face Analysis , 2012, International Journal of Computer Vision.

[31]  Sergio Escalera,et al.  Spherical Blurred Shape Model for 3-D Object and Pose Recognition: Quantitative Analysis and HCI Applications in Smart Environments , 2014, IEEE Transactions on Cybernetics.

[32]  Glenn Sheasby,et al.  A Robust Stereo Prior for Human Segmentation , 2012, ACCV.

[33]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[34]  Joris De Schutter,et al.  An adaptable system for RGB-D based human body detection and pose estimation , 2014, J. Vis. Commun. Image Represent..

[35]  Thierry Bouwmans,et al.  Background Modeling using Mixture of Gaussians for Foreground Detection - A Survey , 2008 .

[36]  Sebastian Thrun,et al.  Learning to Segment and Track in RGBD , 2012, WAFR.

[37]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[38]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[39]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[40]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  A. Broggi,et al.  Pedestrian Detection in Far Infrared Images based on the use of Probabilistic Templates , 2007, 2007 IEEE Intelligent Vehicles Symposium.

[42]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[43]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[44]  Ronen Basri,et al.  Image Segmentation by Probabilistic Bottom-Up Aggregation and Cue Integration , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Luis Salgado,et al.  Background foreground segmentation with RGB-D Kinect data: An efficient combination of classifiers , 2014, J. Vis. Commun. Image Represent..

[46]  Alessio Del Bue,et al.  Re-identification with RGB-D Sensors , 2012, ECCV Workshops.

[47]  Sinisa Segvic,et al.  Combining Spatio-Temporal Appearance Descriptors and Optical Flow for Human Action Recognition in Video Data , 2013, ArXiv.

[48]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Sergio Escalera,et al.  Graph cuts optimization for multi-limb human segmentation in depth maps , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[52]  Thomas B. Moeslund,et al.  Thermal cameras and applications: a survey , 2013, Machine Vision and Applications.

[53]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[54]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[55]  A. Broggi,et al.  Pedestrian Detection using Infrared images and Histograms of Oriented Gradients , 2006, 2006 IEEE Intelligent Vehicles Symposium.

[56]  Riad I. Hammoud,et al.  Robust Multi-Pedestrian Tracking in Thermal-Visible Surveillance Videos , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[57]  Anat Levin,et al.  Learning to Combine Bottom-Up and Top-Down Segmentation , 2006, International Journal of Computer Vision.

[58]  James W. Davis,et al.  Background-subtraction using contour-based fusion of thermal and visible imagery , 2007, Comput. Vis. Image Underst..

[59]  Jake K. Aggarwal,et al.  Human detection using depth information by Kinect , 2011, CVPR 2011 WORKSHOPS.

[60]  Sergio Escalera,et al.  BoVDW: Bag-of-Visual-and-Depth-Words for gesture recognition , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[61]  Philip H. S. Torr,et al.  Simultaneous Human Segmentation, Depth and Pose Estimation via Dual Decomposition , 2012, BMVC 2012.

[62]  Adrian Hilton,et al.  Visual Analysis of Humans - Looking at People , 2013 .

[63]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[64]  Weihong Wang,et al.  Improved human detection and classification in thermal images , 2010, 2010 IEEE International Conference on Image Processing.

[65]  Basilio Sierra,et al.  RGB-D, Laser and Thermal Sensor Fusion for People following in a Mobile Robot , 2013 .

[66]  Maciej Stefanczyk,et al.  Multimodal Segmentation of Dense Depth Maps and Associated Color Information , 2012, ICCVG.

[67]  Arturo de la Escalera,et al.  Contrast invariant features for human detection in far infrared images , 2012, 2012 IEEE Intelligent Vehicles Symposium.

[68]  Christophe Garcia,et al.  Human activities dataset and the ICPR 2012 human activities recognition and localization competition , 2012 .

[69]  Nicolas Pugeault,et al.  Spelling it out: Real-time ASL fingerspelling recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[70]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[71]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[72]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[73]  Andrew Zisserman,et al.  OBJ CUT , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[74]  Stefan Roth,et al.  Efficient Multi-cue Scene Segmentation , 2013, GCPR.

[75]  Larry S. Davis,et al.  An Interactive Approach to Pose-Assisted and Appearance-based Segmentation of Humans , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[76]  Pushmeet Kohli,et al.  PoseCut: Simultaneous Segmentation and 3D Pose Estimation of Humans Using Dynamic Graph-Cuts , 2006, ECCV.

[77]  Yifei Lu,et al.  Max Margin AND/OR Graph learning for parsing the human body , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[79]  Z. Zivkovic Improved adaptive Gaussian mixture model for background subtraction , 2004, ICPR 2004.

[80]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[81]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[82]  Eduardo Ros,et al.  Background Subtraction Based on Color and Depth Using Active Sensors , 2013, Sensors.

[83]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[84]  Gary R. Bradski,et al.  Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library , 2016 .

[85]  Chan-Su Lee,et al.  Applications of Human Motion Tracking: Smart Lighting Control , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[86]  Marie-Pierre Jolly,et al.  Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[87]  Jean-Luc Dugelay,et al.  An Efficient LBP-Based Descriptor for Facial Depth Images Applied to Gender Recognition Using RGB-D Face Data , 2012, ACCV Workshops.

[88]  Ivan Laptev,et al.  Pose Estimation and Segmentation of People in 3D Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[89]  Richard Bowden,et al.  Putting the pieces together: Connected Poselets for human pose estimation , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[90]  James W. Davis,et al.  Robust Background-Subtraction for Person Detection in Thermal Imagery , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[91]  Thomas B. Moeslund,et al.  RGB-D-T Based Face Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[92]  Trevor Darrell,et al.  Background estimation and removal based on range and color , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[93]  Riad I. Hammoud,et al.  Thermal-Visible Video Fusion for Moving Target Tracking and Pedestrian Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[94]  Andrew Zisserman,et al.  Humanising GrabCut: Learning to segment humans using the Kinect , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[95]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[96]  Limin Wang,et al.  Motionlets: Mid-level 3D Parts for Human Motion Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[97]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[98]  Daniel Cremers,et al.  Geometrically consistent elastic matching of 3D shapes: A linear programming solution , 2011, 2011 International Conference on Computer Vision.

[99]  R I Hg,et al.  An RGB-D Database Using Microsoft's Kinect for Windows for Face Detection , 2012, 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems.

[100]  Mark Everingham,et al.  Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[101]  Thomas B. Moeslund,et al.  Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[102]  Nassir Navab,et al.  Estimating human 3D pose from Time-of-Flight images based on geodesic distances and optical flow , 2011, Face and Gesture 2011.