Enhanced Computer Vision With Microsoft Kinect Sensor: A Review

With the invention of the low-cost Microsoft Kinect sensor, high-resolution depth and visual (RGB) sensing has become available for widespread use. The complementary nature of the depth and visual information provided by the Kinect sensor opens up new opportunities to solve fundamental problems in computer vision. This paper presents a comprehensive review of recent Kinect-based computer vision algorithms and applications. The reviewed approaches are classified according to the type of vision problems that can be addressed or enhanced by means of the Kinect sensor. The covered topics include preprocessing, object tracking and recognition, human activity analysis, hand gesture analysis, and indoor 3-D mapping. For each category of methods, we outline their main algorithmic contributions and summarize their advantages/differences compared to their RGB counterparts. Finally, we give an overview of the challenges in this field and future research trends. This paper is expected to serve as a tutorial and source of references for Kinect-based computer vision researchers.

[1]  Zhengyou Zhang,et al.  A Flexible New Technique for Camera Calibration , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[4]  S. Burak Gokturk,et al.  A Time-Of-Flight Depth Sensor - System Description, Issues and Solutions , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[5]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[7]  Lina María Paz,et al.  Large-Scale 6-DOF SLAM With Stereo-in-Hand , 2008, IEEE Transactions on Robotics.

[8]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[9]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[10]  Dieter Fox,et al.  RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments , 2010, ISER.

[11]  Tom Drummond,et al.  Faster and Better: A Machine Learning Approach to Corner Detection , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xiaoyang Tan,et al.  Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions , 2007, IEEE Transactions on Image Processing.

[13]  Interactive Music: Human Motion Initiated Music Generation Using Skeletal Tracking By Kinect , 2011 .

[14]  Wolfram Burgard,et al.  Real-time 3D visual SLAM with a hand-held camera , 2011 .

[15]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[16]  Kai Oliver Arras,et al.  People tracking in RGB-D data with on-line boosted target models , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Dmitriy Vatolin,et al.  Real-time depth map occlusion filling and scene background restoration for projected-pattern based depth cameras , 2011 .

[18]  Sergio Escalera,et al.  Featureweighting in dynamic timewarping for gesture recognition in depth data , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[19]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[20]  Silvio Savarese,et al.  Detecting and tracking people using an RGB-D camera via multiple detector fusion , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[21]  Patrick Rives,et al.  Real-time Dense Visual Tracking under Large Lighting Variations , 2011, BMVC.

[22]  James M. Rehg,et al.  CENTRIST: A Visual Descriptor for Scene Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Junsong Yuan,et al.  Robust hand gesture recognition with kinect sensor , 2011, ACM Multimedia.

[24]  Matthew Tang,et al.  Recognizing Hand Gestures with Microsoft ’ s Kinect , 2011 .

[25]  Rod McCall,et al.  Lightweight palm and finger tracking for real-time 3D gesture control , 2011, 2011 IEEE Virtual Reality Conference.

[26]  Dieter Fox,et al.  Sparse distance learning for object recognition combining RGB and depth information , 2011, 2011 IEEE International Conference on Robotics and Automation.

[27]  Dieter Fox,et al.  Object recognition with hierarchical kernel descriptors , 2011, CVPR 2011.

[28]  Tomás Pajdla,et al.  3D with Kinect , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[29]  Achim J. Lilienthal,et al.  Comparative Evaluation of Range Sensor Accuracy in Indoor Environments , 2011, ECMR.

[30]  Ruigang Yang,et al.  Accurate 3D pose estimation from a single depth image , 2011, 2011 International Conference on Computer Vision.

[31]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[32]  Kai Oliver Arras,et al.  People detection in RGB-D data , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Mihai Budiu,et al.  Parallelizing the Training of the Kinect Body Parts Labeling Algorithm , 2011 .

[34]  Dieter Fox,et al.  Depth kernel descriptors for object recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35]  Max Mignotte,et al.  Fall Detection from Depth Map Video Sequences , 2011, ICOST.

[36]  Jake K. Aggarwal,et al.  Human detection using depth information by Kinect , 2011, CVPR 2011 WORKSHOPS.

[37]  Juho Kannala,et al.  Accurate and Practical Calibration of a Depth and Color Camera Pair , 2011, CAIP.

[38]  Darko Kirovski,et al.  Real-time classification of dance gestures from skeleton animation , 2011, SCA '11.

[39]  Ivan Tashev Recent Advances in Human-Machine Interfaces for Gaming and Entertainment , 2011 .

[40]  Bart Selman,et al.  Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[41]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[42]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[43]  Jason Geng,et al.  Structured-light 3D surface imaging: a tutorial , 2011 .

[44]  Vassilis Athitsos,et al.  Comparing gesture recognition accuracy using color and depth information , 2011, PETRA '11.

[45]  Lale Akarun,et al.  Real time hand pose estimation using depth sensors , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[46]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[47]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, ICCV Workshops.

[48]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[49]  Zhengyou Zhang,et al.  Calibration between depth and color sensors for commodity depth cameras , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[50]  Vangelis Metsis,et al.  A viewpoint-independent statistical method for fall detection , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[51]  Antonis A. Argyros,et al.  Tracking the articulated motion of two strongly interacting hands , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Qi Tian,et al.  Human Daily Action Analysis with Multi-view and Color-Depth Data , 2012, ECCV Workshops.

[53]  Heinrich Müller,et al.  Impact of Thermal and Environmental Conditions on the Kinect Sensor , 2012, WDIA.

[54]  Simone Milani,et al.  Joint denoising and interpolation of depth maps for MS Kinect sensors , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Nathan Jacobs,et al.  Static Hand Gesture Recognition with 2 Kinect Sensors , 2012 .

[56]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[57]  Dieter Fox,et al.  RGB-D mapping: Using Kinect-style depth cameras for dense 3D modeling of indoor environments , 2012, Int. J. Robotics Res..

[58]  Juho Kannala,et al.  Joint Depth and Color Camera Calibration with Distortion Correction , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Wolfram Burgard,et al.  Scene in the Loop : Towards Adaptation-by-Tracking in RGB-D Data , 2012 .

[60]  Daniel Thalmann,et al.  3D fingertip and palm tracking in depth image sequences , 2012, ACM Multimedia.

[61]  Mani Golparvar-Fard,et al.  Automated Vision-based Recognition of Construction Worker Actions for Building Interior Construction , 2012 .

[62]  Georg Umlauf,et al.  3D Hand Gesture Recognition Based on Sensor Fusion of Commodity Hardware , 2012, MuC.

[63]  Marsette Vona,et al.  Moving Volume KinectFusion , 2012, BMVC.

[64]  Yuan Yao,et al.  Real-Time Hand Pose Estimation from RGB-D Sensor , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[65]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Baining Guo,et al.  Exemplar-based human action pose correction and tagging , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Lale Akarun,et al.  Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[68]  Kai Oliver Arras,et al.  Leveraging RGB-D Data: Adaptive fusion and domain adaptation for object detection , 2012, 2012 IEEE International Conference on Robotics and Automation.

[69]  Stepán Obdrzálek,et al.  Accuracy and robustness of Kinect pose estimation in the context of coaching of elderly population , 2012, 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[70]  Rayi Yanu Tara,et al.  Hand Segmentation from Depth Image using Anthropometric Approach in Natural Interface Development , 2012 .

[71]  Tilak Dutta,et al.  Evaluation of the Kinect™ sensor for 3-D kinematic measurement in the workplace. , 2012, Applied ergonomics.

[72]  Tae-Seong Kim,et al.  Human Activity Recognition via the Features of Labeled Depth Body Parts , 2012, ICOST.

[73]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[74]  Andrew I. Comport,et al.  Dense RGB-D mapping for real-time localisation and navigation , 2012 .

[75]  Sebastian Nowozin,et al.  Action Points: A Representation for Low-latency Online Human Action Recognition , 2012 .

[76]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[77]  Sander Oude Elberink,et al.  Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications , 2012, Sensors.

[78]  Peter H. N. de With,et al.  Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment , 2012, IEEE Transactions on Consumer Electronics.

[79]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[80]  Stavros J. Perantonis,et al.  Hand Shape and 3D Pose Estimation Using Depth Data from a Single Cluttered Frame , 2012, ISVC.

[81]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[82]  Pushmeet Kohli,et al.  When Can We Use KinectFusion for Ground Truth Acquisition , 2012 .

[83]  Ji Wan,et al.  RGB-D Based Multi-attribute People Search in Intelligent Visual Surveillance , 2012, MMM.

[84]  John J. Leonard,et al.  Kintinuous: Spatially Extended KinectFusion , 2012, AAAI 2012.

[85]  Luis Salgado,et al.  Efficient spatio-temporal hole filling strategy for Kinect depth maps , 2012, Electronic Imaging.

[86]  Bhiksha Raj,et al.  Microphone array processing for distant speech recognition: Towards real-world deployment , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[87]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[88]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[89]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[90]  Ying Zhang,et al.  Walk&Sketch: create floor plans with an RGB-D camera , 2012, UbiComp.

[91]  Michael Schmeing,et al.  Color Segmentation Based Depth Image Filtering , 2012, WDIA.

[92]  Martin A. Riedmiller,et al.  A learned feature descriptor for object recognition in RGB-D data , 2012, 2012 IEEE International Conference on Robotics and Automation.

[93]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[94]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[95]  James M. Keller,et al.  Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor , 2012, ACCV.

[96]  Manuela M. Veloso,et al.  Depth camera based indoor mobile robot localization and navigation , 2012, 2012 IEEE International Conference on Robotics and Automation.

[97]  Yi Li,et al.  Hand gesture recognition using Kinect , 2012, 2012 IEEE International Conference on Computer Science and Automation Engineering.

[98]  Jörg Stückler,et al.  Integrating depth and color cues for dense multi-resolution scene mapping using RGB-D cameras , 2012, 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI).

[99]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[100]  Ling Shao,et al.  Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[101]  Guangming Shi,et al.  Structure guided fusion for depth map inpainting , 2013, Pattern Recognit. Lett..

[102]  Pengfei Wang,et al.  Performance study of feature descriptors for human detection on depth map , 2014, Int. J. Model. Simul. Sci. Comput..

[103]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..