Data-driven spatio-temporal RGBD feature encoding for action recognition in operating rooms

PurposeContext-aware systems for the operating room (OR) provide the possibility to significantly improve surgical workflow through various applications such as efficient OR scheduling, context-sensitive user interfaces, and automatic transcription of medical procedures. Being an essential element of such a system, surgical action recognition is thus an important research area. In this paper, we tackle the problem of classifying surgical actions from video clips that capture the activities taking place in the OR.MethodsWe acquire recordings using a multi-view RGBD camera system mounted on the ceiling of a hybrid OR dedicated to X-ray-based procedures and annotate clips of the recordings with the corresponding actions. To recognize the surgical actions from the video clips, we use a classification pipeline based on the bag-of-words (BoW) approach. We propose a novel feature encoding method that extends the classical BoW approach. Instead of using the typical rigid grid layout to divide the space of the feature locations, we propose to learn the layout from the actual 4D spatio-temporal locations of the visual features. This results in a data-driven and non-rigid layout which retains more spatio-temporal information compared to the rigid counterpart.ResultsWe classify multi-view video clips from a new dataset generated from 11-day recordings of real operations. This dataset is composed of 1734 video clips of 15 actions. These include generic actions (e.g., moving patient to the OR bed) and actions specific to the vertebroplasty procedure (e.g., hammering). The experiments show that the proposed non-rigid feature encoding method performs better than the rigid encoding one. The classifier’s accuracy is increased by over 4 %, from 81.08 to 85.53 %.ConclusionThe combination of both intensity and depth information from the RGBD data provides more discriminative power in carrying out the surgical action recognition task as compared to using either one of them alone. Furthermore, the proposed non-rigid spatio-temporal feature encoding scheme provides more discriminative histogram representations than the rigid counterpart. To the best of our knowledge, this is also the first work that presents action recognition results on multi-view RGBD data recorded in the OR.

[1]  Won Jong Jeon,et al.  Spatio-temporal pyramid matching for sports videos , 2008, MIR '08.

[2]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Randall S. Burd,et al.  Video based activity recognition in trauma resuscitation , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[4]  Russell H. Taylor,et al.  3D Sensing Algorithms Towards Building an Intelligent Intensive Care Unit , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[5]  Nicolas Padoy,et al.  Temporally Consistent 3D Pose Estimation in the Interventional Room Using Discrete MRF Optimization over RGBD Sequences , 2014, IPCAI.

[6]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[7]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[8]  Pierre Jannin,et al.  A Framework for the Recognition of High-Level Surgical Tasks From Video Images for Cataract Surgeries , 2012, IEEE Transactions on Biomedical Engineering.

[9]  Nicolas Padoy,et al.  Seeing is believing: increasing intraoperative awareness to scattered radiation in interventional procedures by combining augmented reality, Monte Carlo simulations and wireless dosimeters , 2015, International Journal of Computer Assisted Radiology and Surgery.

[10]  Gregory D. Hager,et al.  Surgical gesture classification from video and kinematic data , 2013, Medical Image Anal..

[11]  Andru Putra Twinanda,et al.  Towards Better Laparoscopic Video Database Organization by Automatic Surgery Classification , 2014, IPCAI.

[12]  Frédéric Jurie,et al.  Modeling spatial layout with fisher vectors for image categorization , 2011, 2011 International Conference on Computer Vision.

[13]  Nassir Navab,et al.  Modeling and Segmentation of Surgical Workflow from Laparoscopic Video , 2010, MICCAI.

[14]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Nassir Navab,et al.  Workflow monitoring based on 3D motion features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[16]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Nicolas Padoy,et al.  Piecewise Planar Decomposition of 3D Point Clouds Obtained from Multiple Static RGB-D Cameras , 2014, 2014 2nd International Conference on 3D Vision.

[18]  Nassir Navab,et al.  Statistical modeling and recognition of surgical workflow , 2012, Medical Image Anal..

[19]  Peter Fu-Ming Hu,et al.  Real-Time Identification of Operating Room State from Video , 2007, AAAI.

[20]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[21]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.