An Improved Model for Segmentation and Recognition of Fine-Grained Activities with Application to Surgical Training Tasks

Automated segmentation and recognition of fine-grained activities is important for enabling new applications in industrial automation, human-robot collaboration, and surgical training. Many existing approaches to activity recognition assume that a video has already been segmented and perform classification using an abstract representation based on spatio-temporal features. While some approaches perform joint activity segmentation and recognition, they typically suffer from a poor modeling of the transitions between actions and a representation that does not incorporate contextual information about the scene. In this paper, we propose a model for action segmentation and recognition that improves upon existing work in two directions. First, we develop a variation of the Skip-Chain Conditional Random Field that captures long-range state transitions between actions by using higher-order temporal relationships. Second, we argue that in constrained environments, where the relevant set of objects is known, it is better to develop features using high-level object relationships that have semantic meaning instead of relying on abstract features. We apply our approach to a set of tasks common for training in robotic surgery: suturing, knot tying, and needle passing, and show that our method increases micro and macro accuracy by 18.46% and 44.13% relative to the state of the art on a widely used robotic surgery dataset.

[1]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[2]  Blake Hannaford,et al.  Markov modeling of minimally invasive surgery based on tool/tissue interaction and force/torque signatures for evaluating surgical skills , 2001, IEEE Transactions on Biomedical Engineering.

[3]  C. Barden,et al.  Effects of Limited Work Hours on Surgical Training , 2003 .

[4]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[5]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[6]  Ramakant Nevatia,et al.  Coupled Hidden Semi Markov Models for Activity Recognition , 2007, 2007 IEEE Workshop on Motion and Video Computing (WMVC'07).

[7]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[8]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[9]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Gregory D. Hager,et al.  Task versus Subtask Surgical Skill Evaluation of Robotic Minimally Invasive Surgery , 2009, MICCAI.

[11]  Ben Taskar,et al.  Cascaded Models for Articulated Pose Estimation , 2010, ECCV.

[12]  Shunzheng Yu,et al.  Hidden semi-Markov models , 2010, Artif. Intell..

[13]  Larry S. Davis,et al.  Multi-agent event recognition in structured scenarios , 2011, CVPR 2011.

[14]  Sanjeev Khudanpur,et al.  Learning and inference algorithms for dynamical system models of dextrous motion , 2011 .

[15]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[16]  Tsuhan Chen,et al.  Pictorial structures for object recognition and part labeling in drawings , 2011, 2011 18th IEEE International Conference on Image Processing.

[17]  Dieter Fox,et al.  Fine-grained kitchen activity recognition using RGB-D , 2012, UbiComp.

[18]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  René Vidal,et al.  Surgical Gesture Classification from Video Data , 2012, MICCAI.

[20]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Austin Reiter,et al.  Feature Classification for Tracking Articulated Surgical Tools , 2012, MICCAI.

[22]  Gregory D. Hager,et al.  Sparse Hidden Markov Models for Surgical Gesture Classification and Skill Evaluation , 2012, IPCAI.

[23]  Joachim Hornegger,et al.  Self-gated Radial MRI for Respiratory Motion Compensation on Hybrid PET/MR Systems , 2013, MICCAI.

[24]  Gregory D. Hager,et al.  Surgical gesture classification from video and kinematic data , 2013, Medical Image Anal..

[25]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Jason J. Corso,et al.  Product of tracking experts for visual tracking of surgical tools , 2013, 2013 IEEE International Conference on Automation Science and Engineering (CASE).

[27]  Gregory D. Hager,et al.  Surgical Gesture Segmentation and Recognition , 2013, MICCAI.

[28]  Constantinos Loukas,et al.  Surgical workflow analysis with Gaussian mixture multivariate autoregressive (GMMAR) models: a simulation study , 2013, Computer aided surgery : official journal of the International Society for Computer Aided Surgery.

[29]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[31]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Sven Behnke,et al.  PyStruct: learning structured prediction in python , 2014, J. Mach. Learn. Res..

[33]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .