Automatic Operating Room Surgical Activity Recognition for Robot-Assisted Surgery

Automatic recognition of surgical activities in the operating room (OR) is a key technology for creating next generation intelligent surgical devices and workflow monitoring/support systems. Such systems can potentially enhance efficiency in the OR, resulting in lower costs and improved care delivery to the patients. In this paper, we investigate automatic surgical activity recognition in robot-assisted operations. We collect the first large-scale dataset including 400 full-length multi-perspective videos from a variety of robotic surgery cases captured using Time-of-Flight cameras. We densely annotate the videos with 10 most recognized and clinically relevant classes of activities. Furthermore, we investigate state-of-the-art computer vision action recognition techniques and adapt them for the OR environment and the dataset. First, we fine-tune the Inflated 3D ConvNet (I3D) for clip-level activity recognition on our dataset and use it to extract features from the videos. These features are then fed to a stack of 3 Temporal Gaussian Mixture layers which extracts context from neighboring clips, and eventually go through a Long Short Term Memory network to learn the order of activities in full-length videos. We extensively assess the model and reach a peak performance of \({\sim }88\%\) mean Average Precision.

[1]  Fei-Fei Li,et al.  Vision-Based Hand Hygiene Monitoring in Hospitals , 2016, AMIA.

[2]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  K. Guru,et al.  Evaluation and Impact of Workflow Interruptions During Robot-assisted Surgery. , 2016, Urology.

[4]  Michael S. Ryoo,et al.  AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.

[5]  Gaurav Yengera,et al.  Less is More: Surgical Phase Recognition with Less Annotations through Self-Supervised Pre-training of CNN-LSTM Networks , 2018, ArXiv.

[6]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[7]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[9]  Irfan A. Essa,et al.  Surgical Activity Recognition in Robot-Assisted Radical Prostatectomy using Deep Learning , 2018, MICCAI.

[10]  Michael S. Ryoo,et al.  Activity Detection with Latent Sub-event Hierarchy Learning , 2018, ArXiv.

[11]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Li Fei-Fei,et al.  Privacy-Preserving Action Recognition for Smart Hospitals using Low-Resolution Depth Images , 2018, ArXiv.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Abhishek Dutta,et al.  The VIA Annotation Software for Images, Audio and Video , 2019, ACM Multimedia.

[15]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Mathias Unberath,et al.  CAI4CAI: The Rise of Contextual Artificial Intelligence in Computer-Assisted Interventions , 2019, Proceedings of the IEEE.

[17]  Kate Saenko,et al.  Two-Stream Region Convolutional 3D Network for Temporal Activity Detection , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[19]  Simon Hunter,et al.  Randomized controlled trial of a 12-week digital care program in improving low back pain , 2019, npj Digital Medicine.

[20]  Alexandre Alahi,et al.  A computer vision system for deep learning-based detection of patient mobilization activities in the ICU , 2019, npj Digital Medicine.

[21]  Sergio Escalera,et al.  Gate-Shift Networks for Video Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  J. Anger,et al.  Safety, efficiency and learning curves in robotic surgery: a human factors analysis , 2016, Surgical Endoscopy.

[23]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[24]  Lin Ma,et al.  Multi-Granularity Generator for Temporal Action Proposal , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michael S. Ryoo,et al.  Temporal Gaussian Mixture Layer for Videos , 2018, ICML.

[26]  D. Needham,et al.  Measuring Patient Mobility in the ICU Using a Novel Noninvasive Sensor , 2017, Critical care medicine.