Future-State Predicting LSTM for Early Surgery Type Recognition

This work presents a novel approach for the early recognition of the type of a laparoscopic surgery from its video. Early recognition algorithms can be beneficial to the development of “smart” OR systems that can provide automatic context-aware assistance, and also enable quick database indexing. The task is however ridden with challenges specific to videos belonging to the domain of laparoscopy, such as high visual similarity across surgeries and large variations in video durations. To capture the spatio-temporal dependencies in these videos, we choose as our model a combination of a convolutional neural network (CNN) and long short-term memory (LSTM) network. We then propose two complementary approaches for improving early recognition performance. The first approach is a CNN fine-tuning method that encourages surgeries to be distinguished based on the initial frames of laparoscopic videos. The second approach, referred to as “ Future-State Predicting LSTM,” trains an LSTM to predict information related to future frames, which helps in distinguishing between the different types of surgeries. We evaluate our approaches on a large dataset of 425 laparoscopic videos containing nine types of surgeries (Laparo425), and achieve on average an accuracy of 75% having observed only the first 10 min of a surgery. These results are quite promising from a practical standpoint and also encouraging for other types of image-guided surgeries.

[1]  Austin Reiter,et al.  Feature Classification for Tracking Articulated Surgical Tools , 2012, MICCAI.

[2]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Klaus Schöffmann,et al.  Learning laparoscopic video shot classification for gynecological surgery , 2018, Multimedia Tools and Applications.

[4]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Hema Swetha Koppula,et al.  Recurrent Neural Networks for driver activity anticipation via sensory-fusion architecture , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Andru Putra Twinanda,et al.  EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos , 2016, IEEE Transactions on Medical Imaging.

[7]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[8]  Andru Putra Twinanda,et al.  RSDNet: Learning to Predict Remaining Surgery Duration from Laparoscopic Videos Without Manual Annotations , 2018, IEEE Transactions on Medical Imaging.

[9]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[10]  Yu-Jin Zhang,et al.  Sequential Segment Networks for Action Recognition , 2017, IEEE Signal Processing Letters.

[11]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[12]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[13]  René Vidal,et al.  Surgical Gesture Classification from Video Data , 2012, MICCAI.

[14]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[15]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[16]  Gwénolé Quellec,et al.  Monitoring tool usage in cataract surgery videos using boosted convolutional and recurrent neural networks , 2017, ArXiv.

[17]  Gwénolé Quellec,et al.  Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks , 2018, Medical Image Anal..

[18]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Andru Putra Twinanda,et al.  Towards Better Laparoscopic Video Database Organization by Automatic Surgery Classification , 2014, IPCAI.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Gaurav Yengera,et al.  Less is More: Surgical Phase Recognition with Less Annotations through Self-Supervised Pre-training of CNN-LSTM Networks , 2018, ArXiv.

[24]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).