Surgical Gesture Recognition in Laparoscopic Tasks Based on the Transformer Network and Self-Supervised Learning

In this study, we propose a deep learning framework and a self-supervision scheme for video-based surgical gesture recognition. The proposed framework is modular. First, a 3D convolutional network extracts feature vectors from video clips for encoding spatial and short-term temporal features. Second, the feature vectors are fed into a transformer network for capturing long-term temporal dependencies. Two main models are proposed, based on the backbone framework: C3DTrans (supervised) and SSC3DTrans (self-supervised). The dataset consisted of 80 videos from two basic laparoscopic tasks: peg transfer (PT) and knot tying (KT). To examine the potential of self-supervision, the models were trained on 60% and 100% of the annotated dataset. In addition, the best-performing model was evaluated on the JIGSAWS robotic surgery dataset. The best model (C3DTrans) achieves an accuracy of 88.0%, a 95.2% clip level, and 97.5% and 97.9% (gesture level), for PT and KT, respectively. The SSC3DTrans performed similar to C3DTrans when training on 60% of the annotated dataset (about 84% and 93% clip-level accuracies for PT and KT, respectively). The performance of C3DTrans on JIGSAWS was close to 76% accuracy, which was similar to or higher than prior techniques based on a single video stream, no additional video training, and online processing.

[1]  P. Jannin,et al.  PEg TRAnsfer Workflow recognition challenge report: Does multi-modal data improve recognition? , 2022, SSRN Electronic Journal.

[2]  R. Satava,et al.  A systematic review on artificial intelligence in robot-assisted surgery. , 2021, International journal of surgery.

[3]  Ivan Marsic,et al.  Real-time medical phase recognition using long-term video understanding and progress gate method , 2021, Medical Image Anal..

[4]  S. T. Kim,et al.  OperA: Attention-Regularized Transformers for Surgical Phase Recognition , 2021, MICCAI.

[5]  Danail Stoyanov,et al.  Gesture Recognition in Robotic Surgery: A Review , 2021, IEEE Transactions on Biomedical Engineering.

[6]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[7]  Satoshi Kondo,et al.  LapFormer: surgical tool detection in laparoscopic surgical video using transformer architecture , 2020, Comput. methods Biomech. Biomed. Eng. Imaging Vis..

[8]  Gregory D. Hager,et al.  DASZL: Dynamic Action Signatures for Zero-shot Learning , 2019, AAAI.

[9]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Thomas M. Ward,et al.  Computer vision in surgery. , 2020, Surgery.

[11]  S. Speidel,et al.  Machine Learning for Surgical Phase Recognition: A Systematic Review. , 2020, Annals of surgery.

[12]  Yijie Wang,et al.  Towards Accurate and Interpretable Surgical Skill Assessment: A Video-Based Method Incorporating Recognized Surgical Gestures and Skill Levels , 2020, MICCAI.

[13]  C. Loukas,et al.  Surgical Performance Analysis and Classification Based on Video Annotation of Laparoscopic Tasks , 2020, JSLS : Journal of the Society of Laparoendoscopic Surgeons.

[14]  Jinglu Zhang,et al.  Symmetric Dilated Convolution for Surgical Gesture Recognition , 2020, MICCAI.

[15]  Frank Rudzicz,et al.  Evaluation of Deep Learning Models for Identifying Surgical Actions and Measuring Performance. , 2020, JAMA network open.

[16]  Sebastian Bodenstedt,et al.  Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video , 2019, MICCAI.

[17]  Gregory D. Hager,et al.  Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks , 2019, International Journal of Computer Assisted Radiology and Surgery.

[18]  Daochang Liu,et al.  Deep Reinforcement Learning for Surgical Gesture Segmentation and Classification , 2018, MICCAI.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gregory D. Hager,et al.  A Dataset and Benchmarks for Segmentation and Recognition of Gestures in Robotic Surgery , 2017, IEEE Transactions on Biomedical Engineering.

[22]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nassir Navab,et al.  Sensor substitution for video-based action recognition , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  Gregory D. Hager,et al.  Learning convolutional action primitives for fine-grained action recognition , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Trevor Darrell,et al.  TSC-DL: Unsupervised trajectory segmentation of multi-modal surgical demonstrations with Deep Learning , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[27]  Gregory D. Hager,et al.  An Improved Model for Segmentation and Recognition of Fine-Grained Activities with Application to Surgical Training Tasks , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[28]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[30]  Gregory D. Hager,et al.  Surgical Gesture Segmentation and Recognition , 2013, MICCAI.

[31]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.