Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery

Multi- and cross-modal learning consolidates information from multiple data sources which may offer a holistic representation of complex scenarios. Cross-modal learning is particularly interesting, because synchronized data streams are immediately useful as self-supervisory signals. The prospect of achieving self-supervised continual learning in surgical robotics is exciting as it may enable lifelong learning that adapts to different surgeons and cases, ultimately leading to a more general machine understanding of surgical processes. We present a learning paradigm using synchronous video and kinematics from robot-mediated surgery. Our approach relies on an encoder–decoder network that maps optical flow to the corresponding kinematics sequence. Clustering on the latent representations reveals meaningful groupings for surgeon gesture and skill level. We demonstrate the generalizability of the representations on the JIGSAWS dataset by classifying skill and gestures on tasks not used for training. For tasks seen in training, we report a 59 to 70% accuracy in surgical gestures classification. On tasks beyond the training setup, we note a 45 to 65% accuracy. Qualitatively, we find that unseen gestures form clusters in the latent space of novice actions, which may enable the automatic identification of novel interactions in a lifelong learning scenario. From predicting the synchronous kinematics sequence, optical flow representations of surgical scenes emerge that separate well even for new tasks that the model had not seen before. While the representations are useful immediately for a variety of tasks, the self-supervised learning paradigm may enable research in lifelong and user-specific learning.

[1]  Gesture Classification in Robotic Surgery using Recurrent Neural Networks with Kinematic Information , 2018 .

[2]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Joel W. Burdick,et al.  daVinciNet: Joint Prediction of Motion and Surgical State in Robot-Assisted Surgery , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Peter Kazanzides,et al.  Leveraging vision and kinematics data to improve realism of biomechanic soft tissue simulation for robotic surgery , 2020, International Journal of Computer Assisted Radiology and Surgery.

[5]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[6]  Stefanie Speidel,et al.  Video-based surgical skill assessment using 3D convolutional neural networks , 2019, International Journal of Computer Assisted Radiology and Surgery.

[7]  Gregory D. Hager,et al.  Automated Surgical Activity Recognition with One Labeled Sequence , 2019, MICCAI.

[8]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[9]  Trevor Darrell,et al.  TSC-DL: Unsupervised trajectory segmentation of multi-modal surgical demonstrations with Deep Learning , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[10]  John Kenneth Salisbury,et al.  The Intuitive/sup TM/ telesurgery system: overview and application , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[11]  Ziheng Wang,et al.  Deep learning with convolutional neural network for objective skill evaluation in robot-assisted surgery , 2018, International Journal of Computer Assisted Radiology and Surgery.

[12]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[13]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[14]  Ken Goldberg,et al.  Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Peter Kazanzides,et al.  An open-source research kit for the da Vinci® Surgical System , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Gregory D. Hager,et al.  Unsupervised Learning for Surgical Motion by Learning to Predict the Future , 2018, MICCAI.

[17]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[18]  Mathias Unberath,et al.  Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Gregory D. Hager,et al.  A Dataset and Benchmarks for Segmentation and Recognition of Gestures in Robotic Surgery , 2017, IEEE Transactions on Biomedical Engineering.

[20]  Doran S. Mix,et al.  Design and Validation of a Cervical Laminectomy Simulator using 3D Printing and Hydrogel Phantoms. , 2019, Operative neurosurgery.

[21]  Elena De Momi,et al.  Weakly Supervised Recognition of Surgical Gestures , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[22]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[23]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .