Rendezvous in Time: An Attention-based Temporal Fusion approach for Surgical Triplet Recognition

PURPOSE One of the recent advances in surgical AI is the recognition of surgical activities as triplets of [Formula: see text]instrument, verb, target[Formula: see text]. Albeit providing detailed information for computer-assisted intervention, current triplet recognition approaches rely only on single-frame features. Exploiting the temporal cues from earlier frames would improve the recognition of surgical action triplets from videos. METHODS In this paper, we propose Rendezvous in Time (RiT)-a deep learning model that extends the state-of-the-art model, Rendezvous, with temporal modeling. Focusing more on the verbs, our RiT explores the connectedness of current and past frames to learn temporal attention-based features for enhanced triplet recognition. RESULTS We validate our proposal on the challenging surgical triplet dataset, CholecT45, demonstrating an improved recognition of the verb and triplet along with other interactions involving the verb such as [Formula: see text]instrument, verb[Formula: see text]. Qualitative results show that the RiT produces smoother predictions for most triplet instances than the state-of-the-arts. CONCLUSION We present a novel attention-based approach that leverages the temporal fusion of video frames to model the evolution of surgical actions and exploit their benefits for surgical triplet recognition.

[1]  Helena R. Torres,et al.  CholecTriplet2021: A benchmark challenge for surgical action triplet recognition , 2022, Medical Image Anal..

[2]  Pheng-Ann Heng,et al.  Comparative Validation of Machine Learning Algorithms for Surgical Workflow and Skill Analysis with the HeiChole Benchmark , 2021, Medical Image Anal..

[3]  D. Stoyanov,et al.  Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis , 2022, International Journal of Computer Assisted Radiology and Surgery.

[4]  N. Padoy,et al.  Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets , 2022, ArXiv.

[5]  N. Padoy,et al.  Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos , 2021, Medical Image Anal..

[6]  H. Fu,et al.  Instrument-tissue Interaction Quintuple Detection in Surgery Videos , 2022, MICCAI.

[7]  Riccardo Muradore,et al.  The SARAS Endoscopic Surgeon Action Detection (ESAD) dataset: Challenges and methods , 2021, ArXiv.

[8]  Pheng-Ann Heng,et al.  Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid Embedding Aggregation Transformer , 2021, MICCAI.

[9]  N. Padoy,et al.  Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures , 2021, International Journal of Computer Assisted Radiology and Surgery.

[10]  Jacques Marescaux,et al.  Recognition of Instrument-Tissue Interactions in Endoscopic Videos via Action Triplets , 2020, MICCAI.

[11]  Nassir Navab,et al.  TeCNO: Surgical Phase Recognition with Multi-Stage Temporal Convolutional Networks , 2020, MICCAI.

[12]  Mathias Unberath,et al.  CAI4CAI: The Rise of Contextual Artificial Intelligence in Computer-Assisted Interventions , 2019, Proceedings of the IEEE.

[13]  Hao Chen,et al.  Multi-Task Recurrent Convolutional Network with Correlation Loss for Surgical Video Analysis , 2019, Medical Image Anal..

[14]  Gregory D. Hager,et al.  Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks , 2019, International Journal of Computer Assisted Radiology and Surgery.

[15]  Didier Mutter,et al.  Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos , 2018, International Journal of Computer Assisted Radiology and Surgery.

[16]  Sebastian Bodenstedt,et al.  Temporal coherence-based self-supervised learning for laparoscopic workflow analysis , 2018, OR 2.0/CARE/CLIP/ISIC@MICCAI.

[17]  Jonathan Krause,et al.  Tool Detection and Operative Skill Assessment in Surgical Videos Using Region-Based Convolutional Neural Networks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[18]  Russell H. Taylor,et al.  Surgical data science for next-generation interventions , 2017, Nature Biomedical Engineering.

[19]  Andru Putra Twinanda,et al.  EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos , 2016, IEEE Transactions on Medical Imaging.

[20]  Pierre Jannin,et al.  Automatic data-driven real-time segmentation and recognition of surgical workflow , 2016, International Journal of Computer Assisted Radiology and Surgery.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Rüdiger Dillmann,et al.  Knowledge-Driven Formalization of Laparoscopic Surgeries for Rule-Based Intraoperative Context-Aware Assistance , 2014, IPCAI.

[23]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.