LABRAD-OR: Lightweight Memory Scene Graphs for Accurate Bimodal Reasoning in Dynamic Operating Rooms

Modern surgeries are performed in complex and dynamic settings, including ever-changing interactions between medical staff, patients, and equipment. The holistic modeling of the operating room (OR) is, therefore, a challenging but essential task, with the potential to optimize the performance of surgical teams and aid in developing new surgical technologies to improve patient outcomes. The holistic representation of surgical scenes as semantic scene graphs (SGG), where entities are represented as nodes and relations between them as edges, is a promising direction for fine-grained semantic OR understanding. We propose, for the first time, the use of temporal information for more accurate and consistent holistic OR modeling. Specifically, we introduce memory scene graphs, where the scene graphs of previous time steps act as the temporal representation guiding the current prediction. We design an end-to-end architecture that intelligently fuses the temporal information of our lightweight memory scene graphs with the visual information from point clouds and images. We evaluate our method on the 4D-OR dataset and demonstrate that integrating temporality leads to more accurate and consistent results achieving an +5% increase and a new SOTA of 0.88 in macro F1. This work opens the path for representing the entire surgery history with memory scene graphs and improves the holistic understanding in the OR. Introducing scene graphs as memory representations can offer a valuable tool for many temporal understanding tasks.

[1]  N. Padoy,et al.  Rendezvous in Time: An Attention-based Temporal Fusion approach for Surgical Triplet Recognition , 2022, International journal of computer assisted radiology and surgery.

[2]  Muhammad Abdullah Jamal,et al.  Multi-Modal Unsupervised Pre-Training for Surgical Operating Room Workflow Analysis , 2022, MICCAI.

[3]  S. Yeung,et al.  Adaptation of Surgical Activity Recognition Models Across Operating Rooms , 2022, MICCAI.

[4]  Evin Pinar Ornek,et al.  4D-OR: Semantic Scene Graphs for OR Domain Modeling , 2022, MICCAI.

[5]  O. Mohareri,et al.  Surgical Workflow Recognition: from Analysis of Challenges to Architectural Study , 2022, ECCV Workshops.

[6]  P. Kazanzides,et al.  CaRTS: Causality-driven Robot Tool Segmentation from Vision and Kinematics Data , 2022, MICCAI.

[7]  N. Padoy,et al.  Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos , 2021, Medical Image Anal..

[8]  Zhifeng Li,et al.  Target Adaptive Context Aggregation for Video Scene Graph Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Bodo Rosenhahn,et al.  Spatial-Temporal Transformer for Dynamic Scene Graph Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Di He,et al.  Do Transformers Really Perform Bad for Graph Representation? , 2021, ArXiv.

[11]  Pheng-Ann Heng,et al.  Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid Embedding Aggregation Transformer , 2021, MICCAI.

[12]  Marco A. Zenati,et al.  Computer Vision in the Operating Room: Opportunities and Caveats , 2021, IEEE Transactions on Medical Robotics and Bionics.

[13]  Sharib Ali,et al.  Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy , 2020, MMM.

[14]  Jacques Marescaux,et al.  Recognition of Instrument-Tissue Interactions in Endoscopic Videos via Action Triplets , 2020, MICCAI.

[15]  Omid Mohareri,et al.  Automatic Operating Room Surgical Activity Recognition for Robot-Assisted Surgery , 2020, MICCAI.

[16]  Federico Tombari,et al.  Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Nassir Navab,et al.  TeCNO: Surgical Phase Recognition with Multi-Stage Temporal Convolutional Networks , 2020, MICCAI.

[18]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Hao Chen,et al.  Multi-Task Recurrent Convolutional Network with Correlation Loss for Surgical Video Analysis , 2019, Medical Image Anal..

[20]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[21]  Russell H. Taylor,et al.  Surgical data science for next-generation interventions , 2017, Nature Biomedical Engineering.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[24]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andru Putra Twinanda,et al.  EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos , 2016, IEEE Transactions on Medical Imaging.

[26]  Pierre Jannin,et al.  Surgical process modelling: a review , 2014, International Journal of Computer Assisted Radiology and Surgery.