Content-based retrieval of video segments from minimally invasive surgery videos using deep convolutional video descriptors and iterative query refinement

Despite a strong evidence of the clinical and economic benefits of minimally invasive surgery (MIS) for many common surgical procedures, there is a gross underutilization of MIS in many US hospitals, potentially due to its steep learning curve. Intraoperative videos captured using a camera inserted into the body during MIS procedures are emerging as an invaluable resource for MIS education, skill assessment and quality assurance. However, these videos often have a duration of several hours and there is a pressing need for automated tools to help surgeons quickly find key semantic segments of interest within MIS videos. In this paper, we present a novel integrated approach for facilitating content-based retrieval of video segments that are semantically similar to a query video within a large collection of MIS videos. We use state-of-theart deep 3D convolutional neural network (CNN) models pre-trained on large public video classification datasets to extract spatiotemporal features from MIS video segments and employ an iterative query refinement (IQR) strategy where in a support vector machine (SVM) classifier trained online based on relevance feedback from the user is used to refine the search results iteratively. We show that our method outperforms the state-of-the-art on the SurgicalActions160 dataset containing 160 video clips of typical surgical actions in gynecologic MIS procedures.

[1]  M. Soucisse,et al.  Video Coaching as an Efficient Teaching Method for Surgical Residents-A Randomized Controlled Trial. , 2017, Journal of surgical education.

[2]  Jianhua Zhao,et al.  Probabilistic Principal Component Analysis for 2D data , 2011 .

[3]  Klaus Schöffmann,et al.  Content-based processing and analysis of endoscopic images and videos: A survey , 2017, Multimedia Tools and Applications.

[4]  Klaus Schöffmann,et al.  Video retrieval in laparoscopic video recordings with dynamic content descriptors , 2017, Multimedia Tools and Applications.

[5]  Martin Aumüller,et al.  ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms , 2018, SISAP.

[6]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Klaus Schöffmann,et al.  Learning laparoscopic video shot classification for gynecological surgery , 2018, Multimedia Tools and Applications.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Nicu Sebe,et al.  Histograms of Motion Gradients for real-time video classification , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[10]  Gerald M. Fried,et al.  Surgery through the keyhole: a new view of an old art , 2007, McGill journal of medicine : MJM : an international forum for the advancement of medical sciences by students.

[11]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Justin B Dimick,et al.  Novel Uses of Video to Accelerate the Surgical Learning Curve. , 2016, Journal of laparoendoscopic & advanced surgical techniques. Part A.

[13]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[14]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Thierry Pun,et al.  Performance evaluation in content-based image retrieval: overview and proposals , 2001, Pattern Recognit. Lett..

[16]  Hongwei Yao,et al.  Future therapeutic treatment of COPD: Struggle between oxidants and cytokines , 2007, International journal of chronic obstructive pulmonary disease.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Mo Zhou,et al.  Hospital cost implications of increased use of minimally invasive surgery. , 2015, JAMA surgery.

[19]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Constantinos Loukas,et al.  Video content analysis of surgical procedures , 2018, Surgical Endoscopy.

[21]  Mathias Lux,et al.  Endoscopic Video Retrieval: A Signature-Based Approach for Linking Endoscopic Images with Video Segments , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Andru Putra Twinanda,et al.  EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos , 2016, IEEE Transactions on Medical Imaging.

[25]  Justin B Dimick,et al.  Video-Based Surgical Coaching: An Emerging Approach to Performance Improvement. , 2016, JAMA surgery.

[26]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[27]  Tej D. Azad,et al.  Size and distribution of the global volume of surgery in 2012 , 2016, Bulletin of the World Health Organization.

[28]  Susan Hutfless,et al.  Hospital level under-utilization of minimally invasive surgery in the United States: retrospective review , 2014, BMJ : British Medical Journal.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.