PS-DeVCEM: Pathology-sensitive deep learning model for video capsule endoscopy based on weakly labeled data

Abstract We propose a novel pathology-sensitive deep learning model (PS-DeVCEM) for frame-level anomaly detection and multi-label classification of different colon diseases in video capsule endoscopy (VCE) data. Our proposed model is capable of coping with the key challenge of colon apparent heterogeneity caused by several types of diseases. Our model is driven by attention-based deep multiple instance learning and is trained end-to-end on weakly labeled data using video labels instead of detailed frame-by-frame annotation. This makes it a cost-effective approach for the analysis of large capsule video endoscopy repositories. Other advantages of our proposed model include its capability to localize gastrointestinal anomalies in the temporal domain within the video frames, and its generality, in the sense that abnormal frame detection is based on automatically derived image features. The spatial and temporal features are obtained through ResNet50 and residual Long short-term memory (residual LSTM) blocks, respectively. Additionally, the learned temporal attention module provides the importance of each frame to the final label prediction. Moreover, we developed a self-supervision method to maximize the distance between classes of pathologies. We demonstrate through qualitative and quantitative experiments that our proposed weakly supervised learning model gives a superior precision and F1-score reaching, 61.6% and 55.1%, as compared to three state-of-the-art video analysis methods respectively. We also show our model’s ability to temporally localize frames with pathologies, without frame annotation information during training. Furthermore, we collected and annotated the first and largest VCE dataset with only video labels. The dataset contains 455 short video segments with 28,304 frames and 14 classes of colorectal diseases and artifacts. Dataset and code supporting this publication will be made available on our home page.

[1]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[3]  Mark Craven,et al.  Supervised versus multiple instance learning: an empirical comparison , 2005, ICML.

[4]  Daniel P. W. Ellis,et al.  Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems , 2015, ArXiv.

[5]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Huaici Zhao,et al.  Computer-Aided Endoscopic Diagnosis Without Human-Specific Labeling , 2016, IEEE Transactions on Biomedical Engineering.

[7]  Alexander Rakhlin,et al.  Angiodysplasia Detection and Localization Using Deep Convolutional Neural Networks , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jon Y. Hardeberg,et al.  A bag-to-class divergence approach to multiple-instance learning , 2018, ArXiv.

[10]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Sule Yildirim Yayilgan,et al.  Variational approach for capsule video frame interpolation , 2018, EURASIP J. Image Video Process..

[14]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[17]  Max Welling,et al.  Attention-based Deep Multiple Instance Learning , 2018, ICML.

[18]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[19]  Jiajun Wu,et al.  Deep multiple instance learning for image classification and auto-annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[21]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Holger R. Maier,et al.  Data splitting for artificial neural networks using SOM-based stratified sampling , 2010, Neural Networks.

[23]  Zhi-Hua Zhou,et al.  On the relation between multi-instance learning and semi-supervised learning , 2007, ICML '07.

[24]  Nima Tajbakhsh,et al.  Automatic polyp detection in colonoscopy videos using an ensemble of convolutional neural networks , 2015, 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI).

[25]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Wenyu Liu,et al.  Revisiting multiple instance neural networks , 2016, Pattern Recognit..

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Gary Doran,et al.  A theoretical and empirical analysis of support vector machine methods for multiple-instance classification , 2014, Machine Learning.

[29]  A Van Gossum,et al.  PillCam colon capsule endoscopy compared with colonoscopy for colorectal tumor diagnosis: a prospective pilot study , 2006, Endoscopy.

[30]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[31]  S. Ng,et al.  The role of capsule endoscopy in assessing mucosal inflammation in ulcerative colitis , 2015, Expert review of gastroenterology & hepatology.

[32]  Shai Avidan,et al.  Photo Sequencing , 2012, ECCV.

[33]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[35]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[36]  Misha Denil,et al.  From Group to Individual Labels Using Deep Features , 2015, KDD.

[37]  Max Q.-H. Meng,et al.  Computer-aided small bowel tumor detection for capsule endoscopy , 2011, Artif. Intell. Medicine.

[38]  Marco Loog,et al.  Multiple instance learning with bag dissimilarities , 2013, Pattern Recognit..

[39]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[40]  Mubarak Shah,et al.  Real-World Anomaly Detection in Surveillance Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Razvan C. Bunescu,et al.  Multiple instance learning for sparse positive bags , 2007, ICML '07.

[42]  Aymeric Histace,et al.  Comparative Validation of Polyp Detection Methods in Video Colonoscopy: Results From the MICCAI 2015 Endoscopic Vision Challenge , 2017, IEEE Transactions on Medical Imaging.

[43]  Ronan Collobert,et al.  From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Marius Pedersen,et al.  Y-Net: A deep Convolutional Neural Network for Polyp Detection , 2018, BMVC.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.