Vision-Language Transformer for Interpretable Pathology Visual Question Answering

Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models’ interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers’ encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.

[1]  Matloob Khushi,et al.  Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT , 2021, BMC Bioinform..

[2]  Bo Liu,et al.  Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering , 2021, 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).

[3]  Imran Razzak,et al.  A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[4]  Matloob Khushi,et al.  BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition , 2020, 2021 International Joint Conference on Neural Networks (IJCNN).

[5]  Bo Liu,et al.  Medical Visual Question Answering via Conditional Reasoning , 2020, ACM Multimedia.

[6]  Eric Xing,et al.  Pathological Visual Question Answering , 2020, ArXiv.

[7]  Fuji Ren,et al.  CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering , 2020, IEEE Access.

[8]  Eric Xing,et al.  PathVQA: 30000+ Questions for Medical Visual Question Answering , 2020, ArXiv.

[9]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[10]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[11]  D. Tao,et al.  A Survey on Visual Transformer , 2020, ArXiv.

[12]  Chunhua Shen,et al.  AIML at VQA-Med 2020: Knowledge Inference via a Skeleton-based Sentence Mapping Approach for Medical Domain Visual Question Answering , 2020, CLEF.

[13]  Henning Müller,et al.  Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , 2020, CLEF.

[14]  Thanh-Toan Do,et al.  Overcoming Data Limitation in Medical Visual Question Answering , 2019, MICCAI.

[15]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[16]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[17]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[18]  M. Gurcan,et al.  Digital pathology and artificial intelligence. , 2019, The Lancet. Oncology.

[19]  William W. Cohen,et al.  Probing Biomedical Embeddings from Language Models , 2019, Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for.

[20]  Lin Li,et al.  Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain , 2019, CLEF.

[21]  Lei Shi,et al.  Deep Multimodal Learning for Medical Visual Question Answering , 2019, CLEF.

[22]  Henning Müller,et al.  VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019 , 2019, CLEF.

[23]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[26]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Asma Ben Abacha,et al.  NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain , 2018, CLEF.

[28]  Feifan Liu,et al.  UMass at ImageCLEF Medical Visual Question Answering(Med-VQA) 2018 Task , 2018, CLEF.

[29]  Fuji Ren,et al.  Employing Inception-Resnet-v2 and Bi-LSTM for Medical Domain Visual Question Answering , 2018, CLEF.

[30]  Henning Müller,et al.  Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task , 2018, CLEF.

[31]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[34]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[35]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[42]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[43]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .