Hierarchical Deep Multi-modal Network for Medical Visual Question Answering

Abstract Visual Question Answering in Medical domain (VQA-Med) plays an important role in providing medical assistance to the end-users. These users are expected to raise either a straightforward question with a Yes/No answer or a challenging question that requires a detailed and descriptive answer. The existing techniques in VQA-Med fail to distinguish between the different question types sometimes complicates the simpler problems, or over-simplifies the complicated ones. It is certainly true that for different question types, several distinct systems can lead to confusion and discomfort for the end-users. To address this issue, we propose a hierarchical deep multi-modal network that analyzes and classifies end-user questions/queries and then incorporates a query-specific approach for answer prediction. We refer our proposed approach as Hierarchical Question Segregation based Visual Question Answering, in short HQS-VQA. Our contributions are three-fold, viz. firstly, we propose a question segregation (QS) technique for VQA-Med; secondly, we integrate the QS model to the hierarchical deep multi-modal neural network to generate proper answers to the queries related to medical images; and thirdly, we study the impact of QS in Medical-VQA by comparing the performance of the proposed model with QS and a model without QS. We evaluate the performance of our proposed model on two benchmark datasets, viz. RAD and CLEF18. Experimental results show that our proposed HQS-VQA technique outperforms the baseline models with significant margins. We also conduct a detailed quantitative and qualitative analysis of the obtained results and discover potential causes of errors and their solutions.

[1]  Pushpak Bhattacharyya,et al.  A Unified Multi-task Adversarial Learning Framework for Pharmacovigilance Mining , 2019, ACL.

[2]  Feifan Liu,et al.  UMass at ImageCLEF Medical Visual Question Answering(Med-VQA) 2018 Task , 2018, CLEF.

[3]  Matti Pietikäinen,et al.  Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.

[4]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[5]  Dina Demner-Fushman,et al.  A dataset of clinically generated visual questions and answers about radiology images. , 2018 .

[6]  Jae Y. Shin,et al.  Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? , 2016, IEEE transactions on medical imaging.

[7]  Tiejun Zhao,et al.  A Hierarchical Clustering Approach to Fuzzy Semantic Representation of Rare Words in Neural Machine Translation , 2020, IEEE Transactions on Fuzzy Systems.

[8]  Roser Morante,et al.  Machine Reading of Biomedical Texts about Alzheimer's Disease , 2013, CLEF.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jay G Rueckl,et al.  EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition , 2020, Cogn. Sci..

[11]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[12]  Amit P. Sheth,et al.  Multi-Task Learning Framework for Mining Crowd Intelligence towards Clinical Treatment , 2018, NAACL.

[13]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[16]  Subhransu Maji,et al.  Bilinear Convolutional Neural Networks for Fine-Grained Visual Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Li Fei-Fei,et al.  Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Sang Won Yoon,et al.  A support vector machine-based ensemble algorithm for breast cancer diagnosis , 2017, Eur. J. Oper. Res..

[20]  Fuji Ren,et al.  Employing Inception-Resnet-v2 and Bi-LSTM for Medical Domain Visual Question Answering , 2018, CLEF.

[21]  Catalin I. Fetita,et al.  Increasing CAD system efficacy for lung texture analysis using a convolutional network , 2016, SPIE Medical Imaging.

[22]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[23]  Benoît Favre,et al.  Word Embedding Evaluation and Combination , 2016, LREC.

[24]  Xin Sun,et al.  Few-Shot Learning for Domain-Specific Fine-Grained Image Classification , 2019, IEEE Transactions on Industrial Electronics.

[25]  Pascal Vincent,et al.  fastMRI: An Open Dataset and Benchmarks for Accelerated MRI , 2018, ArXiv.

[26]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Bernt Schiele,et al.  A vision-grounded dataset for predicting typical locations for verbs , 2018, LREC.

[28]  Henning Müller,et al.  Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task , 2018, CLEF.

[29]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[30]  Lin Li,et al.  Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain , 2019, CLEF.

[31]  Pushpak Bhattacharyya,et al.  Relation extraction between the clinical entities based on the shortest dependency path based LSTM , 2019, ArXiv.

[32]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[33]  SchmidhuberJürgen,et al.  2005 Special Issue , 2005 .

[34]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[35]  Ying Mao,et al.  Constructing Medical Image Domain Ontology with Anatomical Knowledge , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  Bernard Kamsu-Foguem,et al.  Deep convolution neural network for image recognition , 2018, Ecol. Informatics.

[41]  Pushpak Bhattacharyya,et al.  Can Taxonomy Help? Improving Semantic Question Matching using Question Taxonomy , 2018, COLING.

[42]  Pushpak Bhattacharyya,et al.  MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi , 2018, LREC.

[43]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[44]  Ghassan Hamarneh,et al.  Multi-resolution-Tract CNN with Hybrid Pretrained and Skin-Lesion Trained Layers , 2016, MLMI@MICCAI.

[45]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[46]  Michael Riegler,et al.  Overview of ImageCLEF 2017: Information Extraction from Images , 2017, CLEF.

[47]  He He,et al.  GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[48]  SahaSriparna,et al.  Exploring Disorder-Aware Attention for Clinical Event Extraction , 2020 .

[49]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[50]  Pushpak Bhattacharyya,et al.  Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural Based Question Answering , 2018, CoNLL.

[51]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[52]  Gijs van Tulder,et al.  Combining Generative and Discriminative Representation Learning for Lung CT Analysis With Convolutional Restricted Boltzmann Machines. , 2016, IEEE transactions on medical imaging.

[53]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[54]  Marilyn Wolf,et al.  CAMEL Dataset for Visual and Thermal Infrared Multiple Object Detection and Tracking , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[55]  Henning Müller,et al.  Overview of ImageCLEFtuberculosis 2018 - Detecting Multi-Drug Resistance, Classifying Tuberculosis Types and Assessing Severity Scores , 2018, CLEF.

[56]  Asma Ben Abacha,et al.  Descriptor : A dataset of clinically generated visual questions and answers about radiology images , 2018 .

[57]  Jiancheng Lv,et al.  Automatically Designing CNN Architectures Using Genetic Algorithm for Image Classification , 2018, ArXiv.

[58]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[59]  Christopher Kanan,et al.  Answer-Type Prediction for Visual Question Answering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Henning Müller,et al.  Overview of ImageCLEFcaption 2017 - Image Caption Prediction and Concept Detection for Biomedical Images , 2017, CLEF.

[61]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[62]  Scott Cohen,et al.  Answering Questions about Data Visualizations using Efficient Bimodal Fusion , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[63]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[64]  Franck Davoine,et al.  Transfer learning in computer vision tasks: Remember where you come from , 2020, Image Vis. Comput..

[65]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[66]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[67]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68]  Lina M. Sulieman,et al.  A systematic literature review of machine learning in online personal health data , 2019, J. Am. Medical Informatics Assoc..

[69]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[70]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[71]  Sebastian Ruder,et al.  Neural transfer learning for natural language processing , 2019 .

[72]  Jiajun Zhi,et al.  Support vector machine classifier for prediction of the metastasis of colorectal cancer , 2018, International journal of molecular medicine.

[73]  Michael Riegler,et al.  Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation , 2018, CLEF.

[74]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[75]  Peng Gao,et al.  Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[77]  DEEPAK GUPTA,et al.  A Deep Neural Network Framework for English Hindi Question Answering , 2019, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[78]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[79]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[80]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Zihao Fu An Introduction of Deep Learning Based Word Representation Applied to Natural Language Processing , 2019, 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI).

[82]  Asma Ben Abacha,et al.  NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain , 2018, CLEF.