论文信息 - Hierarchical Deep Multi-modal Network for Medical Visual Question Answering

Hierarchical Deep Multi-modal Network for Medical Visual Question Answering

Abstract Visual Question Answering in Medical domain (VQA-Med) plays an important role in providing medical assistance to the end-users. These users are expected to raise either a straightforward question with a Yes/No answer or a challenging question that requires a detailed and descriptive answer. The existing techniques in VQA-Med fail to distinguish between the different question types sometimes complicates the simpler problems, or over-simplifies the complicated ones. It is certainly true that for different question types, several distinct systems can lead to confusion and discomfort for the end-users. To address this issue, we propose a hierarchical deep multi-modal network that analyzes and classifies end-user questions/queries and then incorporates a query-specific approach for answer prediction. We refer our proposed approach as Hierarchical Question Segregation based Visual Question Answering, in short HQS-VQA. Our contributions are three-fold, viz. firstly, we propose a question segregation (QS) technique for VQA-Med; secondly, we integrate the QS model to the hierarchical deep multi-modal neural network to generate proper answers to the queries related to medical images; and thirdly, we study the impact of QS in Medical-VQA by comparing the performance of the proposed model with QS and a model without QS. We evaluate the performance of our proposed model on two benchmark datasets, viz. RAD and CLEF18. Experimental results show that our proposed HQS-VQA technique outperforms the baseline models with significant margins. We also conduct a detailed quantitative and qualitative analysis of the obtained results and discover potential causes of errors and their solutions.

[1] Pushpak Bhattacharyya,et al. A Unified Multi-task Adversarial Learning Framework for Pharmacovigilance Mining , 2019, ACL.

[2] Feifan Liu,et al. UMass at ImageCLEF Medical Visual Question Answering(Med-VQA) 2018 Task , 2018, CLEF.

[3] Matti Pietikäinen,et al. Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.

[4] Michael I. Jordan,et al. Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[5] Dina Demner-Fushman,et al. A dataset of clinically generated visual questions and answers about radiology images. , 2018 .

[6] Jae Y. Shin,et al. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? , 2016, IEEE transactions on medical imaging.

[7] Tiejun Zhao,et al. A Hierarchical Clustering Approach to Fuzzy Semantic Representation of Rare Words in Neural Machine Translation , 2020, IEEE Transactions on Fuzzy Systems.

[8] Roser Morante,et al. Machine Reading of Biomedical Texts about Alzheimer's Disease , 2013, CLEF.

[9] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Jay G Rueckl,et al. EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition , 2020, Cogn. Sci..

[11] Richard Socher,et al. Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[12] Amit P. Sheth,et al. Multi-Task Learning Framework for Mining Crowd Intelligence towards Clinical Treatment , 2018, NAACL.

[13] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Zhou Yu,et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[16] Subhransu Maji,et al. Bilinear Convolutional Neural Networks for Fine-Grained Visual Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Li Fei-Fei,et al. Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Zhou Yu,et al. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[19] Sang Won Yoon,et al. A support vector machine-based ensemble algorithm for breast cancer diagnosis , 2017, Eur. J. Oper. Res..

[20] Fuji Ren,et al. Employing Inception-Resnet-v2 and Bi-LSTM for Medical Domain Visual Question Answering , 2018, CLEF.

[21] Catalin I. Fetita,et al. Increasing CAD system efficacy for lung texture analysis using a convolutional network , 2016, SPIE Medical Imaging.

[22] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[23] Benoît Favre,et al. Word Embedding Evaluation and Combination , 2016, LREC.

[24] Xin Sun,et al. Few-Shot Learning for Domain-Specific Fine-Grained Image Classification , 2019, IEEE Transactions on Industrial Electronics.

[25] Pascal Vincent,et al. fastMRI: An Open Dataset and Benchmarks for Accelerated MRI , 2018, ArXiv.

[26] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Bernt Schiele,et al. A vision-grounded dataset for predicting typical locations for verbs , 2018, LREC.

[28] Henning Müller,et al. Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task , 2018, CLEF.

[29] Vladimir Vapnik,et al. Support-vector networks , 2004, Machine Learning.

[30] Lin Li,et al. Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain , 2019, CLEF.

[31] Pushpak Bhattacharyya,et al. Relation extraction between the clinical entities based on the shortest dependency path based LSTM , 2019, ArXiv.

[32] Georgios Balikas,et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[33] SchmidhuberJürgen,et al. 2005 Special Issue , 2005 .

[34] Danqi Chen,et al. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[35] Ying Mao,et al. Constructing Medical Image Domain Ontology with Anatomical Knowledge , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[36] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[38] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40] Bernard Kamsu-Foguem,et al. Deep convolution neural network for image recognition , 2018, Ecol. Informatics.

[41] Pushpak Bhattacharyya,et al. Can Taxonomy Help? Improving Semantic Question Matching using Question Taxonomy , 2018, COLING.

[42] Pushpak Bhattacharyya,et al. MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi , 2018, LREC.

[43] Thomas Wolf,et al. Transfer Learning in Natural Language Processing , 2019, NAACL.

[44] Ghassan Hamarneh,et al. Multi-resolution-Tract CNN with Hybrid Pretrained and Skin-Lesion Trained Layers , 2016, MLMI@MICCAI.

[45] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[46] Michael Riegler,et al. Overview of ImageCLEF 2017: Information Extraction from Images , 2017, CLEF.

[47] He He,et al. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[48] SahaSriparna,et al. Exploring Disorder-Aware Attention for Clinical Event Extraction , 2020 .

[49] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[50] Pushpak Bhattacharyya,et al. Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural Based Question Answering , 2018, CoNLL.

[51] Jürgen Schmidhuber,et al. LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[52] Gijs van Tulder,et al. Combining Generative and Discriminative Representation Learning for Lung CT Analysis With Convolutional Restricted Boltzmann Machines. , 2016, IEEE transactions on medical imaging.

[53] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[54] Marilyn Wolf,et al. CAMEL Dataset for Visual and Thermal Infrared Multiple Object Detection and Tracking , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[55] Henning Müller,et al. Overview of ImageCLEFtuberculosis 2018 - Detecting Multi-Drug Resistance, Classifying Tuberculosis Types and Assessing Severity Scores , 2018, CLEF.

[56] Asma Ben Abacha,et al. Descriptor : A dataset of clinically generated visual questions and answers about radiology images , 2018 .

[57] Jiancheng Lv,et al. Automatically Designing CNN Architectures Using Genetic Algorithm for Image Classification , 2018, ArXiv.

[58] Nello Cristianini,et al. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[59] Christopher Kanan,et al. Answer-Type Prediction for Visual Question Answering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Henning Müller,et al. Overview of ImageCLEFcaption 2017 - Image Caption Prediction and Concept Detection for Biomedical Images , 2017, CLEF.

[61] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[62] Scott Cohen,et al. Answering Questions about Data Visualizations using Efficient Bimodal Fusion , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[63] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[64] Franck Davoine,et al. Transfer learning in computer vision tasks: Remember where you come from , 2020, Image Vis. Comput..

[65] William R. Hersh,et al. TREC GENOMICS Track Overview , 2003, TREC.

[66] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.

[67] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68] Lina M. Sulieman,et al. A systematic literature review of machine learning in online personal health data , 2019, J. Am. Medical Informatics Assoc..

[69] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[70] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[71] Sebastian Ruder,et al. Neural transfer learning for natural language processing , 2019 .

[72] Jiajun Zhi,et al. Support vector machine classifier for prediction of the metastasis of colorectal cancer , 2018, International journal of molecular medicine.

[73] Michael Riegler,et al. Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation , 2018, CLEF.

[74] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[75] Peng Gao,et al. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[77] DEEPAK GUPTA,et al. A Deep Neural Network Framework for English Hindi Question Answering , 2019, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[78] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[79] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[80] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81] Zihao Fu. An Introduction of Deep Learning Based Word Representation Applied to Natural Language Processing , 2019, 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI).

[82] Asma Ben Abacha,et al. NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain , 2018, CLEF.