论文信息 - Leveraging Medical Visual Question Answering with Supporting Facts

Leveraging Medical Visual Question Answering with Supporting Facts

In this working notes paper, we describe IBM Research AI (Almaden) team's participation in the ImageCLEF 2019 VQA-Med competition. The challenge consists of four question-answering tasks based on radiology images. The diversity of imaging modalities, organs and disease types combined with a small imbalanced training set made this a highly complex problem. To overcome these difficulties, we implemented a modular pipeline architecture that utilized transfer learning and multi-task learning. Our findings led to the development of a novel model called Supporting Facts Network (SFN). The main idea behind SFN is to cross-utilize information from upstream tasks to improve the accuracy on harder downstream ones. This approach significantly improved the scores achieved in the validation set (18 point improvement in F-1 score). Finally, we submitted four runs to the competition and were ranked seventh.

Tomasz Kornuta | Deepta Rajan | Alexis Asseman | Ahmet S. Ozcan | Chaitanya Shivade

[1] Henning Müller,et al. VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019 , 2019, CLEF.

[2] G. Corrado,et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography , 2019, Nature Medicine.

[3] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[4] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[5] Tomasz Kornuta,et al. Object-Based Reasoning in VQA , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[6] Tomasz Kornuta,et al. On transfer learning using a MAC model variant , 2018, ArXiv.

[7] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[8] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Rich Caruana,et al. Multitask Learning , 1997, Machine-mediated learning.

[11] Vahid Kazemi,et al. Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[12] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.

[14] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[17] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[18] Yu Cao,et al. Medical sieve: a cognitive assistant for radiologists and cardiologists , 2016, SPIE Medical Imaging.

[19] Carl Doersch,et al. The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR , 2018, ArXiv.

[20] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[21] Christopher D. Manning,et al. GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.

[22] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[23] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[24] Carlos R. del-Blanco,et al. ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Security and Nature , 2019, CLEF.