Leveraging Medical Visual Question Answering with Supporting Facts

In this working notes paper, we describe IBM Research AI (Almaden) team's participation in the ImageCLEF 2019 VQA-Med competition. The challenge consists of four question-answering tasks based on radiology images. The diversity of imaging modalities, organs and disease types combined with a small imbalanced training set made this a highly complex problem. To overcome these difficulties, we implemented a modular pipeline architecture that utilized transfer learning and multi-task learning. Our findings led to the development of a novel model called Supporting Facts Network (SFN). The main idea behind SFN is to cross-utilize information from upstream tasks to improve the accuracy on harder downstream ones. This approach significantly improved the scores achieved in the validation set (18 point improvement in F-1 score). Finally, we submitted four runs to the competition and were ranked seventh.

[1]  Henning Müller,et al.  VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019 , 2019, CLEF.

[2]  G. Corrado,et al.  End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography , 2019, Nature Medicine.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[5]  Tomasz Kornuta,et al.  Object-Based Reasoning in VQA , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[6]  Tomasz Kornuta,et al.  On transfer learning using a MAC model variant , 2018, ArXiv.

[7]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[8]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[11]  Vahid Kazemi,et al.  Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[12]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  Yu Cao,et al.  Medical sieve: a cognitive assistant for radiologists and cardiologists , 2016, SPIE Medical Imaging.

[19]  Carl Doersch,et al.  The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR , 2018, ArXiv.

[20]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[21]  Christopher D. Manning,et al.  GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.

[22]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[23]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[24]  Carlos R. del-Blanco,et al.  ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Security and Nature , 2019, CLEF.