This paper describes the submission created by PwC US-Advisory for the Medical Domain Visual Question Answering (VQA-Med) Task of Image CLEF 2019. The goal of the challenge was to create a Visual Question Answering System which uses medical images as context to generate answers. The VQA pipeline classifies the questions into two groups, the first group of questions involves giving answers from a fixed pool of predefined answer categories and the second group of questions involves generating answers based on the abnormality seen in the image. The first model uses question embeddings from the Universal Sentence Encoder and Image Embeddings from ResNet which are fed into an attention-based classifier to generate answers. The second model uses the same ResNet image embedding along with word embeddings from a Word2Vec model pre-trained on PubMed data which is used as an input to a sequence to sequence model which generates descriptions of abnormalities. This methodology helped us achieve reasonable results with a strict accuracy of 48% and a BLEU score of 53% on the challenge’s test data.
[1]
Margaret Mitchell,et al.
VQA: Visual Question Answering
,
2015,
International Journal of Computer Vision.
[2]
Quoc V. Le,et al.
Sequence to Sequence Learning with Neural Networks
,
2014,
NIPS.
[3]
Eugenio Culurciello,et al.
An Analysis of Deep Neural Network Models for Practical Applications
,
2016,
ArXiv.
[4]
L. Burns.
The Health Care Value Chain: Producers, Purchasers, and Providers
,
2002
.
[5]
Carlos R. del-Blanco,et al.
ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Security and Nature
,
2019,
CLEF.
[6]
Martin Gaynor,et al.
'Examining the Impact of Health Care Consolidation' Statement before the Committee on Energy and Commerce, Oversight and Investigations Subcommittee, U.S. House of Representatives
,
2018
.
[7]
Vahid Kazemi,et al.
Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering
,
2017,
ArXiv.