Medical Visual Question Answering at Image CLEF 2019- VQA Med

This paper describes the submission created by PwC US-Advisory for the Medical Domain Visual Question Answering (VQA-Med) Task of Image CLEF 2019. The goal of the challenge was to create a Visual Question Answering System which uses medical images as context to generate answers. The VQA pipeline classifies the questions into two groups, the first group of questions involves giving answers from a fixed pool of predefined answer categories and the second group of questions involves generating answers based on the abnormality seen in the image. The first model uses question embeddings from the Universal Sentence Encoder and Image Embeddings from ResNet which are fed into an attention-based classifier to generate answers. The second model uses the same ResNet image embedding along with word embeddings from a Word2Vec model pre-trained on PubMed data which is used as an input to a sequence to sequence model which generates descriptions of abnormalities. This methodology helped us achieve reasonable results with a strict accuracy of 48% and a BLEU score of 53% on the challenge’s test data.