Ensembling Visual Explanations for VQA

Explanations make AI systems more transparent and also justify their predictions. The top-ranked Visual Question Answering (VQA) systems are ensembles of multiple systems; however, there has been no work on generating explanations for such ensembles. In this paper, we propose different methods for ensembling visual explanations for VQA using the localization maps of the component systems. Our crowd-sourced human evaluation indicates that our ensemble visual explanation is superior to each of the individual system’s visual explanation, although the results vary depending on the individual system that the ensemble is compared against as well as the number of individual systems that agree with the ensemble model’s answer. Overall, our ensemble explanation is better 63% of the time when compared to any individual system’s explanation. Our algorithm is also efficient and scales linearly in the number of component systems in the ensemble.