Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation

In this work, we address the issues of the missing modalities that have arisen from the Visual Question Answer-Difference prediction task and find a novel method to solve the task at hand. We address the missing modality–the ground truth answers–that are not present at test time and use a privileged knowledge distillation scheme to deal with the issue of the missing modality. In order to efficiently do so, we first introduce a model, the "Big" Teacher, that takes the image/question/answer triplet as its input and out-performs the baseline, then use a combination of models to distill knowledge to a target network (student) that only takes the image/question pair as its inputs. We experiment our models on the VizWiz and VQA-V2 Answer Difference datasets and show through extensive experimentation and ablation the performance of our method and a diverse possibility for future research.

[1]  Vittorio Murino,et al.  Modality Distillation with Multiple Stream Networks for Action Recognition , 2018, ECCV.

[2]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Matthieu Cord,et al.  MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tae-Hyun Oh,et al.  Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[8]  Tae-Hyun Oh,et al.  Disjoint Multi-task Learning Between Heterogeneous Human-Centric Tasks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[11]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Kristen Grauman,et al.  CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question , 2017, CHI.

[14]  Trevor Darrell,et al.  Fooling Vision and Language Models Despite Localization and Attention Mechanism , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Tae-Hyun Oh,et al.  Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach , 2019, EMNLP/IJCNLP.

[16]  Qing Li,et al.  Why Does a Visual Question Have Different Answers? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Trevor Darrell,et al.  Learning with Side Information through Modality Hallucination , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[19]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[21]  Cordelia Schmid,et al.  MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[26]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[29]  Vittorio Murino,et al.  Distillation Multiple Choice Learning for Multimodal Action Recognition , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[31]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Juan Carlos Niebles,et al.  Graph Distillation for Action Detection with Privileged Modalities , 2017, ECCV.

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).