论文信息 - Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation

Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation

In this work, we address the issues of the missing modalities that have arisen from the Visual Question Answer-Difference prediction task and find a novel method to solve the task at hand. We address the missing modality–the ground truth answers–that are not present at test time and use a privileged knowledge distillation scheme to deal with the issue of the missing modality. In order to efficiently do so, we first introduce a model, the "Big" Teacher, that takes the image/question/answer triplet as its input and out-performs the baseline, then use a combination of models to distill knowledge to a target network (student) that only takes the image/question pair as its inputs. We experiment our models on the VizWiz and VQA-V2 Answer Difference datasets and show through extensive experimentation and ablation the performance of our method and a diverse possibility for future research.

[1] Vittorio Murino,et al. Modality Distillation with Multiple Stream Networks for Action Recognition , 2018, ECCV.

[2] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3] Yang Gao,et al. Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5] Matthieu Cord,et al. MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Tae-Hyun Oh,et al. Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.

[8] Tae-Hyun Oh,et al. Disjoint Multi-task Learning Between Heterogeneous Human-Centric Tasks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[11] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Kristen Grauman,et al. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question , 2017, CHI.

[14] Trevor Darrell,et al. Fooling Vision and Language Models Despite Localization and Attention Mechanism , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Tae-Hyun Oh,et al. Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach , 2019, EMNLP/IJCNLP.

[16] Qing Li,et al. Why Does a Visual Question Have Different Answers? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Trevor Darrell,et al. Learning with Side Information through Modality Hallucination , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[19] Jiebo Luo,et al. VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[21] Cordelia Schmid,et al. MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Bernhard Schölkopf,et al. Unifying distillation and privileged information , 2015, ICLR.

[26] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27] Zhou Yu,et al. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[28] Vladimir Vapnik,et al. A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[29] Vittorio Murino,et al. Distillation Multiple Choice Learning for Multimodal Action Recognition , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30] In So Kweon,et al. Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[31] Dhruv Batra,et al. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Juan Carlos Niebles,et al. Graph Distillation for Action Detection with Privileged Modalities , 2017, ECCV.

[33] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Jitendra Malik,et al. Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).