Single-Modal Entropy based Active Learning for Visual Question Answering

Constructing a large-scale labeled dataset in the real world, especially for high-level tasks (e.g., Visual Question Answering), can be expensive and time-consuming. In addition, with the ever-growing amounts of data and architecture complexity, Active Learning has become an important aspect of computer vision research. In this work, we address Active Learning in the multi-modal setting of Visual Question Answering (VQA). In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition through the use of ad hoc single-modal branches for each input to leverage its information. Our mutual information based sample acquisition strategy Single-Modal Entropic Measure (SMEM) in addition to our self-distillation technique enables the sample acquisitor to exploit all present modalities and find the most informative samples. Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks. We confirm our findings on various VQA datasets through state-of-the-art performance by comparing to existing Active Learning baselines.

[1]  Sung Ju Hwang,et al.  Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised Learning , 2020, NeurIPS.

[2]  Rong Jin,et al.  Batch mode active learning and its application to medical image classification , 2006, ICML.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[5]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[6]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[7]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[8]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[9]  In So Kweon,et al.  MCDAL: Maximum Classifier Discrepancy for Active Learning , 2021, ArXiv.

[10]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[11]  Martial Hebert,et al.  Learning by Asking Questions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nikolaos Papanikolopoulos,et al.  Multi-class active learning for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  In So Kweon,et al.  Dense Relational Image Captioning via Multi-task Triple-Stream Networks , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[16]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[17]  Xiao Lin,et al.  Active Learning for Visual Question Answering: An Empirical Study , 2017, ArXiv.

[18]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[19]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Dilek Z. Hakkani-Tür,et al.  Just Ask: An Interactive Learning Framework for Vision and Language Navigation , 2019, AAAI.

[21]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[22]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[23]  Shuo Wang,et al.  Dual Adversarial Network for Deep Active Learning , 2020, ECCV.

[24]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[25]  Tae-Hyun Oh,et al.  Disjoint Multi-task Learning Between Heterogeneous Human-Centric Tasks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Fei-Fei Li,et al.  Towards Scalable Dataset Construction: An Active Learning Approach , 2008, ECCV.

[27]  Xiaogang Wang,et al.  Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data , 2018, ECCV.

[28]  In So Kweon,et al.  Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[30]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[31]  Trevor Darrell,et al.  Variational Adversarial Active Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  In So Kweon,et al.  Learning Loss for Active Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Anima Anandkumar,et al.  Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[34]  Sanja Fidler,et al.  Learning to Caption Images Through a Lifetime by Asking Questions , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Qingming Huang,et al.  State-Relabeling Adversarial Active Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[39]  Anton van den Hengel,et al.  Visual Question Answering as a Meta Learning Task , 2017, ECCV.

[40]  Sanghyun Woo,et al.  LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[43]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Stephen Lin,et al.  ACP++: Action Co-Occurrence Priors for Human-Object Interaction Detection , 2021, IEEE Transactions on Image Processing.

[47]  Trevor Darrell,et al.  Learning with Side Information through Modality Hallucination , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[49]  Tae-Hyun Oh,et al.  Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Eduardo P. Wiechmann,et al.  Active learning for clinical text classification: is it better than random sampling? , 2012, J. Am. Medical Informatics Assoc..

[51]  Tae-Hyun Oh,et al.  Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach , 2019, EMNLP/IJCNLP.

[52]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Milica Gasic,et al.  Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning , 2010, ACL.

[54]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[55]  Zoubin Ghahramani,et al.  Bayesian Active Learning for Classification and Preference Learning , 2011, ArXiv.

[56]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[57]  Christoph H. Lampert,et al.  Distillation-Based Training for Multi-Exit Architectures , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.