Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
暂无分享,去创建一个
Kun Xu | Yin Li | Zhengyuan Yang | Dong Yu | Liwei Wang | Jing Huang | Yin Li | Zhengyuan Yang | Kun Xu | Dong Yu | Jing Huang | Liwei Wang
[1] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[2] Ajay Divakaran,et al. Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[3] Zhe L. Lin,et al. Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.
[4] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.
[5] Liwei Wang,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[6] Fang Zhao,et al. Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[7] Leonid Sigal,et al. G3raphGround: Graph-Based Language Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[8] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[9] Nikos Komodakis,et al. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.
[10] Sergio Guadarrama,et al. Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Markus H. Gross,et al. Neural Sequential Phrase Grounding (SeqGROUND) , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.
[13] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.
[14] Robinson Piramuthu,et al. Conditional Image-Text Embedding Networks , 2017, ECCV.
[15] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.
[16] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[17] Juan Carlos Niebles,et al. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.
[20] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[21] Jinjun Xiong,et al. Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts , 2018, NIPS.
[22] Rich Caruana,et al. Model compression , 2006, KDD '06.
[23] Jiebo Luo,et al. Improving One-stage Visual Grounding by Recursive Sub-query Construction , 2020, ECCV.
[24] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[25] Lucia Specia,et al. Phrase Localization Without Paired Training Examples , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[27] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.
[28] Jan Kautz,et al. Contrastive Learning for Weakly Supervised Phrase Grounding , 2020, ECCV.
[29] Jiebo Luo,et al. A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[31] Xinlei Chen,et al. Grounded Video Description , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.
[33] Minh N. Do,et al. Unsupervised Textual Grounding: Linking Words to Image Concepts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[34] Shu Kong,et al. Modularized Textual Grounding for Counterfactual Resilience , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Jitendra Malik,et al. Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Vittorio Murino,et al. Modality Distillation with Multiple Stream Networks for Action Recognition , 2018, ECCV.
[38] Hanwang Zhang,et al. More Grounded Image Captioning by Distilling Image-Text Matching Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Bohyung Han,et al. Learning to Specialize with Knowledge Distillation for Visual Question Answering , 2018, NeurIPS.
[40] Yong Jae Lee,et al. Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.
[42] Shih-Fu Chang,et al. Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[43] Luowei Zhou,et al. Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , 2018, BMVC.
[44] Ramakant Nevatia,et al. Knowledge Aided Consistency for Weakly Supervised Phrase Grounding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[45] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[46] Yifan Gong,et al. Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.
[47] Rich Caruana,et al. Do Deep Nets Really Need to be Deep? , 2013, NIPS.
[48] Juan Carlos Niebles,et al. Graph Distillation for Action Detection with Privileged Modalities , 2017, ECCV.
[49] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[50] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[51] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[52] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[53] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[54] Ali Farhadi,et al. YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[56] Thanh-Toan Do,et al. Compact Trilinear Interaction for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[57] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.