暂无分享,去创建一个
Xiao Dong | Xunlin Zhan | Yangxin Wu | Yunchao Wei | Xiaoyong Wei | Minlong Lu | Xiaodan Liang | Xiaodan Liang | Yunchao Wei | Yangxin Wu | Xunlin Zhan | Minlong Lu | Xiaoyong Wei | Xiao Dong
[1] Xuelong Li,et al. Latent Semantic Minimal Hashing for Image Retrieval , 2017, IEEE Transactions on Image Processing.
[2] Svetlana Lazebnik,et al. Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.
[3] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[4] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[5] Ling Shao,et al. Kaleido-BERT: Vision-Language Pre-training on Fashion Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[7] Yongfeng Huang,et al. Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval , 2017, IEEE Transactions on Multimedia.
[8] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[9] Yang Zhang,et al. Modality-Agnostic Attention Fusion for visual search with text feedback , 2020, ArXiv.
[10] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[12] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[13] Lei Yang,et al. RPC: A Large-Scale Retail Product Checkout Dataset , 2019, ArXiv.
[14] Xin Huang,et al. An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.
[15] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[16] Hedi Ben-younes,et al. Leveraging Weakly Annotated Data for Fashion Image Retrieval and Label Prediction , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).
[17] Longbo Huang,et al. What Makes Multimodal Learning Better than Single (Provably) , 2021, NeurIPS.
[18] Javier R. Movellan,et al. Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.
[19] Antonio Torralba,et al. LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.
[20] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[21] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[22] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Xiao Dong,et al. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[24] Hao Wang,et al. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval , 2020, SIGIR.
[25] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[26] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[28] Tat-Seng Chua,et al. NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.
[29] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[30] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[31] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[32] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[33] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[34] Jun Wang,et al. Deep Multi-modal Latent Representation Learning for Automated Dementia Diagnosis , 2019, MICCAI.
[35] Ruimao Zhang,et al. DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[37] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[38] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Hung-yi Lee,et al. Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension , 2018, INTERSPEECH.
[40] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[41] Frédéric Jurie,et al. Improving web image search results using query-relative classifiers , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[42] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[43] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[44] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Jack Hessel,et al. Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think! , 2020, EMNLP.
[46] Zhang Yi,et al. Document clustering using locality preserving indexing and support vector machines , 2008, Soft Comput..
[47] Bayya Yegnanarayana,et al. Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.
[48] Xiaogang Wang,et al. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[49] To all authors , 1995 .
[50] Antonio Torralba,et al. Spectral Hashing , 2008, NIPS.
[51] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[52] Erik Cambria,et al. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.
[53] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.