CCMB: A Large-scale Chinese Cross-modal Benchmark
暂无分享,去创建一个
Lin Yao | Jianfei Song | Xiangyang Ji | Henrique Morimitsu | Dawei Leng | Jincheng Li | Chunyu Xie | Xiaoyu Wu | Heng Cai | Fanjing Kong | Jianfei Song | Henrique Morimitsu | Dexin Wang | Xiangzheng Zhang | Baochang Zhang | Yafeng Deng
[1] Jingren Zhou,et al. Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese , 2022, ArXiv.
[2] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[3] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Karan Desai,et al. RedCaps: web-curated image-text data created by the people, for the people , 2021, NeurIPS Datasets and Benchmarks.
[5] Haoran Sun,et al. Towards artificial general intelligence via a multimodal foundation model , 2021, Nature Communications.
[6] Jiecao Chen,et al. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.
[7] Xianyan Jia,et al. M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.
[8] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[10] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[11] Hua Wu,et al. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.
[12] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[13] Wanxiang Che,et al. Revisiting Pre-Trained Models for Chinese Natural Language Processing , 2020, FINDINGS.
[14] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[15] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[17] Yun Fu,et al. Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[18] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[19] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[20] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[21] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[22] Xirong Li,et al. COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval , 2018, IEEE Transactions on Multimedia.
[23] Bo Zhao,et al. AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding , 2017, ArXiv.
[24] Xirong Li,et al. Fluency-Guided Cross-Lingual Image Captioning , 2017, ACM Multimedia.
[25] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.
[26] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[27] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[28] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[29] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[30] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.
[31] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[32] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[33] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[34] Xin Jiang,et al. Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework , 2022, ArXiv.
[35] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.