暂无分享,去创建一个
Aftab Alam | Young-Koo Lee | Muhammad Numan Khan | Tariq Habib Afridi | Jawad Khan | Muhammad Numan Khan | Young-Koo Lee | A. Alam | Jawad Khan
[1] Yuandong Tian,et al. Simple Baseline for Visual Question Answering , 2015, ArXiv.
[2] Michael Wiegand,et al. A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.
[3] Hongxia Yang,et al. InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining , 2020, ArXiv.
[4] Md. Zakir Hossain,et al. A Comprehensive Survey of Deep Learning for Image Captioning , 2018, ACM Comput. Surv..
[5] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[6] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.
[8] Noel Crespi,et al. A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media , 2019, COMPLEX NETWORKS.
[9] Mohamed Atri,et al. A comparative study of CFs, LBP, HOG, SIFT, SURF, and BRIEF techniques for face recognition , 2018, Defense + Security.
[10] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[12] Matti Pietikäinen,et al. Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.
[13] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[14] Meng Zhang,et al. Captioning Images Taken by People Who Are Blind , 2020, ECCV.
[15] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Dietrich Klakow,et al. Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods , 2019, J. Artif. Intell. Res..
[17] Hanwang Zhang,et al. Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.
[19] Donald E. Brown,et al. Text Classification Algorithms: A Survey , 2019, Inf..
[20] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[21] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.
[22] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.
[23] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[24] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[25] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] David Reitter,et al. Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.
[27] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[28] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[29] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[30] Yue Wang,et al. VD-BERT: A Unified Vision and Dialog Transformer with BERT , 2020, EMNLP.
[31] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[32] Hao Wang,et al. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval , 2020, SIGIR.
[33] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[34] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[35] Jingren Zhou,et al. InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction , 2020 .
[36] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[37] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.
[38] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[40] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[41] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[43] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[44] Licheng Yu,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.
[45] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[46] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.
[47] Saurabh Gupta,et al. Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.
[48] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[49] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[50] Snehasis Mukherjee,et al. Visual Question Answering using Deep Learning: A Survey and Performance Analysis , 2019, ArXiv.
[51] Hyunju Lee,et al. MC-BERT4HATE: Hate Speech Detection using Multi-channel BERT for Different Languages and Translations , 2019, 2019 International Conference on Data Mining Workshops (ICDMW).
[52] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Amanpreet Singh,et al. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.
[54] Christopher Kanan,et al. Challenges and Prospects in Vision and Language Research , 2019, Front. Artif. Intell..
[55] Lluis Gomez,et al. Exploring Hate Speech Detection in Multimodal Publications , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[56] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Sérgio Nunes,et al. A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..
[58] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[59] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[60] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[61] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[62] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[63] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[64] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[65] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.