VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
暂无分享,去创建一个
Xin Wang | Yuan-Fang Wang | Lei Li | William Yang Wang | Junkun Chen | Jiawei Wu | Lei Li | Xin Eric Wang | Yuan-fang Wang | Jiawei Wu | Junkun Chen | Junkun Chen
[1] Joost van de Weijer,et al. LIUM-CVC Submissions for WMT18 Multimodal Translation Task , 2018, WMT.
[2] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[3] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.
[4] David J. Crandall,et al. Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation , 2017, ArXiv.
[5] Juan Carlos Niebles,et al. Title Generation for User Generated Videos , 2016, ECCV.
[6] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Larry S. Davis,et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Ali Farhadi,et al. Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[10] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[11] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[12] F. Xia,et al. The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) , 2000 .
[13] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.
[14] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Kemal Oflazer,et al. A Human Judgement Corpus and a Metric for Arabic MT Evaluation , 2014, EMNLP.
[16] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[17] Kemal Oflazer,et al. Building an Arabic Machine Translation Post-Edited Corpus: Guidelines and Annotation , 2016, LREC.
[18] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.
[19] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..
[20] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[21] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Khalil Sima'an,et al. Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.
[23] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[24] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.
[25] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] G. Youmans,et al. Measuring Lexical Style and Competence: The Type-Token Vocabulary Curve , 1990 .
[27] Christopher Joseph Pal,et al. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.
[28] Marcus Rohrbach,et al. A Dataset for Telling the Stories of Social Media Videos , 2018, EMNLP.
[29] Khalil Sima'an,et al. A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.
[30] Xirong Li,et al. Fluency-Guided Cross-Lingual Image Captioning , 2017, ACM Multimedia.
[31] Dapeng Li,et al. OSU Multimodal Machine Translation System Report , 2017, WMT.
[32] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Jorma Laaksonen,et al. The MeMAD Submission to the WMT18 Multimodal Translation Task , 2018, WMT.
[34] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[36] Nobuyuki Shimizu,et al. Visual Question Answering Dataset for Bilingual Image Understanding: A Study of Cross-Lingual Transfer Using Attention Maps , 2018, COLING.
[37] Desmond Elliott,et al. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.
[38] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[39] Tao Chen,et al. Multilingual Visual Sentiment Concept Matching , 2016, ICMR.
[40] William Yang Wang,et al. Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning , 2018, AAAI.
[41] Jean Oh,et al. Attention-based Multimodal Neural Machine Translation , 2016, WMT.
[42] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[43] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[44] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[45] Desmond Elliott,et al. Findings of the Third Shared Task on Multimodal Machine Translation , 2018, WMT.
[46] Qun Liu,et al. Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.
[47] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[49] Ruihua Zhang. Sadness Expressions in English and Chinese: Corpus Linguistic Contrastive Semantic Analysis , 2014 .
[50] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[51] Xin Wang,et al. Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[52] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[53] Fethi Bougares,et al. Multimodal Attention for Neural Machine Translation , 2016, ArXiv.
[54] Bernt Schiele,et al. Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.
[55] Xirong Li,et al. COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval , 2018, IEEE Transactions on Multimedia.
[56] Lucia Specia,et al. Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.
[57] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[58] Yale Song,et al. TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Ramakanth Pasunuru,et al. Reinforced Video Captioning with Entailment Rewards , 2017, EMNLP.
[60] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[61] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.
[62] Xin Wang,et al. Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning , 2018, NAACL.
[63] Lucia Specia,et al. Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation , 2017, WMT.
[64] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[65] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[66] Bernt Schiele,et al. Grounding Action Descriptions in Videos , 2013, TACL.
[67] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[68] Chenliang Xu,et al. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[69] Jindrich Libovický,et al. Attention Strategies for Multi-Source Sequence-to-Sequence Learning , 2017, ACL.
[70] Sanja Fidler,et al. Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[71] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.