How2: A Large-scale Dataset for Multimodal Language Understanding

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.

[1]  Christine D. Wilson,et al.  Grounding conceptual knowledge in modality-specific systems , 2003, Trends in Cognitive Sciences.

[2]  Ali Can Kocabiyikoglu,et al.  Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation , 2018, LREC.

[3]  Fethi Bougares,et al.  NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems , 2017, Prague Bull. Math. Linguistics.

[4]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[5]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[6]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[8]  Florian Metze,et al.  Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach , 2016, INTERSPEECH.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[11]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[12]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[13]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[14]  Matt Post,et al.  Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus , 2013, IWSLT.

[15]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Desmond Elliott,et al.  Findings of the Third Shared Task on Multimodal Machine Translation , 2018, WMT.

[19]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[20]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[22]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[23]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[24]  Florian Metze,et al.  End-to-end Multimodal Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Khalil Sima'an,et al.  A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[26]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Desmond Elliott,et al.  Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[28]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[29]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[33]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[35]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[36]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[37]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[38]  Florian Metze,et al.  Visual features for context-aware speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Jindrich Libovický,et al.  Attention Strategies for Multi-Source Sequence-to-Sequence Learning , 2017, ACL.

[40]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[42]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[43]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[44]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[45]  Jeffrey P. Bigham,et al.  Multimodal summarization of complex sentences , 2011, IUI '11.

[46]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[47]  Nazli Ikizler-Cinbis,et al.  TasvirEt: A benchmark dataset for automatic Turkish description generation from images , 2016, 2016 24th Signal Processing and Communication Application Conference (SIU).

[48]  Haoran Li,et al.  Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.

[49]  Xirong Li,et al.  Adding Chinese Captions to Images , 2016, ICMR.

[50]  Regina Barzilay,et al.  Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2017, ACL 2017.

[51]  David L. Chen and William B. Dolan,et al.  Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection , 2011 .

[52]  Paul Over,et al.  DUC in context , 2007, Inf. Process. Manag..

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[55]  Akikazu Takeuchi,et al.  STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset , 2017, ACL.