Multimodal Grounding for Sequence-to-sequence Speech Recognition
暂无分享,去创建一个
Florian Metze | Ramon Sanabria | Ozan Caglayan | Shruti Palaskar | Loïc Barrault | Loïc Barrault | Florian Metze | Ozan Caglayan | Shruti Palaskar | Ramon Sanabria
[1] Florian Metze,et al. Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[2] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[3] Florian Metze,et al. Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach , 2016, INTERSPEECH.
[4] Qun Liu,et al. Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.
[5] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.
[6] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[8] Dumitru Erhan,et al. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[9] Bolei Zhou,et al. Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[10] Fethi Bougares,et al. NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems , 2017, Prague Bull. Math. Linguistics.
[11] Mark J. F. Gales,et al. Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.
[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Florian Metze,et al. Visual features for context-aware speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[14] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[15] James R. Glass,et al. Look, listen, and decode: Multimodal speech recognition with images , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).
[16] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[18] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[19] Lior Wolf,et al. Using the Output Embedding to Improve Language Models , 2016, EACL.
[20] Joost van de Weijer,et al. LIUM-CVC Submissions for WMT18 Multimodal Translation Task , 2018, WMT.
[21] Gareth J. F. Jones,et al. LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).
[22] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .
[23] Florian Metze,et al. End-to-end Multimodal Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[24] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.
[25] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.
[26] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.