Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks

Whereas deep neural networks were first mostly used for classification tasks, they are rapidly expanding in the realm of structured output problems, where the observed target is composed of multiple random variables that have a rich joint distribution, given the input. In this paper we focus on the case where the input also has a rich structure and the input and output structures are somehow related. We describe systems that learn to attend to different places in the input, for each element of the output, for a variety of tasks: machine translation, image caption generation, video clip description, and speech recognition. All these systems are based on a shared set of building blocks: gated recurrent neural networks and convolutional neural networks, along with trained attention mechanisms. We report on experimental results with these systems, showing impressively good performance and the advantage of the attention mechanism.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[3]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[4]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  S. C. Kremer,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[8]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[11]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[13]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[14]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[15]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[17]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[18]  Misha Denil,et al.  Learning Where to Attend with Deep Architectures for Image Tracking , 2011, Neural Computation.

[19]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Ruslan Salakhutdinov,et al.  Learning Stochastic Feedforward Neural Networks , 2013, NIPS.

[23]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[24]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[25]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[26]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[27]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[28]  Yoshua Bengio,et al.  Audio Chord Recognition with Recurrent Neural Networks , 2013, ISMIR.

[29]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[30]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[31]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[32]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[33]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[34]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[35]  Brian Murphy,et al.  Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses , 2014, PloS one.

[36]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[37]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[38]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[39]  Hugo Larochelle,et al.  A Neural Autoregressive Approach to Attention-based Recognition , 2015, International Journal of Computer Vision.

[40]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[41]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[42]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[43]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[45]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[46]  László Tóth,et al.  Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Nadir Durrani,et al.  Edinburgh’s Phrase-based Machine Translation Systems for WMT-14 , 2014, WMT@ACL.

[48]  Matthieu Cord,et al.  Sequentially Generated Instance-Dependent Image Representations for Classification , 2014, ICLR.

[49]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[50]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[51]  Misha Denil,et al.  Extraction of Salient Sentences from Labelled Documents , 2014, ArXiv.

[52]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[53]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[54]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[56]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[58]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[59]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[60]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[61]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[62]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[63]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[64]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[65]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[66]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[67]  Jason Weston,et al.  Weakly Supervised Memory Networks , 2015, ArXiv.

[68]  Christopher Joseph Pal,et al.  Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.

[69]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[70]  Tapani Raiko,et al.  Techniques for Learning Binary Stochastic Feedforward Neural Networks , 2014, ICLR.

[71]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[74]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[76]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[77]  U. Austin,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2017 .

[78]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).