Learning online alignments with continuous rewards policy gradient

Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

[1]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Annika Riekkola A study of Thomas Tidholm's translation of The Hitch Hiker's Guide to the Galaxy , 2005 .

[3]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[4]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[5]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[6]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[7]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines - Revised , 2015 .

[8]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[9]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[10]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[12]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[13]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[14]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[15]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[16]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[17]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[18]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[19]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[21]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23]  Samy Bengio,et al.  An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[24]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Douglas Adams,et al.  The Hitch Hiker's Guide to the Galaxy: A Trilogy in Five Parts , 1985 .

[26]  Quoc V. Le,et al.  A Neural Transducer , 2015, 1511.04868.

[27]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.