Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard-300

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data.

[1]  Louis B. Rall,et al.  Automatic differentiation , 1981 .

[2]  Parcor Coeff,et al.  Comparison of Speaker Recognition Methods Using Statistical Features and Dynamic Features , 1981 .

[3]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[4]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[5]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[6]  Hermann Ney,et al.  Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[7]  Alan F. Murray,et al.  Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training , 1994, IEEE Trans. Neural Networks.

[8]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[11]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[12]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[15]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[16]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[17]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[18]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[19]  Naoyuki Kanda,et al.  Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Unfolded recurrent neural networks for speech recognition , 2014, INTERSPEECH.

[23]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[24]  Geoffrey Zweig,et al.  Deep bi-directional recurrent networks over spectral windows , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[26]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[27]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[28]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[31]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yang Liu,et al.  Modeling Coverage for Neural Machine Translation , 2016, ACL.

[35]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[37]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[38]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[41]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[42]  Bhuvana Ramabhadran,et al.  Language modeling with highway LSTM , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[43]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[44]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[45]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[47]  Hermann Ney,et al.  Investigation on LSTM Recurrent N-gram Language Models for Speech Recognition , 2018, INTERSPEECH.

[48]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[49]  George Saon,et al.  Advancing Sequence-to-Sequence Based Speech Recognition , 2019, INTERSPEECH.

[50]  Brian Kingsbury,et al.  Sequence Noise Injected Training for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[52]  Hermann Ney,et al.  Training Language Models for Long-Span Cross-Sentence Evaluation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[53]  Brian Kingsbury,et al.  Forget a Bit to Learn Better: Soft Forgetting for CTC-Based Automatic Speech Recognition , 2019, INTERSPEECH.

[54]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[55]  Stephen Merity,et al.  Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.

[56]  M. Seltzer,et al.  Transformer-Based Acoustic Modeling for Hybrid Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[58]  A. Waibel,et al.  Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).