Forget a Bit to Learn Better: Soft Forgetting for CTC-Based Automatic Speech Recognition

Prior work has shown that connectionist temporal classification (CTC)-based automatic speech recognition systems perform well when using bidirectional long short-term memory (BLSTM) networks unrolled over the whole speech utterance. This is because whole-utterance BLSTMs better capture longterm context. We hypothesize that this also leads to overfitting and propose soft forgetting as a solution. During training, we unroll the BLSTM network only over small non-overlapping chunks of the input utterance. We randomly pick a chunk size for each batch instead of a fixed global chunk size. In order to retain some utterance-level information, we encourage the hidden states of the BLSTM network to approximate those of a pre-trained whole-utterance BLSTM. Our experiments on the 300-hour English Switchboard dataset show that soft forgetting improves the word error rate (WER) above a competitive wholeutterance phone CTC BLSTM by an average of 7-9% relative. We obtain WERs of 9.1%/17.4% using speaker-independent and 8.7%/16.8% using speaker-adapted models respectively on the Hub5-2000 Switchboard/CallHome test sets. We also show that soft forgetting improves the WER when the model is used with limited temporal context for streaming recognition. Finally, we present some empirical insights into the regularization and data augmentation effects of soft forgetting.

[1]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Florian Metze,et al.  Hierarchical Multi Task Learning With CTC , 2018, ArXiv.

[3]  Florian Metze,et al.  An empirical exploration of CTC acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[5]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6]  Yoshua Bengio,et al.  Twin Regularization for online speech recognition , 2018, INTERSPEECH.

[7]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[8]  Brian Kingsbury,et al.  Sequence Noise Injected Training for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[11]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[12]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[13]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[14]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[17]  Yashesh Gaur,et al.  Reducing Bias in Production Speech Models , 2017, ArXiv.

[18]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[19]  Jianxin Wu,et al.  Minimal gated unit for recurrent neural networks , 2016, International Journal of Automation and Computing.

[20]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[21]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[22]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[23]  Changshui Zhang,et al.  Connectionist Temporal Classification with Maximum Entropy Regularization , 2018, NeurIPS.

[24]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[27]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[29]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[30]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[31]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yifan Gong,et al.  Advancing Acoustic-to-Word CTC Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jun Wang,et al.  Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition , 2018, INTERSPEECH.

[34]  Daniel Jurafsky,et al.  Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[35]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[36]  Brian Kingsbury,et al.  Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Chengzhu Yu,et al.  A Multistage Training Framework for Acoustic-to-Word Model , 2018, INTERSPEECH.

[38]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[39]  Kai Chen,et al.  Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Karen Livescu,et al.  Hierarchical Multitask Learning for CTC-based Speech Recognition , 2018, ArXiv.

[41]  Chong Wang,et al.  Lookahead Convolution Layer for Unidirectional Recurrent Neural Networks , 2016 .

[42]  Christopher Joseph Pal,et al.  Twin Networks: Matching the Future for Sequence Generation , 2017, ICLR.