Echo State Speech Recognition

We propose automatic speech recognition (ASR) models inspired by echo state network (ESN) [1], in which a subset of recurrent neural networks (RNN) layers in the models are randomly initialized and untrained. Our study focuses on RNN-T and Conformer models, and we show that model quality does not drop even when the decoder is fully randomized. Furthermore, such models can be trained more efficiently as the decoders do not require to be updated. By contrast, randomizing encoders hurts model quality, indicating that optimizing encoders and learn proper representations for acoustic inputs are more vital for speech recognition. Overall, we challenge the common practice of training ASR models for all components, and demonstrate that ESN-based models can perform equally well but enable more efficient training and storage than fully-trainable counterparts.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Henry Markram,et al.  Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[5]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  John G. Harris,et al.  Noise-robust automatic speech recognition using a discriminative echo state network , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[7]  Claudio Gallicchio,et al.  Deep reservoir computing: A critical experimental analysis , 2017, Neurocomputing.

[8]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[10]  Benjamin Schrauwen,et al.  An experimental unification of reservoir computing methods , 2007, Neural Networks.

[11]  Ammar Belatreche,et al.  Neuro-Inspired Speech Recognition Based on Reservoir Computing , 2010 .

[12]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[13]  Claudio Gallicchio,et al.  Comparison between DeepESNs and gated RNNs on multivariate time-series prediction , 2018, ESANN.

[14]  W. Maass,et al.  What makes a dynamical system computationally powerful ? , 2022 .

[15]  Herbert Jaeger,et al.  The''echo state''approach to analysing and training recurrent neural networks , 2001 .

[16]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Claudio Gallicchio,et al.  Richness of Deep Echo State Network Dynamics , 2019, IWANN.

[18]  Xiaofeng Liu,et al.  Rnn-Transducer with Stateless Prediction Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Herbert Jaeger,et al.  Optimization and applications of echo state networks with leaky- integrator neurons , 2007, Neural Networks.

[20]  Claudio Gallicchio,et al.  Echo State Property of Deep Reservoir Computing Networks , 2017, Cognitive Computation.

[21]  Stefan J. Kiebel,et al.  Re-visiting the echo state property , 2012, Neural Networks.

[22]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.