论文信息 - Global Normalization for Streaming Speech Recognition in a Modular Framework

Global Normalization for Streaming Speech Recognition in a Modular Framework

We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50\% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models.

[1] Jinsong Zhang,et al. Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers , 2021, ArXiv.

[2] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[3] Xiaofeng Liu,et al. Rnn-Transducer with Stateless Prediction Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Cyril Allauzen,et al. Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Qian Zhang,et al. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Zhijian Ou,et al. CRF-based Single-stage Acoustic Modeling with CTC Topology , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Chris Dyer,et al. An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search , 2019, NAACL.

[8] Tom Bagby,et al. End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow , 2017, INTERSPEECH.

[9] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[10] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[11] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[12] Slav Petrov,et al. Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[13] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[14] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Mehryar Mohri,et al. On the Disambiguation of Weighted Automata , 2014, CIAA.

[17] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[18] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[20] Geoffrey Zweig,et al. A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21] Steve Renals,et al. Speech Recognition Using Augmented Conditional Random Fields , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22] Mehryar Mohri,et al. Weighted Automata Algorithms , 2009 .

[23] Mehryar Mohri,et al. Speech Recognition with Weighted Finite-State Transducers , 2008 .

[24] Noah A. Smith,et al. Weighted and Probabilistic Context-Free Grammars Are Equally Expressive , 2007, CL.

[25] Johan Schalkwyk,et al. OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[26] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[27] Alex Acero,et al. Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[28] Mehryar Mohri,et al. Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[29] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[30] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[31] Roni Rosenfeld,et al. A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[32] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[33] Yoshua Bengio,et al. Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34] J. S. Bridle,et al. An Alphanet approach to optimising input transformations for continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[35] Hervé Bourlard,et al. Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[36] Peter F. Brown,et al. The acoustic-modeling problem in automatic speech recognition , 1987 .

[37] Lalit R. Bahl,et al. Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.