论文信息 - Who Needs Words? Lexicon-Free Speech Recognition

Who Needs Words? Lexicon-Free Speech Recognition

Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, both with character or word-based LMs.

[1] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Steve J. Young,et al. Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Wonyong Sung,et al. Character-level incremental speech recognition with recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[5] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[6] Gabriel Synnaeve,et al. Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[7] Florian Metze,et al. Comparison of Decoding Strategies for CTC Acoustic Models , 2017, INTERSPEECH.

[8] Steve Renals,et al. Multiplicative LSTM for sequence modelling , 2016, ICLR.

[9] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[11] Frederick Jelinek,et al. A study of n-gram and decision tree letter language modeling methods , 1998, Speech Commun..

[12] Mikko Kurimo,et al. Character-based units for unlimited vocabulary continuous speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Hermann Ney,et al. Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[14] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[15] Hermann Ney,et al. Can We Translate Letters? , 2007, WMT@ACL.

[16] Nicolas Usunier,et al. Fully Convolutional Speech Recognition , 2018, ArXiv.

[17] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[18] Philipp Koehn,et al. Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[19] Bob Carpenter,et al. Scaling High-Order Character Language Models to Gigabytes , 2005, ACL 2005.

[20] Daniel Jurafsky,et al. Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[21] Khaled Shaalan,et al. Lexicon Free Arabic Speech Recognition Recipe , 2016, AISI.

[22] William Chan,et al. Deep Recurrent Neural Networks for Acoustic Modelling , 2015, ArXiv.

[23] Kyu J. Han,et al. The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[24] Ying Zhang,et al. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[25] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[26] Gabriel Synnaeve,et al. Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[28] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[29] Yoshua Bengio,et al. A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[30] Ronan Collobert,et al. Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Preslav Nakov,et al. Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[33] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[34] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[35] Moustapha Cissé,et al. Efficient softmax approximation for GPUs , 2016, ICML.

[36] José A. R. Fonollosa,et al. Character-based Neural Machine Translation , 2016, ACL.

[37] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Sanjeev Khudanpur,et al. End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[39] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.