Who Needs Words? Lexicon-Free Speech Recognition

Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, both with character or word-based LMs.

[1]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Wonyong Sung,et al.  Character-level incremental speech recognition with recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[5]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[6]  Gabriel Synnaeve,et al.  Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[7]  Florian Metze,et al.  Comparison of Decoding Strategies for CTC Acoustic Models , 2017, INTERSPEECH.

[8]  Steve Renals,et al.  Multiplicative LSTM for sequence modelling , 2016, ICLR.

[9]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[11]  Frederick Jelinek,et al.  A study of n-gram and decision tree letter language modeling methods , 1998, Speech Commun..

[12]  Mikko Kurimo,et al.  Character-based units for unlimited vocabulary continuous speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[14]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[15]  Hermann Ney,et al.  Can We Translate Letters? , 2007, WMT@ACL.

[16]  Nicolas Usunier,et al.  Fully Convolutional Speech Recognition , 2018, ArXiv.

[17]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[18]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[19]  Bob Carpenter,et al.  Scaling High-Order Character Language Models to Gigabytes , 2005, ACL 2005.

[20]  Daniel Jurafsky,et al.  Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[21]  Khaled Shaalan,et al.  Lexicon Free Arabic Speech Recognition Recipe , 2016, AISI.

[22]  William Chan,et al.  Deep Recurrent Neural Networks for Acoustic Modelling , 2015, ArXiv.

[23]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[24]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[25]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[26]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[28]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[29]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[30]  Ronan Collobert,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[33]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[34]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[35]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[36]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[37]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[39]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.