Modeling under-resourced languages for speech recognition

One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.

[1]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[2]  Mikko Kurimo,et al.  Learning a subword vocabulary based on unigram likelihood , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Dietrich Klakow,et al.  Selecting articles from the language model training corpus , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[5]  Hermann Ney,et al.  From within-word model search to across-word model search in large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..

[6]  Janne Pylkkönen AN EFFICIENT ONE-PASS DECODER FOR FINNISH LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , .

[7]  Tanel Alumäe,et al.  Multi-Domain Recurrent Neural Network Language Model for Medical Speech Recognition , 2014, Baltic HLT.

[8]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[9]  Mikko Kurimo,et al.  A novel discriminative method for pruning pronunciation dictionary entries , 2013, 2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD).

[10]  Mikko Kurimo,et al.  A word-level token-passing decoder for subword n-gram LVCSR , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[11]  Benoit Maison,et al.  Pronunciation modeling for names of foreign origin , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[12]  Bhuvana Ramabhadran,et al.  An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Mikko Kurimo,et al.  Studies on Training Text Selection for Conversational Finnish Language Modeling , 2013 .

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Torbjørn Svendsen,et al.  Pronunciation variation modeling of non-native proper names by discriminative tree search , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Mark J. F. Gales,et al.  Improved neural network based language modelling and adaptation , 2010, INTERSPEECH.

[17]  Jean-Luc Gauvain,et al.  Neural network language models for conversational speech recognition , 2004, INTERSPEECH.

[18]  Ian R. Lane,et al.  Neural network language models for low resource languages , 2014, INTERSPEECH.

[19]  George Saon,et al.  Dynamic network decoding revisited , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Jan Svec,et al.  Improving Speech Recognition by Detecting Foreign Inclusions and Generating Pronunciations , 2013, TSD.

[21]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22]  Tanel Alumäe,et al.  Multi-domain neural network language model , 2013, INTERSPEECH.

[23]  Mikko Kurimo,et al.  Unsupervised topic adaptation for morph-based speech recognition , 2013, INTERSPEECH.

[24]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[25]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[26]  Mietta Lennes Segmental features in spontaneous and read-aloud Finnish , 2009 .

[27]  Frédéric Bimbot,et al.  Inference of variable-length linguistic and acoustic units by multigrams , 1997, Speech Commun..

[28]  Mikko Kurimo,et al.  Analysing Recognition Errors in Unlimited-Vocabulary Speech Recognition , 2009, HLT-NAACL.

[29]  Yangyang Shi,et al.  Recurrent neural network language model adaptation with curriculum learning , 2015, Comput. Speech Lang..

[30]  Tanel Alumäe Recent improvements in Estonian LVCSR , 2014, SLTU.

[31]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[32]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Jean-Luc Gauvain,et al.  Training Neural Network Language Models on Very Large Corpora , 2005, HLT.

[34]  Mikko Kurimo,et al.  A Toolkit for Efficient Learning of Lexical Units for Speech Recognition , 2014, LREC.

[35]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Panayiotis G. Georgiou,et al.  Text data acquisition for domain-specific language models , 2006, EMNLP.

[37]  Krister Lindén Entry Generation for New Words by Analogy for Morphological Lexicons , 2009 .

[38]  Juho Leinonen Automatic Speech Recognition for Human-Robot Interaction Using an Under-Resourced Language , 2015 .

[39]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[41]  S. Ortmanns,et al.  Progress in dynamic programming search for LVCSR , 1997, Proceedings of the IEEE.