Studies on Training Text Selection for Conversational Finnish Language Modeling

Current ASR and MT systems do not operate on conversational Finnish, because training data for colloquial Finnish has not been available. Although speech recognition performance on literary Finnish is already quite good, those systems have very poor baseline performance in conversational speech. Text data for relevant vocabulary and language models can be collected from the Internet, but web data is very noisy and most of it is not helpful for learning good models. Finnish language is highly agglutinative, and written phonetically. Even phonetic reductions and sandhi are often written down in informal discussions. This increases vocabulary size dramatically and causes word-based selection methods to fail. Our selection method explicitly optimizes the perplexity of a subword language model on the development data, and requires only very limited amount of speech transcripts as development data. The language models have been evaluated for speech recognition using a new data set consisting of generic colloquial Finnish.

[1]  L. M. Määttänen Puheenomaisten piirteiden ilmeneminen erityyppisissä suomalaisissa kirjoitetuissa teksteissä , 2007 .

[2]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[3]  Dietrich Klakow,et al.  Selecting articles from the language model training corpus , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[5]  Panayiotis G. Georgiou,et al.  Text data acquisition for domain-specific language models , 2006, EMNLP.

[6]  Peter Regel-Brietzmann,et al.  Improved modeling of OOV words in spontaneous speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Andreas Stolcke,et al.  Web resources for language modeling in conversational speech recognition , 2007, TSLP.

[8]  Xuedong Huang,et al.  Improved topic-dependent language modeling using information retrieval techniques , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Mietta Lennes Segmental features in spontaneous and read-aloud Finnish , 2009 .

[10]  Jeff A. Bilmes,et al.  Factored Language Models and Generalized Parallel Backoff , 2003, NAACL.

[11]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[12]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[13]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[14]  Seppo Enarvi,et al.  Finnish Language Speech Recognition for Dental Health Care , 2012 .

[15]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Bhuvana Ramabhadran,et al.  An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.