Conversational telephone speech recognition for Lithuanian

Abstract The research presented in the paper addresses conversational telephone speech recognition and keyword spotting for the Lithuanian language. Lithuanian can be considered a low e-resourced language as little transcribed audio data, and more generally, only limited linguistic resources are available electronically. Part of this research explores the impact of reducing the amount of linguistic knowledge and manual supervision when developing the transcription system. Since designing a pronunciation dictionary requires language-specific expertise, the need for manual supervision was assessed by comparing phonemic and graphemic units for acoustic modeling. Although the Lithuanian language is generally described in the linguistic literature with 56 phonemes, under low-resourced conditions some phonemes may not be sufficiently observed to be modeled. Therefore different phoneme inventories were explored to assess the effects of explicitly modeling diphthongs, affricates and soft consonants. The impact of using Web data for language modeling and additional untranscribed audio data for semi-supervised training was also measured. Out-of-vocabulary (OOV) keywords are a well-known challenge for keyword search. While word-based keyword search is quite effective for in-vocabulary words, OOV keywords are largely undetected. Morpheme-based subword units are compared with character n-gram-based units for their capacity to detect OOV keywords. Experimental results are reported for two training conditions defined in the IARPA Babel program: the full language pack and the very limited language pack, for which, respectively, 40 h and 3 h of transcribed training data are available. For both conditions, grapheme-based and phoneme-based models are shown to obtain comparable transcription and keyword spotting results. The use of Web texts for language modeling is shown to significantly improve both speech recognition and keyword spotting performance. Combining full-word and subword units leads to the best keyword spotting results.

[1]  Jean-Luc Gauvain,et al.  Conversational Telephone Speech Recognition for Lithuanian , 2015, SLSP.

[2]  S. Laurinčiukaitė,et al.  Syllable-Phoneme based Continuous Speech Recognition , 2006 .

[3]  Jean-Luc Gauvain,et al.  Lithuanian Broadcast Speech Transcription Using Semi-supervised Acoustic Model Training , 2016, SLTU.

[4]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[5]  Jean-Luc Gauvain,et al.  Lattice-based unsupervised acoustic model training , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[7]  Gailius Raskinis,et al.  Cache-based Statistical Language Models of English and Highly Inflected Lithuanian , 2006, Informatica.

[8]  Brian Kingsbury,et al.  Automatic keyword selection for keyword search development and tuning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Mari Ostendorf,et al.  Subword-based modeling for handling OOV words inkeyword spotting , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Michael Picheny,et al.  Improvements in phone based audio search via constrained match with high order confusion estimates , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[12]  Laimutis Telksnys,et al.  Development of Isolated Word Speech Recognition System , 2002, Informatica.

[13]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[14]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[15]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[17]  Antanas Lipeika,et al.  Development of HMM/Neural Network-Based Medium-Vocabulary Isolated-Word Lithuanian Speech Recognition System , 2004, Informatica.

[18]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Gailius Raskinis,et al.  Building Medium-Vocabulary Isolated-Word Lithuanian HMM Speech Recognition System , 2003, Informatica.

[20]  Hans Uszkoreit,et al.  The Lithuanian Language in the Digital Age , 2012 .

[21]  Lori Lamel,et al.  The Use of Lexica in Automatic Speech Recognition , 2000 .

[22]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Olivier Siohan,et al.  Fast vocabulary-independent audio search using path-based graph indexing , 2005, INTERSPEECH.

[24]  Martin Karafiát,et al.  Combination of multilingual and semi-supervised training for under-resourced languages , 2014, INTERSPEECH.

[25]  Jean-Luc Gauvain,et al.  Comparing decoding strategies for subword-based keyword spotting in low-resourced languages , 2014, INTERSPEECH.

[26]  Richard M. Schwartz,et al.  Normalizationofphonetic keyword search scores , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[28]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[29]  Mark J. F. Gales,et al.  Unicode-based graphemic systems for limited resource languages , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[31]  Rytis Maskeliūnas,et al.  Investigation of Foreign Languages Models for Lithuanian Speech Recognition , 2009 .