Conversational Telephone Speech Recognition for Lithuanian

This paper presents a conversational telephone speech recognition system for the low-resourced Lithuanian language, developed in the context of IARPA-Babel program. Phoneme-based systems and grapheme-based systems are compared to establish whether or not it is necessary to use a phonemic lexicon. We explore the impact using Web data for language modeling and additional untranscribed data for semi-supervised training. Experimental results are reported for two conditions: Full Language Pack FLP and Very Limited Language Pack VLLP, for which respectively 40 and 3i¾źh of transcribed training data are available. Grapheme-based systems are shown to give comparable results to phoneme-based ones. Adding Web texts improves the performance of both the FLP and VLLP system. The best VLLP results are achieved using both Web texts and semi-supervised training.

[1]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.

[2]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[3]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[4]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[5]  Lidia Mangu,et al.  Finding consensus in speech recognition , 2000 .

[6]  Laimutis Telksnys,et al.  Development of Isolated Word Speech Recognition System , 2002, Informatica.

[7]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[8]  Jean-Luc Gauvain,et al.  Lattice-based unsupervised acoustic model training , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[10]  Rytis Maskeliūnas,et al.  Investigation of Foreign Languages Models for Lithuanian Speech Recognition , 2009 .

[11]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[12]  Gailius Raskinis,et al.  Cache-based Statistical Language Models of English and Highly Inflected Lithuanian , 2006, Informatica.

[13]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[14]  Martin Karafiát,et al.  Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Jean-Luc Gauvain,et al.  Comparing decoding strategies for subword-based keyword spotting in low-resourced languages , 2014, INTERSPEECH.

[16]  Antanas Lipeika,et al.  Development of HMM/Neural Network-Based Medium-Vocabulary Isolated-Word Lithuanian Speech Recognition System , 2004, Informatica.

[17]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Gailius Raskinis,et al.  Building Medium-Vocabulary Isolated-Word Lithuanian HMM Speech Recognition System , 2003, Informatica.

[19]  Hans Uszkoreit,et al.  The Lithuanian Language in the Digital Age , 2012 .

[20]  Mark J. F. Gales,et al.  Unicode-based graphemic systems for limited resource languages , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  S. Laurinčiukaitė,et al.  Syllable-Phoneme based Continuous Speech Recognition , 2006 .

[22]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.