Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language

This paper presents our work in automatic speech recognition (ASR) in the context of under-resourced languages with application to Vietnamese. Different techniques for bootstrapping acoustic models are presented. First, we present the use of acoustic-phonetic unit distances and the potential of crosslingual acoustic modeling for under-resourced languages. Experimental results on Vietnamese showed that with only a few hours of target language speech data, crosslingual context independent modeling worked better than crosslingual context dependent modeling. However, it was outperformed by the latter one, when more speech data were available. We concluded, therefore, that in both cases, crosslingual systems are better than monolingual baseline systems. The proposal of grapheme-based acoustic modeling, which avoids building a phonetic dictionary, is also investigated in our work. Finally, since the use of sub-word units (morphemes, syllables, characters, etc.) can reduce the high out-of-vocabulary rate and improve the lack of text resources in statistical language modeling for under-resourced languages, we propose several methods to decompose, normalize and combine word and sub-word lattices generated from different ASR systems. The proposed lattice combination scheme results in a relative syllable error rate reduction of 6.6% over the sentence MAP baseline method for a Vietnamese ASR task.

[1]  J. Kohler Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Christoph Draxler On web-based creation of speech resources for less-resourced languages , 2007, INTERSPEECH.

[3]  Tibor Fegyó,et al.  A morpho-graphemic approach for the recognition of spontaneous speech in agglutinative languages - like Hungarian , 2007, INTERSPEECH.

[4]  Viet Bac Le Reconnaissance automatique de la parole pour des langues peu dotées. (Automatic Speech Recognition for Under-Ressourced Languages) , 2006 .

[5]  Berlin Chen,et al.  Statistical language model adaptation for Mandarin broadcast news transcription , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[6]  Joachim Köhler,et al.  Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Jean-François Bonastre,et al.  Automatic transcription of Somali language , 2006, INTERSPEECH.

[8]  Paul Dalsgaard,et al.  On the use of data-driven clustering technique for identification of poly- and mono-phonemes for four European languages , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Vincent Berment,et al.  Méthodes pour informatiser les langues et les groupes de langues « peu dotées ». (Methods to computerize "little equipped" languages and groups of languages) , 2004 .

[10]  Solomon Gizaw Multiple pronunciation model for Amharic speech recognition system , 2008, SLTU.

[11]  Hervé Blanchon,et al.  The LIG Arabic/English speech translation system at IWSLT08 , 2007, IWSLT.

[12]  Jean-Luc Gauvain,et al.  Broadcast news transcription in Mandarin , 2000, INTERSPEECH.

[13]  Jun Cai,et al.  Transcribing Southern Min speech corpora with a Web-Based language learning system , 2008, 2008 International Conference on Audio, Language and Image Processing.

[14]  Tanja Schultz,et al.  Acoustic-Phonetic Unit Similarities For Context Dependent Acoustic Model Portability , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Thomas Pellegrini,et al.  Investigating automatic decomposition for ASR in less represented languages , 2006, INTERSPEECH.

[16]  L. Joan Vanishing Voices: The Extinction of the World's Languages. , 2004 .

[17]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[18]  Laurent Besacier,et al.  First steps in fast acoustic modeling for a new target language: application to Vietnamese , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[19]  Khurram Waheed,et al.  A robust algorithm for detecting speech segments using an entropic contrast , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[20]  Kazuhiro Kondo,et al.  An evaluation of cross-language adaptation for rapid HMM development in a new language , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Chafic Mokbel,et al.  Towards multilingual speech recognition using data driven source/target acoustical units association , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[23]  Joachim Köhler Multilingual phone models for vocabulary-independent speech recognition tasks , 2001, Speech Commun..

[24]  Klaus Ries,et al.  The Karlsruhe-Verbmobil speech recognition engine , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Laurent Besacier,et al.  Using the web for fast language model construction in minority languages , 2003, INTERSPEECH.

[27]  William J. Byrne,et al.  Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[28]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[29]  Ngoc Thang Vu,et al.  Vietnamese large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[30]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[31]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[32]  Elizabeth C. Botha,et al.  An acoustic distance measure for automatic cross-language phoneme mapping , 2001 .

[33]  Madelaine Plauché,et al.  Unsupervised adaptive speech technology for limited resource languages: a case study for Tamil , 2008, SLTU.

[34]  Steve Young,et al.  The HTK book , 1995 .

[35]  Ruhi Sarikaya,et al.  On the use of morphological analysis for dialectal Arabic speech recognition , 2006, INTERSPEECH.

[36]  Ebru Arisoy,et al.  Unsupervised segmentation of words into morphemes - morpho challenge 2005 application to automatic speech recognition , 2006, INTERSPEECH.

[37]  Andrej Zgank,et al.  Agglomerative vs. tree-based clustering for the definition of multilingual set of triphones , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[38]  Jean-François Serignat,et al.  Spoken and Written Language Resources for Vietnamese , 2004, LREC.

[39]  Paul Dalsgaard,et al.  Data-driven identification of poly- and mono-phonemes for four european languages , 1993, EUROSPEECH.

[40]  G. B. Varile Multilingual Speech Processing , 2005 .

[41]  Sebastian Stüker Integrating Thai grapheme based acoustic models into the ML-MIX framework - for language independent and cross-language ASR , 2008, SLTU.

[42]  Richard M. Stern,et al.  LATTICE COMBINATION FOR IMPROVED SPEECH RECOGNITON , 2001 .

[43]  Sadaoki Furui,et al.  Development of a speech recognition system for Icelandic using machine translated text , 2008, SLTU.

[44]  Hong Quang Nguyen,et al.  A novel approach in continuous speech recognition for Vietnamese, an isolating tonal language , 2008, INTERSPEECH.

[45]  Laurent Besacier,et al.  Word/sub-word lattices decomposition and combination for speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Mikko Kurimo,et al.  On lexicon creation for turkish LVCSR , 2003, INTERSPEECH.

[47]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[48]  Richard M. Stern,et al.  Automatic clustering and generation of contextual questions for tied states in hidden Markov models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[49]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[50]  Laurent Besacier,et al.  Comparison of acoustic modeling techniques for Vietnamese and Khmer ASR , 2006, INTERSPEECH.

[51]  Thomas Pellegrini,et al.  Using phonetic features in unsupervised word decompounding for ASR with application to a less-represented language , 2007, INTERSPEECH.

[52]  Thomas Pellegrini,et al.  Are audio or textual training data more important for ASR in less-represented languages? , 2008, SLTU.

[53]  Christian Boitet,et al.  ASR and Translation for Under-Resourced Languages , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[54]  Lingyun Gu,et al.  A new robust algorithm for isolated word endpoint detection , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.