ASR and Translation for Under-Resourced Languages

There are more than 6000 languages in the world but only a small number possess the resources required for implementation of human language technologies (HLT). Thus, HLT are mostly concerned by languages for which large resources are available or which have suddenly become of interest because of the economic or political scene. On the contrary, languages from developing countries or minorities have been less worked on in the past years. One way of improving this "language divide" is do more research on portability of HLT for multilingual applications. In this paper, we concentrate on speech-to-speech translation. We present here our methodology for fast development of ASR systems for under-resourced languages or, as they are called now, pi-languages (poorly equipped). We present the resources collected for Vietnamese, and the experimental results of our first Vietnamese ASR system. The current validation of our methodology for Khmer is described next. We also discuss some issues related to machine translation and present first contributions of our laboratory in this context of "pi-languages"

[1]  Virach Sornlertlamvanich,et al.  Issues in Thai Text-to-Speech Synthesis: The NECTEC Approach 1 , 2000 .

[2]  Laurent Besacier,et al.  First steps in fast acoustic modeling for a new target language: application to Vietnamese , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Klaus Ries,et al.  The Karlsruhe-Verbmobil speech recognition engine , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Jean Caelen,et al.  EMACOP : Environnement Multimédia pour l'Acquisition et la gestion de COrpus Parole , 1998 .

[5]  Laurent Besacier,et al.  Using the web for fast language model construction in minority languages , 2003, INTERSPEECH.

[6]  Jean-François Serignat,et al.  Spoken and Written Language Resources for Vietnamese , 2004, LREC.

[7]  Vincent Berment,et al.  Méthodes pour informatiser les langues et les groupes de langues « peu dotées ». (Methods to computerize "little equipped" languages and groups of languages) , 2004 .

[8]  Chafic Mokbel,et al.  Towards multilingual speech recognition using data driven source/target acoustical units association , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[10]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.