Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or unsupervised way. The important issue of word reconstruction from the syllable (or morpheme) ASR output is also discussed. For both languages, best results are reached with morphemes got from unsupervised approach. It leads to very significant WER reduction for Amharic ASR for which LM training data is very small (2.3M words) and it also slightly reduces WER over a Word-LM baseline for Swahili ASR (28M words for LM training). A detailed analysis of the OOV word reconstruction is also presented ; it is shown that a high percentage (up to 75% for Amharic) of OOV words can be recovered with morph-based language model and appropriate reconstruction method. MOTS-CLÉS : Modèle de langage, Morphème, Hors vocabulaire, Langues peu-dotées.

[1]  Solomon Teferra Abate,et al.  Part-of-Speech Tagging for Under-Resourced and Morphologically Rich Languages - The Case of Amharic , 2011 .

[2]  Mehryar Mohri,et al.  A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[3]  Ebru Arisoy,et al.  Lattice Extension and Vocabulary Adaptation for Turkish LVCSR , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Johan Schalkwyk,et al.  Voice search for development , 2010, INTERSPEECH.

[5]  Solomon Teferra Abate,et al.  Morpheme-Based and Factored Language Modeling for Amharic Speech Recognition , 2009, LTC.

[6]  Hermann Ney,et al.  Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR , 2011, INTERSPEECH.

[7]  Hermann Ney,et al.  Using morpheme and syllable based sub-words for polish LVCSR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Etienne Barnard,et al.  Speech Technology for Information Access: a South African Case Study , 2010, AAAI Spring Symposium: Artificial Intelligence for Development.

[9]  Ebru Arisoy,et al.  Turkish Broadcast News Transcription and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Solomon Teferra Abate,et al.  An Amharic speech corpus for large vocabulary continuous speech recognition , 2005, INTERSPEECH.

[12]  Matthew Kam,et al.  Rethinking Speech Recognition on Mobile Devices , 2011 .

[13]  Laurent Besacier,et al.  Developments of Swahili resources for an automatic speech recognition system , 2012, SLTU.

[14]  Dilek Z. Hakkani-Tür,et al.  Introduction to the Special Issue on Processing Morphologically Rich Languages , 2009, IEEE Trans. Speech Audio Process..

[15]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[16]  Tapan S. Parikh,et al.  Avaaj Otalo: a field study of an interactive voice forum for small farmers in rural India , 2010, CHI.

[17]  Mei-Yuh Hwang,et al.  Improved tone modeling for Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[18]  M. Inés Torres,et al.  Morpheme-Based Automatic Speech Recognition of Basque , 2009, IbPRIA.

[19]  Mark J. F. Gales,et al.  Morphological decomposition in Arabic ASR systems , 2012, Comput. Speech Lang..

[20]  Thomas Pellegrini,et al.  Automatic Word Decompounding for ASR in a Morphologically Rich Language: Application to Amharic , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[22]  Laurent Besacier,et al.  First Broadcast News Transcription System for Khmer Language , 2008, LREC.

[23]  Ebru Arisoy,et al.  Unsupervised segmentation of words into morphemes - morpho challenge 2005 application to automatic speech recognition , 2006, INTERSPEECH.

[24]  Solomon Teferra Abate,et al.  Morpheme-based automatic speech recognition for a morphologically rich language - Amharic , 2010, SLTU.