Morphological analysis and decomposition for Arabic speech-to-text systems

Language modelling for a morphologically complex language such as Arabic is a challenging task. Its agglutinative structure results in data sparsity problems and high out-of-vocabulary rates. In this work these problems are tackled by applying the MADA tools to the Arabic text. In addition to morphological decomposition, MADA performs context-dependent stem-normalisation. Thus, if word-level system combination, or scoring, is required this normalisation must be reversed. To address this, a novel context-sensitive method for morpheme-to-word conversion is introduced. The performance of the MADA decomposed system was evaluated on an Arabic broadcast transcription task. The MADA-based system out-performed the word-based system, with both the morphological decomposition and stem normalisation being found to be important.

[1]  Mark J. F. Gales,et al.  Development of a phonetic system for large vocabulary Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[2]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[3]  Andreas Stolcke,et al.  Development of the SRI/nightingale Arabic ASR system , 2008, INTERSPEECH.

[4]  Geoffrey Zweig,et al.  Morpheme-Based Language Modeling for Arabic Lvcsr , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Ruhi Sarikaya,et al.  On the use of morphological analysis for dialectal Arabic speech recognition , 2006, INTERSPEECH.

[6]  William J. Byrne,et al.  European Language Translation with Weighted Finite State Transducers: The CUED MT System for the 2008 ACL Workshop on SMT , 2008, WMT@ACL.

[7]  Jean-Luc Gauvain,et al.  Investigating morphological decomposition for transcription of Arabic broadcast news and broadcast conversation data , 2008, INTERSPEECH.

[8]  Hermann Ney,et al.  The RWTH Arabic-to-English spoken language translation system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[9]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[10]  Andreas Stolcke,et al.  Morphology-based language modeling for arabic speech recognition , 2004, INTERSPEECH.

[11]  Nizar Habash,et al.  Combination of Arabic Preprocessing Schemes for Statistical Machine Translation , 2006, ACL.

[12]  Andreas Stolcke,et al.  Morphology-based language modeling for conversational Arabic speech recognition , 2006, Comput. Speech Lang..

[13]  Bing Xiang,et al.  Morphological Decomposition for Arabic Broadcast News Transcription , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[15]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[16]  José B. Mariño,et al.  An n-gram-based statistical machine translation decoder , 2005, INTERSPEECH.