Modeling vowels for Arabic BN transcription

This paper describes the LIMSI Arabic Broadcast News system which produces a vowelized word transcription. The under 10x system, evaluated in the NIST RT-04F evaluation, uses a 3 pass decoding strategy with gender- and bandwidth-specific acoustic models, a vowelized 65k word class pronunciation lexicon and a word-class 4-gram language model. In order to explicitly represent the vowelized word forms, each nonvowelized word entry is considered as a word class regrouping all of its associated vowelized forms. Since Arabic texts are almost exclusively written without vowels, an important challenge is to be able to use these efficiently in a system producing a vowelized output. Since a portion of the acoustic training data was manually transcribed with short vowels, enabling an initial set of acoustic models to be estimated in a supervised manner. The remaining audio data, for which vowels are not annotated, were trained in an implicit manner using the recognizer to choose the preferred form. The system was trained on a total of about 150 hours of audio data and almost 600 million words of Arabic texts, and achieved word error rates of 16.0% and 18.5% on the dev04 and eval04 data, respectively.