Cross-word Arabic pronunciation variation modeling for speech recognition

One of the problems in the speech recognition of Modern Standard Arabic (MSA) is the cross-word pronunciation variation. Cross-word pronunciation variations alter the phonetic spelling of words beyond their listed forms in the phonetic dictionary, leading to a number of Out-Of-Vocabulary (OOV) wordforms. This paper presents a knowledge-based approach to model cross-word pronunciation variation at both phonetic dictionary and language model levels. The proposed approach is based on modeling cross-word pronunciation variation by expanding the phonetic dictionary and corpus transcription. The Baseline system contains a phonetic dictionary of 14,234 words from a 5.4 hours corpus of Arabic broadcast news. The expanded dictionary contains 15,873 words. Also, the corpus transcription is expanded according to the applied Arabic phonological rules. Using Carnegie Mellon University (CMU) Sphinx speech recognition engine, the Enhanced system achieved Word Error Rate (WER) of 9.91% on a test set of fully discretized transcription of about 1.1 hours of Arabic broadcast news. The WER is enhanced by 2.3% compared to the Baseline system.

[1]  T. Plötz Advanced stochastic protein sequence analysis , 2005 .

[2]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[3]  Amit Srivastava,et al.  Arabic speech and text in TIDES OnTAP , 2002 .

[4]  Jian Yang,et al.  Mandarin Speech Recognition for Nonnative Speakers Based on Pronunciation Dictionary Adaptation , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[5]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[6]  Hong Kook Kim,et al.  Non-native pronunciation variation modeling using an indirect data driven method , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[7]  Husni Al-Muhtaseb,et al.  Arabic broadcast news transcription system , 2007, Int. J. Speech Technol..

[8]  Daniel Jurafsky,et al.  Building multiple pronunciation models for novel words using exploratory computational phonology , 1995, EUROSPEECH.

[9]  K. Beulen PRONUNCIATION MODELLING IN THE RWTH LARGE VOCABULARY SPEECH RECOGNIZER , 2008 .

[10]  Mirjam Wester,et al.  Pronunciation modeling for ASR - knowledge-based and data-derived methods , 2003, Comput. Speech Lang..

[11]  Harriet J. Nock,et al.  Detecting and correcting poor pronunciations for multiword units , 1998 .

[12]  Aaron E. Rosenberg,et al.  Word juncture modeling using phonological rules for HMM-based continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  Qian Yang,et al.  Data-driven lexical modeling of pronunciation variations for ASR , 2000, INTERSPEECH.

[14]  Maxine Eskénazi,et al.  Automatic generation of context-dependent pronunciations , 1997, EUROSPEECH.

[15]  Helmer Strik,et al.  A bottom-up method for obtaining information about pronunciation variation , 2000, INTERSPEECH.

[16]  H. Strik Pronunciation adaptation at the lexical level , 2001 .

[17]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[18]  K. Jusoff,et al.  Automatic Segmentation and Labeling for Spontaneous Standard Malay Speech Recognition , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[19]  Minhwa Chung,et al.  Morpheme-Based Modeling of Pronunciation Variation for Large Vocabulary Continuous Speech Recognition in Korean , 2007, IEICE Trans. Inf. Syst..

[20]  N. Morgan,et al.  INCORPORATING CONTEXTUAL PHONETICS INTO AUTOMATIC SPEECH RECOGNITION , 1999 .

[21]  Moustafa Elshafei,et al.  Techniques for high quality Arabic speech synthesis , 2002, Inf. Sci..

[22]  Husni Al-Muhtaseb,et al.  Arabic Phonetic Dictionaries for Speech Recognition , 2009, J. Inf. Technol. Res..

[23]  Nizar Habash,et al.  Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules , 2009, HLT-NAACL.

[24]  Helmer Strik,et al.  Proceedings of the ESCA Workshop 'Modeling Pronunciation Variation for Automatic Speech Recognition' , 1998 .

[25]  George Saon,et al.  Data-driven approach to designing compound words for continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[26]  Pierre Dumouchel,et al.  French large vocabulary recognition with cross-word phonology transducers , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[27]  H. Strik,et al.  Pronunciation Adaption at the Lexical Level , 2001 .

[28]  Dau-Cheng Lyu,et al.  Modeling Pronunciation Variation for Bi-Lingual Mandarin/Taiwanese Speech Recognition , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[29]  Filipp Korkmazskiy,et al.  Joint pronunciation modelling of non-native speakers using data-driven methods , 2000, INTERSPEECH.

[30]  Eric Fosler-Lussier,et al.  A comparison of data-derived and knowledge-based modeling of pronunciation variation , 2000, INTERSPEECH.

[31]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[32]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[33]  Guy Perennou,et al.  Dealing with pronunciation variants at the language model level for the continuous automatic speech recognition of French , 1997, EUROSPEECH.

[34]  Abdullah I. Al-Shoshan,et al.  A Systolic Array Architecture for Computing Time-varying Higher-order Cumulants , 2004, J. King Saud Univ. Comput. Inf. Sci..

[35]  Alexander H. Waibel,et al.  Dictionary learning for spontaneous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  Ian R. Lane,et al.  Pronunciation modeling for dialectal arabic speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[37]  Alexander H. Waibel,et al.  Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition , 1997, EUROSPEECH.