Arabic Dialectical Speech Recognition in Mobile Communication Services

We present in this chapter a practical approach in building Arabic automatic speech recognition (ASR) system for mobile telecommunication service applications. We also present a procedure in conducting acoustic modelling adaptation to better take into account the pronunciation variation across the Arabic speaking countries. Modern Standard Arabic (MSA) is the common spoken and written language for all the Arab countries, ranging from Morocco in the west to Syria in the East, including Egypt, and Tunisia. However, the pronunciation varies significantly from one country to another to a degree that two persons from different countries may not be able understand each other. This is because Arabic speaking countries are characterized by a large number of dialects that differ to an extent that they are no longer mutually intelligible and could almost be described as different languages. Arabic dialects are often spoken rather than written varieties. MSA is common across the Arab countries, but it is often influenced by the dialect of the speaker. This particularity of the Arabic countries constitutes a practical problem in the development of a speech-based application in this region; suppose a speech application system is built for one country influenced by one dialect, what does it take to adapt the system to serve another country with a different dialect region? This is particularly challenging since resource to build accurate speaker independent Arabic ASR system for mobile telecommunication service applications are limited for most of the Arabic dialects and countries. Recent advances in speaker independent automatic speech recognition (SI-ASR) have demonstrated that highly accurate recognition can be achieved, if enough training data is available. However, the amount of available speech data that take into account the dialectal variation of each Arabic country is limited, making it challenging to build a high performance SI-ASR system, especially when we target specific applications. Another big challenge when building an SI-ASR is to handle speaker variations in spoken language. These variations can be due to age, gender, educational level as well as the dialectical variants of Arabic language. Usually an ASR system trained in one regional variation exhibits poorer performance when applied to another regional variation. Three problems may arise when a SI-ASR system built for one dialect but applied to target users with a different dialect: (1) Acoustic model mismatch, (2) Pronunciation lexicon mismatch and (3) Language model mismatch.

[1]  Biing-Hwang Juang,et al.  Statistical and Discriminative Methods for Speech Recognition , 1996 .

[2]  B.-H. Juang,et al.  Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains , 1985, AT&T Technical Journal.

[3]  Albino Nogueiras,et al.  Orientel: speech-based interactive communication applications for the mediterranean and the middle east , 2002, INTERSPEECH.

[4]  Amit Srivastava,et al.  Arabic speech and text in TIDES OnTAP , 2002 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[7]  Chin-Hui Lee,et al.  Large vocabulary speech recognition using subword units , 1993, Speech Commun..

[8]  Mark J. F. Gales,et al.  Phonetic pronunciations for arabic speech-to-text systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[10]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[11]  Jeff A. Bilmes,et al.  Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins Summer Workshop , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Albino Nogueiras,et al.  OrienTel - Multilingual access to interactive communication services for the Mediterranean and the Middle East , 2002, LREC.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Lawrence R. Rabiner,et al.  A segmental k-means training procedure for connected word recognition , 1986, AT&T Technical Journal.

[15]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[16]  Tanja Schultz,et al.  Challenges with Rapid Adaptation of Speech Translation Systems to New Language Pairs , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Biing-Hwang Juang,et al.  A Minimum Error Rate Pattern Recognition Approach to Speech Recognition , 1994, Int. J. Pattern Recognit. Artif. Intell..

[18]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[19]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Khalid Choukri,et al.  OrienTel – Arabic speech resources for the IT market , 2002 .

[21]  Yves Normandin Maximum Mutual Information Estimation of Hidden Markov Models , 1996 .

[22]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[23]  Lawrence R. Rabiner,et al.  On the relations between modeling approaches for speech recognition , 1990, IEEE Trans. Inf. Theory.

[24]  Chin-Hui Lee,et al.  Bayesian Adaptive Learning and Map Estimation of HMM , 1996 .

[25]  Steve Young,et al.  The HTK book , 1995 .

[26]  Ruhi Sarikaya,et al.  On the use of morphological analysis for dialectal Arabic speech recognition , 2006, INTERSPEECH.

[27]  Qiang Huo,et al.  On adaptive decision rules and decision parameter adaptation for automatic speech recognition , 2000, Proceedings of the IEEE.

[28]  Mark J. F. Gales,et al.  Unsupervised discriminative adaptation using discriminative mapping transforms , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[30]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[31]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[32]  Biing-Hwang Juang,et al.  Signal bias removal for robust telephone based speech recognition in adverse environments , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Dimitra Vergyri,et al.  Cross-dialectal acoustic data sharing for Arabic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Biing-Hwang Juang,et al.  An Overview of Automatic Speech Recognition , 1996 .