Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition

Research in Arabic automatic speech recognition (ASR) is constrained by datasets of limited size, and of highly variable content and quality. Arabic-language resources vary in the attributes that affect language resources in other languages (noise, channel, speaker, genre), but also vary significantly in the dialect and level of formality of the spoken Arabic they capture. Many languages suffer similar levels of cross-dialect and cross-register acoustic variability, but these effects have been under-studied. This paper is an experimental analysis of the interaction between classical ASR corpus-compensation methods (feature selection, data selection, gender-dependent acoustic models) and the dialect-dependent/register-dependent variation among Arabic ASR corpora. The first interaction studied in this paper is that between acoustic recording quality and discrete pronunciation variation. Discrete pronunciation variation can be compensated by using grapheme-based instead of phone-based acoustic models, and by filtering out speakers with insufficient training data; the latter technique also helps to compensate for poor recording quality, which is further compensated by eliminating delta-delta acoustic features. All three techniques, together, reduce Word Error Rate (WER) by between 3.24% and 5.35%. The second aspect of dialect and register variation to be considered is variation in the fine-grained acoustic pronunciations of each phoneme in the language. Experimental results prove that gender and dialect are the principal components of variation in speech, therefore, building gender and dialect-specific models leads to substantial decreases in WER. In order to further explore the degree of acoustic differences between phone models required for each of the dialects of Arabic, cross-dialect experiments are conducted to measure how far apart Arabic dialects are acoustically in order to make a better decision about the minimal number of recognition systems needed to cover all dialectal Arabic. Finally, the research addresses an important question: how much training data is needed for building efficient speaker-independent ASR systems? This includes developing some learning curves to find out how large must the training set be to achieve acceptable performance.

[1]  Fadi Biadsy,et al.  Google's cross-dialect Arabic voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Davinder Pal Sharma,et al.  Automatic speech recognition systems: challenges and recent implementation trends , 2014 .

[3]  Mark J. F. Gales,et al.  Improved DNN-based segmentation for multi-genre broadcast audio , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Fatiha Sadat,et al.  Automatic identification of arabic dialects in social media , 2014, SoMeRA@SIGIR.

[5]  Roger K. Moore A comparison of the data requirements of automatic speech recognition systems and human listeners , 2003, INTERSPEECH.

[6]  Slim Abdennadher,et al.  Cross-lingual acoustic modeling for dialectal Arabic speech recognition , 2010, INTERSPEECH.

[7]  P. Lewis Ethnologue : languages of the world , 2009 .

[8]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Mark J. F. Gales,et al.  Cambridge university transcription systems for the multi-genre broadcast challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  K. Almeman,et al.  Multi dialect Arabic speech parallel corpora , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[11]  Raja Noor Ainon,et al.  Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus , 2012, Int. Arab J. Inf. Technol..

[12]  Sid-Ahmed Selouani,et al.  ALGERIAN ARABIC SPEECH DATABASE (ALGASD): CORPUS DESIGN AND AUTOMATIC SPEECH RECOGNITION APPLICATION , 2010 .

[13]  Kamel Smaïli,et al.  Development of the Arabic Loria Automatic Speech Recognition system (ALASR) and its evaluation for Algerian dialect , 2017, ACLING.

[14]  Husni Al-Muhtaseb,et al.  Arabic broadcast news transcription system , 2007, Int. J. Speech Technol..

[15]  Raja Noor Ainon,et al.  Impact of a Newly Developed Modern Standard Arabic Speech Corpus on Implementing and Evaluating Automatic Continuous Speech Recognition Systems , 2010, IWSDS.

[16]  Pedro J. Moreno,et al.  Multi-Dialectical Languages Effect on Speech Recognition: Too Much Choice Can Hurt , 2015, ICNLSP.

[17]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[18]  Allan Ramsay,et al.  Evaluating the effect of using different transcription schemes in building a speech recognition system for Arabic , 2020 .

[19]  Nizar Habash,et al.  A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition , 2014, LREC.

[20]  Mark Hasegawa-Johnson,et al.  Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition , 2012 .

[21]  Dimitra Vergyri,et al.  Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition , 2005, Speech Commun..

[22]  Tao Chen,et al.  Analysis of Speaker Variability , 2022 .

[23]  Abdulhadi Shoufan,et al.  Natural Language Processing for Dialectical Arabic: A Survey , 2015, ANLP@ACL.

[24]  Ivan Medennikov,et al.  Exploration of End-to-End ASR for OpenSTT - Russian Open Speech-to-Text Dataset , 2020, SPECOM.

[25]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[26]  James R. Glass,et al.  A complete KALDI recipe for building Arabic speech recognition systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[27]  Mark Hasegawa-Johnson,et al.  Development of a TV Broadcasts Speech Recognition System for Qatari Arabic , 2014, LREC.