Evaluating the effect of using different transcription schemes in building a speech recognition system for Arabic

It is well-known that the Arabic language poses non-trivial issues for Automatic Speech Recognition (ASR) systems. This paper is concerned with the problems posed by the complex morphology of the language and the absence of diacritics in the written form of the language. Several acoustic and language models are built using different transcription resources, namely a grapheme-based transcription which uses non-diacriticised text materials, phoneme-based transcriptions obtained from automatic diacritisation tools (SAMA or MADAMIRA), and a predefined dictionary. The paper presents a comprehensive assessment for the aforementioned transcription schemes by employing them in building a collection of Arabic ASR systems using the GALE (phase 3) Arabic broadcast news and broadcast conversational speech datasets LDC ( 2015 ), which include 260 h of recorded material. Contrary to our expectations, the experimental evidence confirms that the use of grapheme-based transcription is superior to the use of phoneme-based transcription. To investigate this further, several modifications are applied to the MADAMIRA analysis by applying a number of simple phonological rules. These improvements have a substantial effect on the systems’ performance, but it is still inferior to the use of a simple grapheme-based transcription. The research also examined the use of a manually diacriticised subset of the data in training the ASR system and compared it with the use of grapheme-based transcription and phoneme-based transcription obtained from MADAMIRA. The goal of this step is to validate MADAMIRA’s analysis. The results show that using the manually diacriticised text in generating the phonetic transcription can significantly decrease the WER compared to the use of MADAMIRA diacriticised text and also the isolated graphemes. The results obtained strongly indicate that providing the training model with less information about the data (only graphemes) is less damaging than providing it with inaccurate information.

[1]  P. Lewis Ethnologue : languages of the world , 2009 .

[2]  Dimitra Vergyri,et al.  Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition , 2005, Speech Commun..

[3]  James R. Glass,et al.  A complete KALDI recipe for building Arabic speech recognition systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[4]  Husni Al-Muhtaseb,et al.  Arabic broadcast news transcription system , 2007, Int. J. Speech Technol..

[5]  Fawaz S. Al-Anzi,et al.  The impact of phonological rules on Arabic speech recognition , 2017, Int. J. Speech Technol..

[6]  Steve Young,et al.  The HTK book , 1995 .

[7]  Andreas Stolcke,et al.  Morphology-based language modeling for conversational Arabic speech recognition , 2006, Comput. Speech Lang..

[8]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[9]  Wasfi G. Al-Khatib,et al.  Within-word pronunciation variation modeling for Arabic ASRs: a direct data-driven approach , 2011, International Journal of Speech Technology.

[10]  Allan Ramsay,et al.  Generation of a phonetic transcription for modern standard Arabic: A knowledge-based model , 2014, Comput. Speech Lang..

[11]  Allan Ramsay,et al.  Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions , 2017, Inf. Process. Manag..

[12]  Raja Noor Ainon,et al.  Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus , 2012, Int. Arab J. Inf. Technol..

[13]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[14]  Hermann Ney,et al.  A Hybrid Morphologically Decomposed Factored Language Models for Arabic LVCSR , 2010, HLT-NAACL.

[15]  Jean-Luc Gauvain,et al.  Arabic Broadcast News Transcription Using a One Million Word Vocalized Vocabulary , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[17]  Sherif Abdou,et al.  Recent progress in Arabic broadcast news transcription at BBN , 2005, INTERSPEECH.

[18]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[19]  Fadi Biadsy,et al.  Google's cross-dialect Arabic voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Davinder Pal Sharma,et al.  Automatic speech recognition systems: challenges and recent implementation trends , 2014 .

[21]  Slim Abdennadher,et al.  Cross-lingual acoustic modeling for dialectal Arabic speech recognition , 2010, INTERSPEECH.

[22]  Mark J. F. Gales,et al.  Morphological analysis and decomposition for Arabic speech-to-text systems , 2009, INTERSPEECH.

[23]  Wasfi G. Al-Khatib,et al.  Cross-word Arabic pronunciation variation modeling for speech recognition , 2011, Int. J. Speech Technol..

[24]  Hermann Ney,et al.  Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR , 2009, INTERSPEECH.

[25]  Amy Neustein Note from the Editor: Special issue on speech processing and soft computing , 2012, Int. J. Speech Technol..

[26]  M. Ali,et al.  Generation of arabic phonetic dictionaries for speech recognition , 2008, 2008 International Conference on Innovations in Information Technology.

[27]  Rabih Zbib,et al.  Improved morphological decomposition for Arabic broadcast news transcription , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.