Correct Pronunciation Detection of the Arabic Alphabet Using Deep Learning

Automatic speech recognition for Arabic has its unique challenges and there has been relatively slow progress in this domain. Specifically, Classic Arabic has received even less research attention. The correct pronunciation of the Arabic alphabet has significant implications on the meaning of words. In this work, we have designed learning models for the Arabic alphabet classification based on the correct pronunciation of an alphabet. The correct pronunciation classification of the Arabic alphabet is a challenging task for the research community. We divide the problem into two steps, firstly we train the model to recognize an alphabet, namely Arabic alphabet classification. Secondly, we train the model to determine its quality of pronunciation, namely Arabic alphabet pronunciation classification. Due to the less availability of audio data of this kind, we had to collect audio data from the experts, and novices for our model’s training. To train these models, we extract pronunciation features from audio data of the Arabic alphabet using mel-spectrogram. We have employed a deep convolution neural network (DCNN), AlexNet with transfer learning, and bidirectional long short-term memory (BLSTM), a type of recurrent neural network (RNN), for the classification of the audio data. For alphabet classification, DCNN, AlexNet, and BLSTM achieve an accuracy of 95.95%, 98.41%, and 88.32%, respectively. For Arabic alphabet pronunciation classification, DCNN, AlexNet, and BLSTM achieve an accuracy of 97.88%, 99.14%, and 77.71%, respectively.

[1]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Raja Noor Ainon,et al.  Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus , 2012, Int. Arab J. Inf. Technol..

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Shibli Nisar,et al.  A SILENCE REMOVAL AND ENDPOINT DETECTION APPROACH FOR SPEECH PROCESSING , 2014 .

[5]  Aslam Muhammad,et al.  E-hafiz: Intelligent system to help muslims in recitation and memorization of Quran , 2012 .

[6]  Musbah J. Aqel,et al.  Tajweed: An Expert System for Holy Qur’an Recitation Proficiency☆ , 2015 .

[7]  Michał Grochowski,et al.  Data augmentation for improving deep learning in image classification problem , 2018, 2018 International Interdisciplinary PhD Workshop (IIPhDW).

[8]  Tom Carey,et al.  ACM SIGCHI Curricula for Human-Computer Interaction , 1992 .

[9]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[10]  Steven Euijong Whang,et al.  A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective , 2018, IEEE Transactions on Knowledge and Data Engineering.

[11]  Marcel Lederle,et al.  Combining High-Level Features of Raw Audio Waves and Mel-Spectrograms for Audio Tagging , 2018, ArXiv.

[12]  Alex Sherstinsky,et al.  Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network , 2018, Physica D: Nonlinear Phenomena.

[13]  Muhammad Nadeem Majeed,et al.  Mispronunciation Detection Using Deep Convolutional Neural Network Features and Transfer Learning-Based Model for Arabic Phonemes , 2019, IEEE Access.

[14]  Jen-Tzung Chien,et al.  Joint acoustic and language modeling for speech recognition , 2010, Speech Commun..

[15]  Juan M. Corchado,et al.  Deep neural networks and transfer learning applied to multimedia web mining , 2017, DCAI.

[16]  Nourhan Zayed,et al.  Feature Extraction Techniques: Fundamental Concepts and Survey , 2016 .

[17]  Muazzam Maqsood,et al.  A Complete Mispronunciation Detection System for Arabic Phonemes using SVM , 2016 .

[18]  Andreas Stolcke,et al.  Mispronunciation Detection in Children's Reading of Sentences , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Ian McGraw,et al.  Personalized speech recognition on mobile devices , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Nurul Wahidah Arshad,et al.  Makhraj Recognition for Al-Quran Recitation using MFCC , 2013 .

[21]  Jens Edlund,et al.  The State of Speech in HCI: Trends, Themes and Challenges , 2018, Interact. Comput..

[22]  César Ferri,et al.  Improving Performance of Multiclass Classification by Inducing Class Hierarchies , 2017, ICCS.

[23]  Francis Quintal Lauzon An introduction to deep learning , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[24]  Jianzhong Wang,et al.  Acoustics recognition of construction equipments based on LPCC features and SVM , 2015, 2015 34th Chinese Control Conference (CCC).

[25]  Berlin Chen,et al.  Mandarin Chinese Mispronunciation Detection and Diagnosis Leveraging Deep Neural Network Based Acoustic Modeling and Training Techniques , 2019 .

[26]  David Brown,et al.  A Real-Time DSP-Based System for Voice Activity Detection and Background Noise Reduction , 2019 .

[27]  Javad Abbasi Aghamaleki,et al.  Transfer learning approach for classification and noise reduction on noisy web data , 2018, Expert Syst. Appl..

[28]  Othman O. Khalifa,et al.  Natural speaker-independent Arabic speech recognition system based on Hidden Markov Models using Sphinx tools , 2010, International Conference on Computer and Communication Engineering (ICCCE'10).

[29]  R. R. Aliev,et al.  Artificial Neural Networks , 2001 .

[30]  Khaled Shaalan,et al.  Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Glenn Stockwell,et al.  Computer-Assisted Language Learning: Diversity in Research and Practice. , 2012 .

[33]  A.M. Ahmad,et al.  Recurrent neural network with backpropagation through time for speech recognition , 2004, IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004..

[34]  Jingyu Wang,et al.  Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[35]  Shaohe Lv,et al.  An Overview of End-to-End Automatic Speech Recognition , 2019, Symmetry.

[36]  James Allan Perspectives on Information Retrieval and Speech , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[37]  I. Ahsiah,et al.  Tajweed checking system to support recitation , 2013, 2013 International Conference on Advanced Computer Science and Information Systems (ICACSIS).