Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers

This report proposes state-of-the-art research in the field of Computer Assisted Language Learning (CALL). Mispronunciation detection is one of the core components of Computer Assisted Pronunciation Training (CAPT) systems which is a subset of CALL. Studies on automated pronunciation error detection began in the 1990s, but the development of fullfledged CAPTs has only accelerated in the last decade due to an increase in computing power and availability of mobile devices for recording speech required for pronunciation analysis. Detecting Pronunciation errors is a hard problem to solve as there is no formal definition of correct and incorrect pronunciation. As a result, typically prosodic and phoneme errors such as phoneme substitution, insertion, and deletion are detected. Also, it has been agreed upon that learning pronunciation should focus on speaker intelligibility rather than sounding like an L1 English speaker. Initially, methods were developed on posterior likelihood called Good of Pronunciation using Gaussian Mixture Model-Hidden Markov Model and Deep Neural Network-Hidden Markov Model approaches. These are complex systems to implement when compared with the recently proposed ASR based End-to-End mispronunciations detection systems. The purpose of this research is to create End-to-End (E2E) models using Connectionist Temporal Classification (CTC) and Attention-based sequence decoder. Recently, E2E models have shown considerable improvement in mispronunciation detection accuracy. This research will draw comparison amongst baseline models CNN-RNN-CTC, CNN-RNN-CTC with character sequence-based attention decoder, and CNN-RNN-CTC with phoneme-based decoder systems. This study will help us in deciding a better approach towards developing an efficient mispronunciation detection system.

[1]  Gora Chand Nandi,et al.  A Speech Recognition Technique Using MFCC with DWT in Isolated Hindi Words , 2013, ICACNI.

[2]  Gora Chand Nandi,et al.  Implementation of MFCC based hand gesture recognition on HOAP-2 using Webots platform , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[3]  Gora Chand Nandi,et al.  An efficient gesture based humanoid learning using wavelet descriptor and MFCC techniques , 2017, Int. J. Mach. Learn. Cybern..

[4]  Berlin Chen,et al.  An Effective End-to-End Modeling Approach for Mispronunciation Detection , 2020, INTERSPEECH.

[5]  Ricardo Gutierrez-Osuna,et al.  L2-ARCTIC: A Non-native English Speech Corpus , 2018, INTERSPEECH.

[6]  Gora Chand Nandi,et al.  A mathematical framework for possibility theory-based hidden Markov model , 2017, Int. J. Bio Inspired Comput..

[7]  Berlin Chen,et al.  An End-to-End Mispronunciation Detection System for L2 English Speech Leveraging Novel Anti-Phone Modeling , 2020, INTERSPEECH.

[8]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[9]  Gora Chand Nandi,et al.  Face liveness detection through face structure analysis , 2014, Int. J. Appl. Pattern Recognit..

[10]  Gora Chand Nandi,et al.  Development of a self reliant humanoid robot for sketch drawing , 2017, Multimedia Tools and Applications.

[11]  Marcin Wlodarczak,et al.  TextGridTools: A TextGrid Processing and Analysis Toolkit for Python , 2013 .

[12]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[13]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[14]  Xunying Liu,et al.  CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Gora Chand Nandi,et al.  A rough set based reasoning approach for criminal identification , 2019, Int. J. Mach. Learn. Cybern..

[17]  Expression invariant fragmented face recognition , 2014, 2014 International Conference on Signal Propagation and Computer Technology (ICSPCT 2014).

[18]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[19]  Shweta Tripathi,et al.  A speaker invariant speech recognition technique using HFCC features in isolated Hindi words , 2014, Int. J. Comput. Intell. Stud..

[20]  Gora Chand Nandi,et al.  Human perception based criminal identification through human robot interaction , 2015, 2015 Eighth International Conference on Contemporary Computing (IC3).

[21]  Gora Chand Nandi,et al.  Development of a Fuzzy Expert System based Liveliness Detection Scheme for Biometric Authentication , 2016, ArXiv.

[22]  Gora Chand Nandi,et al.  Real‐Time Gesture–Based Communication Using Possibility Theory–Based Hidden Markov Model , 2017, Comput. Intell..

[23]  Nikita P. Desai,et al.  Mel Frequency Cepstral Coefficients (MFCC) based speaker identification in noisy environment using wiener filter , 2014, 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE).

[24]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[25]  Kai-Florian Richter,et al.  Towards Verbal Explanations by Collaborating Robot Teams , 2019 .

[26]  Frank K. Soong,et al.  Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT) , 2010, INTERSPEECH.

[27]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[28]  Gora Chand Nandi,et al.  Continuous dynamic Indian Sign Language gesture recognition with invariant backgrounds , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[31]  Thomas Hellström,et al.  Fusion of Gesture and Speech for Increased Accuracy in Human Robot Interaction , 2019, 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR).

[32]  Kamalika Datta,et al.  Peak Detection based Spread Spectrum Audio Watermarking using Discrete Wavelet Transform , 2011 .

[33]  Kai-Florian Richter,et al.  Verbal explanations by collaborating robot teams , 2020, Paladyn J. Behav. Robotics.

[34]  Tara N. Sainath,et al.  Deep Learning for Audio Signal Processing , 2019, IEEE Journal of Selected Topics in Signal Processing.

[35]  Kai Chen,et al.  SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  D. Subashini,et al.  Automated Speech Recognition System – A Literature Review , 2017 .

[37]  Kai-Florian Richter,et al.  An Empirical Review of Calibration Techniques for the Pepper Humanoid Robot's RGB and Depth Camera , 2019, IntelliSys.

[38]  Gora Chand Nandi,et al.  NAO humanoid robot: Analysis of calibration techniques for robot sketch drawing , 2016, Robotics Auton. Syst..

[39]  G. C. Nandi,et al.  Sketch drawing by NAO humanoid robot , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[40]  Gora Chand Nandi,et al.  Development of a Framework for Human–Robot interactions with Indian Sign Language Using Possibility Theory , 2017, Int. J. Soc. Robotics.

[41]  Gora Chand Nandi,et al.  Visual perception-based criminal identification: a query-based approach , 2017, J. Exp. Theor. Artif. Intell..

[42]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  G. C. Nandi,et al.  A MFCC based Hindi speech recognition technique using HTK Toolkit , 2013, 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013).

[44]  Yogesh Kumar,et al.  A Comprehensive View of Automatic Speech Recognition System - A Systematic Literature Review , 2019, 2019 International Conference on Automation, Computational and Technology Management (ICACTM).

[45]  Kai-Florian Richter,et al.  Understandable Collaborating Robot Teams , 2020, PAAMS.

[46]  G. C. Nandi,et al.  Implementation and evaluation of DWT and MFCC based ISL gesture recognition , 2014, 2014 9th International Conference on Industrial and Information Systems (ICIIS).

[47]  Kai-Florian Richter,et al.  A Fuzzy Inference System for a Visually Grounded Robot State of Mind , 2020, ECAI.

[48]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[49]  Avinash Kumar Singh,et al.  Extracting Primary Objects and Spatial Relations from Sentences , 2019, ICAART.

[50]  Wei Li,et al.  Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  G. C. Nandi,et al.  Possibility theory based continuous Indian Sign Language gesture recognition , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[52]  Kai-Florian Richter,et al.  Understandable Teams of Pepper Robots , 2020, PAAMS.

[53]  Gora Chand Nandi,et al.  Face recognition using facial symmetry , 2012, CCSEIT '12.

[54]  G. C. Nandi,et al.  Face recognition with liveness detection using eye and mouth movement , 2014, 2014 International Conference on Signal Propagation and Computer Technology (ICSPCT 2014).

[55]  Long Zhang,et al.  End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture , 2020, Sensors.

[56]  Kamalika Datta,et al.  Comparative study of spread spectrum based audio watermarking techniques , 2011, 2011 International Conference on Recent Trends in Information Technology (ICRTIT).