Development of Robust Automatic Speech Recognition System for Children's using Kaldi Toolkit

In this paper, the Punjabi children speech recognition system is developed using Subspace Gaussian mixture models (SGMM) acoustic modeling techniques. Initially, the system is dependent upon Mel-frequency cepstral coefficients (MFCC) approach for controlling the temporal variations in the input speech signals. Here, SGMM is integrated with HMM to measure the efficiency of each state which carries the information of a short-windowed frame. For handling the children speaker acoustic variations speaker adaptive training (SAT), based on vocal-tract length normalization and feature space maximum likelihood linear regression is adopted. Kaldi and open-source speech recognition toolkit is used to develop the Robust Automatic Speech Recognition (ASR) System for Punjabi Children's speech. S GMM accumulate the frame coefficients and their posterior probabilities and pass these probabilities to HMM which systematically fit the frame and output have resulted from HMM states. Therefore, the achievement of SGMM has gotten a large performance margin in Punjabi children speech recognition. A remarkable depletion in the word error rate (WER) was noticed using SGMM by varying the feature dimensions. The developed children ASR system obtained a recognition accuracy of 83.66% while tested by varying the feature dimensions to 12.

[1]  Virender Kadyan,et al.  Punjabi Automatic Speech Recognition Using HTK , 2012 .

[2]  Shweta Ghai,et al.  Enhancing children's speech recognition under mismatched condition by explicit acoustic normalization , 2010, INTERSPEECH.

[3]  Syed Shahnawazuddin,et al.  Pitch-Normalized Acoustic Features for Robust Children's Speech Recognition , 2017, IEEE Signal Processing Letters.

[4]  Shweta Ghai,et al.  Pitch adaptive MFCC features for improving children’s mismatched ASR , 2015, International Journal of Speech Technology.

[5]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[6]  Risanuri Hidayat,et al.  Filterbank Analysis of MFCC Feature Extraction in Robust Children Speech Recognition , 2019, 2019 International Symposium on Multimedia and Communication Technology (ISMAC).

[7]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[8]  Diego Giuliani,et al.  Large vocabulary children's speech recognition with DNN-HMM and SGMM acoustic modeling , 2015, INTERSPEECH.

[9]  Rong Tong,et al.  Transfer learning for children's speech recognition , 2017, 2017 International Conference on Asian Language Processing (IALP).

[10]  Panayiotis G. Georgiou,et al.  Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations , 2018, Comput. Speech Lang..

[11]  Yoshikazu Miyanaga,et al.  Accuracy on Children’s Speech Recognition under Noisy Circumstances , 2018, 2018 18th International Symposium on Communications and Information Technologies (ISCIT).

[12]  Kai Feng,et al.  SUBSPACE GAUSSIAN MIXTURE MODELS FOR SPEECH RECOGNITION , 2009 .

[13]  Li-Rong Dai,et al.  Mismatched training data enhancement for automatic recognition of children's speech using DNN-HMM , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[14]  Archana Mantri,et al.  A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers , 2017, Int. J. Speech Technol..

[15]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.