论文信息 - The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge

The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge

This technical report describes our submission to the 2021 SLT Children Speech Recognition Challenge (CSRC) Track 1. Our approach combines the use of a joint CTC-attention end-to-end (E2E) speech recognition framework, transfer learning, data augmentation and development of various language models. Procedures of data pre-processing, the background and the course of system development are described. The analysis of the experiment results, as well as the comparison between the E2E and DNN-HMM hybrid system are discussed in detail. Our system achieved a character error rate (CER) of 20.1% in our designated test set, and 23.6% in the official evaluation set, which is placed at 10-th overall.

[1] Yuanyuan Liu,et al. Acoustical Assessment of Voice Disorder With Continuous Speech Using ASR Posterior Features , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2] Richeng Duan,et al. Unsupervised Feature Adaptation Using Adversarial Multi-Task Training for Automatic Evaluation of Children's Speech , 2020, INTERSPEECH.

[3] Robert Gale,et al. Improving ASR Systems for Children with Autism and Language Impairment Using Domain-Focused DNN Transfer Techniques , 2019, INTERSPEECH.

[4] Michael Picheny,et al. Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5] Ronald Rosenfeld,et al. A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[6] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[7] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Peter Bell,et al. Improving Children's Speech Recognition Through Out-of-Domain Data Augmentation , 2016, INTERSPEECH.

[9] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[10] Yashesh Gaur,et al. On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[11] Ying Qin,et al. Automatic Speech Assessment for People with Aphasia Using TDNN-BLSTM with Multi-Task Learning , 2018, INTERSPEECH.

[12] Takashi Fukuda,et al. Data Augmentation Based on Vowel Stretch for Improving Children's Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Shrikanth S. Narayanan,et al. Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[14] S. Shahnawazuddin,et al. Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario , 2020, INTERSPEECH.

[15] Kyu J. Han,et al. ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition , 2020, INTERSPEECH.

[16] Syed Shahnawazuddin,et al. Role of Prosodic Features on Children's Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Sanjeev Khudanpur,et al. Advances in Automatic Speech Recognition for Child Speech Using Factored Time Delay Neural Network , 2019, INTERSPEECH.

[18] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[20] Avinash Kumar,et al. Non-Uniform Spectral Smoothing for Robust Children's Speech Recognition , 2018, INTERSPEECH.

[21] Sanjeev Khudanpur,et al. A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[24] Yanmin Qian,et al. GANs for Children: A Generative Data Augmentation Strategy for Children Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25] Sanjeev Khudanpur,et al. A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Xiaodong Cui,et al. English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[27] John R. Hershey,et al. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[28] Shrikanth S. Narayanan,et al. Automatic speech recognition for children , 1997, EUROSPEECH.

[29] Syed Shahnawazuddin,et al. Improving Children's Speech Recognition Through Explicit Pitch Scaling Based on Iterative Spectrogram Inversion , 2017, INTERSPEECH.

[30] Yu Zhang,et al. Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[31] Mark J. F. Gales,et al. Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[32] Yifan Gong,et al. Unsupervised adaptation with domain separation networks for robust speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[34] Syed Shahnawazuddin,et al. Pitch-Adaptive Front-End Features for Robust Children's ASR , 2016, INTERSPEECH.

[35] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[36] Panayiotis G. Georgiou,et al. Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations , 2018, Comput. Speech Lang..