Detection of Consonant Errors in Disordered Speech Based on Consonant-vowel Segment Embedding

Speech sound disorder (SSD) refers to a type of developmental disorder in young children who encounter persistent difficulties in producing certain speech sounds at the expected age. Consonant errors are the major indicator of SSD in clinical assessment. Previous studies on automatic assessment of SSD revealed that detection of speech errors concerning short and transitory consonants is less satisfactory. This paper investigates a neural network based approach to detecting consonant errors in disordered speech using consonant-vowel (CV) diphone segment in comparison to using consonant monophone segment. The underlying assumption is that the vowel part of a CV segment carries important information of co-articulation from the consonant. Speech embeddings are extracted from CV segments by a recurrent neural network model. The similarity scores between the embeddings of the test segment and the reference segments are computed to determine if the test segment is the expected consonant or not. Experimental results show that using CV segments achieves improved performance on detecting speech errors concerning those “difficult” consonants reported in the previous studies.

[1]  Ricardo Gutierrez-Osuna,et al.  A comparison of GMM-HMM and DNN-HMM based pronunciation verification techniques for use in the assessment of childhood apraxia of speech , 2014, INTERSPEECH.

[2]  Wai-Sum Lee,et al.  C-V and V-C Co-articulation in Cantonese , 2017, ISSP.

[3]  Tan Lee,et al.  CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment , 2020, INTERSPEECH.

[4]  Ying Qin,et al.  Child Speech Disorder Detection with Siamese Recurrent Network Using Speech Attribute Features , 2019, INTERSPEECH.

[5]  Daphna Harel,et al.  Social, Emotional, and Academic Impact of Residual Speech Errors in School-Aged Children: A Survey Study , 2015, Seminars in Speech and Language.

[6]  Tan Lee,et al.  Automatic Detection of Phonological Errors in Child Speech Using Siamese Recurrent Autoencoder , 2020, INTERSPEECH.

[7]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[8]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[11]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[12]  Panayiotis G. Georgiou,et al.  Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations , 2018, Comput. Speech Lang..

[13]  Sharon Goldwater,et al.  Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Isabel Trancoso,et al.  Pathological speech detection using x-vector embeddings , 2020, ArXiv.

[15]  Tan Lee,et al.  Text-Independent Speaker Verification with Dual Attention Network , 2020, INTERSPEECH.

[16]  Jill Freyne,et al.  Automated Screening of Speech Development Issues in Children by Identifying Phonological Error Patterns , 2016, INTERSPEECH.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Li Shen,et al.  Comparator Networks , 2018, ECCV.

[19]  Julie Mauclair,et al.  Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer , 2020, INTERSPEECH.

[20]  K. Stevens,et al.  Perturbation of vowel articulations by consonantal context: an acoustical study. , 1963, Journal of speech and hearing research.

[21]  David Suendermann-Oeft,et al.  Improving DNN-Based Automatic Recognition of Non-native Children Speech with Adult Speech , 2016, WOCCI.

[22]  Robert S. Bauer,et al.  Modern Cantonese Phonology , 1997 .

[23]  Tan Lee,et al.  Spoken language resources for Cantonese speech processing , 2002, Speech Commun..

[24]  Abeer Alwan,et al.  Predicting Clinical Evaluations of Children's Speech with Limited Data Using Exemplar Word Template References , 2017, SLaTE.

[25]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[26]  Meysam Asgari,et al.  Automatic analysis of pronunciations for children with speech sound disorders , 2018, Comput. Speech Lang..

[27]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[28]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.