Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition

The automatic recognition of pathological speech, particularly from children with any articulatory impairment, is a challenging task due to various reasons. The lack of available domain specific data is one such obstacle that hinders its usage for different speech-based applications targeting pathological speakers. In line with the challenge, in this work, we investigate a few data augmentation techniques to simulate training data for improving the children speech recognition considering the case of cleft lip and palate (CLP) speech. The augmentation techniques explored in this study, include vocal tract length perturbation (VTLP), reverberation, speaking rate, pitch modification, and speech feature modification using cycle consistent adversarial networks (CycleGAN). Our study finds that the data augmentation methods significantly improve the CLP speech recognition performance, which is more evident when we used feature modification using CycleGAN, VTLP and reverberation based methods. More specifically, the results from this study show that our systems produce an improved phone error rate compared to the systems without data augmentation.

[1]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[2]  T. Whitehill Assessing intelligibility in speakers with cleft palate: a critical review of the literature. , 2002, The Cleft palate-craniofacial journal : official publication of the American Cleft Palate-Craniofacial Association.

[3]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[4]  J. Folkins,et al.  Effect of speaking rate on judgments of disordered speech in children with cleft palate. , 1985, The Cleft palate journal.

[5]  Jon Barker,et al.  Phonetic Analysis of Dysarthric Speech Tempo and Applications to Robust Personalised Dysarthric Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Mark J. F. Gales,et al.  Transcription of multi-genre media archives using out-of-domain data , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[7]  Naoyuki Kanda,et al.  Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[9]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  A. Mihailidis,et al.  Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review , 2010, Assistive technology : the official journal of RESNA.

[13]  D. Radha,et al.  Smart Phone as a Controlling Device for Smart Home using Speech Recognition , 2019, 2019 International Conference on Communication and Signal Processing (ICCSP).

[14]  Lin-Shan Lee,et al.  Rhythm-Flexible Voice Conversion Without Parallel Data Using Cycle-GAN Over Phoneme Posteriorgram Sequences , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15]  Shinji Watanabe,et al.  Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios , 2021, Interspeech.

[16]  Yossi Matias,et al.  Personalizing ASR for Dysarthric and Accented Speech with Limited Data , 2019, INTERSPEECH.

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[18]  Elmar Nöth,et al.  Intelligibility assessment in children with cleft lip and palate in Italian and German , 2009, INTERSPEECH.

[19]  Rohan Kumar Das,et al.  Enhancing the Intelligibility of Cleft Lip and Palate Speech Using Cycle-Consistent Adversarial Networks , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[20]  Aaron E. Rosenberg,et al.  Performance tradeoffs in dynamic time warping algorithms for isolated word recognition , 1980 .

[21]  Horacio Franco,et al.  Articulatory Features for ASR of Pathological Speech , 2018, INTERSPEECH.

[22]  A. Kummer,et al.  Cleft Palate and Craniofacial Anomalies: Effects on Speech and Resonance , 2007 .

[23]  David P Kuehn,et al.  Universal Parameters for Reporting Speech Outcomes in Individuals with Cleft Palate , 2008, The Cleft palate-craniofacial journal : official publication of the American Cleft Palate-Craniofacial Association.

[24]  Frank Rudzicz Adjusting dysarthric speech signals to be more intelligible , 2013, Comput. Speech Lang..

[25]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26]  Luca Maria Gambardella,et al.  High-Performance Neural Networks for Visual Object Classification , 2011, ArXiv.

[27]  Visar Berisha,et al.  Simulating Dysarthric Speech for Training Data Augmentation in Clinical Speech Applications , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Elmar Nöth,et al.  Intelligibility of Children with Cleft Lip and Palate: Evaluation by Speech Recognition Techniques , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[30]  Rohan Kumar Das,et al.  Data Augmentation with Signal Companding for Detection of Logical Access Attacks , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Rohan Kumar Das Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021 , 2021, 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge.

[33]  S. Shahnawazuddin,et al.  Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario , 2020, INTERSPEECH.

[34]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[35]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[36]  Jianwei Yu,et al.  Investigation of Data Augmentation Techniques for Disordered Speech Recognition , 2020, INTERSPEECH.

[37]  Sebastian Möller,et al.  Evaluating the speech output component of a smart-home system , 2006, Speech Commun..

[38]  R. K. Sommers,et al.  Phonetic contexts: their effects on perceived intelligibility in clef-palate speakers. , 1975, Folia phoniatrica.