Non-Uniform Spectral Smoothing for Robust Children's Speech Recognition

Insufficient spectral smoothing during front-end speech parametrization results in pitch-induced distortions in the shorttime magnitude spectra. This, in turn, degrades the performance of an automatic speech recognition (ASR) system for highpitched speakers. Motivated by this fact, a non-uniform spectral smoothing algorithm is proposed in this paper in order to mitigate the acoustic mismatch resulting from pitch differences. In the proposed technique, the speech utterance is first segmented into vowel and non-vowel regions. The short-time magnitude spectrum obtained by discrete Fourier transform is then processed through a single-pole low-pass filter with different pole values for vowel and non-vowel regions. Sufficiently smoothed spectra is obtained by keeping higher values for the pole in the case of vowels while lower values are chosen for non-vowel regions. The Mel-frequency cepstral coefficients computed using the derived smoothed spectra are observed to be less affected by pitch variations. In order to validate this claim, an ASR system is developed on speech from adult speakers and evaluated on a test set which consists of children’s speech to simulate large pitch differences. The experimental evaluations as well as signal domain analyses presented in this paper support the claim.

[1]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[2]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[3]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[4]  Syed Shahnawazuddin,et al.  Assessment of pitch-adaptive front-end signal processing for children's speech recognition , 2018, Comput. Speech Lang..

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[7]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[10]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[11]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[12]  W. Fitch,et al.  Morphology and development of the human vocal tract: a study using magnetic resonance imaging. , 1999, The Journal of the Acoustical Society of America.

[13]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[14]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[16]  Eric L. Miller,et al.  Nonlocal Means Denoising of ECG Signals , 2012, IEEE Transactions on Biomedical Engineering.

[17]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[18]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[19]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[20]  Syed Shahnawazuddin,et al.  Pitch-Adaptive Front-End Features for Robust Children's ASR , 2016, INTERSPEECH.

[21]  Avinash Kumar,et al.  Non-Local Estimation of Speech Signal for Vowel Onset Point Detection in Varied Environments , 2017, INTERSPEECH.

[22]  Shrikanth S. Narayanan,et al.  Analyzing Children's Speech: An Acoustic Study of Consonants and Consonant-Vowel Transition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..