Significance of Pitch-Based Spectral Normalization for Children's Speech Recognition

It is well known from the literature that due to several acoustic mismatches, the recognition performances of children's speech using adult-trained-acoustic models get deteriorated. The differences in pitch and speaking rate are the two major factors that cause the acoustic mismatch between two groups of speakers. This work proposes to incorporate pitch information into an automatic speech recognition (ASR) system by exploiting the correlation between pitch and formants. By using the pitch-based spectrum normalization module in the front-end feature extraction process, the performance of mismatch ASR system is improved for children of different age groups. Further, fuzzy-based time scale modification is applied to study the effect of speaking-rate normalization on the proposed feature. The proposed feature results in relative improvement of <inline-formula><tex-math notation="LaTeX">$\text{30}{\%}$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\text{33}{\%}$</tex-math></inline-formula> on DLSTM-based ASR system over the MFCC baseline without and with speaking-rate normalization, respectively.

[1]  Syed Shahnawazuddin,et al.  Assessment of pitch-adaptive front-end signal processing for children's speech recognition , 2018, Comput. Speech Lang..

[2]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[3]  Shweta Ghai,et al.  Addressing pitch Mismatch for Children's Automatic Speech Recognition , 2011 .

[4]  Philip McLeod,et al.  Fast, Accurate Pitch Detection Tools for Music Analysis , 2008 .

[5]  Terrance M. Nearey,et al.  Effects of frequency shifts on perceived naturalness and gender information in speech , 2006, INTERSPEECH.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[7]  Shweta Ghai,et al.  Exploring the Effect of Differences in the Acoustic Correlates of Adults' and Children's Speech in the Context of Automatic Speech Recognition , 2010, EURASIP J. Audio Speech Music. Process..

[8]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Harald Singer,et al.  Pitch dependent phone modelling for HMM-based speech recognition , 1994 .

[12]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[13]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[14]  Rohit Sinha,et al.  Analyzing pitch robustness of PMVDR and MFCC features for children's speech recognition , 2010, 2010 International Conference on Signal Processing and Communications (SPCOM).

[15]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[16]  Vesa Välimäki,et al.  Audio Time Stretching Using Fuzzy Classification of Spectral Bins , 2017 .

[17]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[18]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[19]  Syed Shahnawazuddin,et al.  Spectral Smoothing by Variationalmode Decomposition and its Effect on Noise and Pitch Robustness of ASR System , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  W. Fitch,et al.  Morphology and development of the human vocal tract: a study using magnetic resonance imaging. , 1999, The Journal of the Acoustical Society of America.

[21]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[22]  Syed Shahnawazuddin,et al.  Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition , 2018, Digit. Signal Process..

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  S. Shahnawazuddin,et al.  Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing , 2019, Digit. Signal Process..

[25]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[26]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[27]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[28]  Elmar Nöth,et al.  Acoustic normalization of children's speech , 2003, INTERSPEECH.

[29]  Syed Shahnawazuddin,et al.  Effect of Prosody Modification on Children's ASR , 2017, IEEE Signal Processing Letters.

[30]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[31]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[32]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[33]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..