Study of Formant Modification for Children ASR

The performance of automatic speech recognition systems for children’s speech is known to suffer from the large variation and mismatch in the acoustic and linguistic attributes between children’s and adults’ speech. One of the various identified sources of mismatch is the difference in formant frequencies between adults and children. In this paper, we propose a formant modification method to mitigate differences between adults’ and children’s speech and to improve the performance of ASR for children. The explored technique gives a relative 27% improvement in system performance compared to a hybrid DNN-HMM baseline. We also compare the system performance with related speaker adaptation methods like vocal tract length normalization (VTLN) and speaking rate adaptation (SRA) and find that the proposed method gives improvements over them, as well. Combining the proposed method with VTLN and SRA results in a further reduction of WER. We also found that the proposed method performs well even for noisy speech.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[3]  Julius O. Smith,et al.  Bark and ERB bilinear transforms , 1999, IEEE Trans. Speech Audio Process..

[4]  S. Shahnawazuddin,et al.  Enhancing the recognition of children's speech on acoustically mismatched ASR system , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[5]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[6]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[7]  S. Shahnawazuddin,et al.  Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition , 2014, 2014 International Conference on Signal Processing and Communications (SPCOM).

[8]  Francoise Beaufays,et al.  Google Search by Voice: A Case Study , 2010 .

[9]  Syed Shahnawazuddin,et al.  Pitch-Adaptive Front-End Features for Robust Children's ASR , 2016, INTERSPEECH.

[10]  Piero Cosi,et al.  On the development of matched and mismatched Italian children's speech recognition systems , 2009, INTERSPEECH.

[11]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[12]  Syed Shahnawazuddin,et al.  Spectral Smoothing by Variationalmode Decomposition and its Effect on Noise and Pitch Robustness of ASR System , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[14]  Rüdiger Hoffmann,et al.  A survey about databases of children's speech , 2013, INTERSPEECH.

[15]  K. Johnson,et al.  Formants of children, women, and men: the effects of vocal intensity variation. , 1999, The Journal of the Acoustical Society of America.

[16]  L Petrosino,et al.  Formant Frequency Characteristics of Children, Young Adult, and Aged Female Speakers , 1991, Perceptual and motor skills.

[17]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[18]  Lonce L. Wyse,et al.  Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[20]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[21]  S Shahnawazuddin,et al.  Improving Children's Speech Recognition Through Time Scale Modification Based Speaking Rate Adaptation , 2018, 2018 International Conference on Signal Processing and Communications (SPCOM).

[22]  Syed Shahnawazuddin,et al.  Role of Prosodic Features on Children's Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[24]  H. Strube Linear prediction on a warped frequency scale , 1980 .

[25]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[26]  Unto K. Laine,et al.  Warped linear prediction (WLP) in speech and audio processing , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Shrikanth Narayanan,et al.  Acoustic Analysis of Preschool Children's Speech , 2003 .

[29]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.