Exploring the role of pitch-adaptive cepstral features in context of children's mismatched ASR

The presented work explores the role of pitch-adaptive cepstral features in context of automatic speech recognition (ASR) of children's speech on adults' speech trained acoustic models. On account of large acoustic mismatch between training and test data, highly degraded recognition rates are noted for such cases. Earlier studies have shown that the said acoustic mismatch is aided by the insufficient smoothing of pitch harmonics in the case of mel-frequency cepstral coefficient (MFCC) features for child speakers. Motivated by that, in this work, we explore pitch-adaptive cepstral features for reducing the sensitivity to gross pitch variations. For this purpose, a simple technique based on adaptive-cepstral-truncation is employed for deriving the pitch-adaptive MFCCs. We have also explored the existing STRAIGHT-based MFCCs for contrast. Both the approaches are found to result in significant and similar improvements for children's mismatch ASR case. The effectiveness of the adaptive-truncation-based approach is also demonstrated in context of the deep-neural-network-based acoustic models. Further, it has been shown that the effectiveness of the existing feature normalization techniques remain intact even with the use of the proposed features.

[1]  I. Hirsh,et al.  Development of speech sounds in children. , 1969, Acta oto-laryngologica. Supplementum.

[2]  Steve Renals,et al.  Pitch adaptive features for LVCSR , 2008, INTERSPEECH.

[3]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[4]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[5]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[6]  Kiyohiro Shikano,et al.  Public speech-oriented guidance system with adult and child discrimination capability , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Peter Kulchyski and , 2015 .

[8]  Joakim Gustafson,et al.  Children's convergence in referring expressions to graphical objects in a speech-enabled computer game , 2007, INTERSPEECH.

[9]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[10]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[11]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[12]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[13]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[14]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[15]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[16]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[17]  Steve Renals,et al.  Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Ronald A. Cole,et al.  Highly accurate children's speech recognition for interactive reading tutors using subword units , 2007, Speech Commun..

[20]  S. Shahnawazuddin,et al.  Enhancing the recognition of children's speech on acoustically mismatched ASR system , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[21]  Shweta Ghai,et al.  Addressing pitch Mismatch for Children's Automatic Speech Recognition , 2011 .

[22]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[23]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[25]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .