Role of Prosodic Features on Children's Speech Recognition

In this paper, we have explored the role of combining prosodic variables with the existing acoustic features in the context of children's speech recognition using acoustic models trained on adults' speech. The explored acoustic features are Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction cepstral coefficients (PLPCC) while the considered prosodic variables are loudness, voice-intensity and voice-probability. An analysis presented in this paper shows that, given that the textual content remains the same, the considered prosodic variables exhibit very similar contours for adults' and children's speech. At the same time, the contours differ a lot when the context is different. Consequently, inclusion of prosodic information reduces the inter-speaker differences and increases the class discrimination. This subsequently improves the recognition performance. Further improvements are obtained by projecting the feature vectors obtained by combining the two features to a lower-dimensional subspace. The same has been experimentally verified in this study for mismatched speech recognition using deep neural network (DNN) based system. On combining MFCC (PLPCC) and prosodic features, a relative improvement of 16% (14%) is noted on decoding children's speech using adult data trained DNN models.

[1]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[2]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[3]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[4]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[6]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[7]  S. Shahnawazuddin,et al.  Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition , 2014, 2014 International Conference on Signal Processing and Communications (SPCOM).

[8]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[9]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[11]  Syed Shahnawazuddin,et al.  Pitch-Adaptive Front-End Features for Robust Children's ASR , 2016, INTERSPEECH.

[12]  George R. Doddington,et al.  Speaker recognition based on idiolectal differences between speakers , 2001, INTERSPEECH.

[13]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[14]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[15]  Masahiko Komatsu,et al.  Human language identification with reduced segmental information: comparison between monolinguals and bilinguals , 2001, INTERSPEECH.

[16]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[17]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[20]  Mario Rossi,et al.  IS SYNTACTIC STRUCTURE PROSODICALLY RETRIEVABLE? , 1997 .

[21]  Masahiko Komatsu,et al.  Human language identification with reduced spectral information , 1999, EUROSPEECH.

[22]  Syed Shahnawazuddin,et al.  Enhancing noise and pitch robustness of children's ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[25]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[26]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[27]  Jan P. H. van Santen Prosodic Modeling in Text-to-Speech Synthesis , 1997 .

[28]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[29]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[30]  S. Shahnawazuddin,et al.  Enhancing the recognition of children's speech on acoustically mismatched ASR system , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[31]  Diego H. Milone,et al.  Prosodic and accentual information for automatic speech recognition , 2003, IEEE Trans. Speech Audio Process..

[32]  Keikichi Hirose,et al.  Detection of prosodic word boundaries by statistical modeling of mora transitions of fundamental frequency contours and its use for continuous speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).