On the Impact of Children's Emotional Speech on Acoustic and Language Models

The automatic recognition of children's speech is well known to be a challenge, and so is the influence of affect that is believed to downgrade performance of a speech recogniser. In this contribution, we investigate the combination of both phenomena. Extensive test runs are carried out for 1 k vocabulary continuous speech recognition on spontaneous motherese, emphatic, and angry children's speech as opposed to neutral speech. The experiments address the question how specific emotions influence word accuracy. In a first scenario, "emotional" speech recognisers are compared to a speech recogniser trained on neutral speech only. For this comparison, equal amounts of training data are used for each emotion-related state. In a second scenario, a "neutral" speech recogniser trained on large amounts of neutral speech is adapted by adding only some emotionally coloured data in the training process. The results show that emphatic and angry speech is recognised best—even better than neutral speech—and that the performance can be improved further by adaptation of the acoustic and linguistic models. In order to show the variability of emotional speech, we visualise the distribution of the four emotion-related states in the MFCC space by applying a Sammon transformation.

[1]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[2]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  Elmar Nöth,et al.  "Of all things the measure is man" automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  John H. L. Hansen,et al.  Analysis and compensation of stressed and noisy speech with application to robust automatic recognition , 1988 .

[5]  Björn Schuller,et al.  Affect-Robust Speech Recognition by Dynamic Emotional Adaptation , 2006 .

[6]  Martin J. Russell,et al.  Recognition of read and spontaneous children's speech using two new corpora , 2004, INTERSPEECH.

[7]  Roddy Cowie,et al.  ASR for emotional speech: Clarifying the issues and enhancing performance , 2005, Neural Networks.

[8]  Elmar Nöth,et al.  Visualization of Voice Disorders Using the Sammon Transform , 2006, TSD.

[9]  Georg Stemmer Modeling variability in speech recognition , 2004 .

[10]  Björn W. Schuller,et al.  Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  John H. L. Hansen,et al.  Speech under stress conditions: overview of the effect on speech production and on system performance , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  Andrew Ortony,et al.  The Cognitive Structure of Emotions , 1988 .

[13]  Björn W. Schuller,et al.  Does affect affect automatic recognition of children2s speech? , 2008, WOCCI.

[14]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[15]  Elmar Nöth,et al.  Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech , 2008, User Modeling and User-Adapted Interaction.

[16]  Shrikanth S. Narayanan,et al.  Politeness and frustration language in child-machine interactions , 2001, INTERSPEECH.

[17]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[18]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[19]  Björn W. Schuller,et al.  Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach , 2010, Adv. Hum. Comput. Interact..

[20]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[21]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  Mats Blomberg Collection and recognition of children s speech in the PF-Star project , 2003 .

[23]  Ronald A. Cole,et al.  Highly accurate children's speech recognition for interactive reading tutors using subword units , 2007, Speech Commun..

[24]  Paul Dalsgaard,et al.  Design, recording and verification of a danish emotional speech database , 1997, EUROSPEECH.

[25]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[26]  Elmar Nöth,et al.  M = Syntax + Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases , 1998, Speech Commun..

[27]  Björn W. Schuller,et al.  The hinterland of emotions: Facing the open-microphone challenge , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[28]  Björn W. Schuller,et al.  Emotion recognition from speech: Putting ASR in the loop , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[30]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[31]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[32]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[33]  Shrikanth S. Narayanan,et al.  Detecting Politeness and frustration state of a child in a conversational computer game , 2005, INTERSPEECH.

[34]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[35]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[36]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.