Static and Dynamic Modelling for the Recognition of Non-verbal Vocalisations in Conversational Speech

Non-verbal vocalisations such as laughter, breathing, hesitation, and consent play an important role in the recognition and understanding of human conversational speech and spontaneous affect. In this contribution we discuss two different strategies for robust discrimination of such events: dynamic modelling by a broad selection of diverse acoustic Low-Level-Descriptors vs. static modelling by projection of these via statistical functionals onto a 0.6k feature space with subsequent de-correlation. As classifiers we employ Hidden Markov Models, Conditional Random Fields, and Support Vector Machines, respectively. For discussion of extensive parameter optimisation test-runs with respect to features and model topology, 2.9k non-verbals are extracted from the spontaneous Audio-Visual Interest Corpus. 80.7% accuracy can be reported with, and 92.6% without a garbage model for the discrimination of the named classes.

[1]  David A. van Leeuwen,et al.  Automatic detection of laughter , 2005, INTERSPEECH.

[2]  Ralf Kompe,et al.  Prosody in Speech Understanding Systems , 1997, Lecture Notes in Computer Science.

[3]  Björn W. Schuller,et al.  Audiovisual recognition of spontaneous interest within conversations , 2007, ICMI '07.

[4]  Nikki Mirghafori,et al.  Automatic laughter detection using neural networks , 2007, INTERSPEECH.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Masataka Goto,et al.  A real-time filled pause detection system for spontaneous speech recognition , 1999, EUROSPEECH.

[7]  Nick Campbell,et al.  On the Use of NonVerbal Speech Sounds in Human Communication , 2007, COST 2102 Workshop.

[8]  Nick Campbell,et al.  No laughing matter , 2005, INTERSPEECH.

[9]  Wayne H. Ward Understanding spontaneous speech: the Phoenix system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Robin J. Lickley,et al.  Processing disfluent speech: how and when are disfluencies found? , 1991, EUROSPEECH.

[11]  Loïc Kessous,et al.  The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals , 2007, INTERSPEECH.

[12]  Steve Young,et al.  A review of large-vocabulary continuous-speech , 1996, IEEE Signal Process. Mag..

[13]  Daniel P. W. Ellis,et al.  Laughter Detection in Meetings , 2004 .

[14]  Björn W. Schuller,et al.  Hidden Conditional Random Fields for Meeting Segmentation , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[15]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[16]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[17]  Tanja Schultz,et al.  Acoustic and language modeling of human and nonhuman noises for human-to-human spontaneous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Björn W. Schuller,et al.  Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.