Paralinguistic event detection from speech using probabilistic time-series smoothing and masking

Non-verbal speech cues serve multiple functions in human interaction such as maintaining the conversational flow as well as expressing emotions, personality, and interpersonal attitude. In particular, non-verbal vocalizations such as laughters are associated with affective expressions while vocal fillers are used to hold the floor during a conversation. The Interspeech 2013 Social Signals Sub-Challenge involves detection of these two types of non-verbal signals in telephonic speech dialogs. We extend the challenge baseline system by using filtering and masking techniques on probabilistic time series representing the occurrence of a vocal event. We obtain improved area under receiver operating characteristic (ROC) curve of 93.3% (10.4% absolute improvement) for laughters and 89.7% (6.1% absolute improvement) for fillers on the test set. This improvement suggests the importance of using temporal context for detecting these paralinguistic events.

[1]  D. Mowrer,et al.  Analysis of five acoustic correlates of laughter , 1987 .

[2]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[3]  J. Morreall Taking Laughter Seriously , 1984 .

[4]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[5]  M. Argyle,et al.  The Communication of Inferior and Superior Attitudes by Verbal and Non‐verbal Signals* , 1970 .

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[8]  Margaret Procyk Creedon,et al.  Language Development in Nonverbal Autistic Children Using a Simultaneous Communication System. , 1973 .

[9]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[13]  J. Bachorowski,et al.  The acoustic features of human laughter. , 2001, The Journal of the Acoustical Society of America.

[14]  Malcolm Kahn,et al.  Non-Verbal Communication and Marital Satisfaction , 1970 .

[15]  H. H. Clark,et al.  Using uh and um in spontaneous speaking , 2002, Cognition.

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..