论文信息 - Paralinguistic event detection from speech using probabilistic time-series smoothing and masking

Paralinguistic event detection from speech using probabilistic time-series smoothing and masking

Non-verbal speech cues serve multiple functions in human interaction such as maintaining the conversational flow as well as expressing emotions, personality, and interpersonal attitude. In particular, non-verbal vocalizations such as laughters are associated with affective expressions while vocal fillers are used to hold the floor during a conversation. The Interspeech 2013 Social Signals Sub-Challenge involves detection of these two types of non-verbal signals in telephonic speech dialogs. We extend the challenge baseline system by using filtering and masking techniques on probabilistic time series representing the occurrence of a vocal event. We obtain improved area under receiver operating characteristic (ROC) curve of 93.3% (10.4% absolute improvement) for laughters and 89.7% (6.1% absolute improvement) for fillers on the test set. This improvement suggests the importance of using temporal context for detecting these paralinguistic events.

[1] D. Mowrer,et al. Analysis of five acoustic correlates of laughter , 1987 .

[2] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.

[3] J. Morreall. Taking Laughter Seriously , 1984 .

[4] Fabio Valente,et al. The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[5] M. Argyle,et al. The Communication of Inferior and Superior Attitudes by Verbal and Non‐verbal Signals* , 1970 .

[6] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[8] Margaret Procyk Creedon,et al. Language Development in Nonverbal Autistic Children Using a Simultaneous Communication System. , 1973 .

[9] Alex Bateman,et al. An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[10] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[12] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[13] J. Bachorowski,et al. The acoustic features of human laughter. , 2001, The Journal of the Acoustical Society of America.

[14] Malcolm Kahn,et al. Non-Verbal Communication and Marital Satisfaction , 1970 .

[15] H. H. Clark,et al. Using uh and um in spontaneous speaking , 2002, Cognition.

[16] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17] Hui Jiang,et al. Confidence measures for speech recognition: A survey , 2005, Speech Commun..