Optimized Time Series Filters for Detecting Laughter and Filler Events

Social signal detection, that is, the task of identifying vocalizations like laughter and filler events is a popular task within computational paralinguistics. Recent studies have shown that besides applying state-of-the-art machine learning methods, it is worth making use of the contextual information and adjusting the frame-level scores based on the local neighbourhood. In this study we apply a weighted average time series smoothing filter for laughter and filler event identification, and set the weights using a state-of-the-art optimization method, namely the Covariance Matrix Adaptation Evolution Strategy (CMAES). Our results indicate that this is a viable way of improving the Area Under the Curve (AUC) scores: our resulting scores are much better than the accuracy scores of the raw likelihoods produced by Deep Neural Networks trained on three different feature sets, and we also significantly outperform standard time series filters as well as DNNs used for smoothing. Our score achieved on the test set of a public English database containing spontaneous mobile phone conversations is the highest one published so far that was realized by feed-forward techniques.

[1]  László Tóth Phone recognition with hierarchical convolutional deep maxout networks , 2015, EURASIP J. Audio Speech Music. Process..

[2]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[3]  Gábor Gosztolya,et al.  Automatic detection of mild cognitive impairment from spontaneous speech using ASR , 2015, INTERSPEECH.

[4]  László Tóth,et al.  A Comparison of Deep Neural Network Training Methods for Large Vocabulary Speech Recognition , 2013, TSD.

[5]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[6]  David A. van Leeuwen,et al.  Automatic detection of laughter , 2005, INTERSPEECH.

[7]  Tanja Schultz,et al.  Speech-Based Detection of Alzheimer's Disease in Conversational German , 2016, INTERSPEECH.

[8]  Björn W. Schuller,et al.  Hierarchical neural networks and enhanced class posteriors for social signal classification , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[9]  Sadik Fikret Gürgen,et al.  Random Forests for Laughter Detection , 2013 .

[10]  Gábor Gosztolya Detecting Laughter and Filler Events by Time Series Smoothing with Genetic Algorithms , 2016, SPECOM.

[11]  Gábor Gosztolya,et al.  On evaluation metrics for social signal detection , 2015, INTERSPEECH.

[12]  Nikki Mirghafori,et al.  Automatic laughter detection using neural networks , 2007, INTERSPEECH.

[13]  Björn W. Schuller,et al.  Social signal classification using deep blstm recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Kevin M. Passino,et al.  Biomimicry of bacterial foraging for distributed optimization and control , 2002 .

[15]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  András Beke,et al.  Automatic laughter detection in Hungarian spontaneous speech using GMM/ANN hybrid method , 2013 .

[17]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[18]  Nikolaus Hansen,et al.  Evaluating the CMA Evolution Strategy on Multimodal Test Functions , 2004, PPSN.

[19]  David E Clarke No laughing matter. , 2015, The Permanente journal.

[20]  Daniel P. W. Ellis,et al.  Laughter Detection in Meetings , 2004 .

[21]  Alessandro Vinciarelli,et al.  Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[22]  Cristina D. Dye,et al.  Temporal parameters of spontaneous speech in Alzheimer's disease , 2010, International journal of speech-language pathology.