Spotting laughter in natural multiparty conversations: A comparison of automatic online and offline approaches using audiovisual data

It is essential for the advancement of human-centered multimodal interfaces to be able to infer the current user's state or communication state. In order to enable a system to do that, the recognition and interpretation of multimodal social signals (i.e., paralinguistic and nonverbal behavior) in real-time applications is required. Since we believe that laughs are one of the most important and widely understood social nonverbal signals indicating affect and discourse quality, we focus in this work on the detection of laughter in natural multiparty discourses. The conversations are recorded in a natural environment without any specific constraint on the discourses using unobtrusive recording devices. This setup ensures natural and unbiased behavior, which is one of the main foci of this work. To compare results of methods, namely Gaussian Mixture Model (GMM) supervectors as input to a Support Vector Machine (SVM), so-called Echo State Networks (ESN), and a Hidden Markov Model (HMM) approach, are utilized in online and offline detection experiments. The SVM approach proves very accurate in the offline classification task, but is outperformed by the ESN and HMM approach in the online detection (F1 scores: GMM SVM 0.45, ESN 0.63, HMM 0.72). Further, we were able to utilize the proposed HMM approach in a cross-corpus experiment without any retraining with respectable generalization capability (F1score: 0.49). The results and possible reasons for these outcomes are shown and discussed in the article. The proposed methods may be directly utilized in practical tasks such as the labeling or the online detection of laughter in conversational data and affect-aware applications.

[1]  Nick Campbell,et al.  On the Use of NonVerbal Speech Sounds in Human Communication , 2007, COST 2102 Workshop.

[2]  Liang Lu,et al.  Advances in SVM-based system using GMM super vectors for text-independent speaker verification , 2008 .

[3]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[4]  Kornel Laskowski Modeling vocal interaction for text-independent detection of involvement hotspots in multi-party meetings , 2008, 2008 IEEE Spoken Language Technology Workshop.

[5]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[6]  Loïc Kessous,et al.  Multimodal user’s affective state analysis in naturalistic interaction , 2010, Journal on Multimodal User Interfaces.

[7]  Nick Campbell,et al.  Tools & Resources for Visualising Conversational-Speech Interaction , 2008, LREC.

[8]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[9]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  杨浩,et al.  Advances in SVM-Based System Using GMM Super Vectors for Text-Independent Speaker Verification , 2008 .

[11]  Robert R. Provine,et al.  Laughter: A Stereotyped Human Vocalization , 2010 .

[12]  Harald Haas,et al.  Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[13]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[14]  Kornel Laskowski,et al.  Contrasting emotion-bearing laughter types in multiparticipant vocal activity detection for meetings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Friedhelm Schwenker,et al.  Conditioned Hidden Markov Model Fusion for Multimodal Classification , 2011, INTERSPEECH.

[16]  Hynek Hermansky,et al.  Perceptually based linear predictive analysis of speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Daniel P. W. Ellis,et al.  Laughter Detection in Meetings , 2004 .

[18]  Günther Palm,et al.  How Low Level Observations Can Help to Reveal the User's State in HCI , 2011, ACII.

[19]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[20]  J. Bachorowski,et al.  The acoustic features of human laughter. , 2001, The Journal of the Acoustical Society of America.

[21]  Günther Palm,et al.  Real-Time Emotion Recognition from Speech Using Echo State Networks , 2008, ANNPR.

[22]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[23]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Herbert Jaeger,et al.  A tutorial on training recurrent neural networks , covering BPPT , RTRL , EKF and the " echo state network " approach - Semantic Scholar , 2005 .

[25]  David A. van Leeuwen,et al.  Automatic detection of laughter , 2005, INTERSPEECH.

[26]  Evan F. Risko,et al.  Eyes wide shut: implied social presence, eye tracking and attention , 2011, Attention, perception & psychophysics.

[27]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[28]  Maja Pantic,et al.  Spotting agreement and disagreement: A survey of nonverbal audiovisual cues and tools , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[29]  Fabrice Rossi,et al.  Support Vector Machine For Functional Data Classification , 2006, ESANN.

[30]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[31]  Nick Campbell,et al.  Robust Real Time Face Tracking for the Analysis of Human Behaviour , 2007, MLMI.

[32]  Akinori Ito,et al.  Smile and laughter recognition using speech processing and face recognition from conversation video , 2005, 2005 International Conference on Cyberworlds (CW'05).

[33]  Nick Campbell,et al.  Comparing measures of synchrony and alignment in dialogue speech timing with respect to turn-taking activity , 2010, INTERSPEECH.

[34]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[35]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[36]  David A. van Leeuwen,et al.  Automatic discrimination between laughter and speech , 2007, Speech Commun..

[37]  Stefanos Zafeiriou,et al.  Audiovisual classification of vocal outbursts in human conversation using Long-Short-Term Memory networks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Maja Pantic,et al.  Is this joke really funny? judging the mirth by audiovisual laughter analysis , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[39]  Carolo Friederico Gauss Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium , 2014 .

[40]  Petra-Maria Strauss,et al.  Evaluation and user acceptance of a dialogue system using Wizard-Of-Oz recordings , 2007 .

[41]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[42]  Günther Palm,et al.  A Novel Feature for Emotion Recognition in Voice Based Applications , 2007, ACII.

[43]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[44]  Nikki Mirghafori,et al.  Automatic laughter detection using neural networks , 2007, INTERSPEECH.

[45]  CampbellNick,et al.  Spotting laughter in natural multiparty conversations , 2012 .

[46]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[47]  Günther Palm,et al.  Emotion Recognition from Speech: Stress Experiment , 2008, LREC.