Laughter detection based on the fusion of local binary patterns, spectral and prosodic features

Today, great focus has been placed on context-aware human-machine interaction, where systems are aware not only of the surrounding environment, but also about the mental/affective state of the user. Such knowledge can allow for the interaction to become more human-like. To this end, automatic discrimination between laughter and speech has emerged as an interesting, yet challenging problem. Typically, audio-or video-based methods have been proposed in the literature; humans, however, are known to integrate both sensory modalities during conversation and/or interaction. As such, this paper explores the fusion of support vector machine classifiers trained on local binary pattern (LBP) video features, as well as speech spectral and prosodic features as a way of improving laughter detection performance. Experimental results on the publicly-available MAHNOB Laughter database show that the proposed audio-visual fusion scheme can achieve a laughter detection accuracy of 93.3%, thus outperforming systems trained on audio or visual features alone.

[1]  Maja Pantic,et al.  The MAHNOB Laughter database , 2013, Image Vis. Comput..

[2]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Maja Pantic,et al.  Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help , 2011, IEEE Transactions on Multimedia.

[4]  Maja Pantic,et al.  Decision-Level Fusion for Audio-Visual Laughter Detection , 2008, MLMI.

[5]  Maja Pantic,et al.  Prediction-based classification for audiovisual discrimination between laughter and speech , 2011, Face and Gesture 2011.

[6]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[7]  Johannes Wagner,et al.  Automatic Recognition of Emotions from Speech: A Review of the Literature and Recommendations for Practical Realisation , 2008, Affect and Emotion in Human-Computer Interaction.

[8]  Andreas Stolcke,et al.  Combining Prosodic Lexical and Cepstral Systems for Deceptive Speech Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[9]  Mohamed S. Kamel,et al.  Audio-visual feature-decision level fusion for spontaneous emotion estimation in speech conversations , 2013, 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[10]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  David A. van Leeuwen,et al.  Automatic discrimination between laughter and speech , 2007, Speech Commun..

[13]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[14]  Matti Pietikäinen,et al.  A comparative study of texture measures with classification based on featured distributions , 1996, Pattern Recognit..

[15]  B. K. Julsing,et al.  Face Recognition with Local Binary Patterns , 2012 .

[16]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[17]  J. Russell,et al.  Facial and vocal expressions of emotion. , 2003, Annual review of psychology.

[18]  Akinori Ito,et al.  Smile and laughter recognition using speech processing and face recognition from conversation video , 2005, 2005 International Conference on Cyberworlds (CW'05).

[19]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[20]  Maja Pantic,et al.  Audiovisual Detection of Laughter in Human-Machine Interaction , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[21]  Fang Chen,et al.  Combining Cepstral and Prosodic Features in Language Identification , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[22]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[23]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[24]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[25]  Sascha Meudt,et al.  Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression , 2014, ICPRAM.

[26]  Daniel P. W. Ellis,et al.  Laughter Detection in Meetings , 2004 .

[27]  András Beke,et al.  Automatic Laughter Detection in Spontaneous Speech Using GMM-SVM Method , 2013, TSD.

[28]  Alex Pentland,et al.  Human computing and machine understanding of human behavior: a survey , 2006, ICMI '06.