Audiovisual classification of vocal outbursts in human conversation using Long-Short-Term Memory networks

We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this year's Paralinguistic Challenge's Audiovisual Interest Corpus of human-to-human natural conversation. For video-based analysis we compare shape and appearance based features. These are fused in an early manner with typical audio descriptors. The results show significant improvements of LSTM networks over a static approach based on Support Vector Machines. More important, we can show a significant gain in performance when fusing audio and visual shape features.

[1]  Daniel González-Jiménez,et al.  Toward Pose-Invariant 2-D Face Recognition Through Point Distribution Models and Facial Symmetry , 2007, IEEE Transactions on Information Forensics and Security.

[2]  J. Bachorowski,et al.  The acoustic features of human laughter. , 2001, The Journal of the Acoustical Society of America.

[3]  J. Trouvain Segmenting Phonetic Units in Laughter , 2003 .

[4]  Ananth N. Iyer,et al.  Emotion Detection From Infant Facial Expressions And Cries , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[6]  R. Provine Laughter Punctuates Speech: Linguistic, Social and Gender Contexts of Laughter , 2010 .

[7]  Maja Pantic,et al.  Particle filtering with factorized likelihoods for tracking facial features , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[8]  Stefanos Zafeiriou,et al.  Principal component analysis of image gradient orientations for face recognition , 2011, Face and Gesture 2011.

[9]  David A. van Leeuwen,et al.  Automatic discrimination between laughter and speech , 2007, Speech Commun..

[10]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[11]  N. Campbell Whom we laugh with affects how we laugh , 2007 .

[12]  Maja Pantic,et al.  Decision-Level Fusion for Audio-Visual Laughter Detection , 2008, MLMI.

[13]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[14]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[15]  Maja Pantic,et al.  Classifying laughter and speech using audio-visual feature prediction , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  P. Ekman,et al.  The expressive pattern of laughter , 2001 .

[17]  Kornel Laskowski,et al.  Contrasting emotion-bearing laughter types in multiparticipant vocal activity detection for meetings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Maja Pantic,et al.  Audiovisual laughter detection based on temporal features , 2008, ICMI '08.

[19]  Günther Palm,et al.  Multimodal Laughter Detection in Natural Discourses , 2009, Human Centered Robot Systems, Cognition, Interaction, Technology.

[20]  Sergio Escalera,et al.  Multi-modal laughter recognition in video conversations , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[21]  Björn W. Schuller,et al.  Static and Dynamic Modelling for the Recognition of Non-verbal Vocalisations in Conversational Speech , 2008, PIT.

[22]  Akinori Ito,et al.  Smile and laughter recognition using speech processing and face recognition from conversation video , 2005, 2005 International Conference on Cyberworlds (CW'05).