Audio-visual emotion recognition using Boltzmann Zippers

This paper presents a novel approach for automatic audio-visual emotion recognition. The audio and visual channels provide complementary information for human emotional states recognition, and we utilize Boltzmann Zippers as model-level fusion to learn intrinsic correlations between the different modalities. We extract effective audio and visual feature streams with different time scales and feed them to two Boltzmann chains respectively. The hidden units of two chains are interconnected. Second-order methods are applied to Boltzmann Zippers to speed up learning and pruning process. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios demonstrate our approach is promising and outperforms single modal HMM and conventional coupled HMM methods.

[1]  Maja Pantic,et al.  Automatic Analysis of Facial Expressions: The State of the Art , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Maja Pantic,et al.  Fully automatic facial feature point detection using Gabor feature based boosted classifiers , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[3]  Beat Fasel,et al.  Automati Fa ial Expression Analysis: A Survey , 1999 .

[4]  Maja Pantic,et al.  Audiovisual discrimination between laughter and speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hatice Gunes,et al.  Audio-Visual Classification and Fusion of Spontaneous Affective Data in Likelihood Space , 2010, 2010 20th International Conference on Pattern Recognition.

[6]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[7]  Zhihong Zeng,et al.  Audio-visual affect recognition through multi-stream fused HMM for HCI , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Maja Pantic,et al.  Particle filtering with factorized likelihoods for tracking facial features , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[9]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[10]  D.G. Stork,et al.  Pruning Boltzmann networks and hidden Markov models , 1996, Conference Record of The Thirtieth Asilomar Conference on Signals, Systems and Computers.

[11]  Harald Haas,et al.  Asilomar Conference on Signals, Systems, and Computers , 2006 .

[12]  Yuxiao Hu,et al.  Training combination strategy of multi-stream fused hidden Markov model for audio-visual affect recognition , 2006, MM '06.

[13]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[14]  P. Ekman,et al.  Constants across cultures in the face and emotion. , 1971, Journal of personality and social psychology.

[15]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..