Training combination strategy of multi-stream fused hidden Markov model for audio-visual affect recognition

To simulate the human ability to assess affects, an automatic affect recognition system should make use of multi-sensor information. In the framework of multi-stream fused hidden Markov model (MFHMM), we present a training combination strategy towards audio-visual affect recognition. Different from the weighting combination scheme, our approach is able to use a variety of learning methods to obtain a robust multi-stream fusion result. We evaluate our approach in personal-independent recognition of 11 affective states from 20 subjects. The experimental results suggest that MFHMM outperforms IHMM which assumes the independence among streams, and the training combination strategy has the superiority over the weighting combination under clean and varying audio channel noise condition.

[1]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[3]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Zhihong Zeng,et al.  Bimodal HCI-related affect recognition , 2004, ICMI '04.

[6]  A. Mehrabian Communication without words , 1968 .

[7]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Lawrence S. Chen,et al.  Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction , 2000 .

[9]  Chalapathy Neti,et al.  Frame-dependent multi-stream reliability indicators for audio-visual speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Michael I. Jordan,et al.  Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones , 1999, Machine Learning.

[11]  Rosalind W. Picard Affective Computing , 1997 .

[12]  Zhihong Zeng,et al.  Audio-visual affect recognition through multi-stream fused HMM for HCI , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  A. Rogier [Communication without words]. , 1971, Tijdschrift voor ziekenverpleging.

[14]  Chalapathy Neti,et al.  Frame-dependent multi-stream reliability indicators for audio-visual speech recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).