论文信息 - An Iterative Decoding Algorithm for Fusion of Multimodal Information

An Iterative Decoding Algorithm for Fusion of Multimodal Information

Human activity analysis in an intelligent space is typically based on multimodal informational cues. Use of multiple modalities gives us a lot of advantages. But information fusion from different sources is a problem that has to be addressed. In this paper, we propose an iterative algorithm to fuse information from multimodal sources. We draw inspiration from the theory of turbo codes. We draw an analogy between the redundant parity bits of the constituent codes of a turbo code and the information from different sensors in a multimodal system. A hidden Markov model is used to model the sequence of observations of individual modalities. The decoded state likelihoods from one modality are used as additional information in decoding the states of the other modalities. This procedure is repeated until a certain convergence criterion is met. The resulting iterative algorithm is shown to have lower error rates than the individual models alone. The algorithm is then applied to a real-world problem of speech segmentation using audio and visual cues.

Mohan M. Trivedi | Bhaskar D. Rao | Shankar T. Shivappa | B. Rao | M. Trivedi

[1] Mohan M. Trivedi,et al. Facial Action Coding Using Multiple Visual Cues and a Hierarchy of Particle Filters , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[2] Zhu Liu,et al. Integration of multimodal features for video scene classification based on HMM , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[3] Paul A. Viola,et al. Robust Real-time Object Detection , 2001 .

[4] Nicu Sebe,et al. Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[5] Mohan M. Trivedi,et al. Activity monitoring and summarization for an intelligent meeting room , 2000, Proceedings Workshop on Human Motion.

[6] A. Murat Tekalp,et al. Multimodal person recognition for human-vehicle interaction , 2006, IEEE Multimedia.

[7] Datong Chen,et al. Multimodal detection of human interaction events in a nursing home environment , 2004, ICMI '04.

[8] Mohan M. Trivedi,et al. A multimodal approach for dynamic event capture of vehicles and pedestrians , 2006, VSSN '06.

[9] Mohan M. Trivedi,et al. Mutual information based registration of multimodal stereo videos for person tracking , 2007, Comput. Vis. Image Underst..

[10] Ioannis Pitas,et al. Visual speech detection using mouth region intensities , 2006, 2006 14th European Signal Processing Conference.

[11] Hervé Glotin,et al. Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[12] Mohan M. Trivedi,et al. Source localization in reverberant environments: modeling and statistical analysis , 2003, IEEE Trans. Speech Audio Process..

[13] Eric Horvitz,et al. Layered representations for human activity recognition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[14] Samy Bengio,et al. Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Sharon L. Oviatt,et al. When do we interact multimodally?: cognitive load and multimodal communication patterns , 2004, ICMI '04.

[16] A. Glavieux,et al. Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[17] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[18] Tanzeem Choudhury,et al. Multimodal person recognition using unconstrained audio and video , 1998 .

[19] Mohan M. Trivedi,et al. Dynamic context capture and distributed video arrays for intelligent spaces , 2005, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[20] John Cocke,et al. Optimal decoding of linear codes for minimizing symbol error rate (Corresp.) , 1974, IEEE Trans. Inf. Theory.

[21] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..