Robust Multi-Modal Group Action Recognition in Meetings from Disturbed Videos with the Asynchronous Hidden Markov Model

The asynchronous hidden Markov model (AHMM) models the joint likelihood of two observation sequences, even if the streams are not synchronised. We explain this concept and how the model is trained by the EM algorithm. We then show how the AHMM can be applied to the analysis of group action events in meetings from both clear and disturbed data. The AHMM outperforms an early fusion HMM by 5.7% recognition rate (a rel. error reduction of 38.5%) for clear data. For occluded data, the improvement is in average 6.5% recognition rate (rel. error red. 40%). Thus asynchronity is a dominant factor in meeting analysis, even if the data is disturbed. The AHMM exploits this and is therefore much more robust against disturbances.

[1]  Steve Renals,et al.  Dynamic Bayesian networks for meeting structuring , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  A. Nakamura,et al.  Nature (London , 1975 .

[4]  Mike Flynn,et al.  Browsing Recorded Meetings with Ferret , 2004, MLMI.

[5]  Gerhard Rigoll,et al.  A multi-modal graphical model for robust recognition of group actions in meetings from disturbed videos , 2005, IEEE International Conference on Image Processing 2005.

[6]  Gerhard Rigoll,et al.  Reduced Complexity and Scaling for Asynchronous HMMS in a Bimodal Input Fusion Application , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Samy Bengio,et al.  Modeling human interaction in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Gerhard Rigoll,et al.  Multimodal meeting analysis by segmentation and classification of meeting events based on a higher level semantic approach , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Darren Moore,et al.  The IDIAP Smart Meeting Room , 2002 .

[10]  Samy Bengio,et al.  Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[11]  Samy Bengio,et al.  Multimodal Authentication Using Asynchronous HMMs , 2003, AVBPA.

[12]  Steve Renals,et al.  Multimodal Integration for Meeting Group Action Segmentation and Recognition , 2005, MLMI.

[13]  Alexander H. Waibel CHIL - Computers in the Human Interaction Loop , 2005, MVA.

[14]  Gerhard Rigoll,et al.  Action segmentation and recognition in meeting room scenarios , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[15]  Samy Bengio Multimodal speech processing using asynchronous Hidden Markov Models , 2004, Inf. Fusion.

[16]  Dirk Heylen,et al.  Determining what people feel and think when interacting with humans and machines Notes on corpus collection and annotation , 2006 .