Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework

We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.

[1]  Jean Carletta,et al.  Nonverbal behaviours improving a simulation of small group discussion , 2003 .

[2]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[3]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[4]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[6]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7]  David C. Hogg,et al.  Learning Behaviour Models of Human Activities , 1999, BMVC.

[8]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[9]  Hagen Soltau,et al.  Advances in automatic meeting record creation and access , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Jan P. H. van Santen,et al.  Review of Handbook of standards and resources for spoken language systems by Dafydd Gibbon, Roger Moore, and Richard Winski. Mouton de Gruyter 1997. , 1998 .

[11]  Elizabeth Shriberg,et al.  Relationship between dialogue acts and hot spots in meetings , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[12]  Alex Pentland,et al.  Towards Measuring Human Interactions in Conversational Settings , 2001 .

[13]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[14]  Ramakant Nevatia,et al.  Multi-agent event recognition , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[15]  Steve Renals,et al.  Dynamic Bayesian networks for meeting structuring , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Alexander H. Waibel,et al.  Skin-Color Modeling and Adaptation , 1998, ACCV.

[17]  Shih-Fu Chang,et al.  Unsupervised discovery of multilevel statistical video structures using hierarchical hidden Markov models , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  A. Nakamura,et al.  Nature (London , 1975 .

[19]  Darren Moore,et al.  The IDIAP Smart Meeting Room , 2002 .

[20]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Mari Ostendorf,et al.  Detection Of Agreement vs. Disagreement In Meetings: Training With Unlabeled Data , 2003, NAACL.

[22]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[23]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[24]  Peter D. Bricker,et al.  The role of audible and visible back-channel responses in interpersonal communication. , 1977 .

[25]  E.,et al.  GROUPS : INTERACTION AND PERFORMANCE , 2001 .