Automatic Meeting Segmentation Using

Multiparty meetings are a ubiquitous feature of orga- nizations, and there are considerable economic benefits that would arise from their automatic analysis and structuring. In this paper, we are concerned with the segmentation and structuring of meet- ings (recorded using multiple cameras and microphones) into se- quences of group meeting actions such as monologue, discussion and presentation. We outline four families of multimodal features based on speaker turns, lexical transcription, prosody, and visual motion that are extracted from the raw audio and video record- ings. We relate these low-level features to more complex group be- haviors using a multistream modelling framework based on multi- stream dynamic Bayesian networks (DBNs). This results in an ef- fective approach to the segmentation problem, resulting in an ac- tion error rate of 12.2%, compared with 43% using an approach based on hidden Markov models. Moreover, the multistream DBN developed here leaves scope for many further improvements and extensions.

[1]  Samy Bengio,et al.  Multimodal group action clustering in meetings , 2004, VSSN '04.

[2]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[3]  Steve Renals,et al.  Dynamic Bayesian networks for meeting structuring , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Hagen Soltau,et al.  Advances in automatic meeting record creation and access , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Rick Kazman,et al.  Four Paradigms for Indexing Video Conferences , 1996, IEEE Multim..

[7]  Eric Horvitz,et al.  Selective perception policies for guiding sensing and computation in multimodal systems: a comparative analysis , 2003, ICMI '03.

[8]  Steve Renals,et al.  Multi-stream segmentation of meetings , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[9]  Berna Erol,et al.  Portable meeting recorder , 2002, MULTIMEDIA '02.

[10]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[11]  David McNeill,et al.  Language and Gesture: Frontmatter , 2000 .

[12]  J. McGrath Time, Interaction, and Performance (TIP) , 1991 .

[13]  Marcel Worring,et al.  Detection of TV news monologues by style analysis , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[14]  Alexander H. Waibel,et al.  Skin-Color Modeling and Adaptation , 1998, ACCV.

[15]  Jeff A. Bilmes,et al.  DBN based multi-stream models for speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[17]  Alex Pentland,et al.  Towards Measuring Human Interactions in Conversational Settings , 2001 .

[18]  Samy Bengio,et al.  Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[19]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[20]  Gerhard Rigoll,et al.  A Multi-Modal Mixed-State Dynamic Bayesian Network for Robust Meeting Event Recognition from Disturbed Data , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[21]  Vladimir Pavlovic,et al.  Boosted learning in dynamic Bayesian networks for multimodal speaker detection , 2003, Proc. IEEE.

[22]  Steve Renals,et al.  Multimodal Integration for Meeting Group Action Segmentation and Recognition , 2005, MLMI.

[23]  Tony Jebara,et al.  Dynamical Systems Trees , 2004, UAI.

[24]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[25]  Hagen Soltau,et al.  The ISL Meeting Room System , 2001 .

[26]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[28]  Anton Nijholt,et al.  Eye gaze patterns in conversations: there is more to conversational agents than meets the eyes , 2001, CHI.

[29]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[31]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[32]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[33]  Gerhard Rigoll,et al.  Segmentation and classification of meeting events using multiple classifier fusion and dynamic programming , 2004, ICPR 2004.

[34]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[35]  Samy Bengio,et al.  An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition , 2002, NIPS.

[36]  Mubarak Shah,et al.  Ontology and taxonomy collaborated framework for meeting classification , 2004, ICPR 2004.

[37]  Samy Bengio,et al.  Modeling human interaction in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[38]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).