Towards Computer Understanding of Human Interactions

People meet in order to interact – disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[1]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[2]  Iain McCowan,et al.  Location based speaker segmentation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Samy Bengio,et al.  A comparative study of adaptation methods for speaker verification , 2002, INTERSPEECH.

[4]  Klaus Uwe Simmer,et al.  Superdirective Microphone Arrays , 2001, Microphone Arrays.

[5]  S McGuire,et al.  Genetic and Environmental Contributions to Loneliness in Children , 2000, Psychological science.

[6]  James L. Flanagan,et al.  A digital processing system for source location and sound capture by large microphone arrays , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Daniel Gatica-Perez,et al.  Order Matters: A Distributed Sampling Method for Multi-Object Tracking , 2004, BMVC.

[8]  Samy Bengio,et al.  Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[9]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Matthew Brand,et al.  Coupled hidden Markov models for modeling interacting processes , 1997 .

[11]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[12]  Sumit Basu,et al.  Learning Human Interactions w ith the Influence Model , 2001, NIPS 2001.

[13]  Sharath Pankanti,et al.  Biometrics, Personal Identification in Networked Society: Personal Identification in Networked Society , 1998 .

[14]  K. Parker,et al.  Speaking turns in small group interaction: A context-sensitive event sequence model. , 1988 .

[15]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[16]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[17]  Alexander H. Waibel,et al.  Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Samy Bengio,et al.  Improving face verification using skin color information , 2002, Object recognition supported by user interaction for service robots.

[19]  A. Nakamura,et al.  Nature (London , 1975 .

[20]  J. Vroomen,et al.  The perception of emotions by ear and by eye , 2000 .

[21]  Martial Michel,et al.  The NIST Smart Space and Meeting Room projects: signals, acquisition annotation, and metrics , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  Alex Pentland,et al.  Action Reaction Learning: Automatic Visual Analysis and Synthesis of Interactive Behaviour , 1999, ICVS.

[23]  S. Garrod,et al.  Group Discussion as Interactive Dialogue or as Serial Monologue: The Influence of Group Size , 2000, Psychological science.

[24]  Samy Bengio,et al.  Face verification using adapted generative models , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[25]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[26]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[27]  James W. Davis,et al.  The KidsRoom: A Perceptually-Based Interactive and Immersive Story Environment , 1999, Presence.

[28]  Anoop Gupta,et al.  Distributed meetings: a meeting capture and broadcasting system , 2002, MULTIMEDIA '02.

[29]  Samy Bengio,et al.  Confidence measures for multimodal identity verification , 2002, Inf. Fusion.

[30]  Hagen Soltau,et al.  Advances in automatic meeting record creation and access , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[31]  Neil J. Gordon,et al.  Editors: Sequential Monte Carlo Methods in Practice , 2001 .

[32]  Luc Vandendorpe,et al.  Face authentication test on the BANCA database , 2004, ICPR 2004.

[33]  Hervé Bourlard,et al.  Microphone array post-filter based on noise field coherence , 2003, IEEE Trans. Speech Audio Process..

[34]  D. E. Green Group Research. , 1954, Science.

[35]  B. Depaulo,et al.  Decoding discrepant nonverbal cues. , 1978 .

[36]  David G. Novick,et al.  Coordinating turn-taking with gaze , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  John Makhoul,et al.  Rough'n'Ready: a meeting recorder and browser , 1999, CSUR.

[38]  Ramakant Nevatia,et al.  Multi-agent event recognition , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[39]  Tanja Schultz,et al.  SMaRT: the Smart Meeting Room Task at ISL , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[40]  David C. Hogg,et al.  The acquisition and use of interaction behaviour models , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[41]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[42]  Ross Cutler The distributed meetings system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[43]  Hervé Bourlard,et al.  Microphone array post-filter for diffuse noise field , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Peter D. Bricker,et al.  The role of audible and visible back-channel responses in interpersonal communication. , 1977 .

[45]  E.,et al.  GROUPS : INTERACTION AND PERFORMANCE , 2001 .

[46]  Jean-Marc Odobez,et al.  A probabilistic framework for joint head tracking and pose estimation , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[47]  Kuldip K. Paliwal,et al.  Polynomial features for robust face authentication , 2002, Proceedings. International Conference on Image Processing.

[48]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  Iain McCowan,et al.  Microphone array speech recognition: experiments on overlapping speech in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[50]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Kuldip K. Paliwal,et al.  Fast features for face authentication under illumination direction changes , 2003, Pattern Recognit. Lett..

[52]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[53]  Luc Vandendorpe,et al.  Face Authentication Competition on the BANCA Database , 2004, ICBA.

[54]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[55]  Jean-Marc Odobez,et al.  Unsupervised Location-Based Segmentation of Multi-Party Speech , 2004 .

[56]  Jean-Marc Odobez,et al.  A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking , 2003, ICCV 2003.

[57]  Daniel P. W. Ellis,et al.  Audio information access from meeting rooms , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[58]  J. Carletta,et al.  A simulation of small group discussion , 2002 .