Towards Computer Understanding of Human Interactions

People meet in order to interact – disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[1]  Peter D. Bricker,et al.  The role of audible and visible back-channel responses in interpersonal communication. , 1977 .

[2]  Samy Bengio,et al.  An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition , 2002, NIPS.

[3]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[4]  Daniel Gatica-Perez,et al.  Order Matters: A Distributed Sampling Method for Multi-Object Tracking , 2004, BMVC.

[5]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[6]  Samy Bengio,et al.  Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[7]  Jean-Marc Odobez,et al.  Unsupervised Location-Based Segmentation of Multi-Party Speech , 2004 .

[8]  B. Depaulo,et al.  Decoding discrepant nonverbal cues. , 1978 .

[9]  John Makhoul,et al.  Rough'n'Ready: a meeting recorder and browser , 1999, CSUR.

[10]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[11]  Luc Vandendorpe,et al.  Face Authentication Competition on the BANCA Database , 2004, ICBA.

[12]  Jean-Marc Odobez,et al.  A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking , 2003, ICCV 2003.

[13]  Sumit Basu,et al.  Learning Human Interactions w ith the Influence Model , 2001, NIPS 2001.

[14]  James W. Davis,et al.  The KidsRoom: A Perceptually-Based Interactive and Immersive Story Environment , 1999, Presence.

[15]  Anoop Gupta,et al.  Distributed meetings: a meeting capture and broadcasting system , 2002, MULTIMEDIA '02.

[16]  J. McGrath Groups: Interaction and Performance , 1984 .

[17]  Samy Bengio,et al.  Confidence measures for multimodal identity verification , 2002, Inf. Fusion.

[18]  Samy Bengio,et al.  Improving face verification using skin color information , 2002, Object recognition supported by user interaction for service robots.

[19]  Ross Cutler The distributed meetings system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Hervé Bourlard,et al.  Microphone array post-filter for diffuse noise field , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[22]  Daniel P. W. Ellis,et al.  Audio information access from meeting rooms , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[23]  Alex Pentland,et al.  Action Reaction Learning: Automatic Visual Analysis and Synthesis of Interactive Behaviour , 1999, ICVS.

[24]  Alexander H. Waibel,et al.  Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25]  David G. Novick,et al.  Coordinating turn-taking with gaze , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[27]  Kuldip K. Paliwal,et al.  Polynomial features for robust face authentication , 2002, Proceedings. International Conference on Image Processing.

[28]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Iain McCowan,et al.  Microphone array speech recognition: experiments on overlapping speech in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[30]  Samy Bengio,et al.  A comparative study of adaptation methods for speaker verification , 2002, INTERSPEECH.

[31]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Kuldip K. Paliwal,et al.  Fast features for face authentication under illumination direction changes , 2003, Pattern Recognit. Lett..

[33]  James L. Flanagan,et al.  A digital processing system for source location and sound capture by large microphone arrays , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Neil J. Gordon,et al.  Editors: Sequential Monte Carlo Methods in Practice , 2001 .

[35]  Hagen Soltau,et al.  Advances in automatic meeting record creation and access , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  Samy Bengio,et al.  Face verification using adapted generative models , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[37]  David C. Hogg,et al.  The acquisition and use of interaction behaviour models , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[38]  J. Carletta,et al.  A simulation of small group discussion , 2002 .

[39]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[40]  K. Parker,et al.  Speaking turns in small group interaction: A context-sensitive event sequence model. , 1988 .

[41]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[42]  Martial Michel,et al.  The NIST Smart Space and Meeting Room projects: signals, acquisition annotation, and metrics , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[43]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[44]  Ramakant Nevatia,et al.  Multi-agent event recognition , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[45]  Iain McCowan,et al.  Location based speaker segmentation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[46]  Tanja Schultz,et al.  SMaRT: the Smart Meeting Room Task at ISL , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[47]  Jean-Marc Odobez,et al.  A probabilistic framework for joint head tracking and pose estimation , 2004, ICPR 2004.

[48]  S. Garrod,et al.  Group Discussion as Interactive Dialogue or as Serial Monologue: The Influence of Group Size , 2000, Psychological science.

[49]  J. Vroomen,et al.  The perception of emotions by ear and by eye , 2000 .

[50]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[52]  Hervé Bourlard,et al.  Microphone array post-filter based on noise field coherence , 2003, IEEE Trans. Speech Audio Process..