Modeling focus of attention for meeting indexing based on multiple cues

A user's focus of attention plays an important role in human-computer interaction applications, such as a ubiquitous computing environment and intelligent space, where the user's goal and intent have to be continuously monitored. We are interested in modeling people's focus of attention in a meeting situation. We propose to model participants' focus of attention from multiple cues. We have developed a system to estimate participants' focus of attention from gaze directions and sound sources. We employ an omnidirectional camera to simultaneously track participants' faces around a meeting table and use neural networks to estimate their head poses. In addition, we use microphones to detect who is speaking. The system predicts participants' focus of attention from acoustic and visual information separately. The system then combines the output of the audio- and video-based focus of attention predictors. We have evaluated the system using the data from three recorded meetings. The acoustic information has provided 8% relative error reduction on average compared to only using one modality. The focus of attention model can be used as an index for a multimedia meeting record. It can also be used for analyzing a meeting.

[1]  Ralph Gross,et al.  Towards a multimodal meeting record , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[2]  J. Ruusuvuori,et al.  Looking means listening: coordinating displays of engagement in doctor-patient interaction. , 2001, Social science & medicine.

[3]  M. Argyle,et al.  Gaze and Mutual Gaze , 1994, British Journal of Psychiatry.

[4]  Ralph Gross,et al.  Multimodal Meeting Tracker , 2000, RIAO.

[5]  N. Emery,et al.  The eyes have it: the neuroethology, function and evolution of social gaze , 2000, Neuroscience & Biobehavioral Reviews.

[6]  Alex Waibel,et al.  Gaze Tracking Based on Face‐Color , 1995 .

[7]  Roberto Cipolla,et al.  Non-intrusive gaze tracking for human-computer interaction , 1994 .

[8]  Alexander H. Waibel,et al.  Modeling focus of attention for meeting indexing , 1999, MULTIMEDIA '99.

[9]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[10]  V. Bruce,et al.  Do the eyes have it? Cues to the direction of social attention , 2000, Trends in Cognitive Sciences.

[11]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..

[12]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[13]  D. Perrett,et al.  Understanding the intentions of others from visual signals: Neurophysiological evidence. , 1994 .

[14]  Michael H. Coen,et al.  Design Principles for Intelligent Environments , 1998, AAAI/IAAI.

[15]  Helge J. Ritter,et al.  Recognition of human head orientation based on artificial neural networks , 1998, IEEE Trans. Neural Networks.

[16]  Shree K. Nayar,et al.  A theory of catadioptric image formation , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[17]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Hiroshi Ishii,et al.  ClearBoard: a seamless medium for shared drawing and conversation with eye contact , 1992, CHI.

[19]  Kevin W. Bowyer,et al.  Combination of multiple classifiers using local accuracy estimates , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Roel Vertegaal,et al.  The GAZE groupware system: mediating joint attention in multiparty communication and collaboration , 1999, CHI '99.

[21]  David J. Miller,et al.  Critic-driven ensemble classification , 1999, IEEE Trans. Signal Process..

[22]  Daniel Gopher,et al.  The Blackwell dictionary of Cognitive Psychology , 1990 .

[23]  T. J. Walker,et al.  Animal Communication: Techniques of Study and Results of Research , 1969 .

[24]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[25]  Michael C. Mozer,et al.  The Neural Network House: An Environment that Adapts to its Inhabitants , 1998 .

[26]  Alex Pentland,et al.  Parametrized structure from motion for 3D adaptive feedback tracking of faces , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Alexander H. Waibel,et al.  Multimodal people ID for a multimedia meeting browser , 1999, MULTIMEDIA '99.