论文信息 - Audiovisual recognition of spontaneous interest within conversations

Audiovisual recognition of spontaneous interest within conversations

In this work we present an audiovisual approach to the recognition of spontaneous interest in human conversations. For a most robust estimate, information from four sources is combined by a synergistic and individual failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic properties based on a high-dimensional prosodic, articulatory, and voice quality feature space plus the linguistic analysis of spoken content by LVCSR and bag-of-words vector space modeling including non-verbals. Secondly, visual analysis provides patterns of the facial expression by AAMs, and of the movement activity by eye tracking. Experiments base on a database of 10.5h of spontaneous human-to-human conversation containing 20 subjects in gender and age-class balance. Recordings are fulfilled with a room microphone, camera, and headsets for close-talk to consider diverse comfort and noise conditions. Three levels of interest were annotated within a rich transcription. We describe each information stream and a fusion on an early level in detail. Our experiments aim at a person-independent system for real-life usage and show the high potential of such a multimodal approach. Benchmark results based on transcription versus automatic processing are also provided.

[1] Timothy F. Cootes,et al. Active Appearance Models , 1998, ECCV.

[2] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3] Björn Schuller,et al. Speech Communication and Multimodal Interfaces , 2006 .

[4] Samy Bengio,et al. Torch: a modular machine learning software library , 2002 .

[5] Ian Witten,et al. Data Mining , 2000 .

[6] George N. Votsis,et al. Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[7] Shumin Zhai,et al. RealTourist - A Study of Augmenting Human-Human and Human-Computer Dialogue with Eye-Gaze Overlay , 2005, INTERACT.

[8] Björn W. Schuller,et al. Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[9] L. Rothkrantz,et al. Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[10] Michael Kipp,et al. ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[11] Timothy F. Cootes,et al. Statistical models of appearance for computer vision , 1999 .

[12] Ingo Mierswa. Automatic Feature Extraction from Large Time Series , 2004, LWA.

[13] Timothy F. Cootes,et al. A Comparative Evaluation of Active Appearance Model Algorithms , 1998, BMVC.