Workshop Programme " Multimodal Corpora From Multimodal Behaviour Theories to Usable Models "

At ATR, we are collecting and analysing `meetings' data using a table-top sensor device consisting of a small 360degree camera surrounded by an array of high-quality directional microphones. This equipment provides a stream of information about the audio and visual events of the meeting which is then processed to form a representation of the verbal and non-verbal interpersonal activity, or discourse flow, during the meeting. In this paper we show that simple primitives can provide a rich source of information. INTRODUCTION Several laboratories around the world are now collecting and analysing “meetings data” in an effort to automate some of the transcription, search, and informationretrieval processes that are currently very timeconsuming, and to produce a technology capable of tracking a meeting in realtime and recording and annotating its main events. One key area of this research is devoted to identifying and tracking the active participants in a meeting in order to maximise efficiency in data collection by processing inactive or nonparticipating members differently. [1, 2, 3, 4, 5, 6, 7, 8]. At ATR we are now completing the second year of a threeyear SCOPE funded project to collect and analyse such data. This paper reports an analysis of material collected from one such meeting in terms of speaker overlaps and conflicting speech turns. Our goal is to determine whether it is necessary to track multiple participants, or whether processing can be constrained by identifying the dominant member(s) alone. The results show that in a clear majority of the cases, only one speaker is active at any time, and that the number of overlapping turns, when two or more participants are actively engaged in speaking at the same time, amount to less than 15% of the meeting. This encourages us to pursue future research by focussing our resources on identifying the single main speaker at any given time, rather than attempting to monitor all of the speech activity throughout the meeting. The second part of the paper shows that a change in speaker might be predicted from the amount and types of body movement. These movements are speaker-specific and not uniform, but systematically increase in the time immediately before onset of speech. By observing the bodily movements of the participants, we can form an estimate of who is going to speak next, and prepare to focus our attention (i.e., the recording devices) accordingly. Figure 1. The camera's-eye view of a meeting (top), showing the annotated movement data for three participants (D,I,L) using the wavesurfer video plugin (bottom) CATEGORIES OF SPEECH ACTIVITY We have regularly been recording our monthly project meetings, where research results and project planning are discussed, to provide a database of natural (non-acted/no role-playing) speech and interaction information. The number of members attending each monthly project meeting can vary between four and twelve. Participation is voluntary, but since the research is being carried out by three teams at different locations (ATR, NAIST, and

[1]  Craig Martell FORM: An Extensible, Kinematically-based Gesture Annotation Scheme , 2002, LREC.

[2]  Fabio Pianesi,et al.  Annotation of Group Behaviour : a Proposal for a Coding Scheme , 2005 .

[3]  H. H. Clark,et al.  Speaking while monitoring addressees for understanding , 2004 .

[4]  Jay Hall,et al.  The Effects of a Normative Intervention on Group Decision-Making Performance , 1970 .

[5]  Marjorie Skubic,et al.  Spatial language for human-robot dialogs , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[6]  Jean-Claude Martin,et al.  Evaluation of Multimodal Behaviour of Embodied Agents , 2004, From Brows to Trust.

[7]  Sharon L. Oviatt,et al.  Taming recognition errors with a multimodal interface , 2000, CACM.

[8]  J. Gabriel Amores,et al.  Cooperation and Collaboration in Natural Command Language Dialogues , 2002 .

[9]  Costanza Navarretta,et al.  The MUMIN multimodal coding scheme , 2005 .

[10]  Emiel Krahmer,et al.  Pitch, eyebrows and the perception of focus , 2002, Speech Prosody 2002.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Clare-Marie Karat,et al.  The Beauty of Errors: Patterns of Error Correction in Desktop Speech Systems , 1999, INTERACT.

[13]  Sven C. Martin,et al.  Statistical Language Modeling Using Leaving-One-Out , 1997 .

[14]  James A. Landay,et al.  SATIN: a toolkit for informal ink-based applications , 2000, UIST '00.

[15]  Pilar Manchón Portillo WOZ experiments in Multimodal Dialogue Systems , 2005 .

[16]  Jonas Beskow,et al.  Data-driven synthesis of expressive visual speech using an MPEG-4 talking head , 2005, INTERSPEECH.

[17]  David R. Traum,et al.  A "speech acts" approach to grounding in conversation , 1992, ICSLP.

[18]  R. Bales,et al.  Personality and Interpersonal Behavior. , 1971 .

[19]  F. H. Adler Cybernetics, or Control and Communication in the Animal and the Machine. , 1949 .

[20]  Herbert H. Clark,et al.  Contributing to Discourse , 1989, Cogn. Sci..

[21]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[22]  Joakim Nivre,et al.  On the Semantics and Pragmatics of Linguistic Feedback , 1992, J. Semant..

[23]  Marie-Luce Bourguet,et al.  A Toolkit for Creating and Testing Multimodal Interface Designs , 2002 .

[24]  Stefan Kopp,et al.  Towards integrated microplanning of language and iconic gesture for multimodal output , 2004, ICMI '04.

[25]  Rachid Alami,et al.  Task planning for human-robot interaction , 2005, sOc-EUSAI '05.

[26]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[27]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[28]  David Salesin,et al.  Resynthesizing facial animation through 3D model-based tracking , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[29]  Peter C. Cheeseman,et al.  Estimating uncertain spatial relationships in robotics , 1986, Proceedings. 1987 IEEE International Conference on Robotics and Automation.

[30]  Oliver Brdiczka,et al.  Automatic detection of interaction groups , 2005, ICMI '05.

[31]  Roxane Bertrand,et al.  About the relationship between eyebrow movements and Fo variations , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[32]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[33]  J. Gabriel Amores,et al.  Dialogue moves for natural command languages , 2001, Proces. del Leng. Natural.

[34]  J. Allwood,et al.  A study of gestural feedback expressions , 2006 .

[35]  A. Green,et al.  Applying the Wizard-of-Oz framework to cooperative service discovery and configuration , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[36]  A. Kendon An Agenda for Gesture Studies , 2007 .

[37]  Samy Bengio,et al.  Towards Computer Understanding of Human Interactions , 2003, EUSAI.

[38]  Florian Schiel,et al.  User-State Labeling Procedures For The Multimodal Data Collection Of SmartKom , 2002, LREC.

[39]  Cynthia Whissell,et al.  THE DICTIONARY OF AFFECT IN LANGUAGE , 1989 .

[40]  Loredana Cerrato Some characteristics of feedback expressions in Swedish , 2007 .

[41]  A. Hanks Canada , 2002 .

[42]  C. Breazeal Sociable Machines: Expressive Social Ex-change Between Humans and Robots , 2000 .

[43]  N. Emery,et al.  The eyes have it: the neuroethology, function and evolution of social gaze , 2000, Neuroscience & Biobehavioral Reviews.

[44]  Kristinn R. Thórisson,et al.  The Power of a Nod and a Glance: Envelope Vs. Emotional Feedback in Animated Conversational Agents , 1999, Appl. Artif. Intell..

[45]  E. Hall The hidden dimension: an anthropologist examines man's use of space in public and private , 1969 .

[46]  Peter Eisert,et al.  Analyzing Facial Expressions for Virtual Conferencing , 1998, IEEE Computer Graphics and Applications.

[47]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[48]  Gérard Bailly,et al.  Tracking talking faces with shape and appearance models , 2004, Speech Commun..

[49]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[50]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[51]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[52]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[53]  B. Granström,et al.  NATURAL TURN-TAKING NEEDS NO MANUAL : COMPUTATIONAL THEORY AND MODEL , FROM PERCEPTION TO ACTION , 2002 .

[54]  Guillaume Gibert,et al.  Capturing data and realistic 3d models for cued speech analysis and audiovisual synthesis , 2005, AVSP.

[55]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[56]  David Lee,et al.  The influence of subjects' personality traits on personal spatial zones in a human-robot interaction experiment , 2005, ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, 2005..

[57]  P. Ekman,et al.  Facial Affect Scoring Technique: A First Validity Study , 1971 .

[58]  Jens Allwood The structure of dialog , 1999 .

[59]  Alexander H. Waibel,et al.  Multimodal error correction for speech user interfaces , 2001, TCHI.

[60]  Robert E. Kraut,et al.  Action as language in a shared visual space , 2004, CSCW.