Estimation of conversational activation level during video chat using turn-taking information.

In this paper, we discuss the feasibility of estimating the activation level of a conversation by using phonetic and turn-taking features. First, we recorded the voices of conversations of six three-person groups at three different activation levels. Then, we calculated the phonetic and turn-taking features, and analyzed the correlation between the features and the activity level. The analysis revealed that response latency, overlap rate, and speech rate correlate with the activation levels and they are less sensitive to individual deviation. Then, we formulated multiple regression equations, and examined the estimation accuracy using the analyzed data of the six three-person groups. The results demonstrated the feasibility to estimate activation level at approximately 18% root-mean-square error (RMSE).

[1]  Laurel Fais,et al.  Infant-directed speech supports phonetic category learning in English and Japanese , 2007, Cognition.

[2]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[3]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[4]  H. Saito,et al.  Evaluation of the relation between emotional concepts and emotional parameters in speech , 2001 .

[5]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[6]  Marjorie Fink Vargas,et al.  Louder than words : an introduction to nonverbal communication , 1986 .

[7]  Tomio Watanabe,et al.  InterActor: Speech-Driven Embodied Interactive Actor , 2004, Int. J. Hum. Comput. Interact..

[8]  Maurizio Mancini,et al.  Implementing Expressive Gesture Synthesis for Embodied Conversational Agents , 2005, Gesture Workshop.

[9]  Kinya Fujita,et al.  Control of Avatar's Facial Expression Using Fundamental Frequency in Multi-user Voice Chat System , 2006, IVA.

[10]  Kitamura Yoshifumi,et al.  A Study of Nonverbal Cues and Atmosphere in Three-Person Conversation , 2010 .

[11]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[12]  Tomio Watanabe,et al.  A Virtual Audience System for Enhancing Embodied Interaction Based on Conversational Activity , 2011, HCI.

[13]  Kanji Akahori,et al.  Effects of emotional cues transmitted in e-mail communication on the emotions experienced by senders and receivers , 2007, Comput. Hum. Behav..

[14]  O.Maeran Omaeran,et al.  Speech recognition through phoneme segmentation and neural classification , 1997 .

[15]  Renee van Bezooyen Characteristics and recognizability of vocal expressions of emotion , 1984 .

[16]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[17]  J. Burgoon,et al.  Nonverbal Communication , 2018, Encyclopedia of Evolutionary Psychological Science.

[18]  J. Russell A circumplex model of affect. , 1980 .

[19]  V. Piuri,et al.  Speech recognition through phoneme segmentation and neural classification , 1997, IEEE Instrumentation and Measurement Technology Conference Sensing, Processing, Networking. IMTC Proceedings.