Règles d'associations temporelles de signaux sociaux pour la synthèse d'agents conversationnels animés. Application aux attitudes sociales

Afin d'ameliorer l'interaction entre des humains et des agents conversationnels animes (ACA), l'un des enjeux majeurs du domaine est de generer des agents credibles socialement. Dans cet article, nous presentons une methode, intitulee SMART pour social multimodal association rules with timing, capable de trouver automatiquement des associations temporelles entre l'utilisation de signaux sociaux (mouvements de tete, expressions faciales, prosodie. . .) issues de videos d'interactions d'humains exprimant differents etats affectifs (comportement, attitude, emotions,. . .). Notre systeme est base sur un algorithme de fouille de sequences qui lui permet de trouver des regles d'associations temporelles entre des signaux sociaux extraits automatiquement de flux audio-video. SMART va egalement analyser le lien de ces regles avec chaque etat affectif pour ne conserver que celles qui sont pertinentes. Finalement, SMART va les enrichir afin d'assurer une animation facile d'un ACA pour qu'il exprime l'etat voulu. Dans ce papier, nous formalisons donc l'implementation de SMART et nous justifions son inte-ret par plusieurs etudes. Dans un premier temps, nous montrons que les regles calculees sont bien en accord avec la litterature en psychologie et sociologie. Ensuite, nous presentons les resultats d'evaluations perceptives que nous avons conduites suite a des etudes de corpus pro-posant l'expression d'attitudes sociales marquees. ABSTRACT. In the field of Embodied Conversational Agent (ECA) one of the main challenges is to generate socially believable agents. The long run objective of the present study is to infer rules for the multimodal generation of agents' socio-emotional behaviour. In this paper, we introduce the Social Multimodal Association Rules with Timing (SMART) algorithm. It proposes to Revue d'intelligence artificielle-n o 4/2017, 511-537 512 RIA. Volume 31-n o 4/2017 learn the rules from the analysis of a multimodal corpus composed by audio-video recordings of human-human interactions. The proposed methodology consists in applying a Sequence Mining algorithm using automatically extracted Social Signals such as prosody, head movements and facial muscles activation as an input. This allows us to infer Temporal Association Rules for the behaviour generation. We show that this method can automatically compute Temporal Association Rules coherent with prior results found in the literature especially in the psychology and sociology fields. The results of a perceptive evaluation confirms the ability of a Temporal Association Rules based agent to express a specific stance.

[1]  Mathieu Guillame-Bert,et al.  Learning Temporal Association Rules on Symbolic Time Sequences , 2012, ACML.

[2]  Arman Savran,et al.  Temporal Bayesian Fusion for Affect Sensing: Combining Video, Audio, and Lexical Modalities , 2015, IEEE Transactions on Cybernetics.

[3]  Stefanos Zafeiriou,et al.  Markov Random Field Structures for Facial Action Unit Intensity Estimation , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[4]  Ingo Siegert,et al.  Proceedings of the International Workshop on Emotion Representations and Modelling for Companion Technologies , 2015, ERM4CT@ICMI.

[5]  Catherine Pelachaud,et al.  From Non-verbal Signals Sequence Mining to Bayesian Networks for Interpersonal Attitudes Expression , 2014, IVA.

[6]  Mohamed Chetouani,et al.  Facial Action Unit intensity prediction via Hard Multi-Task Metric Learning for Kernel Regression , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[7]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[8]  Chloé Clavel,et al.  Towards the generation of dialogue acts in socio-affective ECAs: a corpus-based prosodic analysis , 2016, Lang. Resour. Evaluation.

[9]  Vladimir Pavlovic,et al.  1 Machine Learning Methods for Social Signal Processing , 2014 .

[10]  Nigel Ward,et al.  Action-coordinating prosody , 2016 .

[11]  Dirk Heylen,et al.  Bridging the Gap between Social Animal and Unsocial Machine: A Survey of Social Signal Processing , 2012, IEEE Transactions on Affective Computing.

[12]  D. Keltner Signs of appeasement: evidence for the distinct displays of embarrassment, amusement, and shame , 1995 .

[13]  Roddy Cowie,et al.  The emotional and communicative significance of head nods , 2010, LREC 2010.

[14]  Catherine Pelachaud,et al.  From a User-created Corpus of Virtual Agent's Non-verbal Behavior to a Computational Model of Interpersonal Attitudes , 2013, IVA.

[15]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[16]  K. Scherer What are emotions? And how can they be measured? , 2005 .

[17]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[18]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[19]  Elisabeth Ahlsén,et al.  Some Suggestions for the Study of Stance in Communication , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[20]  Mattias Heldner,et al.  Learning Prosodic Sequences Using the Fundamental Frequency Variation Spectrum , 2008 .

[21]  Catherine Pelachaud,et al.  Sequence-based multimodal behavior modeling for social agents , 2016, ICMI.

[22]  Ran Zhao,et al.  Socially-Aware Virtual Agents: Automatically Assessing Dyadic Rapport from Temporal Patterns of Behavior , 2016, IVA.

[23]  Stacy Marsella,et al.  Modeling Speaker Behavior: A Comparison of Two Approaches , 2012, IVA.

[24]  Catherine Pelachaud,et al.  Model of the perception of smiling virtual character , 2012, AAMAS.

[25]  Dirk Heylen,et al.  First Impressions: Users' Judgments of Virtual Agents' Personality and Interpersonal Attitude in First Encounters , 2012, IVA.

[26]  Wen Gao,et al.  Learning Prosodic Patterns for Mandarin Speech Synthesis , 2000, Journal of Intelligent Information Systems.

[27]  Gérard Bailly,et al.  Characterization of Audiovisual Dramatic Attitudes , 2016, INTERSPEECH.

[28]  Bhuvana Ramabhadran,et al.  Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[29]  Georgios N. Yannakakis,et al.  Mining multimodal sequential patterns: a case study on affect detection , 2011, ICMI '11.

[30]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  A. Pentland Social Dynamics: Signals and Behavior , 2004 .

[32]  Chloé Clavel,et al.  Using Temporal Association Rules for the Synthesis of Embodied Conversational Agents with a Specific Stance , 2016, IVA.