A generic framework for the inference of user states in human computer interaction

The analysis of affective or communicational states in human-human and human-computer interaction (HCI) using automatic machine analysis and learning approaches often suffers from the simplicity of the approaches or that very ambitious steps are often tried to be taken at once. In this paper, we propose a generic framework that overcomes many difficulties associated with real world user behavior analysis (i.e. uncertainty about the ground truth of the current state, subject independence, dynamic realtime analysis of multimodal information, and the processing of incomplete or erroneous inputs, e.g. after sensor failure or lack of input). We motivate the approach, that is based on the analysis and spotting of behavioral cues that are regarded as basic building blocks forming user state specific behavior, with the help of related work and the analysis of a large HCI corpus. For this corpus paralinguistic and nonverbal behavior could be significantly associated with user states. Some of our previous work on the detection and classification of behavioral cues is presented and a layered architecture based on hidden Markov models is introduced. We believe that this step by step approach towards the understanding of human behavior underlined by encouraging preliminary results outlines a principled approach towards the development and evaluation of computational mechanisms for the analysis of multimodal social signals.

[1]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[2]  Günther Palm,et al.  Detecting Actions by Integrating Sequential Symbolic and Sub-symbolic Information in Human Activity Recognition , 2012, MLDM.

[3]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[4]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[5]  Maja Pantic,et al.  Spotting agreement and disagreement: A survey of nonverbal audiovisual cues and tools , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[6]  C. Darwin The Expression of Emotion in Man and Animals , 2020 .

[7]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[8]  Kostas Karpouzis,et al.  Multimodal Emotion Recognition from Low-Level Cues , 2011 .

[9]  C. Gobl,et al.  Voice quality and loudness in affect perception , 2008 .

[10]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[11]  Dietmar F. Rösner,et al.  On the Role of the NIMITEK Corpus in Developing an Emotion Adaptive Spoken Dialogue System , 2008, LREC.

[12]  Alex Pentland,et al.  Social signals, their function, and automatic analysis: a survey , 2008, ICMI '08.

[13]  Günther Palm,et al.  A Novel Feature for Emotion Recognition in Voice Based Applications , 2007, ACII.

[14]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[15]  A. Freitas-Magalhães Facial Expression of Emotion , 2012 .

[16]  Yiannis Aloimonos,et al.  View-Invariant Modeling and Recognition of Human Actions Using Grammars , 2006, WDV.

[17]  Elisabeth André,et al.  Emotion recognition based on physiological changes in music listening , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[19]  Klaus R. Scherer,et al.  Using Actor Portrayals to Systematically Study Multimodal Emotion Expression: The GEMEP Corpus , 2007, ACII.

[20]  A. Pentland Social Signal Processing [Exploratory DSP] , 2007, IEEE Signal Processing Magazine.

[21]  Josef Kittler,et al.  Combining multiple classifiers by averaging or by multiplying? , 2000, Pattern Recognit..

[22]  H. Scheich,et al.  The "Magdeburger Prosodie-Korpus , 2002 .

[23]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[24]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[25]  K. Scherer,et al.  Handbook of affective sciences. , 2003 .

[26]  David G. Lowe,et al.  University of British Columbia. , 1945, Canadian Medical Association journal.

[27]  Sadik Kapadia,et al.  Discriminative Training of Hidden Markov Models , 1998 .

[28]  Robert R. Provine,et al.  Laughter: A Stereotyped Human Vocalization , 2010 .

[29]  S. Ullman Visual routines , 1984, Cognition.

[30]  J. Beck Perceptual analysis of voice quality: the place of Vocal Profile Analysis. , 2005 .

[31]  Herbert Jaeger,et al.  A tutorial on training recurrent neural networks , covering BPPT , RTRL , EKF and the " echo state network " approach - Semantic Scholar , 2005 .

[32]  Raimund Schatz,et al.  It takes two to tango - assessing the impact of delay on conversational interactivity on perceived speech quality , 2010, INTERSPEECH.

[33]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[34]  Maja Pantic,et al.  Decision-Level Fusion for Audio-Visual Laughter Detection , 2008, MLMI.

[35]  Mattias Heldner,et al.  Pause and gap length in face-to-face interaction , 2009, INTERSPEECH.

[36]  Heiko Neumann,et al.  Robust Stereoscopic Head Pose Estimation in Human-Computer Interaction and a Unified Evaluation Framework , 2011, ICIAP.

[37]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[38]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[39]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[40]  Daniel P. W. Ellis,et al.  Laughter Detection in Meetings , 2004 .

[41]  Günther Palm,et al.  On the discovery of events in EEG data utilizing information fusion , 2013, Comput. Stat..

[42]  Peter Bell,et al.  Proceedings of Speech Prosody 2006 , 2006 .

[43]  Stefan Scherer,et al.  Embodied Communicative Activity in Cooperative Conversational Interactions - studies in Visual Interaction Management , 2012 .

[44]  T. Chartrand,et al.  Using Nonconscious Behavioral Mimicry to Create Affiliation and Rapport , 2003, Psychological science.

[45]  Nick Campbell,et al.  No laughing matter , 2005, INTERSPEECH.

[46]  A. Kendon,et al.  Nonverbal Communication, Interaction, and Gesture , 1981 .

[47]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[48]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[49]  Nick Campbell,et al.  Listening between the lines : a study of paralinguistic information carried by tone-of-voice , 2004 .

[50]  Friedhelm Schwenker,et al.  A Hidden Markov Model Based Approach for Facial Expression Recognition in Image Sequences , 2010, ANNPR.

[51]  Günther Palm,et al.  Wizard-of-Oz Data Collection for Perception and Interaction in Multi-User Environments , 2006, LREC.

[52]  Jennifer S. Beer,et al.  Facial expression of emotion. , 2003 .

[53]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[54]  Nikki Mirghafori,et al.  Automatic laughter detection using neural networks , 2007, INTERSPEECH.

[55]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[56]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[57]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[58]  Yiannis Aloimonos,et al.  The Language of Action , 2010 .

[59]  Juan Carlos Augusto,et al.  Human-Centric Interfaces for Ambient Intelligence , 2009 .

[60]  Mohan M. Trivedi,et al.  Head Pose Estimation in Computer Vision: A Survey , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  P. Ekman Facial expression and emotion. , 1993, The American psychologist.

[62]  Loïc Kessous,et al.  Multimodal user’s affective state analysis in naturalistic interaction , 2010, Journal on Multimodal User Interfaces.

[63]  Friedhelm Schwenker,et al.  Multimodal Emotion Classification in Naturalistic User Behavior , 2011, HCI.

[64]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[65]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[66]  Petra-Maria Strauss,et al.  Evaluation and user acceptance of a dialogue system using Wizard-Of-Oz recordings , 2007 .

[67]  Nick Campbell,et al.  Tools & Resources for Visualising Conversational-Speech Interaction , 2008, LREC.

[68]  J. Russell Core affect and the psychological construction of emotion. , 2003, Psychological review.

[69]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[70]  M. Argyle Bodily communication, 2nd ed. , 1988 .

[71]  Friedhelm Schwenker,et al.  Multiple Classifier Systems for the Recogonition of Human Emotions , 2010, MCS.

[72]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..

[73]  Maja Pantic,et al.  Is this joke really funny? judging the mirth by audiovisual laughter analysis , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[74]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[75]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[76]  Martin J. Pickering,et al.  Alignment as the Basis for Successful Communication , 2006 .

[77]  J. Bachorowski,et al.  The acoustic features of human laughter. , 2001, The Journal of the Acoustical Society of America.

[78]  Anne Lacheret,et al.  The role of intonation and voice quality in the affective speech perception , 2007, INTERSPEECH.

[79]  Elizabeth Shriberg,et al.  Spotting "hot spots" in meetings: human judgments and prosodic cues , 2003, INTERSPEECH.

[80]  Friedhelm Schwenker,et al.  Fuzzy-Input Fuzzy-Output One-Against-All Support Vector Machines , 2007, KES.

[81]  No Author Given Robust Real-Time Face Tracking for the Analysis of Human Behaviour , 2007 .

[82]  H. Neumann,et al.  Extraction of Surface-Related Features in a Recurrent Model of V1-V2 Interactions , 2009, PloS one.

[83]  R. Krauss,et al.  Nonverbal Behavior and Nonverbal Communication: What do Conversational Hand Gestures Tell Us? , 1996 .

[84]  Andrew Ortony,et al.  The Cognitive Structure of Emotions , 1988 .

[85]  Andrei Popescu-Belis,et al.  Machine Learning for Multimodal Interaction , 4th International Workshop, MLMI 2007, Brno, Czech Republic, June 28-30, 2007, Revised Selected Papers , 2008, MLMI.

[86]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[87]  C. Gobl,et al.  Expressive synthesis: how crucial is voice quality? , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[88]  Ailbhe Ní Chasaide,et al.  Voice quality and f0 cues for affect expression: implications for synthesis , 2005, INTERSPEECH.

[89]  M. Pickering,et al.  Why is conversation so easy? , 2004, Trends in Cognitive Sciences.

[90]  Nick Campbell,et al.  Comparing measures of synchrony and alignment in dialogue speech timing with respect to turn-taking activity , 2010, INTERSPEECH.

[91]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[92]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[93]  G. Palm,et al.  Classifier fusion for emotion recognition from speech , 2007 .

[94]  Yasuhiro Katagiri,et al.  Prosodic alignment in human–computer interaction , 2007, Connect. Sci..

[95]  Friedhelm Schwenker,et al.  Incorporating uncertainty in a layered HMM architecture for human activity recognition , 2011, J-HGBU '11.

[96]  Bin Yang,et al.  Robust Estimation of Voice Quality Parameters Under Realworld Disturbances , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[97]  Catharine Oertel Gen Bierbach On the use of multimodal cues for the prediction of involvement in spontaneous conversation , 2011 .

[98]  H. Neumann,et al.  Object Segmentation from Motion Discontinuities and Temporal Occlusions–A Biologically Inspired Model , 2008, PloS one.

[99]  Yiannis Aloimonos,et al.  Active Segmentation , 2009, Int. J. Humanoid Robotics.

[100]  Khalil Sima'an,et al.  Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) , 2008 .

[101]  Arthur C. Graesser,et al.  Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis,TN, USA, October 9-12, 2011; Proceedings, Part II ... Vision, Pattern Recognition, and Graphics) , 2011 .

[102]  Pierre-Yves Oudeyer,et al.  The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[103]  Thomas A. Sebeok,et al.  Nonverbal communication, interaction, and gesture : selections from Semiotica , 1981 .

[104]  Francesc Alías,et al.  DISCRIMINATING EXPRESSIVE SPEECH STYLES BY VOICE QUALITY PARAMETERIZATION , 2007 .

[105]  W. Ickes,et al.  Big Five predictors of behavior and perceptions in initial dyadic interactions: personality similarity helps extraverts and introverts, but hurts "disagreeables". , 2009, Journal of personality and social psychology.

[106]  Thomas Serre,et al.  A neuromorphic approach to computer vision , 2010, Commun. ACM.

[107]  Petra Wagner,et al.  Towards the Automatic Detection of Involvement in Conversation , 2010, COST 2102 Conference.

[108]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[109]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[110]  Johanna D. Moore,et al.  Proceedings of Interspeech 2008 , 2008 .

[111]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[112]  Günther Palm,et al.  The PIT Corpus of German Multi-Party Dialogues , 2008, LREC.

[113]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[114]  David A. van Leeuwen,et al.  Automatic detection of laughter , 2005, INTERSPEECH.

[115]  Günther Palm,et al.  Orientation Histograms for Face Recognition , 2006, ANNPR.

[116]  J. Russell,et al.  Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. , 1999, Journal of personality and social psychology.

[117]  P. Watzlawick,et al.  Menschliche Kommunikation : Formen, Störungen, Parodoxien , 1996 .

[118]  Roland Siegwart,et al.  Medial Features for Superpixel Segmentation , 2009, MVA.

[119]  Günther Palm,et al.  The GMM-SVM Supervector Approach for the Recognition of the Emotional Status from Speech , 2009, ICANN.

[120]  Nick Campbell,et al.  On the Use of Multimodal Cues for the Prediction of Degrees of Involvement in Spontaneous Conversation , 2011, INTERSPEECH.

[121]  Stefan Scherer,et al.  A Flexible Wizard of Oz Environment for Rapid Prototyping , 2008, LREC.

[122]  Günther Palm,et al.  How Low Level Observations Can Help to Reveal the User's State in HCI , 2011, ACII.

[123]  Günther Palm,et al.  Spotting laughter in natural multiparty conversations: A comparison of automatic online and offline approaches using audiovisual data , 2012, TIIS.

[124]  Günther Palm,et al.  Multimodal Laughter Detection in Natural Discourses , 2009, Human Centered Robot Systems, Cognition, Interaction, Technology.

[125]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[126]  Holger Hoffmann,et al.  Evaluation of the PIT Corpus Or What a Difference a Face Makes? , 2010, LREC.

[127]  J. Laver The phonetic description of voice quality , 1980 .

[128]  Friedhelm Schwenker,et al.  A Multiple Classifier System Approach for Facial Expressions in Image Sequences Utilizing GMM Supervectors , 2010, 2010 20th International Conference on Pattern Recognition.

[129]  Friedhelm Schwenker,et al.  Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification , 2013, Comput. Speech Lang..

[130]  G. Palm,et al.  Learning of Decision Fusion Mappings for Pattern Recognition , 2006 .

[131]  Alex Pentland,et al.  Honest Signals - How They Shape Our World , 2008 .

[132]  Ludmila I. Kuncheva,et al.  Using measures of similarity and inclusion for multiple classifier fusion by decision templates , 2001, Fuzzy Sets Syst..

[133]  Friedhelm Schwenker,et al.  Artificial Neural Networks in Pattern Recognition , 2014, Lecture Notes in Computer Science.

[134]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[135]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[136]  Brian Vaughan,et al.  Prosodic Synchrony in Co-Operative Task-Based Dialogues: A Measure of Agreement and Disagreement , 2011, INTERSPEECH.

[137]  Nick Campbell,et al.  On the Use of NonVerbal Speech Sounds in Human Communication , 2007, COST 2102 Workshop.

[138]  Kornel Laskowski Modeling vocal interaction for text-independent detection of involvement hotspots in multi-party meetings , 2008, 2008 IEEE Spoken Language Technology Workshop.

[139]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[140]  Wolfgang Minker,et al.  Proactive Spoken Dialogue Interaction in Multi-Party Environments , 2010 .

[141]  Susan E. Brennan,et al.  LEXICAL ENTRAINMENT IN SPONTANEOUS DIALOG , 1996 .