Recognizing Stress Using Semantics and Modulation of Speech and Gestures

This paper investigates how speech and gestures convey stress, and how they can be used for automatic stress recognition. As a first step, we look into how humans use speech and gestures to convey stress. In particular, for both speech and gestures, we distinguish between stress conveyed by the intended semantic message (e.g. spoken words for speech, symbolic meaning for gestures), and stress conveyed by the modulation of either speech and gestures (e.g. intonation for speech, speed and rhythm for gestures). As a second step, we use this decomposition of stress as an approach for automatic stress prediction. The considered components provide an intermediate representation with intrinsic meaning, which helps bridging the semantic gap between the low level sensor representation and the high level context sensitive interpretation of behavior. Our experiments are run on an audiovisual dataset with service-desk interactions. The final goal is having a surveillance system that would notify when the stress level is high and extra assistance is needed. We find that speech modulation is the best performing intermediate level variable for automatic stress prediction. Using gestures increases the performance and is mostly beneficial when speech is lacking. The two-stage approach with intermediate variables performs better than baseline feature level or decision level fusion.

[1]  Léon J. M. Rothkrantz,et al.  Automatic Audio-Visual Fusion for Aggression Detection Using Meta-information , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[2]  P. Ekman,et al.  The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .

[3]  D. McNeill So you think gestures are nonverbal , 1985 .

[4]  Léon J. M. Rothkrantz,et al.  Aggression Detection in Speech Using Sensor and Semantic Information , 2012, TSD.

[5]  O. Lartillot,et al.  A MATLAB TOOLBOX FOR MUSICAL FEATURE EXTRACTION FROM AUDIO , 2007 .

[6]  Petri Toiviainen,et al.  MIR in Matlab (II): A Toolbox for Musical Feature Extraction from Audio , 2007, ISMIR.

[7]  Sergios Theodoridis,et al.  Audio-Visual Fusion for Detecting Violent Scenes in Videos , 2010, SETN.

[8]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[9]  Léon J. M. Rothkrantz,et al.  Automatic stress detection in emergency (telephone) calls , 2011, Int. J. Intell. Def. Support Syst..

[10]  R. Krauss,et al.  The Communicative Value of Conversational Hand Gesture , 1995 .

[11]  Scherer,et al.  On the use of actor portrayals in research on emotional expression , 2010 .

[12]  A. Koller,et al.  Speech Acts: An Essay in the Philosophy of Language , 1969 .

[13]  R. Krauss,et al.  Nonverbal Behavior and Nonverbal Communication: What do Conversational Hand Gestures Tell Us? , 1996 .

[14]  G. J. Burghouts,et al.  Automatic Audio-Visual Fusion for Aggression Detection Using Meta-information , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[15]  Björn W. Schuller,et al.  Emotion representation, analysis and synthesis in continuous space: A survey , 2011, Face and Gesture 2011.

[16]  Ginevra Castellano,et al.  Recognising Human Emotions from Body Movement and Gesture Dynamics , 2007, ACII.

[17]  Leonard Berkowitz,et al.  Affective Aggression: The Role of Stress, Pain, and Negative Affect , 1998 .

[18]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..

[19]  John H. L. Hansen,et al.  Speech Under Stress: Analysis, Modeling and Recognition , 2007, Speaker Classification.

[20]  Michelle Karg,et al.  Body Movements for Affective Expression: A Survey of Automatic Recognition and Generation , 2013, IEEE Transactions on Affective Computing.

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Loïc Kessous,et al.  Multimodal emotion recognition from expressive faces, body gestures and speech , 2007, AIAI.

[23]  Kristian Kroschel,et al.  Audio-visual emotion recognition using an emotion space concept , 2008, 2008 16th European Signal Processing Conference.

[24]  K. Scherer,et al.  Vocal expression of affect , 2005 .

[25]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[26]  Klamer Schutte,et al.  Correlations between 48 human actions improve their detection , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[27]  Nikos Fakotakis,et al.  Fusion of acoustic and optical sensor data for automatic fight detection in urban environments , 2010, 2010 13th International Conference on Information Fusion.

[28]  Gertjan J. Burghouts,et al.  An audio-visual dataset of human–human interactions in stressful situations , 2014, Journal on Multimodal User Interfaces.

[29]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Michael Neff,et al.  Evaluating the Effect of Gesture and Language on Personality Perception in Conversational Agents , 2010, IVA.

[31]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[32]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[33]  Jean-Claude Martin,et al.  Gesture and emotion: Can basic gestural form features discriminate emotions? , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[34]  Cynthia Whissell,et al.  THE DICTIONARY OF AFFECT IN LANGUAGE , 1989 .

[35]  Zhenke Yang,et al.  Multi-modal aggression detection in trains , 2009 .

[36]  Hatice Gunes,et al.  Bi-modal emotion recognition from expressive face and body gestures , 2007, J. Netw. Comput. Appl..

[37]  Paul Ekman,et al.  Emotional and Conversational Nonverbal Signals , 2004 .

[38]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  S. Folkman,et al.  Stress, appraisal, and coping , 1974 .

[40]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[41]  K. Scherer,et al.  The New Handbook of Methods in Nonverbal Behavior Research , 2008 .

[42]  K. Scherer Voice, Stress, and Emotion , 1986 .

[43]  Will Kalkhoff,et al.  Moderators and mediators of the stress-aggression relationship: executive function and state anger. , 2011, Emotion.

[44]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.