Automatic voice emotion recognition of child-parent conversations in natural settings

ABSTRACT While voice communication of emotion has been researched for decades, the accuracy of automatic voice emotion recognition (AVER) is yet to improve. In particular, the intergenerational communication has been under-researched, as indicated by the lack of an emotion corpus on child–parent conversations. In this paper, we presented our work of applying Support-Vector Machines (SVMs), established machine learning models, to analyze 20 pairs of child–parent dialogues on everyday life scenarios. Among many issues facing the emerging work of AVER, we explored two critical ones: the methodological issue of optimising its performance against computational costs, and the conceptual issue on the state of emotionally neutral. We used the minimalistic/extended acoustic feature set extracted with OpenSMILE and a small/large set of annotated utterances for building models, and analyzed the prevalence of the class neutral. Results indicated that the bigger the combined sets, the better the training outcomes. Regardless, the classification models yielded modest average recall when applied to the child–parent data, indicating their low generalizability. Implications for improving AVER and its potential uses are drawn.

[1]  Aurobinda Routray,et al.  Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[2]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[3]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[4]  George Trigeorgis,et al.  The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring , 2017, INTERSPEECH.

[5]  Jean-Claude Martin,et al.  Emotional Corpora: From Acquisition to Modeling , 2013, Emotion-Oriented Systems.

[6]  B. Williams,et al.  An Introduction to Child and Adolescent Mental Health , 2014 .

[7]  W. Fitch The evolution of speech: a comparative review , 2000, Trends in Cognitive Sciences.

[8]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[9]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[10]  Laurence Devillers,et al.  Multimodal Sentiment Analysis in the Wild: Ethical considerations on Data Collection, Annotation, and Exploitation , 2016 .

[11]  Peter Robinson,et al.  Real-Time Recognition of Affective States from Nonverbal Features of Speech and Its Application for Public Speaking Skill Analysis , 2011, IEEE Transactions on Affective Computing.

[12]  Nicholas B. Allen,et al.  Introducing Emotions to the Modelingof Intra- and Inter-Personal Influencesin Parent-Adolescent Conversations , 2013, IEEE Transactions on Affective Computing.

[13]  Michael Nowak,et al.  Social visualization and negotiation: effects of feedback configuration and status , 2012, CSCW.

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[16]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[17]  P. White Appraisal Theory , 2015 .

[18]  K. Scherer,et al.  Beyond arousal: valence and potency/control cues in the vocal expression of emotion. , 2010, The Journal of the Acoustical Society of America.

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  P. Sheeran,et al.  Dealing with feeling: a meta-analysis of the effectiveness of strategies derived from the process model of emotion regulation. , 2012, Psychological bulletin.

[21]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[22]  Effie Lai-Chong Law,et al.  What Can Self-Reports and Acoustic Data Analyses on Emotions Tell Us? , 2017, Conference on Designing Interactive Systems.

[23]  P. Kleinginna,et al.  A categorized list of emotion definitions, with suggestions for a consensual definition , 1981 .

[24]  M. Callanan,et al.  Metarepresentation in action: 3-, 4-, and 5-year-olds' developing theories of mind in parent-child conversations. , 1998, Developmental psychology.

[25]  Björn Schuller,et al.  Effects of In-Car Noise-Conditions on the Recognition of Emotion within Speech , 2007 .

[26]  Homayoon S. M. Beigi,et al.  Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning , 2018, ArXiv.

[27]  Klaus R. Scherer,et al.  The Role of Perceived Voice and Speech Characteristics in Vocal Emotion Communication , 2014 .

[28]  Takao Kobayashi,et al.  Speech emotion recognition using convolutional long short-term memory neural network and support vector machines , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[29]  Chaitali Chakrabarti,et al.  A multi-modal approach to emotion recognition using undirected topic models , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[30]  Shwetak N. Patel,et al.  How Good is 85%?: A Survey Tool to Connect Classifier Evaluation to Acceptability of Accuracy , 2015, CHI.

[31]  Shourya Roy,et al.  Automatic Identification of Important Segments and Expressions for Mining of Business-Oriented Conversations at Contact Centers , 2007, EMNLP.

[32]  J. Goodnow,et al.  Parental belief systems : the psychological consequences for children , 1992 .

[33]  T. Dalgleish,et al.  Handbook of cognition and emotion , 1999 .

[34]  Björn Schuller,et al.  Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement , 2019, INTERSPEECH.

[35]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[36]  Carlos Busso,et al.  Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition , 2013, IEEE Transactions on Affective Computing.

[37]  James R. Lewis,et al.  Practical Speech User Interface Design , 2010 .

[38]  Marie Tahon,et al.  Real-Life Emotion Detection from Speech in Human-Robot Interaction: Experiments Across Diverse Corpora with Child and Adult Voices , 2011, INTERSPEECH.

[39]  M. Pell Evaluation of Nonverbal Emotion in Face and Voice: Some Preliminary Findings on a New Battery of Tests , 2002, Brain and Cognition.

[40]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[41]  C. Darwin The Expression of the Emotions in Man and Animals , .

[42]  Björn W. Schuller,et al.  New Avenues in Opinion Mining and Sentiment Analysis , 2013, IEEE Intelligent Systems.

[43]  K. Scherer,et al.  Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception. , 2012, Emotion.

[44]  Enrique Herrera-Viedma,et al.  Sentiment analysis: A review and comparative analysis of web services , 2015, Inf. Sci..

[45]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[46]  J. Belsky,et al.  The determinants of parenting: a process model. , 1984, Child development.

[47]  D. L. Mumme,et al.  Infants' responses to facial and vocal emotional signals in a social referencing paradigm. , 1996, Child development.

[48]  Marc D. Pell,et al.  The sound of Passion and Indifference , 2018, Speech Commun..

[49]  K. Scherer What are emotions? And how can they be measured? , 2005 .

[50]  Xin Zhao,et al.  A Fast Feature Selection Algorithm Based on Swarm Intelligence in Acoustic Defect Detection , 2018, IEEE Access.

[51]  M. Callanan,et al.  Links between parents' epistemological stance and children's evidence talk. , 2013, Developmental psychology.

[52]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[53]  SchmidhuberJürgen Deep learning in neural networks , 2015 .

[54]  K. Lagattuta,et al.  How do thoughts, emotions, and decisions align? A new way to examine theory of mind during middle childhood and beyond. , 2016, Journal of experimental child psychology.

[55]  Virpi Roto,et al.  Understanding, scoping and defining user experience: a survey approach , 2009, CHI.

[56]  Linda Dawson,et al.  A systematic review of speech recognition technology in health care , 2014, BMC Medical Informatics and Decision Making.

[57]  Michael Beetz,et al.  Robot recommender system using affection-based episode ontology for personalization , 2013, 2013 IEEE RO-MAN.

[58]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[59]  Sonja A. Kotz,et al.  Emotional Speech Perception Unfolding in Time: The Role of the Basal Ganglia , 2011, PloS one.

[60]  G. Baird,et al.  A multimodal approach to emotion recognition ability in autism spectrum disorders. , 2011, Journal of child psychology and psychiatry, and allied disciplines.

[61]  CambriaErik,et al.  A review of affective computing , 2017 .

[62]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[63]  Johanna D. Moore,et al.  Emotion recognition in spontaneous and acted dialogues , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[64]  Peter Robinson,et al.  Speech Emotion Classification and Public Speaking Skill Assessment , 2010, HBU.

[65]  Nian-Shing Chen,et al.  Review of Speech-to-Text Recognition Technology for Enhancing Learning , 2014, J. Educ. Technol. Soc..

[66]  M A Sinclair,et al.  Questionnaire design. , 1975, Applied ergonomics.

[67]  John C. McCarthy,et al.  Technology as experience , 2004, INTR.

[68]  Johan Sundberg,et al.  Comparing the acoustic expression of emotion in the speaking and the singing voice , 2015, Comput. Speech Lang..

[69]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[70]  R. Janney,et al.  Toward a pragmatics of emotive communication , 1994 .

[71]  Björn W. Schuller,et al.  Speech emotion recognition , 2018, Commun. ACM.

[72]  Richard E. Ladner,et al.  Improving Real-Time Captioning Experiences for Deaf and Hard of Hearing Students , 2016, ASSETS.

[73]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[74]  L. Kohlberg,et al.  Moral development: A review of the theory , 1977 .

[75]  Christian Martyn Jones,et al.  Affective Human-Robotic Interaction , 2008, Affect and Emotion in Human-Computer Interaction.

[76]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[77]  Björn W. Schuller,et al.  Affect recognition in real-life acoustic conditions - a new perspective on feature selection , 2013, INTERSPEECH.

[78]  Rohit Kumar,et al.  Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[79]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[80]  R. Fivush Emotional content of parent–child conversations about the past. , 1993 .

[81]  M. Bornstein,et al.  Contemporary research on parenting. The case for nature and nurture. , 2000, The American psychologist.

[82]  Jodi Forlizzi,et al.  Understanding experience in interactive systems , 2004, DIS '04.

[83]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[84]  Fabien Ringeval,et al.  SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  Elisabeth André,et al.  EmoVoice - A Framework for Online Recognition of Emotions from Voice , 2008, PIT.

[86]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[87]  G. Kochanska,et al.  Difficult temperament moderates links between maternal responsiveness and children's compliance and behavior problems in low-income families. , 2013, Journal of child psychology and psychiatry, and allied disciplines.

[88]  Elizabeth Shriberg,et al.  Crowdsourcing Emotional Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[89]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[90]  M. Cox,et al.  Families With Young Children: A Review of Research in the 1990s , 2000 .

[91]  Marc Hassenzahl,et al.  User experience (UX): towards an experiential perspective on product quality , 2008, IHM '08.

[92]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[93]  Stefan Wermter,et al.  On the Robustness of Speech Emotion Recognition for Human-Robot Interaction with Deep Neural Networks , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[94]  H. Hops,et al.  Methodological issues in direct observation: Illustrations with the living in familial environments (LIFE) coding system , 1995 .

[95]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[96]  Weihui Dai,et al.  Emotion recognition and affective computing on vocal social media , 2015, Inf. Manag..

[97]  Emily Mower Provost,et al.  Using regional saliency for speech emotion recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[98]  Carlos Busso,et al.  Supervised domain adaptation for emotion recognition from speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).