Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging task, in part because of its heavy reliance on manual pre-processing. Traditional approaches frequently require hand-crafted speech features and segmentation of speaker turns. In this work, we design end-to-end deep learning methods to recognize each person's affective expression in an audio stream with two speakers, automatically discovering features and time regions relevant to the target speaker's affect. We integrate a local attention mechanism into the end-to-end architecture and compare the performance of three attention implementations - one mean pooling and two weighted pooling methods. Our results show that the proposed weighted-pooling attention solutions are able to learn to focus on the regions containing target speaker's affective information and successfully extract the individual's valence and arousal intensity. Here we introduce and use a "dyadic affect in multimodal interaction - parent to child" (DAMI-P2C) dataset collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together. In contrast to existing public datasets for affect recognition, each instance for both speakers in the DAMI-P2C dataset is annotated for the perceived affect by three labelers. To encourage more research on the challenging task of multi-speaker affect sensing, we make the annotated DAMI-P2C dataset publicly available, including acoustic features of the dyads' raw audios, affect annotations, and a diverse set of developmental, social, and demographic profiles of each dyad.

[1]  Ruchuan Wang,et al.  A Survey on Automatic Emotion Recognition Using Audio Big Data and Deep Learning Architectures , 2018, 2018 IEEE 4th International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing, (HPSC) and IEEE International Conference on Intelligent Data and Security (IDS).

[2]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[3]  Kathy Hirsh-Pasek,et al.  Skype me! Socially contingent interactions help toddlers learn language. , 2014, Child development.

[4]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  E. Hoff The specificity of environmental influence: socioeconomic status affects early vocabulary development via maternal speech. , 2003, Child development.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Wu Guo,et al.  An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition , 2018, INTERSPEECH.

[8]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[9]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Cynthia Breazeal,et al.  Personalized Estimation of Engagement From Videos Using Active Learning With Deep Reinforcement Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Angela Stewart,et al.  Face Forward: Detecting Mind Wandering from Video During Narrative Film Comprehension , 2017, AIED.

[12]  Björn W. Schuller,et al.  Personalized machine learning for robot perception of affect and engagement in autism therapy , 2018, Science Robotics.

[13]  Susan B. Neuman,et al.  Giving Our Children a Fighting Chance: Poverty, Literacy, and the Development of Information Capital , 2012 .

[14]  Rebecca J. Panagos Meaningful Differences in the Everyday Experience of Young American Children , 1998 .

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  M. Bornstein,et al.  Maternal responsiveness and children's achievement of language milestones. , 2001, Child development.

[18]  Cynthia Breazeal,et al.  Affective Personalization of a Social Robot Tutor for Children's Second Language Skills , 2016, AAAI.

[19]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[20]  Cynthia Breazeal,et al.  TinkRBook: shared reading interfaces for storytelling , 2011, IDC.

[21]  E. Redcay,et al.  Spontaneous mentalizing captures variability in the cortical thickness of social brain regions. , 2015, Social cognitive and affective neuroscience.

[22]  Victor J. Sensenig,et al.  Giving our children a fighting chance: Poverty, literacy, and the development of information capital , 2013, International Review of Education.

[23]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[24]  Cynthia Breazeal,et al.  Teaching and learning with children: Impact of reciprocal peer learning with a social robot on children's learning and emotive engagement , 2020, Comput. Educ..

[25]  Björn W. Schuller,et al.  CultureNet: A Deep Learning Approach for Engagement Intensity Estimation from Face Images of Children with Autism , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Nangyeon Lim,et al.  Cultural differences in emotion: differences in emotional arousal level between the East and the West , 2016, Integrative medicine research.

[27]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Susan B. Neuman,et al.  Worlds Apart: One City, Two Libraries, and Ten Years of Watching Inequality Grow. , 2012 .

[29]  Cynthia Breazeal,et al.  Impact of Interaction Context on the Student Affect-Learning Relationship in Child-Robot Interaction , 2020, 2020 15th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[30]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[31]  D. Cicchetti Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology. , 1994 .

[32]  Kathleen V. Hoover-dempsey,et al.  Parental Involvement in Children's Education: Why Does it Make a Difference? , 1995, Teachers College Record: The Voice of Scholarship in Education.

[33]  R. Stevens,et al.  The New Coviewing: Designing for Learning through Joint Media Engagement , 2011 .

[34]  Ruth Fielding-Barnsley,et al.  Digital Texts, iPads, and Families: An Examination of Families’ Shared Reading Behaviours , 2014 .

[35]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[36]  M. Rowe,et al.  Child-directed speech: relation to socioeconomic status, knowledge of child development and child vocabulary skill* , 2008, Journal of Child Language.

[37]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[38]  Kandarpa Kumar Sarma,et al.  Emotion Identification from Raw Speech Signals Using DNNs , 2018, INTERSPEECH.

[39]  Cynthia Breazeal,et al.  A Model-Free Affective Reinforcement Learning Approach to Personalization of an Autonomous Social Robot Companion for Early Literacy Education , 2019, AAAI.

[40]  Sidney K. D'Mello,et al.  Multimodal-multisensor affect detection , 2018, The Handbook of Multimodal-Multisensor Interfaces, Volume 2.

[41]  Ryan Shaun Joazeiro de Baker,et al.  Accuracy vs. Availability Heuristic in Multimodal Affect Detection in the Wild , 2015, ICMI.

[42]  Kevin A Hallgren,et al.  Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[43]  Angela Stewart,et al.  Where's Your Mind At?: Video-Based Mind Wandering Detection During Film Viewing , 2016, UMAP.

[44]  E. Shipley,et al.  Parental speech to middle- and working-class children from two racial groups in three settings , 1996, Applied Psycholinguistics.

[45]  Hao Meng,et al.  Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network , 2019, IEEE Access.

[46]  S. Suter Meaningful differences in the everyday experience of young American children , 2005, European Journal of Pediatrics.

[47]  Björn W. Schuller,et al.  Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  M. Rothbart,et al.  Development of Short and Very Short Forms of the Children's Behavior Questionnaire , 2006, Journal of personality assessment.

[49]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[50]  Cynthia Breazeal,et al.  A Social Robot System for Modeling Children's Word Pronunciation: Socially Interactive Agents Track , 2018, AAMAS.

[51]  Florian Eyben,et al.  Real-time Speech and Music Classification by Large Audio Feature Space Extraction , 2015 .

[52]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.