Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach

Effective emotion inference from user queries helps to give a more personified response for Voice Dialogue Applications(VDAs). The tremendous amounts of VDA users bring in diverse emotion expressions. How to achieve a high emotion inferring performance from large-scale Internet Voice Data in VDAs? Traditionally, researches on speech emotion recognition are based on acted voice datasets, which have limited speakers but strong and clear emotion expressions. Inspired by this, in this paper, we propose a novel approach to leverage acted voice data with strong emotion expressions to enhance large-scale unlabeled internet voice data with diverse emotion expressions for emotion inferring. Specifically, we propose a novel semi-supervised multi-modal curriculum augmentation deep learning framework. First, to learn more general emotion cues, we adopt a curriculum learning based epoch-wise training strategy, which trains our model guided by strong and balanced emotion samples from acted voice data and sub-sequently leverages weak and unbalanced emotion samples from internet voice data.Second, to employ more diverse emotion expressions, we design a Multi-path Mixmatch Multimodal Deep Neural Network(MMMD), which effectively learns feature representations for multiple modalities and trains labeled and unlabeled data in hybrid semisupervised methods for superior generalisation and robustness. Experiments on an internet voice dataset with 500,000 utterances show our method outperforms (+10.09% in terms of F1) several alternative baselines, while an acted corpus with 2,397 utterances contributes 4.35%. To further compare our method with state-of-the-art techniques in traditionally acted voice datasets, we also conduct experiments on public dataset IEMOCAP. The results reveal the effectiveness of the

[1]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[2]  Jie Tang,et al.  Learning to Infer Public Emotions from Large-Scale Networked Voice Data , 2014, MMM.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Björn W. Schuller,et al.  Semisupervised Autoencoders for Speech Emotion Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Xiaoyan Zhu,et al.  Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory , 2017, AAAI.

[6]  Qi Wang,et al.  Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach , 2018, AAAI.

[7]  Rohit Kumar,et al.  Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[8]  Xiaoyuan Yi,et al.  Inferring users' emotions for human-mobile voice dialogue applications , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[9]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Stefan Scherer,et al.  Learning representations of emotional speech with deep convolutional generative adversarial networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[12]  Chao Wang,et al.  Multimodal and Multi-view Models for Emotion Recognition , 2019, ACL.

[13]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[14]  Björn W. Schuller,et al.  Universal Onset Detection with Bidirectional Long Short-Term Memory Neural Networks , 2010, ISMIR.

[15]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[16]  Ngoc Thang Vu,et al.  Improving Speech Emotion Recognition with Unsupervised Representation Learning on Unlabeled Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Maosong Sun,et al.  Punctuation as Implicit Annotations for Chinese Word Segmentation , 2009, CL.

[18]  P. Ekman,et al.  The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .

[19]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[20]  Kuo Zhang,et al.  Acoustics, content and geo-information based sentiment prediction from large-scale networked voice data , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[21]  Björn W. Schuller,et al.  Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .

[22]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[23]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[24]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[25]  Yanfeng Wang,et al.  Inferring Emotions From Large-Scale Internet Voice Data , 2019, IEEE Transactions on Multimedia.

[26]  Emily Mower Provost,et al.  Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG) , 2019, IEEE Transactions on Affective Computing.

[27]  Najim Dehak,et al.  Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts , 2018, INTERSPEECH.

[28]  Reza Lotfian,et al.  Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.