Inferring Emotions From Large-Scale Internet Voice Data

As voice dialog applications (VDAs, e.g., Siri,<xref ref-type="fn" rid="fn1"><sup>1</sup></xref><fn id="fn1"><label><sup>1</sup></label><p><uri>http://www.apple.com/ios/siri/</uri>.</p></fn> Cortana,<xref ref-type="fn" rid="fn2"><sup>2</sup></xref><fn id="fn2"><label><sup>2</sup></label><p><uri>http://www.microsoft.com/en-us/mobile/campaign-cortana/</uri>.</p></fn> Google Now<xref ref-type="fn" rid="fn3"><sup>3</sup></xref><fn id="fn3"><label><sup>3</sup></label><p><uri>http://www.google.com/landing/now/</uri>.</p></fn>) are increasing in popularity, inferring emotions from the large-scale internet voice data generated from VDAs can help give a more reasonable and humane response. However, the tremendous amounts of users in large-scale internet voice data lead to a great diversity of users accents and expression patterns. Therefore, the traditional speech emotion recognition methods, which mainly target acted corpora, cannot effectively handle the massive and diverse amount of internet voice data. To address this issue, we carry out a series of observations, find suitable emotion categories for large-scale internet voice data, and verify the indicators of the social attributes (query time, query topic, and users location) and emotion inferring. Based on our observations, two different strategies are employed to solve the problem. First, a deep sparse neural network model that uses acoustic information, textual information, and three indicators (a temporal indicator, descriptive indicator, and geo-social indicator) as the input is proposed. Then, to capture the contextual information, we propose a hybrid emotion inference model that includes long short-term memory to capture the acoustic features and a latent dirichlet allocation to extract text features. Experiments on 93 000 utterances collected from the Sogou Voice Assistant<xref ref-type="fn" rid="fn4"><sup>4</sup></xref><fn id="fn4"><label><sup>4</sup></label><p><uri>http://yy.sogou.com</uri>.</p></fn> (Chinese Siri) validate the effectiveness of the proposed methodologies. Furthermore, we compare the two methodologies and give their advantages and disadvantages.

[1]  Jie Tang,et al.  Modeling Emotion Influence in Image Social Networks , 2015, IEEE Transactions on Affective Computing.

[2]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[3]  Chaitali Chakrabarti,et al.  A speech emotion recognition framework based on latent Dirichlet allocation: Algorithm and FPGA implementation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[6]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[7]  Jie Tang,et al.  Can we understand van gogh's mood?: learning to infer affects from images in social networks , 2012, ACM Multimedia.

[8]  Jiebo Luo,et al.  Guest Editorial: Deep Learning for Multimedia Computing , 2015, IEEE Trans. Multim..

[9]  Yue Gao,et al.  Continuous Probability Distribution Prediction of Image Emotions via Multitask Shared Sparse Regression , 2017, IEEE Transactions on Multimedia.

[10]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Shrikanth S. Narayanan,et al.  An Acoustic Measure for Word Prominence in Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Kyumin Lee,et al.  Spatio-temporal dynamics of online memes: a study of geo-tagged tweets , 2013, WWW.

[14]  Stefan M. Rüger,et al.  Weakly Supervised Joint Sentiment-Topic Detection from Text , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[16]  Svetha Venkatesh,et al.  Connectivity, Online Social Capital, and Mood: A Bayesian Nonparametric Analysis , 2013, IEEE Transactions on Multimedia.

[17]  Jie Tang,et al.  Understanding the emotions behind social images: Inferring with user demographics , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Emily Mower Provost,et al.  Data selection for acoustic emotion recognition: Analyzing and comparing utterance and sub-utterance selection strategies , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[19]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[20]  Tong Zhang,et al.  A Deep Neural Network-Driven Feature Learning Method for Multi-view Facial Expression Recognition , 2016, IEEE Transactions on Multimedia.

[21]  Ning An,et al.  Harmony search for feature selection in speech emotion recognition , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[22]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[23]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Björn W. Schuller,et al.  A multi-stream ASR framework for BLSTM modeling of conversational speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Na Li,et al.  Semi-supervised emotional classification of color images by learning from cloud , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[26]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[27]  Jie Tang,et al.  Learning to Infer Public Emotions from Large-Scale Networked Voice Data , 2014, MMM.

[28]  Constantine Kotropoulos,et al.  Using Adaptive Genetic Algorithms to Improve Speech Emotion Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[29]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  Elke A. Rundensteiner,et al.  EMOTEX: Detecting Emotions in Twitter Messages , 2014 .

[32]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[33]  Xiaoyuan Yi,et al.  Inferring users' emotions for human-mobile voice dialogue applications , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[34]  Yixiong Pan,et al.  SPEECH EMOTION RECOGNITION USING SUPPORT VECTOR MACHINE , 2010 .

[35]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[36]  Qiong Duan,et al.  Speech Emotion Recognition Using Gaussian Mixture Model , 2012 .

[37]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[38]  Juan-Zi Li,et al.  How Do Your Friends on Social Media Disclose Your Emotions? , 2014, AAAI.

[39]  Zi Huang,et al.  A temporal context-aware model for user behavior modeling in social media systems , 2014, SIGMOD Conference.

[40]  Yuuki Tachioka,et al.  Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[43]  Allan Hanbury,et al.  Affective image classification using features inspired by psychology and art theory , 2010, ACM Multimedia.

[44]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[45]  Björn W. Schuller,et al.  A novel bottleneck-BLSTM front-end for feature-level context modeling in conversational speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[46]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[47]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[48]  Adel M. Alimi,et al.  Speech emotion recognition based on Arabic features , 2015, 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA).

[49]  R CBalabantaray,et al.  Multi-Class Twitter Emotion Classification: A New Approach , 2012 .

[50]  Xiong Chen,et al.  Automatic Speech Emotion Recognition using Support Vector Machine , 2011, Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology.

[51]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[52]  Hongtao Lu,et al.  LSOD: Local Sparse Orthogonal Descriptor for Image Matching , 2016, ACM Multimedia.

[53]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[54]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[55]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[56]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Nitin Thapliyal,et al.  Speech based Emotion Recognition with Gaussian Mixture Model , 2012 .

[58]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[59]  Lei Guo,et al.  Semantic Segmentation based on Stacked Discriminative Autoencoders and Context-Constrained Weakly Supervised Learning , 2015, ACM Multimedia.

[60]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[61]  Yiran Chen,et al.  Quantitative Study of Individual Emotional States in Social Networks , 2012, IEEE Transactions on Affective Computing.

[62]  Rongfang Bie,et al.  Deep Learning Based Affective Model for Speech Emotion Recognition , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).

[63]  Dorra Ben Ayed Mezghanni,et al.  Improved Frame Level Features and SVM Supervectors Approach for the Recogniton of Emotional States from Speech: Application to categorical and dimensional states , 2014, ArXiv.

[64]  Lei Gao,et al.  A fisher discriminant framework based on Kernel Entropy Component Analysis for feature extraction and emotion recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[65]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[66]  Eugen Lupu,et al.  Emotions recognition by speechand facial expressions analysis , 2009, 2009 17th European Signal Processing Conference.