Inferring users' emotions for human-mobile voice dialogue applications

In this paper, we tackle the problem of inferring users' emotions in real-world Voice Dialogue Applications (VDAs, Siri1, Cortana2, etc.). We first conduct an investigation, indicating that besides the text information of users' queries, the acoustic information and query attributes are very important in inferring emotions in VDAs. To integrate the information above, we propose a Hybrid Emotion Inference Model (HEIM), which involves a Latent Dirichlet Allocation (LDA) to extract text features and a Long Short-Term Memory (LSTM) to model the acoustic features. To further improve accuracy, a Recurrent Autoencoder Guided by Query Attributes (RAGQA) which incorporates other emotion-related query attributes is proposed in HEIM to pre-train LSTM. The accuracy of HEIM on a data set collected from Sogou Voice Assistant3 (Chinese Siri) containing 93,000 utterances achieves 75.2%, which outperforms state-of-the-art methods for 33.5-38.5%. Specifically, we discover that on average, the acoustic information enhances the performance for 46.6%, while query attributes further enhance the performance for 6.5%.

[1]  Jerome R. Bellegarda,et al.  Spoken Language Understanding for Natural Interaction: The Siri Experience , 2012, Natural Interaction with Robots, Knowbots and Smartphones, Putting Spoken Dialog Systems into Practice.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Xiong Chen,et al.  Automatic Speech Emotion Recognition using Support Vector Machine , 2011, Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology.

[4]  Kuo Zhang,et al.  Acoustics, content and geo-information based sentiment prediction from large-scale networked voice data , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[5]  Enes Yuncu,et al.  Automatic Speech Emotion Recognition Using Auditory Models with Binary Decision Tree and SVM , 2014, 2014 22nd International Conference on Pattern Recognition.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[8]  S. Lalitha,et al.  Speech emotion recognition , 2014, 2014 International Conference on Advances in Electronics Computers and Communications.

[9]  Juan-Zi Li,et al.  How Do Your Friends on Social Media Disclose Your Emotions? , 2014, AAAI.

[10]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[11]  Kyumin Lee,et al.  Spatio-temporal dynamics of online memes: a study of geo-tagged tweets , 2013, WWW.

[12]  Ibrahiem M. M. El Emary,et al.  Speech emotion recognition approaches in human computer interaction , 2013, Telecommun. Syst..

[13]  Stefan M. Rüger,et al.  Weakly Supervised Joint Sentiment-Topic Detection from Text , 2012, IEEE Transactions on Knowledge and Data Engineering.

[14]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Zi Huang,et al.  A temporal context-aware model for user behavior modeling in social media systems , 2014, SIGMOD Conference.