Speech Intention Classification with Multimodal Deep Learning

We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.

[1]  Christopher C. Cummins,et al.  A model of intentional communication: AIRBUS (Asymmetric Intention Recognition with Bayesian Updating of Signals) , 2012 .

[2]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[3]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[4]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Wolfgang Minker,et al.  Challenges in speech-based human–computer interfaces , 2007, Int. J. Speech Technol..

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Erik Cambria,et al.  Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  Dilek Z. Hakkani-Tür,et al.  Spoken language understanding , 2008, IEEE Signal Processing Magazine.

[10]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[11]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12]  Geoffrey Zweig,et al.  Fast and easy language understanding for dialog systems with Microsoft Language Understanding Intelligent Service (LUIS) , 2015, SIGDIAL Conference.

[13]  Ze-Jing Chuang,et al.  Multi-Modal Emotion Recognition from Speech and Text , 2004, ROCLING/IJCLCLP.

[14]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[15]  Ivan Marsic,et al.  Online process phase detection using multimodal deep learning , 2016, 2016 IEEE 7th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[16]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[17]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[18]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[19]  Fabio Paternò,et al.  Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema , 2012, International Journal of Speech Technology.

[20]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[21]  Ning An,et al.  Speech Emotion Recognition Using Fourier Parameters , 2015, IEEE Transactions on Affective Computing.

[22]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[23]  Hiok Chai Quek,et al.  Cultural dependency analysis for understanding speech emotion , 2012, Expert Syst. Appl..

[24]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Ye-Yi Wang,et al.  Spoken language understanding , 2005, IEEE Signal Processing Magazine.