论文信息 - Speech Intention Classification with Multimodal Deep Learning

Speech Intention Classification with Multimodal Deep Learning

We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.

[1] Christopher C. Cummins,et al. A model of intentional communication: AIRBUS (Asymmetric Intention Recognition with Bayesian Updating of Signals) , 2012 .

[2] Lillian Lee,et al. Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[3] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[4] Carlos Busso,et al. Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Wolfgang Minker,et al. Challenges in speech-based human–computer interfaces , 2007, Int. J. Speech Technol..

[6] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7] Erik Cambria,et al. Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.

[8] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9] Dilek Z. Hakkani-Tür,et al. Spoken language understanding , 2008, IEEE Signal Processing Magazine.

[10] Christopher D. Manning,et al. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[11] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12] Geoffrey Zweig,et al. Fast and easy language understanding for dialog systems with Microsoft Language Understanding Intelligent Service (LUIS) , 2015, SIGDIAL Conference.

[13] Ze-Jing Chuang,et al. Multi-Modal Emotion Recognition from Speech and Text , 2004, ROCLING/IJCLCLP.

[14] George N. Votsis,et al. Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[15] Ivan Marsic,et al. Online process phase detection using multimodal deep learning , 2016, 2016 IEEE 7th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[16] Erik Cambria,et al. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[17] Erik Cambria,et al. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[18] Ting Liu,et al. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[19] Fabio Paternò,et al. Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema , 2012, International Journal of Speech Technology.

[20] Fakhri Karray,et al. Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[21] Ning An,et al. Speech Emotion Recognition Using Fourier Parameters , 2015, IEEE Transactions on Affective Computing.

[22] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[23] Hiok Chai Quek,et al. Cultural dependency analysis for understanding speech emotion , 2012, Expert Syst. Appl..

[24] Gerald Penn,et al. Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Ye-Yi Wang,et al. Spoken language understanding , 2005, IEEE Signal Processing Magazine.