Inferring Emphasis for Real Voice Data: An Attentive Multimodal Neural Network Approach

To understand speakers’ attitudes and intentions in real Voice Dialogue Applications (VDAs), effective emphasis inference from users’ queries may play an important role. However, in VDAs, there are tremendous amount of uncertain speakers with a great diversity of users’ dialects, expression preferences, which challenge the traditional emphasis detection methods. In this paper, to better infer emphasis for real voice data, we propose an attentive multimodal neural network. Specifically, first, beside the acoustic features, extensive textual features are applied in modelling. Then, considering the feature in-dependency, we model the multi-modal features utilizing a Multi-path convolutional neural network (MCNN). Furthermore, combining high-level multi-modal features, we train an emphasis classifier by attending on the textual features with an attention-based bidirectional long short-term memory network (ABLSTM), to comprehensively learn discriminative features from diverse users. Our experimental study based on a real-world dataset collected from Sogou Voice Assistant (https://yy.sogou.com/) show that our method outperforms (over 1.0–15.5% in terms of F1 measure) alternative baselines.

[1]  Maosong Sun,et al.  Punctuation as Implicit Annotations for Chinese Word Segmentation , 2009, CL.

[2]  Wei Chen,et al.  Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[3]  D. Ladd,et al.  The perception of intonational emphasis: continuous or categorical? , 1997 .

[4]  Lan Wang,et al.  Automatic lexical stress detection for Chinese learners' of English , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[5]  Milos Cernak,et al.  An empirical model of emphatic word detection , 2015, INTERSPEECH.

[6]  Lianhong Cai,et al.  Learning cross-lingual knowledge with multilingual BLSTM for emphasis detection with limited training data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Mattias Heldner,et al.  Spectral emphasis as an additional source of information in accent detection , 2001 .

[8]  Martin Heckmann,et al.  Evaluation of optical flow field features for the detection of word prominence in a human-machine interaction scenario , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[9]  Daniel Jurafsky,et al.  The detection of emphatic words using acoustic and lexical features , 2005, INTERSPEECH.

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[12]  Tomoki Toda,et al.  Preserving Word-Level Emphasis in Speech-to-Speech Translation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[14]  Kristin Precoda,et al.  Lexical stress classification for language learning using spectral and segmental features , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jiebo Luo,et al.  Twitter Sentiment Analysis via Bi-sense Emoji Embedding and Attention-based LSTM , 2018, ACM Multimedia.

[16]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[17]  Lianhong Cai,et al.  Combining CNN and BLSTM to Extract Textual and Acoustic Features for Recognizing Stances in Mandarin Ideological Debate Competition , 2016, INTERSPEECH.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Qi Wang,et al.  Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach , 2018, AAAI.

[20]  Daniel P. W. Ellis,et al.  Pitch-based emphasis detection for characterization of meeting recordings , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[21]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[22]  Martin Heckmann,et al.  Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario , 2014, INTERSPEECH.