论文信息 - Learning cross-lingual knowledge with multilingual BLSTM for emphasis detection with limited training data

Learning cross-lingual knowledge with multilingual BLSTM for emphasis detection with limited training data

Bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) has achieved state-of-the-art performance in many sequence processing problems given its capability in capturing contextual information. However, for languages with limited amount of training data, it is still difficult to obtain a high quality BLSTM model for emphasis detection, the aim of which is to recognize the emphasized speech segments from natural speech. To address this problem, in this paper, we propose a multilingual BLSTM (MTL-BLSTM) model where the hidden layers are shared across different languages while the softmax output layer is language-dependent. The MTL-BLSTM can learn cross-lingual knowledge and transfer this knowledge to both languages to improve the emphasis detection performance. Experimental results demonstrate our method can outperform the comparison methods over 2–15.6% and 2.9–15.4% on the English corpus and Mandarin corpus in terms of relative F1-measure, respectively.

[1] Takashi Nose,et al. HMM-Based Emphatic Speech Synthesis Using Unsupervised Context Labeling , 2011, INTERSPEECH.

[2] Peng Liu,et al. Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Lan Wang,et al. Automatic lexical stress detection for Chinese learners' of English , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[4] J. Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[5] Francisco Costa. Intrinsic Prosodic Properties of Stressed Vowels in European Portuguese , 2004 .

[6] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7] Kun Li,et al. Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Yi Xu,et al. Closely related languages, different ways of realizing focus , 2009, INTERSPEECH.

[9] Lianhong Cai,et al. Using tilt for automatic emphasis detection with Bayesian networks , 2015, INTERSPEECH.

[10] Gökhan Tür,et al. Use of kernel deep convex networks and end-to-end learning for spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[11] Lianhong Cai,et al. Synthesizing Expressive Speech to Convey Focus using a Perturbation Model for Computer-Aided Pronunciation Training , 2010 .

[12] Fabio Tamburini,et al. Prosodic prominence detection in speech , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[13] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[14] Martin Heckmann,et al. Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario , 2014, INTERSPEECH.

[15] Andrew Rosenberg,et al. Automatic detection and classification of prosodic events , 2009 .

[16] Milos Cernak,et al. Sound Pattern Matching for Automatic Prosodic Event Detection , 2016, INTERSPEECH.