Learning cross-lingual knowledge with multilingual BLSTM for emphasis detection with limited training data

Bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) has achieved state-of-the-art performance in many sequence processing problems given its capability in capturing contextual information. However, for languages with limited amount of training data, it is still difficult to obtain a high quality BLSTM model for emphasis detection, the aim of which is to recognize the emphasized speech segments from natural speech. To address this problem, in this paper, we propose a multilingual BLSTM (MTL-BLSTM) model where the hidden layers are shared across different languages while the softmax output layer is language-dependent. The MTL-BLSTM can learn cross-lingual knowledge and transfer this knowledge to both languages to improve the emphasis detection performance. Experimental results demonstrate our method can outperform the comparison methods over 2–15.6% and 2.9–15.4% on the English corpus and Mandarin corpus in terms of relative F1-measure, respectively.

[1]  Takashi Nose,et al.  HMM-Based Emphatic Speech Synthesis Using Unsupervised Context Labeling , 2011, INTERSPEECH.

[2]  Peng Liu,et al.  Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Lan Wang,et al.  Automatic lexical stress detection for Chinese learners' of English , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[4]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[5]  Francisco Costa Intrinsic Prosodic Properties of Stressed Vowels in European Portuguese , 2004 .

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yi Xu,et al.  Closely related languages, different ways of realizing focus , 2009, INTERSPEECH.

[9]  Lianhong Cai,et al.  Using tilt for automatic emphasis detection with Bayesian networks , 2015, INTERSPEECH.

[10]  Gökhan Tür,et al.  Use of kernel deep convex networks and end-to-end learning for spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[11]  Lianhong Cai,et al.  Synthesizing Expressive Speech to Convey Focus using a Perturbation Model for Computer-Aided Pronunciation Training , 2010 .

[12]  Fabio Tamburini,et al.  Prosodic prominence detection in speech , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[13]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[14]  Martin Heckmann,et al.  Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario , 2014, INTERSPEECH.

[15]  Andrew Rosenberg,et al.  Automatic detection and classification of prosodic events , 2009 .

[16]  Milos Cernak,et al.  Sound Pattern Matching for Automatic Prosodic Event Detection , 2016, INTERSPEECH.