Using tilt for automatic emphasis detection with Bayesian networks

This paper proposes a new framework for emphasis detection from natural speech, where emphasis refers to a word or part of a word perceived as standing out from its surrounding words. Labeling emphatic words from speech recordings plays a significant role not only in human-computer interactions, but also in building speech corpus for expressive speech synthesis. Many previous researches use the global features to train their models, neglecting the efficiency of the local ones. In this paper, we introduce the tilt parameters which correspond to the phonetic prominence of an intonation event to our task. Besides, traditional approaches such as emphasis detection with support vector machines (SVMs) neglect the correlations between features, thus degrading the accuracy of emphasis detection. In this paper, we use Bayesian networks (BNs) which consider the dependency between features as detector. Experimental results demonstrate that BNs outperform the baseline and SVMs for the task. Specifically, by combining the tilt feature with the traditional segmental features and semitone, the proposed method yields an 11.6% improvement in emphasis detection accuracy as compared with the baseline and 2.2%-3.1% improvement with other feature combinations.

[1]  Jhing-Fa Wang,et al.  Stress Detection Based on Multi-class Probabilistic Support Vector Machines for Accented English Speech , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[2]  Takashi Nose,et al.  HMM-Based Emphatic Speech Synthesis Using Unsupervised Context Labeling , 2011, INTERSPEECH.

[3]  Martin Heckmann,et al.  Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario , 2014, INTERSPEECH.

[4]  Andrew Rosenberg,et al.  Automatic detection and classification of prosodic events , 2009 .

[5]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[6]  Lianhong Cai,et al.  Generating emphatic speech with hidden Markov model for expressive speech synthesis , 2014, Multimedia Tools and Applications.

[7]  D. Ladd,et al.  The perception of intonational emphasis: continuous or categorical? , 1997 .

[8]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[9]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[10]  Yasemin Altun,et al.  Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[11]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[12]  Lianhong Cai,et al.  Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training , 2013, Multimedia Tools and Applications.

[13]  Jyh-Shing Roger Jang,et al.  Stress Detection of English Words for a CAPT System Using Word-Length Dependent GMM-Based Bayesian Classifiers , 2012 .

[14]  Lan Wang,et al.  Automatic lexical stress detection for Chinese learners' of English , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[15]  Jia Liu,et al.  Automatic lexical stress detection using acoustic features for computer-assisted language learning , 2011 .

[16]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[17]  Daniel Jurafsky,et al.  The detection of emphatic words using acoustic and lexical features , 2005, INTERSPEECH.

[18]  Lianhong Cai,et al.  Synthesizing Expressive Speech to Convey Focus using a Perturbation Model for Computer-Aided Pronunciation Training , 2010 .

[19]  Shrikanth S. Narayanan,et al.  An Acoustic Measure for Word Prominence in Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Paul Taylor,et al.  The tilt intonation model , 1998, ICSLP.

[21]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[22]  Shrikanth S. Narayanan,et al.  Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[23]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[24]  Fabio Tamburini,et al.  Prosodic prominence detection in speech , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[25]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .