Word Prominence Detection using Robust yet Simple Prosodic Features

Automatic detection of word prominence can provide valuable information for downstream applications such as spoken language understanding. Prior work on automatic word prominence detection exploit a variety of lexical, syntactic, and prosodic features and model the task as a sequence labeling problem (independently or using context). While lexical and syntactic features are highly correlated with the notion of word prominence, the output of speech recognition is typically noisy and hence these features are less reliable than the acousticprosodic feature stream. In this work, we address the automatic detection of word prominence through novel prosodic features that capture the changes in F0 curve shape and magnitude in conjunction with duration and energy. We contrast the utility of these features with aggregate statistics of F0, duration and energy used in prior work. Our features are simple to compute yet robust to the inherent difficulties associated with identifying salient points (such as F0 peaks) in the F0 contour. Feature analysis demonstrates that these novel features are significantly more predictive than the standard aggregation-based prosodic features. Experimental results on a corpus of spontaneous speech indicate that prominence detection accuracy using only the new prosodic features is better than using both lexical and syntactic features.

[1]  Shrikanth S. Narayanan,et al.  Exploiting Acoustic and Syntactic Features for Automatic Prosody Labeling in a Maximum Entropy Framework , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Shrikanth Narayanan,et al.  Detecting prominence in conversational speech: pitch accent, givenness and focus , 2008, Speech Prosody 2008.

[3]  Graham J. Williams,et al.  Rattle: A Data Mining GUI for R , 2009, R J..

[4]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[5]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[6]  Paul Taylor,et al.  The tilt intonation model , 1998, ICSLP.

[7]  Stefanie Shattuck-Hufnagel,et al.  A prosodically labeled database of spontaneous speech , 2001 .

[8]  Paul Christopher Bagshaw,et al.  Automatic prosodic analysis for computer aided pronunciation teaching , 1994 .

[9]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[10]  Julia Hirschberg,et al.  Detecting Pitch Accents at the Word, Syllable and Vowel Level , 2009, NAACL.

[11]  N. M. Veilleuz,et al.  Prosody/Parse Scoring and Its Application in ATIS , 1993, HLT.

[12]  B. Rosner,et al.  Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.

[13]  Carlo Caini,et al.  An Automatic System for Detecting Prosodic Prominence in American English Continuous Speech , 2005, Int. J. Speech Technol..

[14]  Yasemin Altun,et al.  Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[15]  Taniya Mishra,et al.  Decomposition of fundamental frequency contours in the general superpositional intonation model , 2008 .

[16]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[17]  Daniel Jurafsky,et al.  The detection of emphatic words using acoustic and lexical features , 2005, INTERSPEECH.

[18]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[19]  Shrikanth S. Narayanan,et al.  An Acoustic Measure for Word Prominence in Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.