Automatic detection of sentence prominence in speech using predictability of word-level acoustic features

Automatic detection of prominence in speech is an important task for many spoken language applications. However, most previous approaches rely on the availability of a corpus that is annotated with prosodic labels in order to train classifiers, therefore lacking generality beyond high-resourced languages. In this paper, we propose an algorithm for the automatic detection of sentence prominence that does not require explicit prominence labels for training. The method is based on the finding that human perception of prominence correlates with the (un)predictability of prosodic trajectories. The proposed system takes speech as input and combines information from automatically detected syllabic nuclei and three prosodic features in order to provide estimates of the prominent words. Results are reported using a speech corpus with manually assigned prominence labels from twenty annotators, showing that the algorithmic output converges with the annotators’ prominence responses with 86% accuracy.

[1]  Yang Liu,et al.  Automatic prosodic event detection using a novel labeling and selection method in co-training , 2012, Speech Commun..

[2]  Agaath M. C. Sluijter,et al.  Spectral balance as an acoustic correlate of linguistic stress. , 1996, The Journal of the Acoustical Society of America.

[3]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[4]  B. Rosner,et al.  Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.

[5]  Rudi C. Villing,et al.  Performance Limits for Envelope based Automatic Syllable Segmentation , 2006 .

[6]  Okko Johannes Räsänen,et al.  Perception of Sentence Stress in Speech Correlates With the Temporal Unpredictability of Prosodic Features , 2016, Cogn. Sci..

[7]  Tatsuya Kawahara,et al.  Modeling and automatic detection of English sentence stress for computer-assisted English prosody learning system , 2002, INTERSPEECH.

[8]  Shrikanth S. Narayanan,et al.  Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling , 2006, INTERSPEECH.

[9]  Mark Hasegawa-Johnson,et al.  A Maximum Likelihood Prosody Recognizer , 2004 .

[10]  Jennifer Cole,et al.  Naïve listeners' prominence and boundary perception , 2008, Speech Prosody 2008.

[11]  Shrikanth S. Narayanan,et al.  Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Julia Hirschberg,et al.  Modeling Local Context for Pitch Accent Prediction , 2000, ACL.

[13]  Pilar Prieto,et al.  Acoustic Correlates of Stress in Central Catalan and Castilian Spanish , 2011, Language and speech.

[14]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[15]  Keikichi Hirose,et al.  Acoustic modeling of sentence stress using differential features between syllables for English rhythm learning system development , 2002, INTERSPEECH.

[16]  P. Lieberman Some Acoustic Correlates of Word Stress in American English , 1959 .

[17]  Toomas Altosaar,et al.  A Speech Corpus for Modeling Language Acquisition: CAREGIVER , 2010, LREC.

[18]  Helena Moniz,et al.  Extending AuToBI to prominence detection in European Portuguese , 2014 .

[19]  J. Terken Fundamental frequency and perceived prominence of accented syllables. , 1991, The Journal of the Acoustical Society of America.

[20]  Kristin Lemhöfer,et al.  Introducing LexTALE: A quick and valid Lexical Test for Advanced Learners of English , 2011, Behavior research methods.

[21]  Okko Johannes Räsänen,et al.  Perception of sentence stress in English infant directed speech , 2014, INTERSPEECH.

[22]  George Christodoulides,et al.  An evaluation of machine learning methods for prominence detection in French , 2014, INTERSPEECH.

[23]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[24]  Hie-Jung You,et al.  Determining prominence and prosodic boundaries in Korean by non-expert rapid prosody transcription , 2012 .

[25]  Okko Johannes Räsänen,et al.  Statistical Unpredictability of F0 Trajectories as a Cue to Sentence Stress , 2014, CogSci.

[26]  Hongbing Hu,et al.  A spectral/temporal method for robust fundamental frequency tracking. , 2008, The Journal of the Acoustical Society of America.

[27]  Pier Marco Bertinetto,et al.  Prosodic prominence detection in Italian continuous speech using probabilistic graphical models , 2014 .

[28]  Gina-Anne Levow,et al.  Unsupervised and Semi-supervised Learning of Tone and Pitch Accent , 2006, NAACL.

[29]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[30]  Okko Johannes Räsänen,et al.  Analyzing the Predictability of Lexeme-specific Prosodic Features as a Cue to Sentence Prominence , 2015, CogSci.

[31]  Carlo Caini,et al.  An Automatic System for Detecting Prosodic Prominence in American English Continuous Speech , 2005, Int. J. Speech Technol..

[32]  Eric Keller,et al.  Prosodic aspects of speech , 1995 .