论文信息 - Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario - 字舞流文

Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario

Modifying the articulatory parameters to raise the prominence of a segment of an utterance (hyperarticulating) is usually accompanied by a reduction of these parameters (hypoarticulation) for the neighboring segments. In this paper we investigate different approaches for the automatic labeling of the prominence of words. In particular, we investigate how the information in the sequence can be used. During the recording of the underlying audio-visual database, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only. We extracted an extensive range of features from the audio and visual channel. For the classification of word prominence we compare two algorithms. On the one hand SVM, a local classifier, on the other hand a classifier based on a sequential model, linear chain Conditional Random Fields (CRF). Both were trained on different context regions. For the CRF the whole sentence is used as a word sequence for training and testing. Overall we show that sequence models such as CRF, which performs best in our experiment, are suited for prominence detection and, furthermore, that the neighboring words contain information which further improves the detection. Index Terms: prosody, prominence, audio-visual, sequence information

Martin Heckmann | Andrea Schnall | M. Heckmann | Andrea Schnall

[1] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2] Elizabeth Shriberg,et al. Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[3] Hynek Hermansky,et al. RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[4] Julia Hirschberg,et al. Detecting Pitch Accents at the Word, Syllable and Vowel Level , 2009, NAACL.

[5] Andreas Stolcke,et al. Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[6] Martin Heckmann,et al. Inter-speaker variability in audio-visual classification of word prominence , 2013, INTERSPEECH.

[7] Julia Hirschberg,et al. Characterizing and Predicting Corrections in Spoken Dialogue Systems , 2006, CL.

[8] Gina-Anne Levow,et al. Automatic Prosodic Labeling with Conditional Random Fields and Rich Acoustic Features , 2008, IJCNLP.

[9] Yasemin Altun,et al. Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[10] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[11] Mari Ostendorf,et al. Error-correction detection and response generation in a spoken dialogue system , 2005, Speech Commun..

[12] Mattias Heldner,et al. On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish , 2003, J. Phonetics.

[13] Gina-Anne Levow,et al. Context in multi-lingual tone and pitch accent recognition , 2005, INTERSPEECH.

[14] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15] Marion Dohen,et al. Pre-focal rephrasing, focal enhancement and postfocal deaccentuation in French , 2004, INTERSPEECH.

[16] Gina-Anne Levow,et al. Identifying local corrections in human-computer dialogue , 2004, INTERSPEECH.

[17] Julia Hirschberg,et al. Corrections in spoken dialogue systems , 2000, INTERSPEECH.

[18] Eric Fosler-Lussier,et al. Conditional Random Fields in Speech, Audio, and Language Processing , 2013, Proceedings of the IEEE.

[19] Julia Hirschberg,et al. Prosodic and other cues to speech recognition failures , 2004, Speech Commun..

[20] Yi Xu,et al. Phonetic realization of focus in English declarative intonation , 2005, J. Phonetics.

[21] Martin Heckmann,et al. Audio-visual Evaluation and Detection of Word Prominence in a Human-Machine Interaction Scenario , 2012, INTERSPEECH.

[22] Andrew Rosenberg,et al. Automatic detection and classification of prosodic events , 2009 .