Caractérisation et détection de parole spontanée dans de larges collections de documents audio

Processing spontaneous speech is one of the many challenges that Automatic Speech Recognition (ASR) systems have to deal with. The main evidences characterizing spontaneous speech are disfluencies (filled pause, repetition, repair and false start) and many studies have focused on the detection and the correction of these disfluencies. In this study we define spontaneous speech as unprepared speech, in opposition to prepared speech where utterances contain well-formed sentences close to those that can be found in written documents. Disfluencies are of course very good indicators of unprepared speech, however they are not the only ones : ungrammaticality and language register are also important as well as prosodic patterns. This paper proposes a set of acoustic and linguistic features that can be used for characterizing and detecting spontaneous speech segments from large audio databases. To better define this notion of unprepared speech, a set of speech segments representing an 11 hour corpus (French Broadcast News) has been manually labelled according to a level of spontaneity. We present an evaluation of our features on this corpus and describe the correlation between the Word-Error-Rate obtained by a state-of-the-art ASR decoder on this BN corpus and the level of spontaneity.

[1]  Elizabeth Shriberg,et al.  Phonetic Consequences of Speech Disfluency , 1999 .

[2]  Chung-Hsien Wu,et al.  Edit disfluency detection and correction using a cleanup language model and an alignment model , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[4]  Geneviève Caelen-Haumont Perlocutory Values and Functions of Melisms in Spontaneous Dialogue , 2002 .

[5]  Andreas Stolcke,et al.  Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Paul Deléglise,et al.  The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[7]  Mari Ostendorf,et al.  Modeling disfluencies in conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Gökhan Tür,et al.  Statistical Sentence Extraction for Information Distillation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[10]  Andreas Stolcke,et al.  Comparing HMM, maximum entropy, and conditional random fields for disfluency detection , 2005, INTERSPEECH.

[11]  Patrick Paroubek,et al.  A quantitative study of disfluencies in French broadcast interviews , 2005, DiSS.

[12]  Matthew Lease,et al.  Recognizing disfluencies in conversational speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Daniel Luzzati Le fenêtrage syntaxique: une méthode d'analyse et d'évaluation de l'oral spontané , 2004 .