Detection of filled pauses in spontaneous conversational speech

Most automatic speech recognition work has concentrated on read speech, whose acoustic aspects differ significantly from speech found in actual dialogues. A primary difference between read speech and spontaneous speech concerns a high rate of disfluencies (e.g., filled pauses, repetitions, repairs, false starts). Filled pauses (e.g., “uh,” “um”), unlike silences, resemble phones as part of words in continuous speech. In this paper the problem of detection of filled pauses in spontaneous speech and how this can be useful in automatic speech recognition are considered. The acoustic aspects of filled pauses in a widely-used SWITCHBOARD [1] database are examined here, from the point of view of identifying them acoustically using a combination of duration, fundamental frequency and spectra.

[1]  John J. Godfrey,et al.  Robust automatic time alignment of orthographic transcriptions with unconstrained speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[3]  Elizabeth Shriberg,et al.  Phonetic Consequences of Speech Disfluency , 1999 .

[4]  Hermann Ney A dynamic programming technique for nonlinear smoothing , 1981, ICASSP.

[5]  Robin J. Lickley,et al.  Juncture cues to disfluency , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Elizabeth Shriberg DISFLUENCIES IN SWITCHBOARD , 1996 .

[8]  C H Nakatani,et al.  A corpus-based study of repair cues in spontaneous speech. , 1994, The Journal of the Acoustical Society of America.

[9]  Andreas Stolcke,et al.  A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.

[10]  D. O'Shaughnessy,et al.  Recognition of hesitations in spontaneous speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Douglas D. O'Shaughnessy Locating disfluencies in spontaneous speech: an acoustical analysis , 1993, EUROSPEECH.

[12]  George R. Doddington,et al.  An integrated pitch tracking algorithm for speech systems , 1983, ICASSP.