Word Fragments Identification Using Acoustic-Prosodic Features in Conversational Speech

Word fragments pose serious problems for speech recognizers. Accurate identification of word fragments will not only improve recognition accuracy, but also be very helpful for disfluency detection algorithm because the occurrence of word fragments is a good indicator of speech disfluencies. Different from the previous effort of including word fragments in the acoustic model, in this paper, we investigate the problem of word fragment identification from another approach, i.e. building classifiers using acoustic-prosodic features. Our experiments show that, by combining a few voice quality measures and prosodic features extracted from the forced alignments with the human transcriptions, we obtain a precision rate of 74.3% and a recall rate of 70.1% on the downsampled data of spontaneous speech. The overall accuracy is 72.9%, which is significantly better than chance performance of 50%.

[1]  James F. Allen,et al.  Speech repains, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue , 1999, CL.

[2]  Andreas Stolcke,et al.  Prosody Modeling for Automatic Speech Recognition and Understanding , 2004 .

[3]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[4]  Gunnar Fant,et al.  The voice source in connected speech , 1997, Speech Commun..

[5]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[7]  Giuseppe Riccardi,et al.  Modeling disfluency and background events in ASR for a natural language understanding task , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  C H Nakatani,et al.  A corpus-based study of repair cues in spontaneous speech. , 1994, The Journal of the Acoustical Society of America.

[9]  W. Levelt,et al.  Monitoring and self-repair in speech , 1983, Cognition.

[10]  G. Fant Dept. for Speech, Music and Hearing Quarterly Progress and Status Report the Lf-model Revisited. Transformations and Frequency Domain Analysis the Lf-model Revisited. Transformations and Frequency Domain Analysis* , 2022 .

[11]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969 .

[12]  Barabara Blankenship,et al.  The time course of breathiness and laryngealization in vowels , 1997 .

[13]  W. Levelt Speaking: From Intention to Articulation , 1990 .

[14]  Larry P. Heck,et al.  A lognormal tied mixture model of pitch for prosody based speaker recognition , 1997, EUROSPEECH.

[15]  John Bear,et al.  Integrating Multiple Knowledge Sources for Detection and Correction of Repairs in Human-Computer Dialog , 1992, ACL.

[16]  Douglas D. O'Shaughnessy Analysis and automatic recognition of false starts in spontaneous speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Robin J. Lickley,et al.  Detecting disfluency in spontaneous speech , 1994 .

[18]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..