Improved Features and Models for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech

Detection of edit disfluencies is key to transcribing spontaneous utterances. In this paper, we present improved features and models to detect edit disfluencies and enhance transcription of spontaneous Mandarin speech using hypothesized disfluency interruption points (IPs) and edit word detection. A comprehensive set of prosodic features that takes into account the special characteristics of edit disfluencies in Mandarin is developed, and an improved model combining decision trees and maximum entropy is proposed to detect IPs. This model is further adapted to desired prosodic conditions by latent prosodic modeling, a probabilistic framework for analyzing speech prosody in terms of a set of latent prosodic states. These techniques contribute to higher recognition accuracy (by rescoring with the hypothesized IPs) and better edit word detection (using conditional random fields defined on Chinese characters) in the final transcription, as verified by experiments on a spontaneous Mandarin speech corpus.

[1]  Andreas Stolcke,et al.  A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.

[2]  Lin-Shan Lee,et al.  Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis (PLSA) , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Elizabeth Shriberg,et al.  Phonetic Consequences of Speech Disfluency , 1999 .

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Andreas Stolcke,et al.  Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Andreas Stolcke,et al.  Automatic disfluency identification in conversational speech using multiple knowledge sources , 2003, INTERSPEECH.

[7]  Andreas Stolcke,et al.  Comparing HMM, maximum entropy, and conditional random fields for disfluency detection , 2005, INTERSPEECH.

[8]  Andreas Stolcke,et al.  Structural metadata research in the EARS program , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[10]  Tanja Schultz,et al.  Correction of disfluencies in spontaneous speech using a noisy-channel approach , 2003, INTERSPEECH.

[11]  Christophe G. Giraud-Carrier,et al.  Learning the Threshold in Hierarchical Agglomerative Clustering , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[12]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[13]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[14]  Eugene Charniak,et al.  A TAG-based noisy-channel model of speech repairs , 2004, ACL.

[15]  Chiu-yu Tseng,et al.  Fluent speech prosody: Framework and modeling , 2005, Speech Commun..

[16]  Robin J. Lickley,et al.  Juncture cues to disfluency , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Mari Ostendorf,et al.  Detecting Structural Metadata with Decision Trees and Transformation-Based Learning , 2004, HLT-NAACL.

[20]  James F. Allen,et al.  Speech repains, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue , 1999, CL.

[21]  Mari Ostendorf,et al.  Parsing Conversational Speech Using Enhanced Segmentation , 2004, NAACL.

[22]  Eugene Charniak,et al.  Edit Detection and Parsing for Transcribed Speech , 2001, NAACL.

[23]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Guergana K. Savova,et al.  Prosodic features of four types of disfluencies , 2003, DiSS.

[25]  Richard M. Schwartz,et al.  A Lexically-Driven Algorithm for Disfluency Detection , 2004, NAACL.

[26]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[27]  Geoffrey Zweig,et al.  The IBM 2004 conversational telephony system for rich transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[28]  Sadaoki Furui,et al.  Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese , 2005, Speech Commun..

[29]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[30]  Matthew Lease,et al.  Recognizing disfluencies in conversational speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Tanja Schultz,et al.  Automatic disfluency removal on recognized spontaneous speech - rapid adaptation to speaker-dependent disfluencies , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32]  C H Nakatani,et al.  A corpus-based study of repair cues in spontaneous speech. , 1994, The Journal of the Acoustical Society of America.

[33]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Andreas Stolcke,et al.  Using machine learning to cope with imbalanced classes in natural speech: evidence from sentence boundary and disfluency detection , 2004, INTERSPEECH.

[35]  Chung-Hsien Wu,et al.  Edit disfluency detection and correction using a cleanup language model and an alignment model , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[37]  Mark J. F. Gales,et al.  Automatic transcription of conversational telephone speech , 2005, IEEE Transactions on Speech and Audio Processing.