A study in machine learning from imbalanced data for sentence boundary detection in speech

Abstract Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.

[1]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[2]  Andreas Stolcke,et al.  Using Machine Learning to Cope with Imbalanced Classes in Natural Speech : Evidence from Sentence Boundary and Dis fl uency Detection , 2004 .

[3]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[4]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[5]  Roger K. Moore Computer Speech and Language , 1986 .

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[8]  Shrikanth S. Narayanan,et al.  A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Robin J. Lickley,et al.  On not recognizing disfluencies in dialogue , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[11]  E. Bard On not Recognizing Dis uencies in Dialogue , 1996 .

[12]  Andreas Stolcke,et al.  Automatic disfluency identification in conversational speech using multiple knowledge sources , 2003, INTERSPEECH.

[13]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[14]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[15]  M. Swerts Prosodic features at discourse boundaries of different strength. , 1997, The Journal of the Acoustical Society of America.

[16]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[17]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[18]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[19]  Dustin Hillard,et al.  SCORING STRUCTURAL MDE: TOWARDS MORE MEANINGFUL ERROR RATES , 2004 .

[20]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[21]  Siripong Potisuk,et al.  Prosodic disambiguation in automatic speech understanding of Thai , 1995 .

[22]  Andreas Stolcke,et al.  Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech , 2004, EMNLP.

[23]  orgTom Fawcett fawcett Robust Classiication for Imprecise Environments , 1989 .

[24]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[25]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[26]  Sauchi Stephen Lee Noisy replication in skewed binary classification , 2000 .

[27]  Mark Stevenson,et al.  Experiments on Sentence Boundary Detection , 2000, ANLP.

[28]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[29]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[30]  Andreas Stolcke,et al.  THE ICSI/SRI/UW RT04 STRUCTURAL METADATA EXTRACTION SYSTEM , 2004 .

[31]  Mari Ostendorf,et al.  The use of prosody in syntactic disambiguation , 1991 .

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[33]  John D. Lafferty,et al.  Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[34]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[35]  Elizabeth Shriberg,et al.  Spotting "hot spots" in meetings: human judgments and prosodic cues , 2003, INTERSPEECH.

[36]  Helmut Schmid Unsupervised Learning of Period Disambiguation for Tokenisation , 2000 .

[37]  Andreas Stolcke,et al.  Using machine learning to cope with imbalanced classes in natural speech: evidence from sentence boundary and disfluency detection , 2004, INTERSPEECH.

[38]  D. Scott,et al.  Duration as a cue to the perception of a phrase boundary. , 1982, The Journal of the Acoustical Society of America.

[39]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[40]  C. Julian Chen,et al.  Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[41]  Andreas Stolcke,et al.  Automatic linguistic segmentation of conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[42]  Ji-Hwan Kim,et al.  The use of prosody in a combined system for punctuation generation and speech recognition , 2001, INTERSPEECH.

[43]  C H Nakatani,et al.  A corpus-based study of repair cues in spontaneous speech. , 1994, The Journal of the Acoustical Society of America.

[44]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[45]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[46]  Nitesh V. Chawla,et al.  Distributed learning with bagging-like performance , 2003, Pattern Recognit. Lett..

[47]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[48]  Angelien Sanderman,et al.  On the perceptual strength of prosodic boundaries and its relation to suprasegmental cues , 1994 .

[49]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[50]  Mary P. Harper,et al.  Structural event detection for rich transcription of speech , 2004 .

[51]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[52]  R. J. Lickley,et al.  Proceedings of the International Conference on Spoken Language Processing. , 1992 .

[53]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[54]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[55]  Ralf Kompe,et al.  Prosody in Speech Understanding Systems , 1997, Lecture Notes in Computer Science.

[56]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[57]  Marti A. Hearst,et al.  Adaptive Sentence Boundary Disambiguation , 1994, ANLP.

[58]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.