Memory-based disfluency chunking

We investigate the feasibility of machine learning in automatic detection of disfluencies in a large syntactically annotated corpus of spontaneous spoken Dutch. We define disfluencies as chunks that do not fit under the syntactic tree of a sentence (including fragmented words, laughter, self-corrections, repetitions, abandoned constituents, hesitations and filled pauses). We use a memory-based learning algorithm for detecting disfluent chunks, on the basis of a relatively small set of low-level features, keeping track of the local context of the focus word and of potential overlaps between words in this context. We use attenuation to deal with sparse data and show that this leads to a slight improvement of the results and more efficient experiments. We perform a search for the optimal settings of the learning algorithm, which yields an accuracy of 97% and an F-score of 80%. This is a significant improvement of the baselines and of the results obtained with the default settings of the learner.

[1]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[2]  Nelleke Oostdijk,et al.  The Design of the Spoken Dutch Corpus , 2002 .

[3]  Michael Moortgat,et al.  Syntactic Analysis in the Spoken Dutch Corpus (CGN) , 2002, LREC.

[4]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[5]  Sharon L. Oviatt,et al.  Predicting spoken disfluencies during human-computer interaction , 1995, Comput. Speech Lang..

[6]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[7]  John Bear,et al.  Integrating Multiple Knowledge Sources for Detection and Correction of Repairs in Human-Computer Dialog , 1992, ACL.

[8]  Antal van den Bosch,et al.  Shallow Parsing on the Basis of Words Only: A Case Study , 2002, ACL.

[9]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting Punctuation, Disfluencies, and Overlapping Speech , 2003 .

[10]  Elmar Nöth,et al.  How to repair speech repairs in an end-to-end system , 2001, DiSS.

[11]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[12]  Jason Eisner,et al.  An Empirical Comparison of Probability Models for Dependency Grammar , 1997, ArXiv.

[13]  Piroska Lendvai,et al.  Learning to Identify Fragmented Words in Spoken Discourse , 2003, EACL.

[14]  Elizabeth Shriberg,et al.  Crosslinguistic disfluency modelling: a comparative analysis of Swedish and american English human-human and human-machine dialogues , 1998, ICSLP.

[15]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[16]  Donald Hindle,et al.  Deterministic Parsing of Syntactic Non-fluencies , 1983, ACL.

[17]  James F. Allen,et al.  Tagging Speech Repairs , 1994, HLT.