A Tagged Corpus-Based Study for Repeats and Self-repairs Detection in French Transcribed Speech

We present in this paper the results of a tagged corpus-based study conducted on two kinds of disfluencies (repeats and self-repairs) from a corpus of spontaneous spoken French. This work first investigates the linguistic features of both phenomena, and then shows how --- from a corpus output tagged with TreeTagger --- to take into account repeats and self-repairs using word N-grams model and rule-based pattern matching. Some results on a test corpus are finally presented.