Removing fillers to induce semantic classes for a Chinese dialogue system

In this paper, we introduced an unsupervised method to remove fillers in spoken dialogues semi-automatically based on their probability distribution and the effect of removing fillers to induce semantic classes. We conduct the unigram and bigram distribution of fillers on our Chinese voice search data and find that only using these distributions, fillers are in the first 1% of all words. We also test the semantic class induction precision before fillers removing and after fillers removing on both human-to-computer corpus and human-to-human corpus. After removing fillers, the precision grows from 81.8% to 86.9% in human-to-computer dialogues and from 58.0% to 61.9% in human-to-human dialogues.