Three Complements to Make Better Guideline of Chinese Word Segmentation

Three complements are proposed in this paper to make better guideline of Chinese word segmentation,which are essential for building high quality Chinese segmented corpora.They are named entity(person name,location name and organization name) tagging rules,factoid(date,time,percentage,etc.) tagging rules and disambiguation rules.Because named entities and factoids are considered as segmentation units in many corpora,and the disambiguation problem is seldom defined in former segmentation guidelines.Actually,people always have different intuitions of ambiguity strings,so it is necessary to explain them in segmentation guidelines.Our practices have shown that specifying particular segmentation rules can help to decrease errors and inconsistencies in annotated corpus.