Verifying a Chinese collection for text categorization
暂无分享,去创建一个
This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method [1]. Experiments showed that effectiveness was not affected by the confusing documents.
[1] Gerard Salton,et al. Length Normalization in Degraded Text Collections , 1995 .
[2] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.
[3] William John Teahan,et al. A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.