论文信息 - Verifying a Chinese collection for text categorization

Verifying a Chinese collection for text categorization

This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method [1]. Experiments showed that effectiveness was not affected by the confusing documents.

Yuen-Hsien Tseng | William John Teahan

[1] Gerard Salton,et al. Length Normalization in Degraded Text Collections , 1995 .

[2] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[3] William John Teahan,et al. A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.