A Rough Set-Based Approach to Text Classification
暂无分享,去创建一个
A non-trivial obstacle in good text classification for information filtering and retrieval (IF/IR) is the dimensionality of the data. This paper proposes a technique using Rough Set Theory to alleviate this situation. Given corpora of documents and a training set of examples of classified documents, the technique locates a minimal set of co-ordinate keywords to distinguish between classes of documents, reducing the dimensionality of the keyword vectors. This simplifies the creation of knowledge-based IF/IR systems, speeds up their operation, and allows easy editing of the rule bases employed. The paper describes the proposed technique, discusses the integration of a keyword acquisition algorithm with a rough set-based dimensionality reduction algorithm, and provides experimental results of a proof-of-concept implementation.
[1] Janusz Zalewski,et al. Rough sets: Theoretical aspects of reasoning about data , 1996 .
[2] D. H. Crocker,et al. Standard for the format of arpa intemet text messages , 1982 .
[3] Qiang Shen,et al. Combining rough sets and data-driven fuzzy learning for generation of classification rules , 1999, Pattern Recognit..
[4] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.