On redundancy of training corpus for text categorization: a perspective of geometry
暂无分享,去创建一个
Text Categorization is an important and extensively studied problem in information retrieval area[1, 2, 3]. However, we notice that in the literature text categorization research has been focusing on categorization methods. Few researchers have paid much attention to the training corpora from the point of text categorization research view. Nevertheless, it was observed that even for a specified categorization method, classifiers trained with different training corpora show different classification performances, i.e., classifier performance is relevant to training corpus. In this paper, we study the redundancy of training corpus in the context of kNN text categorization, aim to explore how to judge whether a training corpus has redundancy and how to reduce the redundancy if it has. With the rapidly increasing of text documents, the sizes of training corpora are growing(Reuters-21578 has 21578 news articles, while RCV1 contains over 800K). Redundancy is an unavoidable existence in training corpora building and utilizing. Reducing redundancy of training corpora can help to compact the training corpora, subsequently boost the efficiency of the training and classification processes, and even improve classification performance. Note that redundancy is different from duplicate problem[4] that means similar text content, while redundancy indicates similar semantic content. We give the definition of redundant training examples from the point of geometry view, and develop a redundancy reduction algorithm. Experiments are conducted to demonstrate the existence of redundancy in training corpora and validate the proposed algorithm.
[1] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.
[2] Yiming Yang,et al. A scalability analysis of classifiers in text categorization , 2003, SIGIR.
[3] Wai Lam,et al. Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.
[4] William John Teahan,et al. A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.