论文信息 - Effect of term distributions on centroid-based text categorization

Effect of term distributions on centroid-based text categorization

Most of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term distributions, in addition to tf and idf, to improve performance of centroid-based text categorization. Three types of term distributions, called inter-class, intra-class and in-collection distributions, are introduced. These distributions are useful to increase classification accuracy by exploiting information of (1) term distribution among classes, (2) term distribution within a class and (3) term distribution in the whole collection of training data. In addition, this paper investigates how these term distributions contribute to weight each term in documents, e.g., a high term distribution of a word promotes or demotes importance or classification power of that word. To this end, several centroid-based classifiers are constructed with different term weightings. Using various data sets, their performances are investigated and compared to a standard centroid-based classifier (TDIDF) and a centroid-based classifier modified with information gain. Moreover, we also compare them to two well-known methods: k-NN and naive Bayes. In addition to a unigram model of document representation, a bigram model is also explored. Finally, the effectiveness of term distributions to improve classification accuracy is explored with regard to the training set size and the number of classes.

Verayuth Lertnattee | Thanaruk Theeramunkong | T. Theeramunkong | V. Lertnattee

[1] David D. Lewis,et al. Text categorization of low quality images , 1995 .

[2] Amit Singhal,et al. Pivoted document length normalization , 1996, SIGIR 1996.

[3] David A. Hull. Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[4] Ming Zhang,et al. A Linear Text Classification Algorithm Based on Category Relevance Factors , 2002, ICADL.

[5] Norbert Fuhr,et al. Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[6] B. C. Brookes,et al. Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[7] Gerard Salton,et al. Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[8] Verayuth Lertnattee,et al. IMPROVING CENTROID-BASED TEXT CLASSIFICATION USING TERM-DISTRIBUTION-BASED WEIGHTING SYSTEM AND CLUSTERING , 2001 .

[9] Gerard Salton,et al. Length Normalization in Degraded Text Collections , 1995 .

[10] Thorsten Joachims,et al. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[11] De Raedt,et al. Advances in Inductive Logic Programming , 1996 .

[12] Humphrey Sorensen,et al. PSUN: A Profiling System for Usenet News , 1995, CIKM Information Agents Workshop.

[13] Jihoon Yang,et al. A Fast Algorithm for Hierarchical Text Classification , 2000, DaWaK.

[14] Sholom M. Weiss,et al. Automated learning of decision rules for text categorization , 1994, TOIS.

[15] Leah S. Larkey,et al. Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[16] Hwee Tou Ng,et al. Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[17] Vipin Kumar,et al. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[18] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[19] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[20] Tom M. Mitchell,et al. Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[21] Yoram Singer,et al. Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[22] Judy Kay,et al. A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization , 2002, PRICAI.

[23] Fabrizio Sebastiani,et al. Supervised term weighting for automated text categorization , 2003, SAC '03.

[24] Yiming Yang,et al. An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[25] Sebastian Thrun,et al. Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[26] Gerard Salton,et al. The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[27] Yiming Yang,et al. An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[28] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[29] J. J. Rocchio,et al. Relevance feedback in information retrieval , 1971 .

[30] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[31] Stephen E. Robertson,et al. Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[32] Tom M. Mitchell,et al. Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[33] George Karypis,et al. Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[34] David B. Skalak,et al. Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[35] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[36] Florence d'Alché-Buc,et al. Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[37] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..