A Probabilistic Model for Text Categorization: Based on a Single Random Variable with Multiple Values

Text categorization is the classification of documents with respect to a set of predefined categories. In this paper, we propose a new probabilistic model for text categorization, that is based on a Single random Variable with Multiple Values (SVMV). Compared to previous probabilistic models, our model has the following advantages; 1) it considers within-document term frequencies, 2) considers term weighting for target documents, and 3) is less affected by having insufficient training cases. We verify our model's superiority over the others in the task of categorizing news articles from the "Wall Street Journal".

[1]  Norbert Fuhr,et al.  Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[2]  Kui-Lam Kwok,et al.  Experiments with a component theory of probabilistic information retrieval based on single terms as document components , 1990, TOIS.

[3]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[4]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[5]  W. Bruce Croft Document representation in probabilistic models of information retrieval , 1981, J. Am. Soc. Inf. Sci..

[6]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[7]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[8]  B. J. Field TOWARDS AUTOMATIC INDEXING: AUTOMATIC ASSIGNMENT OF CONTROLLED‐LANGUAGE INDEXING AND CLASSIFICATION FROM FREE INDEXING , 1975 .

[9]  Karen Spärck Jones Collection properties influencing automatic term classification performance , 1973, Inf. Storage Retr..

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[12]  Yiyu Yao,et al.  A probability distribution model for information retrieval , 1989, Inf. Process. Manag..

[13]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[14]  Clement T. Yu,et al.  A framework for effective retrieval , 1989, ACM Trans. Database Syst..