Using density estimation to improve text categorization

This paper explores the use of a statistical technique known as density estimation to potentially improve the results of text categorization systems which label documents by computing similarities between documents and categories. In addition to potentially improving a system's overall accuracy, density estimation converts similarity scores to probabilities. These probabilities provide con dence measures for a system's predictions which are easily interpretable and could potentially help to combine results of various systems. We discuss the results of three complete experiments on three separate data sets applying density estimation to the results of a TF*IDF/Rocchio system, and we compare these results to those of many competing approaches.

[1]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[4]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[5]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[6]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[7]  Vasileios Hatzivassiloglou,et al.  Text-based approaches for non-topical image categorization , 2000, International Journal on Digital Libraries.

[8]  Paul N. Bennett Using Asymmetric Distributions to Improve Classifier Probabilities : A Comparison of New and Standard Parametric Methods , 2002 .

[9]  Shih-Fu Chang,et al.  Integration of Visual and Text-Based Approaches for the Content Labeling and Classification of Photographs , 1999, SIGIR 1999.

[10]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[11]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[12]  Paul N. Bennett Assessing the Calibration of Naive Bayes Posterior Estimates , 2000 .

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[15]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[16]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[17]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[18]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[19]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .