Representation Quality in Text Classification: An Introduction and Experiment

The way in which text is represented has a strong impact on the performance of text classification (retrieval and categorization) systems. We discuss the operation of text classification systems, introduce a theoretical model of how text representation impacts their performance, and describe how the performance of text classification systems is evaluated. We then present the results of an experiment on improving text representation quality, as well as an analysis of the results and the directions they suggest for future research.

[1]  Philip J. Hayes,et al.  A News Story Categorization System , 1988, ANLP.

[2]  Antonio Zamora,et al.  The use of titles for automatic document classification , 1980, J. Am. Soc. Inf. Sci..

[3]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[4]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[5]  W. Bruce Croft,et al.  Experiments with query acquisition and use in document retrieval systems , 1989, SIGIR '90.

[6]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[7]  Carolyn J. Crouch,et al.  A cluster-based approach to thesaurus construction , 1988, SIGIR '88.

[8]  Natasha Vleduts-Stokolov Concept recognition in an automatic text‐processing system for the life sciences , 1987 .

[9]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[10]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[11]  Norbert Fuhr,et al.  Optimum probability estimation from empirical distributions , 1989, Inf. Process. Manag..

[12]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[13]  Norbert Fuhr,et al.  The automatic indexing system AIR/PHYS - from research to applications , 1988, SIGIR '88.

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[16]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[17]  Gerard Salton,et al.  Another look at automatic text-retrieval systems , 1986, CACM.

[18]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[19]  Kenneth Ward Church,et al.  Parsing, Word Associations and Typical Predicate-Argument Relations , 1989, HLT.

[20]  Kenneth Ward Church,et al.  Enhanced Good-Turing and Cat-Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version) , 1989, HLT.

[21]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[22]  Edward A. Fox,et al.  Coefficients of combining concept classes in a collection , 1988, SIGIR '88.

[23]  Natasha Vleduts-Stokolov,et al.  Concept recognition in an automatic text-processing system for the life sciences , 1987, J. Am. Soc. Inf. Sci..