Text Categorisation Using Document Profiling

This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee’s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.

[1]  Michael D. Lee Fast Text Classification Using Sequential Sampling Processes , 2001, Australian Joint Conference on Artificial Intelligence.

[2]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[3]  Dan Roth,et al.  Learning to Resolve Natural Language Ambiguities: A Unified Approach , 1998, AAAI/IAAI.

[4]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[5]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[6]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[7]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[8]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[13]  P. Schönemann On artificial intelligence , 1985, Behavioral and Brain Sciences.

[14]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[15]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[17]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[18]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[19]  Yiming Yang,et al.  A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts , 1992, COLING.

[20]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[21]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[22]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[23]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[24]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[25]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[26]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[27]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[28]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[29]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.