Evolutionary learning of document categories

This paper deals with a supervised learning method devoted to producing categorization models of text documents. The goal of the method is to use a suitable numerical measurement of example similarity to find centroids describing different categories of examples. The centroids are not abstract or statistical models, but rather consist of bits of examples. The centroid-learning method is based on a Genetic Algorithm for Texts (GAT). The categorization system using this genetic algorithm infers a model by applying the genetic algorithm to each set of preclassified documents belonging to a category. The models thus obtained are the category centroids that are used to predict the category of a test document. The experimental results validate the utility of this approach for classifying incoming documents.

[1]  Michael M. Richter,et al.  The Knowledge Contained in Similarity Measures , 1995 .

[2]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[5]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Pedro M. Domingos,et al.  Learning to Match the Schemas of Data Sources: A Multistrategy Approach , 2003, Machine Learning.

[8]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[9]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[10]  Mario Lenz,et al.  Textual CBR , 1998, Case-Based Reasoning Technology.

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  D. E. Goldberg,et al.  Genetic Algorithms in Search, Optimization & Machine Learning , 1989 .

[13]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Padmini Srinivasan,et al.  Automatic Text Categorization Using Neural Networks , 1997 .

[16]  M. Dolores del Castillo,et al.  A multistrategy approach for digital text categorization from imbalanced documents , 2004, SKDD.

[17]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[18]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[19]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[20]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[21]  Mark P. Sinka,et al.  A Large Benchmark Dataset for Web Document Clustering , 2002 .

[22]  Analía Amandi,et al.  PersonalSearcher: An Intelligent Agent for Searching Web Pages , 2000, IBERAMIA-SBIA.