A multistrategy approach for digital text categorization from imbalanced documents

The goal of the research described here is to develop a multistrategy classifier system that can be used for document categorization. The system automatically discovers classification patterns by applying several empirical learning methods to different representations for preclassified documents belonging to an imbalanced sample. The learners work in a parallel manner, where each learner carries out its own feature selection based on evolutionary techniques and then obtains a classification model. In classifying documents, the system combines the predictions of the learners by applying evolutionary techniques as well. The system relies on a modular, flexible architecture that makes no assumptions about the design of learners or the number of learners available and guarantees the independence of the thematic domain.

[1]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[2]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[3]  William W. Cohen Text Categorization and Relational Learning , 1995, ICML.

[4]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[5]  hierarchyDunja Mladeni Feature Selection for Classiication Based on Text Hierarchy , 1998 .

[6]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[7]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[10]  William B. Langdon,et al.  Genetic programming for combining classifiers , 2001 .

[11]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[12]  Ma. Dolores del Castillo Sobrino,et al.  Knowledge acquisition from batch semiconductor manufacturing data , 1999 .

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  Dolores del Castillo Sobrino,et al.  Genetic processing of the sensorial information , 1993 .

[15]  M. Dolores del Castillo,et al.  Knowledge acquisition from batch semiconductor manufacturing data , 1999, Intell. Data Anal..

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Marko Grobelnik,et al.  Interaction of Feature Selection Methods and Linear Classification Models , 2002 .

[18]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[19]  Luiz Eduardo Soares de Oliveira,et al.  Feature selection using multi-objective genetic algorithms for handwritten digit recognition , 2002, Object recognition supported by user interaction for service robots.

[20]  Pedro M. Domingos,et al.  Learning to Match the Schemas of Data Sources: A Multistrategy Approach , 2003, Machine Learning.

[21]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[22]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[23]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[25]  Ryszard S. Michalski,et al.  Machine learning: an artificial intelligence approach volume III , 1990 .

[26]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[27]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.