Integrating Lexical Knowledge in Learning-Based Text Categorization

Automatic Text Categorization (ATC) is an important task in the field of Information Access. The prevailing approach to ATC is making use of a a collection of prelabeled texts for the induction of a document classifier through learning methods. With the increasing availability of lexical resources in electronic form (including Lexical Databases (LDBs), Machine Readable Dictionaries, etc.), there is an interesting opportunity for the integration of them in learning-based ATC. In this paper, we present an approach to the integration of lexical knowledge extracted from the LDB WordNet in learning-based ATC, based on Stacked Generalization (SG). The method we suggest is based on combining the lexical knowledge extracted from the LDB interpreted as a classifier with a learning-based classifier, through SG. We have performed experiments which results show that the ideas we describe are promising and deserve further investigation.

[1]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[2]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[3]  Y Yang An evaluation of statistical approaches to MEDLINE indexing. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[4]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[5]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[6]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[7]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[10]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[11]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[12]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[13]  Piek Vossen,et al.  EuroWordNet: a multilingual database for information retrieval , 1997 .

[14]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[15]  Toshio Yokoi,et al.  The EDR electronic dictionary , 1995, CACM.

[16]  José María Gómez Hidalgo,et al.  Combining Text and Heuristics for Cost-Sensitive Spam Filtering , 2000, CoNLL/LLL.

[17]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[18]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[19]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.