The Influence of Semantics in Text Categorisation: A Comparative Study using the k Nearest Neighbours Method

In this paper we investigate dieren t uses of semantics in text categorisation tasks. At this end, we consider distinct representa- tions of documents which dier in the kind of information incorporated: a) information about terms only, b) semantic information (terms sense) and c) a combination of both types of information. Moreover, we study how the vocabulary size reduction aects this task. Thek Nearest Neigh- bours method was used to perform the categorisation and the vocabulary size was reduced by means of the Information Gain technique. A num- ber of dieren t document codications were tested. The experimental results showed that in corpora richer syntactically and semantically the inclusion of semantic information improves the text categorisation task if vocabularies with a sucien t number of features are considered.

[1]  Yiming Yang,et al.  Using corpus statistics to remove redundant words in text categorization , 1996 .

[2]  Paolo Rosso,et al.  An Approach to Clustering Abstracts , 2005, NLDB.

[3]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Mark Craven,et al.  Combining Statistical and Relational Methods for Learning in Hypertext Domains , 1998, ILP.

[6]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[7]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[8]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[9]  Paolo Rosso,et al.  Integrating Conceptual Density with WordNet Domains and CALD Glosses for Noun Sense Disambiguation , 2004, EsTAL.

[10]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[11]  Yiming Yang,et al.  Using Corpus Statistics to Remove Redundant Words in Text Categorization , 1996, J. Am. Soc. Inf. Sci..

[12]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[13]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[14]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[15]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[16]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[17]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[18]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[19]  Paolo Rosso,et al.  Information Retrieval and Text Categorization with Semantic Indexing , 2004, CICLing.

[20]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[23]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[24]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[25]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[26]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[27]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[28]  Paolo Rosso,et al.  Semantic Text Categorization using the K Nearest Neighbours Method , 2003, IICAI.

[29]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..