NLP-driven IR: Evaluating Performances over a Text Classification task

Although several attempts have been made to introduce Natural Language Processing (NLP) techniques in Information Retrieval, most ones failed to prove their effectiveness in increasing performances. In this paper Text Classification (TC) has been taken as the IR task and the effect of linguistic capabilities of the underlying system have been studied. A novel model for TC, extending a well know statistical model (i.e. Rocchio's formula [Ittner et al., 1995]) and applied to linguistic features has been defined and experimented. The proposed model represents an effective feature selection methodology. All the experiments result in a significant improvement with respect to other purely statistical methods (e.g. [Yang, 1999]), thus stressing the relevance of the available linguistic information. Moreover, the derived classifier reachs the performance (about 85%) of the best known models (i.e. Support Vector Machines (SVM) and K -Nearest Neighbour (KNN)) characterized by an higher computational complexity for training and processing.

[1]  Gregory Grefenstette Short Query Linguistic Expansion Techniques: Palliating One-Word Queries by Providing Intermediate Structure to Text , 1997, SCIE.

[2]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[3]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[4]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[7]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[8]  Roberto Basili,et al.  Language sensitive text classification , 2000, RIAO.

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[11]  Roberto Basili,et al.  An Adaptive and Distributed Framework for Advanced IR , 2000, RIAO.

[12]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[13]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14]  NgHwee Tou,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997 .

[15]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[16]  Roberto Basili,et al.  Inducing Terminology for Lexical Acquisition , 1997, EMNLP.