A comparison of classifiers and document representations for the routing problem

In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015% better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.

[1]  J. Gower,et al.  Methods for statistical data analysis of multivariate observations , 1977, A Wiley publication in applied statistics.

[2]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[6]  J. Friedman Regularized Discriminant Analysis , 1989 .

[7]  Norbert Fuhr,et al.  Optimum polynomial retrieval functions based on the probability ranking principle , 1989, TOIS.

[8]  Richard K. Belew,et al.  Adaptive information retrieval: using a connectionist representation to retrieve and learn about documents , 1989, SIGIR '89.

[9]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[10]  Kui-Lam Kwok,et al.  Experiments with a component theory of probabilistic information retrieval based on single terms as document components , 1990, TOIS.

[11]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[12]  Jan O. Pedersen,et al.  An object-oriented architecture for text retrieval , 1991, RIAO.

[13]  Ross Wilkinson,et al.  Using the cosine measure in a neural network for document retrieval , 1991, SIGIR '91.

[14]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[15]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[16]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[17]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[18]  Richard M. Tong,et al.  Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) , 1993, TREC.

[19]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[20]  Donna K. Harman,et al.  Overview of the first TREC conference , 1993, SIGIR.

[21]  W. Bruce Croft,et al.  TREC-2 Routing and Ad-Hoc Retrieval Evaluation using the INQUERY System , 1993, TREC.

[22]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[23]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[24]  Fredric C. Gey,et al.  Full Text Retrieval based on Probalistic Equations with Coefficients fitted by Logistic Regression , 1993, TREC.

[25]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[26]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[27]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[28]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[29]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[30]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[31]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[32]  Hinrich Schütze,et al.  Xerox TREC-3 Report: Combining Exact and Fuzzy Predictors , 1994, TREC.

[33]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[34]  Norbert Fuhr,et al.  Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions , 1994, TOIS.

[35]  Yves Chauvin,et al.  Backpropagation: the basic theory , 1995 .

[36]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .