Lower dimensional representation of text data in vector space based information retrieval

Dimension reduction in today's vector space based information retrieval system is essential for improving computational eeciency in handling massive data. In this paper, we propose a mathematical framework for lower dimensional representation of text data in vector space based information retrieval using minimization and matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then we propose a new approach which is more eecient and eeective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods are discussed over the LSI/SVD in terms of computational eeciency and data representation in the reduced dimensional space. Experimental results are presented to illustrate the eeectiveness of our approach in certain 1 classiication problem in reduced dimensional space. These results were computed using an information retrieval test system we are now developing. The results indicate that for a successful lower dimensional representation of data, it is important to incorporate a priori knowledge on data in dimension reduction.

[1]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[2]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[3]  R. E. Cline,et al.  The Rank of a Difference of Matrices and Associated Generalized Inverses , 1976 .

[4]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[9]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[10]  Stephen P. Harter,et al.  Psychological Relevance and Information Science , 1992, J. Am. Soc. Inf. Sci..

[11]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[12]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[13]  R. D. Fierro,et al.  Low-Rank Orthogonal Decompositions for Information Retrieval Applications , 1995 .

[14]  Gene H. Golub,et al.  A Rank-One Reduction Formula and Its Applications to Matrix Factorizations , 1995, SIAM Rev..

[15]  John R. Conlon,et al.  Optimal Use of an Information Retrieval System , 1996, J. Am. Soc. Inf. Sci..

[16]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[17]  J. Ben Rosen,et al.  Total Least Norm Formulation and Solution for Structured Problems , 1996, SIAM J. Matrix Anal. Appl..

[18]  Gene H. Golub,et al.  Matrix Computations, Third Edition , 1996 .

[19]  Michael W. Berry,et al.  Low-rank Orthogonal Decompositions for Information Retrieval Applications , 1995, Numer. Linear Algebra Appl..

[20]  Tamara G. Kolda,et al.  Limited-memory matrix methods with applications , 1997 .

[21]  Annelise Mark Pejtersen Semantic information retrieval , 1998, CACM.

[22]  Susan T. Dumais,et al.  Using Latent Semantic Indexing for Literature Based Discovery , 1998, J. Am. Soc. Inf. Sci..

[23]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[24]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[25]  Haesun Park,et al.  An Effective Term-Weighting Scheme for Information Retrieval , 2000 .

[26]  Willem J. Heiser,et al.  Two Purposes for Matrix Factorization: A Historical Appraisal , 2000, SIAM Rev..