Divide and Conquer Strategies for Effective Information Retrieval

The standard application of Latent Semantic Indexing (LSI), a well-known technique for information retrieval, requires the computation of a partial Singular Value Decomposition (SVD) of the term-document matrix. This computation is infeasible for large document collections, since it is very demanding both in terms of arithmetic operations and in memory requirements. This paper discusses two divide and conquer strategies applied to LSI, with the goal of alleviating these difficulties. These strategies process a data set by dividing it in subsets and conquering the LSI results on each subset. Since each sub-problem resulting from the divide and conquer strategy has a smaller size, the processing of large scale document collections requires much fewer resources. In addition, the computation is highly parallel and can be easily adapted to a parallel computing environment. To reduce the computational cost of the LSI analysis of the subsets, we employ an approximation technique that is based on the Lanczos algorithm. This technique is far more efficient than the truncated SVD, while its accuracy is comparable. Experimental results confirm that the proposed divide and conquer strategies are effective for information retrieval problems.

[1]  Axel Ruhe,et al.  A Krylov Subspace Method for Information Retrieval , 2005, SIAM J. Matrix Anal. Appl..

[2]  Hongyuan Zha,et al.  Matrices with Low-Rank-Plus-Shift Structure: Partial SVD and Latent Semantic Indexing , 1999, SIAM J. Matrix Anal. Appl..

[3]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[4]  Frédéric Guyomarc'h,et al.  Least-Squares Polynomial Filters for Ill-Conditioned Linear Systems , 2001 .

[6]  Michael W. Berry,et al.  Low-rank Orthogonal Decompositions for Information Retrieval Applications , 1995, Numer. Linear Algebra Appl..

[7]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[8]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[9]  E. Chisholm,et al.  New Term Weighting Formulas for the Vector Space Method in Information Retrieval , 1999 .

[10]  Chris H. Q. Ding,et al.  Term norm distribution and its effects on Latent Semantic Indexing , 2005, Inf. Process. Manag..

[11]  Yousef Saad,et al.  Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Yousef Saad,et al.  Farthest Centroids Divisive Clustering , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[13]  Efstratios Gallopoulos,et al.  TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections , 2006, Grouping Multidimensional Data.

[14]  Michael W. Berry,et al.  Downdating the Latent Semantic Indexing Model for Conceptual Information Retrieval , 1998, Comput. J..

[15]  Ümit V. Çatalyürek,et al.  Permuting Sparse Rectangular Matrices into Block-Diagonal Form , 2004, SIAM J. Sci. Comput..

[16]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[17]  M. Brand,et al.  Fast low-rank modifications of the thin singular value decomposition , 2006 .

[18]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[19]  S. McCormick,et al.  A multigrid tutorial (2nd ed.) , 2000 .

[20]  Yousef Saad,et al.  Filtered Conjugate Residual-type Algorithms with Applications , 2006, SIAM J. Matrix Anal. Appl..

[21]  Wolfgang Hackbusch,et al.  Multi-grid methods and applications , 1985, Springer series in computational mathematics.

[22]  Tamara G. Kolda,et al.  Partitioning Rectangular and Structurally Unsymmetric Sparse Matrices for Parallel Processing , 1999, SIAM J. Sci. Comput..

[23]  David Tritchler,et al.  A spectral clustering method for microarray data , 2005, Comput. Stat. Data Anal..

[24]  Hongyuan Zha,et al.  On Updating Problems in Latent Semantic Indexing , 1997, SIAM J. Sci. Comput..

[25]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[26]  Jing Gao,et al.  Clustered SVD strategies in latent semantic indexing , 2005, Inf. Process. Manag..

[27]  Yousef Saad,et al.  Polynomial filtering in latent semantic indexing for information retrieval , 2004, SIGIR '04.

[28]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[29]  Raymond J. Spiteri,et al.  Updating the partial singular value decomposition in latent semantic indexing , 2007, Comput. Stat. Data Anal..

[30]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[32]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[33]  D. Brandt,et al.  Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[34]  Gene H. Golub,et al.  Matrix computations , 1983 .