Matrix computations for query expansion in information retrieval

Information retrieval (IR) is the task of identifying information items (documents) that are relevant to a user query from a collection. The most popular IR research technique, the vector space model (VSM), is a word-matching approach: it uses the words in common between the query and the document as a basic way of determining their similarity. Because the same concept can be expressed in the query and document using very different vocabularies, synonymy can cause a document to be judged as irrelevant by VSM. Query-expansion methods deal with this problem by automatically supplying the query with additional words that are related to those already in it. Latent semantic indexing (LSI) is a statistical method that derives term associations through a reduced dimensional singular value decomposition (SVD) of a matrix formed from the collection. LSI has equaled or outperformed VSM on many relatively small retrieval collections. But, with the growing size of modern information repositories, LSI has failed to demonstrate its advantage over traditional word-matching methods on some of these large corpora. In this work, I provide evidence that LSI is not reaching its potential for large collections because existing SVD implementations are not able to compute a sufficiently large number of dimensions. I establish a unified framework of vector-based information retrieval called dimension equalization . Through this, I present approximate dimension equalization (ADE), a method that “extrapolates” the result of a high-dimensional SVD based on a relatively small number of computed dimensions. Experiments indicate that ADE improves retrieval performance over LSI and has great utility in cross-language applications. I also investigate sampling approaches to reducing LSI computation, which use only a subset of the document collection to build LSI term-associations. My focus is on local LSI, a variation of the ever-popular local feedback approaches in the IR community. This method computes an SVD on a subset of documents that are related to the query. Experiments show that local LSI outperforms not only the global sampling methods but also the baseline VSM or LSI without sampling. I extend the existing local LSI approach to cross-language retrieval and present its high-quality results.

[1]  C. Eckart,et al.  A principal axis transformation for non-hermitian matrices , 1939 .

[2]  J A Swets,et al.  Information Retrieval Systems. , 1963, Science.

[3]  K. W. Cattermole The Fourier Transform and its Applications , 1965 .

[4]  Robert E. Bleier Treating hierarchical data structures in the SDC Time-Shared Data Management System (TDMS) , 1967, ACM '67.

[5]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[6]  Automatic Processing of Foreign Language Documents , 1969, COLING 1969.

[7]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[8]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[9]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing , 1975, J. Am. Soc. Inf. Sci..

[10]  Don R. Swanson,et al.  A decision theoretic foundation for indexing , 1975, J. Am. Soc. Inf. Sci..

[11]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[12]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[13]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[14]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[15]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Gene H. Golub,et al.  Matrix computations , 1983 .

[18]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[19]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[20]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[21]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Richard A. Harshman,et al.  Information retrieval using a singular value decomposition model of latent semantic structure , 1988, SIGIR '88.

[24]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[25]  S. K. Wong,et al.  A Note on Inverse Document Frequency Weighting Scheme , 1989 .

[26]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[27]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[28]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[29]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[30]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[31]  David D. Lewis,et al.  Text filtering in MUC-3 and MUC-4 , 1992, MUC.

[32]  Susan T. Dumais,et al.  LSI meets TREC: A Status Report , 1992, TREC.

[33]  IJsbrand Jan Aalbersberg,et al.  Incremental relevance feedback , 1992, SIGIR '92.

[34]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[35]  James Allan,et al.  Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 , 1993, TREC.

[36]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[37]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[38]  W. Bruce Croft,et al.  Relevance feedback and inference networks , 1993, SIGIR.

[39]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[40]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[41]  David Graff,et al.  Multilingual Text Resources at the Linguistic Data Consortium , 1994, HLT.

[42]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[43]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[44]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[45]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[46]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[47]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[48]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[49]  James Allan,et al.  Relevance feedback with too much data , 1995, SIGIR '95.

[50]  Michael W. Berry,et al.  Using latent semantic indexing for multilanguage information retrieval , 1995, Comput. Humanit..

[51]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[52]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[53]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[54]  Yonggang Qiu Automatic query expansion based on a similarity thesaurus , 1995 .

[55]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[56]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[57]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[58]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[59]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[60]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[61]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[62]  Gregory B. Newby Metric Multidimensional Information Space , 1996, TREC.

[63]  Amitabh Kumar Singhal,et al.  Term Weighting Revisited , 1996 .

[64]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[65]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[66]  Alan F. Smeaton,et al.  Spanish and Chinese Document Retrieval in TREC-5 , 1996, TREC.

[67]  Susan T. Dumais,et al.  Automatic 3-Language Cross-Language Information Retrieval with Latent Semantic Indexing , 1997, TREC.

[68]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[69]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[70]  Carol Peters,et al.  Cross-Language Information Retrieval (CLIR) Track Overview , 1997, TREC.

[71]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[72]  Mark W. Davis,et al.  QUILT: implementing a large-scale cross-language text retrieval system , 1997, SIGIR '97.

[73]  Larry Fitzpatrick,et al.  Automatic feedback using past queries: social searching? , 1997, SIGIR '97.

[74]  Gregory B. Newby Context-Based Statistical Sub-Spaces , 1997, TREC.

[75]  Hongyuan Zha,et al.  Large-Scale SVD and Subspace-Based Methods for Information Retrieval , 1998, IRREGULAR.

[76]  Hsin-Hsi Chen,et al.  Integrating Query Translation and Document Translation in a Cross-language Information Retrieval System , 1998, AMTA.

[77]  Gregory B. Newby,et al.  Information Space Gets Normal , 1998, TREC.

[78]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[79]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[80]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[81]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[82]  Martin Braschler,et al.  SPIDER Retrieval System at TREC7 , 1998, TREC.

[83]  Carol Peters,et al.  Cross-Language Information Retrieval: A System for Comparable Corpus Querying , 1998 .

[84]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[85]  Fan Jiang,et al.  Learning a Language-Independent Representation for Terms from a Partially Aligned Corpus , 1998, ICML.

[86]  Peter Schäuble,et al.  Building a Large Multilingual Test Collection from Comparable News Documents , 1998 .

[87]  Ralf D. Brown Automatically-Extracted Thesauri for Cross-Language IR: When Better is Worse , 1998 .

[88]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[89]  Fredric C. Gey,et al.  Manual Queries and Machine Translation in Cross-Language Retrieval and Interactive Retrieval with Cheshire II at TREC-7 , 1998, TREC.

[90]  Martin Braschler,et al.  The Eurospider Retrieval System and the TREC-8 Cross-Language Track , 1999, TREC.

[91]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[92]  Jianqiang Wang,et al.  NTCIR CLIR Experiments at the University of Maryland , 1999, NTCIR.

[93]  Ellen M. Voorhees Natural Language Processing and Information Retrieval , 1999, SCIE.

[94]  Fredric C. Gey,et al.  Comparing Multiple Methods for Japanese and Japanese-English Text Retrieval , 1999, NTCIR.

[95]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[96]  Tetsuya Sakai,et al.  Cross-Language Information Retrieval for NTCIR at Toshiba , 1999, NTCIR.

[97]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[98]  Noriko Kando,et al.  Overview of IR tasks , 1999, NTCIR.

[99]  Tetsuya Ishikawa,et al.  Cross-Language Information Retrieval for Technical Documents , 1999, EMNLP.

[100]  Hongyuan Zha,et al.  On Updating Problems in Latent Semantic Indexing , 1997, SIAM J. Sci. Comput..

[101]  M. Littman,et al.  A Comparison of Two Corpus-Based Methods for Translingual Information Retrieval , 2000 .

[102]  Hongyuan Zha,et al.  Matrices with Low-Rank-Plus-Shift Structure: Partial SVD and Latent Semantic Indexing , 1999, SIAM J. Matrix Anal. Appl..