On the Enhancements of a Sparse Matrix Information Retrieval Approach

A novel approach to information retrieval is proposed and evaluated. By representing an inverted index as a sparse matrix, matrix-vector multiplication algorithms can be used to query the index. As many parallel sparse matrix multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. We evaluate our proposed approach using several document collections from within the commonly used NIST TREC corpus. Our results indicate that our approach saves approximately 30% of the total storage requirements for the inverted index. Additionally, to improve accuracy, we develop a novel matrix based, relevance feedback technique as well as a proximity search algorithm.

[1]  Dik Lun Lee,et al.  HYTREM - A Hybrid Text-Retrieval Machine for Large Databases , 1990, IEEE Trans. Computers.

[2]  Paul G. Spirakis,et al.  Parallel text retrieval on a high performance supercomputer using the Vector Space Model , 1995, SIGIR '95.

[3]  Ophir Frieder,et al.  Integrating Structured Data and Text: A Relational Approach , 1997, J. Am. Soc. Inf. Sci..

[4]  Ophir Frieder,et al.  Clustering and Classification of Large Document Bases in a Parallel Environment , 1997, J. Am. Soc. Inf. Sci..

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Craig Stanfill,et al.  Parallel free-text search on the connection machine system , 1986, CACM.

[7]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[8]  Dik Lun Lee Massive Parallelism on the Hybrid Text-Retrieval Machine , 1995, Inf. Process. Manag..

[9]  Mukesh Singhal,et al.  An Analysis of Performance and Cost Factors in Searching Large Text Databases Using Parallel Search Systems , 1994, Journal of the American Society for Information Science.

[10]  Harold S. Stone,et al.  Parallel Querying of Large Databases: A Case Study , 1987, Computer.

[11]  Alistair Moffat,et al.  Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text , 1992, TREC.

[12]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[13]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[14]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[15]  A. Peters Sparse matrix vector multiplication techniques on the IBM 3090 VF , 1991, Parallel Comput..

[16]  Sanda M. Harabagiu,et al.  A parallel algorithm for text inference , 1996, Proceedings of International Conference on Parallel Processing.

[17]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[18]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[19]  Ellis Horowitz,et al.  Fundamentals of Data Structures , 1984 .

[20]  Kesheng Wu,et al.  BASIC SPARSE MATRIX COMPUTATIONS ON THE CM-5 , 1993 .

[21]  David B. Skillicorn Structured Parallel Computation in Structured Documents , 1995 .

[22]  Chia-Hui Chang,et al.  Enabling Concept-Based Relevance Feedback for Information Retrieval on the WWW , 1999, IEEE Trans. Knowl. Data Eng..

[23]  Peter Bailey,et al.  A parallel architecture for query processing over a terabyte of text , 1996 .