Performance of query processing implementations in ranking-based text retrieval systems using inverted indices

Similarity calculations and document ranking form the computationally expensive parts of query processing in ranking-based text retrieval. In this work, for these calculations, 11 alternative implementation techniques are presented under four different categories, and their asymptotic time and space complexities are investigated. To our knowledge, six of these techniques are not discussed in any other publication before. Furthermore, analytical experiments are carried out on a 30 GB document collection to evaluate the practical performance of different implementations in terms of query processing time and space consumption. Advantages and disadvantages of each technique are illustrated under different querying scenarios, and several experiments that investigate the scalability of the implementations are presented.

[1]  Michael J. Carey,et al.  A Study of Index Structures for a Main Memory Database Management System , 1986, HPTS.

[2]  Ramez Elmasri,et al.  Fundamentals of Database Systems, 5th Edition , 2006 .

[3]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[4]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[5]  Alistair Moffat,et al.  Memory Efficient Ranking , 1994, Inf. Process. Manag..

[6]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[7]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[8]  Ismail Sengör Altingövde,et al.  Efficiency and effectiveness of query processing in cluster-based retrieval , 2004, Inf. Syst..

[9]  Ramez Elmasri,et al.  Fundamentals of database systems (2nd ed.) , 1994 .

[10]  Donna K. Harman,et al.  An experimental study of factors important in document ranking , 1986, SIGIR '86.

[11]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[12]  C. J. van Rijsbergen,et al.  The nearest neighbour problem in information retrieval: an algorithm using upperbounds , 1981, SIGIR '81.

[13]  Rajeev Rastogi,et al.  Main-memory index structures with fixed-size partial keys , 2001, SIGMOD '01.

[14]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[15]  Alan F. Smeaton,et al.  The nearest neighbour problem in information retrieval: an algorithm using upperbounds , 1981, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[16]  Ellis Horowitz,et al.  Fundamentals of Computer Algorithms , 1978 .

[17]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[18]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[19]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[20]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[21]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[22]  Dario Lucarella,et al.  A document retrieval system based on nearest neighbour searching , 1988, J. Inf. Sci..

[23]  Ian H. Witten,et al.  Data Compression in Full-Text Retrieval Systems , 1993, J. Am. Soc. Inf. Sci..

[24]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[25]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[26]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[27]  Ron Sacks-Davis,et al.  Efficient passage ranking for document databases , 1999, TOIS.

[28]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[29]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[30]  W. Bruce Croft,et al.  Implementing ranking strategies using text signatures , 1988, TOIS.

[31]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[32]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[33]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[34]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[35]  Dik Lun Lee,et al.  Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[36]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[37]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[38]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[39]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[40]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..