Top-k Ranked Document Search in General Text Databases

Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

[1]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[2]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[3]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[4]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[5]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[6]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[7]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[8]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[9]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[10]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[12]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[13]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[14]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[15]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.

[16]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[17]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[18]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[19]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[20]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[21]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[22]  William F. Smyth,et al.  Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[23]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[24]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[25]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[26]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.