Efficient index for retrieving top-k most frequent documents

In the document retrieval problem (Muthukrishnan, 2002), we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P, our focus is to identify the top-k documents having most occurrences of P. This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF (Term Frequency-Inverse Document Frequency) metric. A related problem was studied by Muthukrishnan (2002) where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f. However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be reported as the output. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O(|P|+logDloglogD+k) time and taking O(DlogD) space. Our approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). The practicality of the proposed index is validated by the experimental results.

[1]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[2]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[3]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[4]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[5]  Campbell B. Read,et al.  Zipf's Law , 2004 .

[6]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[7]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[8]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[9]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[10]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[11]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[12]  Yossi Matias,et al.  Augmenting Suffix Trees, with Applications , 1998, ESA.

[13]  Roberto Grossi,et al.  Rank-Sensitive Data Structures , 2005, SPIRE.

[14]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[15]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[16]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[17]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[18]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[19]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[20]  Jeffrey Scott Vitter,et al.  Ordered Pattern Matching: Towards Full-Text Retrieval , 2006 .

[21]  Gonzalo Navarro,et al.  Position-Restricted Substring Searching , 2006, LATIN.