Durable top-k search in document archives

We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.

[1]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[2]  Wee Hyong Tok,et al.  Consistent Top-k Queries over Time , 2009, DASFAA.

[3]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[4]  Torsten Suel,et al.  Efficient search in large textual collections with redundancy , 2007, WWW '07.

[5]  Reza Sherkat,et al.  On efficiently searching trajectories and archival data for historical similarities , 2008, Proc. VLDB Endow..

[6]  Andrei Z. Broder,et al.  Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[7]  Man Lung Yiu,et al.  Efficient top-k aggregation of ranked inputs , 2007, TODS.

[8]  Alistair Moffat,et al.  Pruning strategies for mixed-mode querying , 2006, CIKM '06.

[9]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[10]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[11]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[12]  Gerhard Weikum,et al.  Efficient Time-Travel on Versioned Text Collections , 2007, BTW.

[13]  Frank Wm. Tompa,et al.  Seeking Stable Clusters in the Blogosphere , 2007, VLDB.

[14]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[15]  Francesco Romani,et al.  Ranking a stream of news , 2005, WWW '05.

[16]  Nick Koudas,et al.  BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[17]  Michael Herscovici,et al.  Efficient Indexing of Versioned Document Sequences , 2007, ECIR.

[18]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[19]  Ravi Kumar,et al.  Visualizing tags over time , 2006, WWW '06.

[20]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[21]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[22]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[23]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[24]  Shuming Shi,et al.  Effective top-k computation with term-proximity support , 2009, Inf. Process. Manag..

[25]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.