论文信息 - Durable top-k search in document archives - 字舞流文

Durable top-k search in document archives

We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.

Nikos Mamoulis | Srikanta J. Bedathur | Klaus Berberich | U LeongHou | Leong Hou U | N. Mamoulis | K. Berberich

[1] Mark de Berg,et al. Computational geometry: algorithms and applications , 1997 .

[2] Wee Hyong Tok,et al. Consistent Top-k Queries over Time , 2009, DASFAA.

[3] Amit Singhal,et al. Pivoted document length normalization , 1996, SIGIR 1996.

[4] Torsten Suel,et al. Efficient search in large textual collections with redundancy , 2007, WWW '07.

[5] Reza Sherkat,et al. On efficiently searching trajectories and archival data for historical similarities , 2008, Proc. VLDB Endow..

[6] Andrei Z. Broder,et al. Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[7] Man Lung Yiu,et al. Efficient top-k aggregation of ranked inputs , 2007, TODS.

[8] Alistair Moffat,et al. Pruning strategies for mixed-mode querying , 2006, CIKM '06.

[9] Moni Naor,et al. Optimal aggregation algorithms for middleware , 2001, PODS.

[10] Gerhard Weikum,et al. A Time Machine for Text Search , 2022 .

[11] Torsten Suel,et al. Compact full-text indexing of versioned document collections , 2009, CIKM.

[12] Gerhard Weikum,et al. Efficient Time-Travel on Versioned Text Collections , 2007, BTW.

[13] Frank Wm. Tompa,et al. Seeking Stable Clusters in the Blogosphere , 2007, VLDB.

[14] Marios Hadjieleftheriou,et al. R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[15] Francesco Romani,et al. Ranking a stream of news , 2005, WWW '05.

[16] Nick Koudas,et al. BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[17] Michael Herscovici,et al. Efficient Indexing of Versioned Document Sequences , 2007, ECIR.

[18] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[19] Ravi Kumar,et al. Visualizing tags over time , 2006, WWW '06.

[20] Alistair Moffat,et al. Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[21] W. Bruce Croft,et al. Time-based language models , 2003, CIKM '03.

[22] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[23] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[24] Shuming Shi,et al. Effective top-k computation with term-proximity support , 2009, Inf. Process. Manag..

[25] W. Bruce Croft,et al. A language modeling approach to information retrieval , 1998, SIGIR '98.