Optimizing positional index structures for versioned document collections

Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.

[1]  Torsten Suel,et al.  Hierarchical substring caching for efficient content distribution to low-bandwidth clients , 2005, WWW '05.

[2]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[3]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[4]  Andrei Z. Broder,et al.  Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[5]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[6]  S. Héman Super-Scalar Database Compression between RAM and CPU Cache , 2005 .

[7]  Nikolaj Bjørner,et al.  Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression , 2006 .

[8]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[9]  Peter G. Anick,et al.  Versioning a full-text information retrieval system , 1992, SIGIR '92.

[10]  Özgür Ulusoy,et al.  Incremental cluster-based retrieval using compressed cluster-skipping inverted files , 2008, TOIS.

[11]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[12]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[13]  Torsten Suel,et al.  Faster temporal range queries over versioned text , 2011, SIGIR '11.

[14]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[15]  Dror Rawitz,et al.  The Minimum Substring Cover problem , 2008, Inf. Comput..

[16]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[17]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Srikanta J. Bedathur,et al.  EverLast: a distributed architecture for preserving the web , 2009, JCDL '09.

[19]  Torsten Suel,et al.  Efficient search in large textual collections with redundancy , 2007, WWW '07.

[20]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[21]  Jeffrey Scott Vitter,et al.  Dynamic maintenance of web indexes using landmarks , 2003, WWW '03.

[22]  Srikanta J. Bedathur,et al.  Efficient temporal keyword search over versioned text , 2010, CIKM.

[23]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[24]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[25]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[26]  Nikos Mamoulis,et al.  Durable top-k search in document archives , 2010, SIGMOD Conference.

[27]  Miguel A. Martínez-Prieto,et al.  Indexes for highly repetitive document collections , 2011, CIKM '11.

[28]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[29]  Joseph JaJa,et al.  Archiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storage , 2008 .

[30]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[31]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[32]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[33]  Torsten Suel,et al.  Algorithms for Delta Compression and Remote File Synchronization , 2003 .

[34]  Michael Herscovici,et al.  Efficient Indexing of Versioned Document Sequences , 2007, ECIR.