Temporal index sharding for space-time efficiency in archive search

Time-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the web archives or wikis. Typical approaches for efficient evaluation of these queries involve slicing either the entire collection [20] or individual index lists [10] along the time-axis. Both these methods are not satisfactory since they sacrifice compactness of index for processing efficiency making them either too big or, otherwise, too slow. We present a novel index organization scheme that shards each index list with almost zero increase in index size but still minimizes the cost of reading index entries during query processing. Based on the optimal sharding thus btained, we develop a practically efficient sharding that takes into account the different costs of random and sequential accesses. Our algorithm merges shards from the optimal solution to allow for a few extra sequential accesses while gaining significantly by reducing the number of random accesses. We empirically establish the effectiveness of our sharding scheme with experiments over the revision history of the English Wikipedia between 2001-2005 (approx 700 GB) and an archive of U.K. governmental web sites (approx 400 GB). Our results demonstrate the feasibility of faster time-travel query processing with no space overhead.

[1]  Ashwin Machanavajjhala,et al.  P-ring: an efficient and robust P2P range index structure , 2007, SIGMOD '07.

[2]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM '04.

[3]  Karl Aberer,et al.  P-Grid: a self-organizing structured P2P system , 2003, SGMD.

[4]  Erik D. Demaine,et al.  EpiChord: parallelizing the chord lookup algorithm with reactive routing state management , 2004, Proceedings. 2004 12th IEEE International Conference on Networks (ICON 2004) (IEEE Cat. No.04EX955).

[5]  Gerhard Weikum,et al.  Tunable Word-Level Index Compression for Versioned Corpora , 2008 .

[6]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[7]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  François Goasdoué,et al.  WebContent: efficient P2P Warehousing of web data , 2008, Proc. VLDB Endow..

[9]  Divyakant Agrawal,et al.  PRISM: indexing multi-dimensional data in P2P networks using reference vectors , 2005, MULTIMEDIA '05.

[10]  Kin Ying Yu,et al.  Long term preservation of electronic documents , 2004 .

[11]  Kenneth J. Supowit,et al.  Decomposing a Set of Points into Chains, with Applications to Permutation and Circle Graphs , 1985, Inf. Process. Lett..

[12]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[13]  David B. Lomet,et al.  Access methods for multiversion data , 1989, SIGMOD '89.

[14]  Richard P. Martin,et al.  Autonomous replication for high availability in unstructured P2P systems , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[15]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[16]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[17]  Ian Clarke,et al.  Freenet: A Distributed Anonymous Information Storage and Retrieval System , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[18]  Shipeng Li,et al.  Distributed Segment Tree: Support of Range Query and Cover Query over DHT , 2006, IPTPS.

[19]  Ramakrishna Kotla,et al.  SafeStore: A Durable and Practical Storage System , 2007, USENIX Annual Technical Conference.

[20]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[21]  Yuh-Jzer Joung,et al.  Keyword Search in DHT-Based Peer-to-Peer Networks , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[22]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[23]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[24]  Ben Y. Zhao,et al.  Silverback: A Global-Scale Archival System , 2001 .

[25]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[26]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[27]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[28]  David B. Lomet,et al.  Transaction time indexing with version compression , 2008, Proc. VLDB Endow..

[29]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[30]  Gerhard Weikum,et al.  The LHAM log-structured history data access method , 2000, The VLDB Journal.

[31]  Kimmo E. E. Raatikainen,et al.  Epidemic Dissemination for Probabilistic Data Storage , 2006 .

[32]  Gerhard Weikum,et al.  MINERVA: Collaborative P2P Search , 2005, VLDB.

[33]  Michael Gertz,et al.  On the value of temporal information in information retrieval , 2007, SIGF.

[34]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[35]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[36]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[37]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[38]  Antony I. T. Rowstron,et al.  PAST: a large-scale, persistent peer-to-peer storage utility , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[39]  Karl Aberer,et al.  GridVine: An Infrastructure for Peer Information Management , 2007, IEEE Internet Computing.

[40]  Jun Gao,et al.  An adaptive protocol for efficient support of range queries in DHT-based systems , 2004, Proceedings of the 12th IEEE International Conference on Network Protocols, 2004. ICNP 2004..

[41]  Cédric du Mouza,et al.  Dynamic storage balancing in a distributed spatial index , 2007, GIS.

[42]  Beng Chin Ooi,et al.  Answering similarity queries in peer-to-peer networks , 2004, WWW Alt. '04.

[43]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[44]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[45]  Gerhard Weikum,et al.  Near-optimal dynamic replication in unstructured peer-to-peer networks , 2008, PODS.

[46]  Paolo Toth,et al.  Approximation schemes for the subset-sum problem: Survey and experimental analysis , 1985 .

[47]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[48]  Srikanta J. Bedathur,et al.  Efficient temporal keyword search over versioned text , 2010, CIKM.

[49]  William Yurcik,et al.  A survey of peer-to-peer storage techniques for distributed file systems , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[50]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[51]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[52]  Bernhard Seeger,et al.  An asymptotically optimal multiversion B-tree , 1996, The VLDB Journal.

[53]  Manolis Koubarakis,et al.  LibraRing: An Architecture for Distributed Digital Libraries Based on DHTs , 2005, ECDL.

[54]  Oren Dobzinski,et al.  Viceroy - on the implementation of a Peer to Peer network , 2003 .

[55]  Eelco Herder Characterizations of User Web Revisit Behavior , 2005, LWA.

[56]  Mira Dontcheva,et al.  Zoetrope: interacting with the ephemeral web , 2008, UIST '08.

[57]  Rajiv Gandhi,et al.  Approximation algorithms for partial covering problems , 2004, J. Algorithms.

[58]  Dmitri Loguinov,et al.  IRLbot: scaling to 6 billion pages and beyond , 2008, WWW.

[59]  Gerhard Weikum,et al.  Architectural Alternatives for Information Filtering in Structured Overlays , 2007, IEEE Internet Computing.

[60]  Joel Waldfogel,et al.  Introduction , 2010, Inf. Econ. Policy.

[61]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[62]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[63]  Artur Andrzejak,et al.  Scalable, efficient range queries for grid information services , 2002, Proceedings. Second International Conference on Peer-to-Peer Computing,.

[64]  J. Parreira,et al.  The JXP Method for Robust PageRank Approximation in a Peer-to-Peer Web Search Network , 2007 .

[65]  Sriram Ramabhadran,et al.  Prefix Hash Tree An Indexing Data Structure over Distributed Hash Tables , 2004, PODC 2004.

[66]  Geoffrey M. Voelker,et al.  On Object Maintenance in Peer-to-Peer Systems , 2006, IPTPS.

[67]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[68]  Sebastian Michel,et al.  P2P Web Search: Make It Light, Make It Fly (Demo) , 2007, CIDR.

[69]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[70]  Karl Aberer,et al.  ALVIS peers: a scalable full-text peer-to-peer retrieval engine , 2006, P2PIR '06.

[71]  Gerhard Weikum,et al.  FluxCapacitor: Efficient Time-Travel Text Search , 2007, VLDB.

[72]  Emin Gün Sirer,et al.  Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.

[73]  Gerhard Weikum,et al.  Flood Little, Cache More: Effective Result-Reuse in P2P IR Systems , 2008, DASFAA.

[74]  Divyakant Agrawal,et al.  Approximate Range Selection Queries in Peer-to-Peer Systems , 2003, CIDR.

[75]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[76]  Karl Aberer,et al.  Range queries in trie-structured overlays , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[77]  Gerhard Weikum,et al.  The Juxtaposed approximate PageRank method for robust PageRank approximation in a peer-to-peer web search network , 2008, The VLDB Journal.

[78]  Peter G. Anick,et al.  Versioning a full-text information retrieval system , 1992, SIGIR '92.

[79]  Mary Baker,et al.  The LOCKSS peer-to-peer digital preservation system , 2005, TOCS.

[80]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[81]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[82]  Karl Aberer,et al.  Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[83]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[84]  Timo Burkard,et al.  Herodotus: A Peer-to-Peer Web Archival System , 2002 .

[85]  Gerhard Weikum,et al.  Efficient Time-Travel on Versioned Text Collections , 2007, BTW.

[86]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[87]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[88]  Mudhakar Srivatsa,et al.  Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web , 2003, Distributed Multimedia Information Retrieval.

[89]  Matthias Bender,et al.  Advanced methods for query routing in peer-to-peer information retrieval , 2007 .

[90]  Beng Chin Ooi,et al.  BATON: A Balanced Tree Structure for Peer-to-Peer Networks , 2005, VLDB.

[91]  Vassilis J. Tsotras,et al.  Comparison of access methods for time-evolving data , 1999, CSUR.

[92]  Christian S. Jensen,et al.  Join operations in temporal databases , 2005, The VLDB Journal.

[93]  Michael Herscovici,et al.  Efficient Indexing of Versioned Document Sequences , 2007, ECIR.

[94]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[95]  Beng Chin Ooi,et al.  Paths to stardom: calibrating the potential of a peer-based data management system , 2008, SIGMOD Conference.

[96]  Hector Garcia-Molina,et al.  Wave-indices: indexing evolving databases , 1997, SIGMOD '97.

[97]  JoAnne Holliday,et al.  Redundancy Management for P2P Storage , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[98]  Johannes Gehrke,et al.  P-tree: a p2p index for resource discovery applications , 2004, WWW Alt. '04.

[99]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[100]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[101]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[102]  Bernardo A. Huberman,et al.  Predicting the popularity of online content , 2008, Commun. ACM.

[103]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[104]  Alexander Zangerl,et al.  Tamper-resistant replicated peer-to-peer storage using hierarchical signatures , 2006, First International Conference on Availability, Reliability and Security (ARES'06).