Persistent Data Sketching

A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried. Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today's big data applications, in particular for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which use only sublinear space by sacrificing slightly on accuracy. All streaming algorithms work by maintaining a small data structure in memory, which is usually called a em sketch, summary, or synopsis. The sketch is updated upon the arrival of every element in the stream, thus is ephemeral, meaning that it can only answer queries about the current status of the stream. In this paper, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.

[1]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[2]  Robert E. Tarjan,et al.  Making data structures persistent , 1986, STOC '86.

[3]  Joseph O'Rourke,et al.  An on-line algorithm for fitting straight lines between data ranges , 1981, CACM.

[4]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[5]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[6]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[7]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[8]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[9]  Sudipto Guha,et al.  Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams , 2009, SIAM J. Comput..

[10]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[11]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[12]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[13]  David B. Lomet,et al.  Access methods for multiversion data , 1989, SIGMOD '89.

[14]  Hao Xu,et al.  SNAP: efficient snapshots for back-in-time execution , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[16]  Rafail Ostrovsky,et al.  Optimal sampling from sliding windows , 2009, J. Comput. Syst. Sci..

[17]  Feifei Li,et al.  Improving Transaction-Time DBMS Performance and Functionality , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[19]  Jian Pei,et al.  Logging every footstep: quantile summaries for the entire history , 2010, SIGMOD Conference.

[20]  Rakesh M. Verma,et al.  An Efficient Multiversion Access STructure , 1997, IEEE Trans. Knowl. Data Eng..

[21]  Kostas Tsichlas,et al.  Fully persistent B-trees , 2012, SODA.

[22]  Bernhard Seeger,et al.  An asymptotically optimal multiversion B-tree , 1996, The VLDB Journal.

[23]  Gustavo Alonso,et al.  Searching in time , 2006, SIGMOD Conference.

[24]  Jennifer Widom,et al.  LIVE: A Lineage-Supported Versioned DBMS , 2010, SSDBM.

[25]  Hao Xu,et al.  Skippy: a new snapshot indexing method for time travel in the storage manager , 2008, SIGMOD Conference.

[26]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[27]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[28]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[29]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[30]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[31]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[32]  Mohamed F. Mokbel,et al.  Immortal DB: transaction time support for SQL server , 2005, SIGMOD '05.

[33]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.