Forward Decay: A Practical Time Decay Model for Streaming Systems

Temporal data analysis in data warehouses and datastreaming systems often uses time decay to reduce the importance of older tuples, without eliminating their influence, on the results of the analysis. While exponential time decay is commonly used in practice, other decay functions (e.g. polynomial decay) are not, even though they have been identified as useful. We argue that this is because the usual definitions of time decay are "backwards": the decayed weight of a tuple is based on its age, measured backward from the current time. Since this age is constantly changing, such decay is too complex and unwieldy for scalable implementation. In this paper, we propose a new class of "forward" decay functions based on measuring forward from a fixed point in time. We show that this model captures the more practical models already known, such as exponential decay and landmark windows, but also includes a wide class of other types of time decay. We provide efficient algorithms to compute a variety of aggregates and draw samples under forward decay, and show that these are easy to implement scalably. Further, we provide empirical evidence that these can be executed in a production data stream management system with little or no overhead compared to the undecayed computations. Our implementation required no extensions to the query language or the DSMS, demonstrating that forward decay represents a practical model of time decay for systems that deal with time-based data.

[1]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[2]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[3]  Theodore Johnson,et al.  A Heartbeat Mechanism and Its Application in Gigascope , 2005, VLDB.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[6]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[7]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[8]  Graham Cormode,et al.  Time-decaying aggregates in out-of-order streams , 2008, PODS.

[9]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[10]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[11]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[12]  Srikanta Tirthapura,et al.  A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window , 2007, STACS.

[13]  Lap-Kei Lee,et al.  A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.

[14]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[15]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[16]  Piotr Indyk,et al.  Better algorithms for high-dimensional proximity problems via asymmetric embeddings , 2003, SODA '03.

[17]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[18]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[19]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[20]  Lukasz Golab,et al.  Sliding Window Query Processing over Data Streams , 2006 .

[21]  Graham Cormode,et al.  Time-decaying sketches for sensor data aggregation , 2007, PODC '07.

[22]  Ying Xing,et al.  Distributed operation in the Borealis stream processing engine , 2005, SIGMOD '05.

[23]  Theodore Johnson,et al.  Gigascope: high performance network monitoring with an SQL interface , 2002, SIGMOD '02.

[24]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[25]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[26]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[27]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[28]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[29]  Srikanta Tirthapura,et al.  Range-Efficient Counting of Distinct Elements in a Massive Data Stream , 2007, SIAM J. Comput..

[30]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[31]  David Maier,et al.  No pane, no gain: efficient evaluation of sliding-window aggregates over data streams , 2005, SGMD.

[32]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[33]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[34]  Michael Stonebraker,et al.  The Aurora and Medusa Projects , 2003, IEEE Data Eng. Bull..

[35]  Graham Cormode,et al.  Exponentially Decayed Aggregates on Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[36]  Graham Cormode,et al.  Estimating Dominance Norms of Multiple Data Streams , 2003, ESA.