On joining and caching stochastic streams

We consider the problem of joining data streams using limited cache memory, with the goal of producing as many result tuples as possible from the cache. Many cache replacement heuristics have been proposed in the past. Their performance often relies on implicit assumptions about the input streams, e.g., that the join attribute values follow a relatively stationary distribution. However, in general and in practice, streams often exhibit more complex behaviors, such as increasing trends and random walks, rendering these "hardwired" heuristics inadequate.In this paper, we propose a framework that is able to exploit known or observed statistical properties of input streams to make cache replacement decisions aimed at maximizing the expected number of result tuples. To illustrate the complexity of the solution space, we show that even an algorithm that considers, at every time step, all possible sequences of future replacement decisions may not be optimal. We then identify a condition between two candidate tuples under which an optimal algorithm would always choose one tuple over the other to replace. We develop a heuristic that behaves consistently with an optimal algorithm whenever this condition is satisfied. We show through experiments that our heuristic outperforms previous ones.As another evidence of the generality of our framework, we show that the classic caching/paging problem for static objects can be reduced to a stream join problem and analyzed under our framework, yielding results that agree with or extend classic ones.

[1]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[2]  Donald E. Knuth The Art of Computer Programming 2 / Seminumerical Algorithms , 1971 .

[3]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[4]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[5]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[6]  J. Wrench Table errata: The art of computer programming, Vol. 2: Seminumerical algorithms (Addison-Wesley, Reading, Mass., 1969) by Donald E. Knuth , 1970 .

[7]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[9]  Andrew V. Goldberg,et al.  An efficient implementation of a scaling minimum-cost flow algorithm , 1993, IPCO.

[10]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[11]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[12]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[13]  Gerhard Weikum,et al.  An optimality proof of the LRU-K page replacement algorithm , 1999, JACM.

[14]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[15]  Jennifer Widom,et al.  Exploiting k-constraints to reduce memory overhead in continuous queries over data streams , 2004, TODS.

[16]  Sandy Irani,et al.  Competitive Analysis of Paging: A Survey , 1998 .

[17]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[18]  Jennifer Widom,et al.  Query Processing, Resource Management, and Approximation ina Data Stream Management System , 2002 .

[19]  Alfred V. Aho,et al.  Principles of Optimal Page Replacement , 1971, J. ACM.