Propagation of Densities of Streaming Data within Query Graphs

Data Stream Systems (DSSs) use cost models to determine if a DSS can cope with a given workload and to optimize query graphs. However, certain relevant input parameters of these models are often unknown or highly imprecise. Especially selectivities are stream-dependent and application-specific parameters. In this paper, we describe a method that supports selectivity estimation considering input streams' attribute value distribution. The novelty of our approach is the propagation of the probability distributions through the query graph in order to give estimates for the inner nodes of the graph. For most common stream operators, we establish formulas that describe their output distribution as a function of their input distributions. For unknown operators like User-Defined Operators (UDOs), we introduce a method to measure the influence of these operators on arbitrary probability distributions. This method is able to do most of the computational work before the query is deployed and introduces minimal overhead at runtime. Our evaluation framework facilitates the appropriate combination of both methods and allows to model almost arbitrary query graphs.

[1]  Bernhard Seeger,et al.  Maintaining Nonparametric Estimators over Data Streams , 2005, BTW.

[2]  Jeffrey F. Naughton,et al.  Rate-based query optimization for streaming information sources , 2002, SIGMOD '02.

[3]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[4]  Dick Hamlet,et al.  Properties of Software Systems Synthesized from Components , 2004 .

[5]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[6]  T. H. Merrett,et al.  Distribution Models Of Relations , 1979, Fifth International Conference on Very Large Data Bases, 1979..

[7]  Larry Kerschberg,et al.  A detailed statistical model for relational query optimization , 1985, ACM '85.

[8]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[9]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[10]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[11]  Man Lung Yiu,et al.  Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM) , 2007 .

[12]  Bernhard Seeger,et al.  Toward Simulation-Based Optimization in Data Stream Management Systems , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[14]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[15]  Bernhard Seeger,et al.  Towards Kernel Density Estimation over Streaming Data , 2006, COMAD.

[16]  Kung-Kiu Lau Component-based Software Development: Case Studies , 2004 .

[17]  Bernhard Seeger,et al.  Adaptive Wavelet Density Estimators over Data Streams , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[18]  Bernhard Seeger,et al.  Cluster Kernels: Resource-Aware Kernel Density Estimators over Streaming Data , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  Christoph Heinz,et al.  Density estimation over data streams , 2007 .

[20]  Marcus Meyerhöfer Messung und Verwaltung von Softwarekomponenten für die Performancevorhersage , 2007 .

[21]  J. Simonoff Multivariate Density Estimation , 1996 .

[22]  Klaus Meyer-Wegener,et al.  Integration of Heterogeneous Sensor Nodes by Data Stream Management , 2009, 2009 Tenth International Conference on Mobile Data Management: Systems, Services and Middleware.