A skip-list approach for efficiently processing forecasting queries

Time series data is common in many settings including scientific and financial applications. In these applications, the amount of data is often very large. We seek to support prediction queries over time series data. Prediction relies on model building which can be too expensive to be practical if it is based on a large number of data points. We propose to use statistical tests of hypotheses to choose a proper subset of data points to use for a given prediction query interval. This involves two steps: choosing a proper history length and choosing the number of data points to use within this history. Further, we use an I/O conscious skip list data structure to provide samples of the original data set. Based on the statistics collected for a query workload, which we model as a probability mass function (PMF) over query intervals, we devise a randomized algorithm that selects a set of pre-built models (PM's) to construct, subject to some maintenance cost constraint when there are updates. Given this set of PM's, we discuss interesting query processing strategies for not only point queries, but also range, aggregation, and JOIN queries. We conduct a comprehensive empirical study on real world datasets to verify the effectiveness of our approaches and algorithms.

[1]  T. Bollerslev,et al.  Forecasting financial market volatility: Sample frequency vis-a-vis forecast horizon , 1999 .

[2]  Philip S. Yu,et al.  Local Correlation Tracking in Time Series , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[4]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Jonathan Kirsch,et al.  Load balancing and locality in range-queriable data structures , 2004, PODC '04.

[6]  Goetz Graefe,et al.  Algebraic Optimization of Computations over Scientific Databases , 1993, IEEE Data Eng. Bull..

[7]  Adamantios Diamantopoulos,et al.  Forecasting practice: A review of the empirical literature and an agenda for future research , 1996 .

[8]  Samuel Madden,et al.  PAQ: Time Series Forecasting for Approximate Query Answering in Sensor Networks , 2006, EWSN.

[9]  Philip S. Yu,et al.  Optimal multi-scale patterns in time series streams , 2006, SIGMOD Conference.

[10]  Dennis Shasha,et al.  Query by Humming: a Time Series Database Approach , 2003, SIGMOD 2003.

[11]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[12]  James Stewart,et al.  Calculus: Concepts and Contexts , 1999 .

[13]  Rob J Hyndman,et al.  Minimum Sample Size requirements for Seasonal Forecasting Models , 2007 .

[14]  Ittai Abraham,et al.  Skip B-Trees , 2005, OPODIS.

[15]  James Aspnes,et al.  Skip graphs , 2003, SODA '03.

[16]  Henry J. Schultz The Sum of the kTh Powers of the First n Integers , 1980 .

[17]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[18]  R. Shanmugam Introduction to Time Series and Forecasting , 1997 .

[19]  Steven C. Wheelwright,et al.  Forecasting methods and applications. , 1979 .

[20]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[21]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[22]  John T. Mentzer,et al.  Familiarity, application, and performance of sales forecasting techniques , 1984 .

[23]  W. Mendenhall,et al.  Statistics for engineering and the sciences , 1984 .

[24]  Rob J Hyndman,et al.  25 Years of Iif Time Series Forecasting: A Selective Review , 2005 .

[25]  Shivnath Babu,et al.  Processing Forecasting Queries , 2007, VLDB.

[26]  R. V. Parker Sums of Powers of the Integers , 1958 .

[27]  Steven C. Wheelwright,et al.  Forecasting: Methods and Applications, 3rd Edition , 1998 .

[28]  Dimitrios Gunopulos,et al.  Online amnesic approximation of streaming time series , 2004, Proceedings. 20th International Conference on Data Engineering.

[29]  J. Ian Munro,et al.  Deterministic skip lists , 1992, SODA '92.

[30]  William Mendenhall,et al.  Statistics for Engineering and the Sciences (5th Edition) , 2006 .

[31]  Leonore Neugebauer Optimization and evaluation of database queries including embedded interpolation procedures , 1991, SIGMOD '91.

[32]  R. L. Eubank,et al.  A Kalman filter primer , 2005 .

[33]  Dennis Shasha,et al.  Lots o'Ticks: real time high performance time series queries on billions of trades and quotes , 2001, SIGMOD '01.

[34]  Ambuj K. Singh,et al.  SWAT: hierarchical stream summarization in large networks , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).