Transparent Forecasting Strategies in Database Management Systems

Whereas traditional data warehouse systems assume that data is complete or has been carefully preprocessed, increasingly more data is imprecise, incomplete, and inconsistent. This is especially true in the context of big data, where massive amount of data arrives continuously in real-time from vast data sources. Nevertheless, modern data analysis involves sophisticated statistical algorithm that go well beyond traditional BI and, additionally, is increasingly performed by non-expert users. Both trends require transparent data mining techniques that efficiently handle missing data and present a complete view of the database to the user. Time series forecasting estimates future, not yet available, data of a time series and represents one way of dealing with missing data. Moreover, it enables queries that retrieve a view of the database at any point in time — past, present, and future. This article presents an overview of forecasting techniques in database management systems. After discussing possible application areas for time series forecasting, we give a short mathematical background of the main forecasting concepts. We then outline various general strategies of integrating time series forecasting inside a database and discuss some individual techniques from the database community. We conclude this article by introducing a novel forecasting-enabled database management architecture that natively and transparently integrates forecast models.

[1]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[2]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[3]  Eli Upfal,et al.  Database-support for continuous prediction queries over streaming data , 2010, Proc. VLDB Endow..

[4]  C. Holt Author's retrospective on ‘Forecasting seasonals and trends by exponentially weighted moving averages’ , 2004 .

[5]  Tomasz Imielinski,et al.  MSQL: A Query Language for Database Mining , 1999, Data Mining and Knowledge Discovery.

[6]  Shivnath Babu,et al.  Processing Forecasting Queries , 2007, VLDB.

[8]  Kristen LeFevre,et al.  Splash: ad-hoc querying of data and statistical models , 2010, EDBT '10.

[9]  Carlo Zaniolo,et al.  A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services , 2001, VLDB.

[10]  Gunnar Rätsch,et al.  Predicting Time Series with Support Vector Machines , 1997, ICANN.

[11]  Wolfgang Lehner,et al.  Bridging Two Worlds with RICE Integrating R into the SAP In-Memory Computing Engine , 2011, Proc. VLDB Endow..

[12]  Eli Upfal,et al.  The Case for Predictive Database Systems: Opportunities and Challenges , 2011, CIDR.

[13]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[14]  Ulrich Küsters,et al.  Forecasting software: Past, present and future , 2006 .

[15]  T. Bollerslev,et al.  Forecasting financial market volatility: Sample frequency vis-a-vis forecast horizon , 1999 .

[16]  Wolfgang Lehner,et al.  Forcasting Evolving Time Series of Energy Demand and Supply , 2011, ADBIS.

[17]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[18]  Spyros Makridakis,et al.  The M3-Competition: results, conclusions and implications , 2000 .

[19]  Luo Si,et al.  Forecasting counts of user visits for online display advertising with probabilistic latent class models , 2011, SIGIR '11.

[20]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[21]  Michael Y. Hu,et al.  Forecasting with artificial neural networks: The state of the art , 1997 .

[22]  Stéphane Grumbach,et al.  Manipulating Interpolated Data is Easier than You Thought , 2000, VLDB.

[23]  John Turner,et al.  The Planning of Guaranteed Targeted Display Advertising , 2012, Oper. Res..

[24]  V. S. Subrahmanian,et al.  Embedding Forecast Operators in Databases , 2011, SUM.

[25]  Carlos Ordonez,et al.  Bayesian Classifiers Programmed in SQL , 2010, IEEE Transactions on Knowledge and Data Engineering.

[26]  Gene Fliedner,et al.  Hierarchical forecasting: issues and use guidelines , 2001, Ind. Manag. Data Syst..

[27]  Rob J Hyndman,et al.  25 years of time series forecasting , 2006 .

[28]  Chris Chatfield,et al.  Time‐series forecasting , 2000 .

[29]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[30]  Rob J Hyndman,et al.  A state space framework for automatic forecasting using exponential smoothing methods , 2002 .

[31]  Richard A. Davis,et al.  Introduction to time series and forecasting , 1998 .

[32]  Wolfgang Lehner,et al.  Forecasting the data cube: A model configuration advisor for multi-dimensional data sets , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[33]  Datong Chen,et al.  Forecasting high-dimensional data , 2010, SIGMOD Conference.

[34]  R. Ramanathan,et al.  Short-run forecasts of electricity loads and peaks , 1997 .

[35]  Rob J Hyndman,et al.  Minimum Sample Size requirements for Seasonal Forecasting Models , 2007 .

[36]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[37]  Everette S. Gardner,et al.  Exponential smoothing: The state of the art , 1985 .

[38]  Surajit Chaudhuri,et al.  Efficient evaluation of queries with mining predicates , 2002, Proceedings 18th International Conference on Data Engineering.

[39]  Torben Bach Pedersen,et al.  Real-Time Business Intelligence in the MIRABEL Smart Grid System , 2012, BIRTE.

[40]  Wolfgang Lehner,et al.  Sample-based forecasting exploiting hierarchical time series , 2012, IDEAS '12.

[41]  Samuel Madden,et al.  Querying continuous functions in a database system , 2008, SIGMOD Conference.

[42]  Svetha Venkatesh,et al.  Using multiple windows to track concept drift , 2004, Intell. Data Anal..

[43]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[44]  Wolfgang Lehner,et al.  Context-Aware Parameter Estimation for Forecast Models in the Energy Domain , 2011, SSDBM.

[45]  Stanley B. Zdonik,et al.  A skip-list approach for efficiently processing forecasting queries , 2008, Proc. VLDB Endow..

[46]  Gianluca Bontempi,et al.  Machine Learning Strategies for Time Series Forecasting , 2012, eBISS.

[47]  Carlos Ordonez,et al.  One-pass data mining algorithms in a DBMS with UDFs , 2011, SIGMOD '11.

[48]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[49]  Wolfgang Lehner,et al.  Towards Integrated Data Analytics: Time Series Forecasting in DBMS , 2012, Datenbank-Spektrum.

[50]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[51]  Spyros Makridakis,et al.  Accuracy measures: theoretical and practical concerns☆ , 1993 .

[52]  Wolfgang Lehner,et al.  Efficient In-Database Maintenance of ARIMA Models , 2011, SSDBM.

[53]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[54]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[55]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[56]  Carlos Ordonez Programming the K-means clustering algorithm in SQL , 2004, KDD '04.

[57]  Wolfgang Lehner,et al.  Indexing forecast models for matching and maintenance , 2010, IDEAS '10.

[58]  Terence C. Mills,et al.  Time series techniques for economists , 1990 .

[59]  Marcos M. Campos,et al.  SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines , 2005, VLDB.

[60]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[61]  Carlo Zaniolo,et al.  ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams , 2003, VLDB.

[62]  Weiping Zhang,et al.  I/O-efficient statistical computing with RIOT , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[63]  Deepak Ganesan,et al.  PRESTO: feedback-driven data management in sensor networks , 2009, TNET.

[64]  Zbigniew Michalewicz,et al.  Time Series Forecasting for Dynamic Environments: The DyFor Genetic Program Model , 2007, IEEE Transactions on Evolutionary Computation.

[65]  Shiliang Sun,et al.  A Bayesian network approach to time series forecasting of short-term traffic flows , 2004, Proceedings. The 7th International IEEE Conference on Intelligent Transportation Systems (IEEE Cat. No.04TH8749).

[66]  George E. P. Box,et al.  Time Series Analysis: Box/Time Series Analysis , 2008 .

[67]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[68]  Christopher Ré,et al.  Incrementally Maintaining Classification using an RDBMS , 2011, Proc. VLDB Endow..

[69]  Haitang Feng Performance Problems of Forecasting Systems , 2011, ADBIS.

[70]  Wolfgang Lehner,et al.  F2DB: The Flash-Forward Database System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[71]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[72]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[73]  D. Heckerman,et al.  Autoregressive Tree Models for Time-Series Analysis , 2002, SDM.

[74]  Richard Hale,et al.  Dynamic Warehousing: Data Mining Made Easy , 2007 .

[75]  Rob J Hyndman,et al.  Automatic Time Series Forecasting: The forecast Package for R , 2008 .

[76]  Samuel Madden,et al.  PAQ: Time Series Forecasting for Approximate Query Answering in Sensor Networks , 2006, EWSN.

[77]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.