Adaptive, Hands-Off Stream Mining (CMU-CS-02-205)

Sensor devices and embedded processors are becoming ubiquitous, especially in measurement and monitoring applications. Automatic discovery of patterns and trends in the large volumes of such data is of paramount importance. The combination of relatively limited resources (CPU, memory and/or communication bandwidth and power) poses some interesting challenges. We need both powerful and concise “languages” to represent the important features of the data, which can (a) adapt and handle arbitrary periodic components, including bursts, and (b) require little memory and a single pass over the data. This allows sensors to automatically (a) discover interesting patterns and trends in the data, and (b) perform outlier detection to alert users. We need a way so that a sensor can discover something like “the hourly phone call volume so far follows a daily and a weekly periodicity, with bursts roughly every year,” which a human might recognize as, e.g., the Mother’s day surge. When possible and if desired, the user can then issue explicit queries to further investigate the reported patterns. In this work we propose AWSOM (Arbitrary Window Stream mOdeling Method), which allows sensors operating in remote or hostile environments to discover patterns efficiently and effectively, with practically no user interventions. Our algorithms require limited resources and thus can be incorporated in individual sensors, possibly alongside a distributed query processing engine [CCC+02, BGS01, MSHR02]. Updates are performed in constant time, using sub-linear (in fact, logarithmic) space. Existing, state of the art forecasting methods (AR, SARIMA, GARCH, etc) fall short on one or more of these requirements. To the best of our knowledge, AWSOM is the first method that has all the above characteristics. Experiments on real and synthetic datasets demonstrate that AWSOM discovers meaningful patterns over long time periods. Thus, the patterns can also be used to make long-range forecasts, which are notoriously difficult to perform automatically and efficiently. In fact, AWSOM outperforms manually set up auto-regressive models, both in terms of long-term pattern detection and modeling, as well as by at least 10× in resource consumption.

[1]  T. Bollerslev,et al.  Generalized autoregressive conditional heteroskedasticity , 1986 .

[2]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[3]  J. Griffin,et al.  Designing computer systems with MEMS-based storage , 2000, SIGP.

[4]  Christos Faloutsos,et al.  Data mining meets performance evaluation: fast algorithms for modeling bursty traffic , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  L. Richard Carley,et al.  MEMS-based integrated-circuit mass-storage systems , 2000, CACM.

[6]  Richard A. Davis,et al.  Time Series: Theory and Methods , 2013 .

[7]  R. Gencay,et al.  An Introduction to Wavelets and Other Filtering Methods in Finance and Economics , 2001 .

[8]  Sudipto Guha,et al.  Near-optimal sparse fourier representations via sampling , 2002, STOC '02.

[9]  A. Walden,et al.  Wavelet Methods for Time Series Analysis , 2000 .

[10]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[11]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[12]  Jennifer Widom,et al.  Characterizing memory requirements for queries over continuous data streams , 2002, PODS '02.

[13]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[14]  Richard G. Baraniuk,et al.  A Multifractal Wavelet Model with Application to Network Traffic , 1999, IEEE Trans. Inf. Theory.

[15]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[16]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[17]  Leon Abelmann,et al.  Single-chip computers with microelectromechanical systems-based magnetic memory (invited) , 2000 .

[18]  Peter C. Young,et al.  Recursive Estimation and Time-Series Analysis: An Introduction , 1984 .

[19]  Michael R. Chernick,et al.  Wavelet Methods for Time Series Analysis , 2001, Technometrics.

[20]  Metin Akay,et al.  Time frequency and wavelets in biomedical signal processing , 1998 .

[21]  Robertus A. Zuidwijk,et al.  Fast algorithm for directional time-scale analysis using wavelets , 1998, Optics & Photonics.

[22]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[23]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[24]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[25]  Robert Szewczyk,et al.  System architecture directions for networked sensors , 2000, ASPLOS IX.

[26]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[27]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[28]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[29]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[30]  William H. Press,et al.  Numerical recipes in C , 2002 .

[31]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[32]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[33]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[34]  Christos Faloutsos,et al.  Data mining on an OLTP system (nearly) for free , 2000, SIGMOD '00.

[35]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[36]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[37]  Sudipto Guha,et al.  Fast algorithms for hierarchical range histogram construction , 2002, PODS '02.

[38]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[39]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[40]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[41]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .