Mining time-changing data streams

Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.

[1]  R. F.,et al.  Statistical Method from the Viewpoint of Quality Control , 1940, Nature.

[2]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[3]  F. David The moments of the z and F distributions. , 1949, Biometrika.

[4]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[5]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[6]  N. Cliff,et al.  A generalization of the interpoint distance model , 1964 .

[7]  P. J. Landin,et al.  The next 700 programming languages , 1966, CACM.

[8]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[9]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[10]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[11]  R. E. Wheeler Statistical distributions , 1983, APLQ.

[12]  Editors , 1986, Brain Research Bulletin.

[13]  J. Wellner,et al.  Empirical Processes with Applications to Statistics , 2009 .

[14]  James M. Lucas,et al.  Exponentially weighted moving average control schemes: Properties and enhancements , 1990 .

[15]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[16]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[17]  Nola D. Tracy,et al.  Multivariate Control Charts for Individual Observations , 1992 .

[18]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[19]  Marcos Salganicoff,et al.  Density-Adaptive Learning and Forgetting , 1993, ICML.

[20]  Stuart J. Russell,et al.  Decision Theoretic Subsampling for Induction on Large Databases , 1993, ICML.

[21]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[22]  G. McCabe,et al.  Sensitivity of water resources in the Delaware River basin to climate variability and change , 1993 .

[23]  T. McMahon,et al.  Detection of trend or change in annual flow of Australian rivers , 1993 .

[24]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[25]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[26]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[27]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[28]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[29]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[30]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[31]  Shai Ben-David,et al.  Learning Changing Concepts by Exploiting the Structure of Change , 1996, COLT '96.

[32]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[33]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[34]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[35]  Sunita Sarawagi,et al.  Mining Surprising Patterns Using Temporal Description Length , 1998, VLDB.

[36]  Nandlal L. Sarda,et al.  An adaptive algorithm for incremental mining of association rules , 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130).

[37]  Andrew Heybey,et al.  Tribeca: A System for Managing Large Databases of Network Traffic , 1998, USENIX Annual Technical Conference.

[38]  JefI’rty C. Schlirrlrrer Beyond incremental processing : Tracking concept drift , 1999 .

[39]  Dirk Thierens,et al.  Linkage Information Processing In Distribution Estimation Algorithms , 1999, GECCO.

[40]  Niall M. Adams,et al.  The impact of changing populations on classifier performance , 1999, KDD '99.

[41]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[42]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[43]  Necip Fazil Ayan,et al.  An efficient algorithm to update large itemsets with early pruning , 1999, KDD '99.

[44]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[45]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[46]  Alec Wolman,et al.  Organization-Based Analysis of Web-Object Sharing and Caching , 1999, USENIX Symposium on Internet Technologies and Systems.

[47]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[48]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[49]  Karsten Schwan,et al.  ACDS: Adapting computational data streams for high performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[50]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[51]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[52]  Fredrik Gustafsson,et al.  Adaptive filtering and change detection , 2000 .

[53]  Johannes Gehrke,et al.  DEMON: Mining and Monitoring Evolving Data , 2001, IEEE Trans. Knowl. Data Eng..

[54]  James T. Wassell,et al.  Bootstrap Methods: A Practitioner's Guide , 2001, Technometrics.

[55]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[56]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[57]  Qiang Ding,et al.  Decision tree classification of spatial data streams using Peano Count Trees , 2002, SAC '02.

[58]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[59]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[60]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[61]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[62]  Robert L. Grossman,et al.  Merging Multiple Data Streams on Common Keys over High Performance Networks , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[63]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[64]  Johannes Gehrke,et al.  Mining data streams under block evolution , 2002, SKDD.

[65]  E. Perlman,et al.  Predictive Mining of Time Series Data in Astronomy , 2002 .

[66]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[67]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[68]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[69]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[70]  Carlo Zaniolo,et al.  ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams , 2003, VLDB.

[71]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[72]  R. Motwani,et al.  Query Processing, Approximation, and Resource Management in a Data Stream Management System , 2003, CIDR.

[73]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[74]  Michael J. Franklin,et al.  PSoup: a system for streaming queries over streaming data , 2003, The VLDB Journal.

[75]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[76]  David J. DeWitt,et al.  Tuple Routing Strategies for Distributed Eddies , 2003, VLDB.

[77]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[78]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[79]  Martin Pelikan,et al.  Hierarchical Bayesian optimization algorithm: toward a new generation of evolutionary algorithms , 2010, SICE 2003 Annual Conference (IEEE Cat. No.03TH8734).

[80]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[81]  Pilar Rodriguez-Loaiza,et al.  Application of the Multivariate T2 Control Chart and the Mason–Tracy–Young Decomposition Procedure to the Study of the Consistency of Impurity Profiles of Drug Substances , 2003 .

[82]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[83]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[84]  Dennis Shasha,et al.  The Virtues and Challenges of Ad Hoc + Streams Querying in Finance , 2003, IEEE Data Eng. Bull..

[85]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[86]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[87]  Wei Hong,et al.  The sensor spectrum: technology, trends, and requirements , 2003, SGMD.

[88]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[89]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[90]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[91]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[92]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[93]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[94]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[95]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[96]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[97]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[98]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.