Learning model trees from evolving data streams

The problem of real-time extraction of meaningful patterns from time-changing data streams is of increasing importance for the machine learning and data mining communities. Regression in time-changing data streams is a relatively unexplored topic, despite the apparent applications. This paper proposes an efficient and incremental stream mining algorithm which is able to learn regression and model trees from possibly unbounded, high-speed and time-changing data streams. The algorithm is evaluated extensively in a variety of settings involving artificial and real data. To the best of our knowledge there is no other general purpose algorithm for incremental learning regression/model trees able to perform explicit change detection and informed adaptation. The algorithm performs online and in real-time, observes each example only once at the speed of arrival, and maintains at any-time a ready-to-use model tree. The tree leaves contain linear models induced online from the examples assigned to them, a process with low complexity. The algorithm has mechanisms for drift detection and model adaptation, which enable it to maintain accurate and updated regression models at any time. The drift detection mechanism exploits the structure of the tree in the process of local change detection. As a response to local drift, the algorithm is able to update the tree structure only locally. This approach improves the any-time performance and greatly reduces the costs of adaptation.

[1]  João Gama,et al.  Forest trees for on-line data , 2004, SAC '04.

[2]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[3]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[4]  João Gama,et al.  Learning with Local Drift Detection , 2006, ADMA.

[5]  Michelangelo Ceci,et al.  Trading-Off Local versus Global Effects of Regression Nodes in Model Trees , 2002, ISMIS.

[6]  P. Chaudhuri,et al.  Piecewise polynomial regression trees , 1994 .

[7]  Sanjay Ranka,et al.  Statistical change detection for multi-dimensional data , 2007, KDD '07.

[8]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[9]  João Gama,et al.  Online Reliability Estimates for Individual Predictions in Data Streams , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[10]  João Gama,et al.  Regression Trees from Data Streams with Drift Detection , 2009, Discovery Science.

[11]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[12]  João Gama,et al.  Change Detection in Climate Data over the Iberian Peninsula , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[13]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[14]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[15]  Aram Karalic,et al.  Employing Linear Regression in Regression Tree Leaves , 1992, ECAI.

[16]  Tobias Scheffer,et al.  Scalable look-ahead linear regression trees , 2007, KDD '07.

[17]  Duncan Potts,et al.  Incremental learning of linear model trees , 2004, ICML.

[18]  João Gama,et al.  Learning Model Trees from Data Streams , 2008, Discovery Science.

[19]  Alexander Gammerman,et al.  Prediction algorithms and confidence measures based on algorithmic randomness theory , 2002, Theor. Comput. Sci..

[20]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[21]  Roberta Siciliano,et al.  Modelling for Recursive Partitioning and Variable Selection , 1994 .

[22]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[23]  Geoff Holmes,et al.  Handling Numeric Attributes in Hoeffding Trees , 2008, PAKDD.

[24]  Ah-Hwee Tan,et al.  Topic Detection, Tracking, and Trend Analysis Using Self-Organizing Neural Networks , 2001, PAKDD.

[25]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[26]  L. Breiman Arcing Classifiers , 1998 .

[27]  Stuart J. Russell,et al.  Decision Theoretic Subsampling for Induction on Large Databases , 1993, ICML.

[28]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[29]  Luís Torgo,et al.  Functional Models for Regression Tree Leaves , 1997, ICML.

[30]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[31]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[32]  Johannes Gehrke,et al.  SECRET: a scalable linear regression tree algorithm , 2002, KDD.

[33]  H. Mouss,et al.  Test of Page-Hinckley, an approach for fault detection in an agro-alimentary production system , 2004, 2004 5th Asian Control Conference (IEEE Cat. No.04EX904).

[34]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[35]  Kai Ming Ting,et al.  Improving the Centered CUSUMS Statistic for Structural Break Detection in Time Series , 2004, Australian Conference on Artificial Intelligence.

[36]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[37]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[38]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[39]  Jonathan Gratch,et al.  Sequential Inductive Learning , 1996, AAAI/IAAI, Vol. 1.

[40]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[41]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[42]  Michèle Basseville,et al.  Detection of Abrupt Changes: Theory and Applications. , 1995 .

[43]  Alexander Gammerman,et al.  Hedging predictions in machine learning , 2006, ArXiv.

[44]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[45]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[46]  J. Friedman Multivariate adaptive regression splines , 1990 .

[47]  Suresh Venkatasubramanian,et al.  Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams , 2009, IDA.

[48]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[49]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[50]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[51]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .