Evaluating algorithms that learn from data streams

Learning from data streams is a research area of increasing importance. Nowadays, several stream learning algorithms have been developed. Most of them learn decision models that continuously evolve over time, run in resource-aware environments, and detect and react to changes in the environment generating data. One important issue, not yet conveniently addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. In this paper we propose a general framework for assessing the quality of streaming learning algorithms. We defend the use of Predictive Sequential error estimates over a sliding window to assess performance of learning algorithms that learn from open-ended data streams in non-stationary environments. This paper studies properties of convergence and methods to comparatively assess algorithms performance.

[1]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  João Gama,et al.  Forest trees for on-line data , 2004, SAC '04.

[3]  Geoff Hulten,et al.  Catching up with the Data: Research Issues in Mining Data Streams , 2001, DMKD.

[4]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Myra Spiliopoulou,et al.  MONIC: modeling and monitoring cluster transitions , 2006, KDD '06.

[8]  Richard Brendon Kirkby,et al.  Improving Hoeffding Trees , 2007 .

[9]  João Gama,et al.  Stream-Based Electricity Load Forecast , 2007, PKDD.

[10]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[11]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[12]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[13]  Jesús S. Aguilar-Ruiz,et al.  Discovering decision rules from numerical data streams , 2004, SAC '04.

[14]  B. K. Ghosh,et al.  Handbook of sequential analysis , 1991 .

[15]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[16]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[17]  João Gama,et al.  Hierarchical Clustering of Time-Series Data Streams , 2008, IEEE Transactions on Knowledge and Data Engineering.

[18]  Jesús S. Aguilar-Ruiz,et al.  Incremental Rule Learning and Border Examples Selection from Numerical Data Streams , 2005, J. Univers. Comput. Sci..

[19]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[20]  João Gama,et al.  Bias Management of Bayesian Network Classifiers , 2005, Discovery Science.

[21]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[22]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[23]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[24]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[25]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[26]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[27]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.