Stop Chasing Trends: Discovering High Order Models in Evolving Data

Many applications are driven by evolving data - patterns in Web traffic, program execution traces, network event logs, etc., are often non-stationary. Building prediction models for evolving data becomes an important and challenging task. Currently, most approaches work by "chasing trends", that is, they keep learning or updating models from the evolving data, and use these impromptu models for online prediction. In many cases, this proves to be both costly and ineffective - much time is wasted on re-learning recurring concepts, yet the classifier may remain one step behind the current trend all the time. In this paper, we propose to mine high-order models in evolving data. More often than not, there are a limited number of concepts, or stable distributions, in the data stream, and concepts switch between each other constantly. We mine all such concepts offline from a historical stream, and build high quality models for each of them. At run time, combining historical concept change patterns and cues provided by an online training stream, we find the most likely current concept and use its corresponding models to classify data in an unlabeled stream. The primary advantage of the high-order model approach is its high accuracy. Experiments show that in benchmark datasets, classification error of the high-order model is only a small fraction of that of the current best approaches. Another important benefit is that, unlike state-of-the-art approaches, our approach does not require users to tune any parameters to achieve a satisfying result on streams of different characteristics.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[3]  Salvatore J. Stolfo,et al.  An extensible meta-learning approach for scalable and accurate inductive learning , 1996 .

[4]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[5]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[6]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[7]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[8]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[9]  Philip S. Yu,et al.  Discovering High-Order Periodic Patterns , 2004, Knowledge and Information Systems.

[10]  Kenneth O. Stanley Learning Concept Drift with a Committee of Decision Trees , 2003 .

[11]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[12]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[13]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[14]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[15]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[16]  Xindong Wu,et al.  Combining proactive and reactive predictions for data streams , 2005, KDD '05.

[17]  Philip S. Yu,et al.  Loadstar: Load Shedding in Data Stream Mining , 2005, VLDB.

[18]  Haixun Wang,et al.  On reducing classifier granularity in mining concept-drifting data streams , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[19]  Philip S. Yu,et al.  Loadstar: A Load Shedding Scheme for Classifying Data Streams , 2005, SDM.

[20]  Philip S. Yu,et al.  Suppressing model overfitting in mining concept-drifting data streams , 2006, KDD '06.

[21]  Quanyuan Wu,et al.  Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.