A framework for on-demand classification of evolving data streams

Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classification problem from the point of view of a dynamic approach in which simultaneous training and test streams are used for dynamic classification of data sets. This model reflects real-life situations effectively, since it is desirable to classify test streams in real time over an evolving training and test stream. The aim here is to create a classification system in which the training model can adapt quickly to the changes of the underlying data stream. In order to achieve this goal, we propose an on-demand classification process which can dynamically select the appropriate window of past training data to build the classifier. The empirical results indicate that the system maintains an high classification accuracy in an evolving data stream, while providing an efficient solution to the classification task.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[3]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[4]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[5]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[6]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[7]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[8]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[9]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[10]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[11]  Mahesh Viswanathan,et al.  Testing and spot-checking of data streams (extended abstract) , 2000, ACM-SIAM Symposium on Discrete Algorithms.

[12]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[13]  Jennifer Widom,et al.  Query Processing, Resource Management, and Approximation ina Data Stream Management System , 2002 .

[14]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[15]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[16]  Mahesh Viswanathan,et al.  Testing and Spot-Checking of Data Streams , 2000, SODA '00.

[17]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[18]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Jerome H. Friedman,et al.  A Recursive Partitioning Decision Rule for Nonparametric Classification , 1977, IEEE Transactions on Computers.

[20]  Jesús S. Aguilar-Ruiz,et al.  Discovering decision rules from numerical data streams , 2004, SAC '04.

[21]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[22]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[23]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[24]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[25]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[26]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[27]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.