An Overview on Mining Data Streams

The most challenging applications of knowledge discovery involve dynamic environments where data continuous flow at high-speed and exhibit non-stationary properties. In this chapter we discuss the main challenges and issues when learning from data streams. In this work, we discuss the most relevant issues in knowledge discovery from data streams: incremental learning, cost-performance management, change detection, and novelty detection. We present illustrative algorithms for these learning tasks, and a real-world application illustrating the advantages of stream processing. The chapter ends with some open issues that emerge from this new research area.

[1]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks , 2008, SAC '08.

[2]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[3]  Jude W. Shavlik,et al.  Using neural networks for data mining , 1997, Future Gener. Comput. Syst..

[4]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[5]  Jesús S. Aguilar-Ruiz,et al.  Incremental Rule Learning and Border Examples Selection from Numerical Data Streams , 2005, J. Univers. Comput. Sci..

[6]  D. Bauer Constructing Confidence Sets Using Rank Statistics , 1972 .

[7]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[8]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Jennifer Widom,et al.  Proceedings of the 1996 ACM SIGMOD international conference on Management of data , 1996, PODS 1996.

[10]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[11]  Alessandra Russo,et al.  Advances in Artificial Intelligence – SBIA 2004 , 2004, Lecture Notes in Computer Science.

[12]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[13]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[14]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[15]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[16]  Michèle Basseville,et al.  Detection of Abrupt Changes: Theory and Applications. , 1995 .

[17]  João Gama,et al.  Learning Model Trees from Data Streams , 2008, Discovery Science.

[18]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[19]  Mohamed Medhat Gaber,et al.  Cost-Efficient Mining Techniques for Data Streams , 2004, ACSW.

[20]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[21]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[22]  João Gama,et al.  A system for analysis and prediction of electricity-load streams , 2009, Intell. Data Anal..

[23]  David B. Skillicorn,et al.  Proceedings of the Sixth SIAM International Conference on Data Mining, April 20-22, 2006, Bethesda, MD, USA , 2005, SDM.

[24]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[25]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[26]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[27]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[28]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[29]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[30]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[31]  Christos Faloutsos,et al.  Evaluating the intrinsic dimension of evolving data streams , 2006, SAC '06.

[32]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[33]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[34]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[35]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[36]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[37]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[38]  João Gama,et al.  ODAC: Hierarchical Clustering of Time Series Data Streams , 2006, SDM.

[39]  Hisham M. Haddad,et al.  Proceedings of the 2008 ACM Symposium on Applied Computing (SAC), Fortaleza, Ceara, Brazil, March 16-20, 2008 , 2008, SAC.

[40]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.