Change detection in categorical evolving data streams

Detecting change in evolving data streams is a central issue for accurate adaptive learning. In real world applications, data streams have categorical features, and changes induced in the data distribution of these categorical features have not been considered extensively so far. Previous work on change detection focused on detecting changes in the accuracy of the learners, but without considering changes in the data distribution. To cope with these issues, we propose a new unsupervised change detection method, called CDCStream (Change Detection in Categorical Data Streams), well suited for categorical data streams. The proposed method is able to detect changes in a batch incremental scenario. It is based on the two following characteristics: (i) a summarization strategy is proposed to compress the actual batch by extracting a descriptive summary and (ii) a new segmentation algorithm is proposed to highlight changes and issue warnings for a data stream. To evaluate our proposal we employ it in a learning task over real world data and we compare its results with state of the art methods. We also report qualitative evaluation in order to show the behavior of CDCStream.

[1]  Keke Chen,et al.  HE-Tree: a framework for detecting changes in clustering structure for categorical data streams , 2009, The VLDB Journal.

[2]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[3]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[6]  Jiye Liang,et al.  A Framework for Clustering Categorical Time-Evolving Data , 2010, IEEE Transactions on Fuzzy Systems.

[7]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[8]  Dino Ienco,et al.  Clustering Based Active Learning for Evolving Data Streams , 2013, Discovery Science.

[9]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[10]  Joshua Zhexue Huang,et al.  A Concept-Drifting Detection Algorithm for Categorical Evolving Data , 2013, PAKDD.

[11]  Ruggero G. Pensa,et al.  From Context to Distance: Learning Dissimilarity for Categorical Data Clustering , 2012, TKDD.

[12]  Ming-Syan Chen,et al.  Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[14]  Claude Sammut,et al.  Extracting Hidden Context , 1998, Machine Learning.

[15]  Jan Paredaens,et al.  Advances in Database Systems , 1994 .

[16]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[17]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[18]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .