Incremental entropy-based clustering on categorical data streams with concept drift

Clustering on categorical data streams is a relatively new field that has not received as much attention as static data and numerical data streams. One of the main difficulties in categorical data analysis is lacking in an appropriate way to define the similarity or dissimilarity measure on data. In this paper, we propose three dissimilarity measures: a point-cluster dissimilarity measure (based on incremental entropy), a cluster-cluster dissimilarity measure (based on incremental entropy) and a dissimilarity measure between two cluster distributions (based on sample standard deviation). We then propose an integrated framework for clustering categorical data streams with three algorithms: Minimal Dissimilarity Data Labeling (MDDL), Concept Drift Detection (CDD) and Cluster Evolving Analysis (CEA). We also make comparisons with other algorithms on several data streams synthesized from real data sets. Experiments show that the proposed algorithms are more effective in generating clustering results and detecting concept drift.

[1]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[2]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[3]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[4]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[5]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[6]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[7]  Jiye Liang,et al.  A data labeling method for clustering categorical data , 2011, Expert Syst. Appl..

[8]  Ambuj K. Singh,et al.  A unified framework for monitoring data streams in real time , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[10]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[11]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[12]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[13]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[14]  Rodrigo Fernandes de Mello,et al.  Energy-based function to evaluate data stream clustering , 2013, Adv. Data Anal. Classif..

[15]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[16]  Jiye Liang,et al.  A Framework for Clustering Categorical Time-Evolving Data , 2010, IEEE Transactions on Fuzzy Systems.

[17]  Olfa Nasraoui,et al.  Robust Clustering for Tracking Noisy Evolving Data Streams , 2006, SDM.

[18]  Ming-Syan Chen,et al.  Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[19]  Matthew Brand,et al.  An Entropic Estimator for Structure Discovery , 1998, NIPS.

[20]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Philip S. Yu,et al.  Detection and Classification of Changes in Evolving Data Streams , 2006, Int. J. Inf. Technol. Decis. Mak..

[22]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[23]  Ming-Syan Chen,et al.  Clustering over Multiple Evolving Streams by Events and Correlations , 2007, IEEE Transactions on Knowledge and Data Engineering.

[24]  Hadi Sadoghi Yazdi,et al.  Ensemble of online neural networks for non-stationary and imbalanced data streams , 2013, Neurocomputing.

[25]  Jie Lu,et al.  Modified blame-based noise reduction for concept drift , 2012 .

[26]  Sns Rajalakshmi,et al.  A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites , 2012 .

[27]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[28]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[29]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[30]  Ambuj K. Singh,et al.  SWAT: hierarchical stream summarization in large networks , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[31]  Harry Wechsler,et al.  A Martingale Framework for Detecting Changes in Data Streams by Testing Exchangeability , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Piotr Duda,et al.  Decision Trees for Mining Data Streams Based on the Gaussian Approximation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[33]  Ming-Syan Chen,et al.  Adaptive Clustering for Multiple Evolving Streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[34]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[35]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[36]  Charu C. Aggarwal,et al.  On change diagnosis in evolving data streams , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Frank Klawonn,et al.  Dynamic data assigning assessment clustering of streaming data , 2008, Appl. Soft Comput..

[38]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[39]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[40]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[41]  Theodore Johnson,et al.  Sampling algorithms in a stream operator , 2005, SIGMOD '05.

[42]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[43]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[44]  Philip S. Yu,et al.  Online Analysis of Community Evolution in Data Streams , 2005, SDM.

[45]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[46]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[47]  Keke Chen,et al.  “Best K”: critical clustering structures in categorical datasets , 2008, Knowledge and Information Systems.