RCRDE: A Method for Reducing the Rate of Re-Clustering, using Replicated Data Eliminate Algorithm

In this paper is explored a way to reduce the rate of reclustering and speed up the clustering process on categorical time-evolving data. This method introduces two algorithms RDE (Replicated Data Elimination) and RCRDE. The RDE algorithm removes the successive surveys of replicated data and considers counters to keep this data. Hence the number of created windows via the sliding window technique is limited and this leads to decrease the number of implementations of clustering algorithm. The RCRDE algorithm based on MARDL (MAximal Resemblance Data Labeling) framework decides about re-clustering implementation or modification of previous clustering results. The presented method is independent of clustering algorithm’s type and any kind of categorical clustering algorithm can be used. According to the results obtained on different data sets, this method performs well in practice and facilitates the clustering implementation on categorical data. Also, this method can be utilized to cluster a very large categorical static database with higher quality than previous work.

[1]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[2]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[3]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[4]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[5]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[6]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[7]  Olfa Nasraoui,et al.  Robust Clustering for Tracking Noisy Evolving Data Streams , 2006, SDM.

[8]  Ming-Syan Chen,et al.  Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[9]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[10]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[11]  Vipin Kumar,et al.  Clustering Based On Association Rule Hypergraphs , 1997, DMKD.

[12]  Zhengxin Chen,et al.  An iterative initial-points refinement algorithm for categorical data clustering , 2002, Pattern Recognit. Lett..

[13]  Ming-Syan Chen,et al.  Labeling unclustered categorical data into clusters based on the important attribute values , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  Mohammed J. Zaki,et al.  CLICKS: Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Ming-Syan Chen,et al.  Clustering over Multiple Evolving Streams by Events and Correlations , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Aoying Zhou,et al.  Tracking clusters in evolving data streams over sliding windows , 2008, Knowledge and Information Systems.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Ming-Syan Chen,et al.  On Data Labeling for Clustering Categorical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[19]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[20]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[21]  Ming-Syan Chen,et al.  Adaptive Clustering for Multiple Evolving Streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Philip S. Yu,et al.  Detection and Classification of Changes in Evolving Data Streams , 2006, Int. J. Inf. Technol. Decis. Mak..

[23]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[24]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.