A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data

The proportionate increase in the size of the data with increase in space implies that clustering a very large data set becomes difficult and is a time consuming process. Sampling is one important technique to scale down the size of dataset and to improve the efficiency of clustering. After sampling, allocating unlabeled data point into proper cluster is difficult in the categorical domain and in real situations data changes over time. However, clustering this type of data not only decreases the quality of clusters and also disregards the expectation of users, who usually require recent clustering results. In both the cases mentioned above, one is of allocating unlabeled data point into proper clusters after the sampling and the other is of finding clustering results when data changes over time which is difficult in the categorical domain. In this paper, using node importance technique, a rough set based method proposed to label unlabeled data point and to find the next clustering result based on the previous clustering result.

[1]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[2]  S. Viswanadha Raju,et al.  Clustering of Concept Drift Categorical Data Using Our-NIR Method , 2011 .

[3]  Cungen Cao,et al.  A rough set approach to outlier detection , 2008, Int. J. Gen. Syst..

[4]  K. V. N. Sunitha,et al.  Our - NIR : Node Importance Representative for Clustering of Categorical Data , 2011 .

[5]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[6]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[9]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[10]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[11]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[12]  Ralf Klinkenberg,et al.  Using Labeled and Unlabeled Data to Learn Drifting Concepts , 2007 .

[13]  Jiye Liang,et al.  A new measure of uncertainty based on knowledge granulation for rough sets , 2009, Inf. Sci..

[14]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[15]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  Ming-Syan Chen,et al.  Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[21]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[22]  Ming-Syan Chen,et al.  Labeling unclustered categorical data into clusters based on the important attribute values , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[24]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[25]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[26]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[27]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[28]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[29]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[30]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.