Scalable Clustering Using Rank Based Pre-processing Technique for Mixed Data Sets Using Enhanced Rock Algorithm

The current requirements to cluster real world data sets are scalability, ability to handle any kind of data like categorical and numerical . It should also have the capability to handle noisy and missing data. Traditional algorithm can cluster categorical or numerical data but not the both. In general it is tedious to cluster mixed data types but it gives us best clusters with more accurate results. Another important factor that affects the quality of clusters are preprocessing techniques. In order to meet out the current requirement we proposed a clustering methodology that helps to enhance the performance of ROCK clustering algorithm which is scalable. This approach has two process (1) Numerical attributes are converted in to categorical, missing values are filled by using a rank based method (2) Clustering takes place using ROCK algorithm. These approaches are combined together and known as EROCK algorithm. Experimental results obtained by this methodology are compared with EM and CLOPE algorithms. It shows that our new methodology performs well for real world data sets and found it is very effective.

[1]  Gautam Biswas,et al.  Knowledge-Based Scientific Discovery in Geological Databases , 1995, KDD.

[2]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[3]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[4]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[5]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[6]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[7]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[8]  Ann Q. Gates,et al.  TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2005 .

[9]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[10]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[11]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[12]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[13]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Miin-Shen Yang,et al.  A robust clustering algorithm for interval data , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[18]  S. Mingoti,et al.  Clustering Algorithms for Categorical Data: A Monte Carlo Study , 2012 .

[19]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .