A Data Set Oriented Approach for Clustering Algorithm Selection

In the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. Thus, a variety of algorithms have been proposed which have application in different fields and may result in different partitioning of a data set, depending on the specific clustering criterion used. Moreover, since clustering is an unsupervised process, most of the algorithms are based on assumptions in order to define a partitioning of a data set. It is then obvious that in most applications the final clustering scheme requires some sort of evaluation. In this paper we present a clustering validity procedure, which taking in account the inherent features of a data set evaluates the results of different clustering algorithms applied to it. A validity index, S_Dbw, is defined according to well-known clustering criteria so as to enable the selection of the algorithm providing the best partitioning of a data set. We evaluate the reliability of our approach both theoretically and experimentally, considering three representative clustering algorithms ran on synthetic and real data sets. It performed favorably in all studies, giving an indication of the algorithm that is suitable for the considered application.

[1]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Michalis Vazirgiannis,et al.  Quality Scheme Assessment in the Clustering Process , 2000, PKDD.

[4]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[5]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[6]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[7]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[8]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[9]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[10]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[11]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[12]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[13]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[14]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[15]  Subhash Sharma Applied multivariate techniques , 1995 .