Cluster Detection with the PYRAMID Algorithm

As databases continue to grow in size, efficient and effective clustering algorithms play a paramount role in data mining applications. Practical clustering faces several challenges including: identifying clusters of arbitrary shapes, sensitivity to the order of input, dynamic determination of the number of clusters, outlier handling, processing speed of massive data sets, handling higher dimensions, and dependence on user-supplied parameters. Many studies have addressed one or more of these challenges. PYRAMID, or parallel hybrid clustering using genetic programming and multi-objective fitness with density, is an algorithm that we introduced in a previous research, which addresses some of the above challenges. While leaving significant challenges for future work, such as handling higher dimensions, PYRAMID employs a combination of data parallelism, a form of genetic programming, and a multi-objective density-based fitness function in the context of clustering. This study adds to our previous research by exploring the detection capability of PYRAMID against a challenging dataset and evaluating its independence on user supplied parameters

[1]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[2]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[3]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[4]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[5]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[6]  Jiong Yang,et al.  An Approach to Active Spatial Data Mining Based on Statistical Information , 2000, IEEE Trans. Knowl. Data Eng..

[7]  Erica Kolatch,et al.  Clustering Algorithms for Spatial Databases: A Survey , 2001 .

[8]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Junping Sun,et al.  Parallel Hybrid Clustering using Genetic Programming and Multi-Objective Fitness with Density (PYRAMID) , 2006, DMIN.

[11]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[12]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[13]  John R. Koza,et al.  Evolving a Computer Program to Generate Random Numbers Using the Genetic Programming Paradigm , 1991, ICGA.

[14]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[15]  Ali M. S. Zalzala,et al.  A genetic rule-based data clustering toolkit , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[16]  Y. Ohsawa A Spatio-temporal Geographic Information System Based on Implicit Topology Description : STIMS , 2001 .

[17]  Ali M. S. Zalzala,et al.  Mining Comprehensible Clustering Rules with an Evolutionary Algorithm , 2003, GECCO.

[18]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[19]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[20]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[21]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[22]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[23]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.