NOCEA: A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases

Clustering is a descriptive data mining task aiming to group the data into homogeneous groups. This paper presents a novel evolutionary algorithm (NOCEA) that efficiently and effectively clusters massive numerical databases. NOCEA evolves individuals of variable-length consisting of disjoint and axis-aligned hyper-rectangular rules with homogeneous data distribution. The antecedent part of the rules includes an interval-like condition for each dimension. A novel quantisation algorithm imposes a regular multi-dimensional grid structure onto the data space to reduce the search combinations. Due to quantisation the boundaries of the intervals are encoded as integer values. The evolutionary search is guided by a simple data coverage maximisation function. The enormous data space is effectively explored by task-specific recombination and mutation operators producing candidate solutions with no overlapping rules. A parsimony generalisation operator shortens the discovered knowledge by replacing adjacent rules with more generic ones. NOCEA employs a special homogeneity operator that enforces quasi-uniform data distribution in the space enclosed by the candidate rules. After convergence the discovered knowledge undergoes simplification to perform subspace clustering, and to assemble the clusters. Results using real-world datasets are included to show that NOCEA has several attractive properties for clustering including: (a) comprehensible output in the form of disjoint and homogeneous rules, (b) the ability to discover clusters of arbitrary shape, density, size, and data coverage, (c) ability to perform effective subspace clustering, (d) near linear scalability with the database size, data and cluster dimensionality, and (e) substantial potential for task parallelism (speedup of 13.8 on 16 processors). A real-world example is a detailed study of the seismicity along the African-Eurasian-Arabian plate boundaries.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[3]  V I Keilis-Borok,et al.  Intermediate-term earthquake prediction. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[5]  Ioannis Sarafis Data mining clustering of high dimensional databases with evolutionary algorithms , 2005 .

[6]  Anastasia Kiratzi,et al.  S-wave spectral analysis of the 1995 Kozani-Grevena (NW Greece) aftershock sequence , 2002 .

[7]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[8]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[9]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[10]  KabánAta,et al.  When is 'nearest neighbour' meaningful , 2009 .

[11]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[12]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[13]  Roy George,et al.  A variable-length genetic algorithm for clustering and classification , 1995, Pattern Recognit. Lett..

[14]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[15]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[16]  Vladimir Estivill-Castro,et al.  Hybrid Genetic Algorithms Are Better for Spatial Clustering , 2000, PRICAI.

[17]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[18]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[19]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[20]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[21]  J. Drew Modern Data Analysis: A First Course in Applied Statistics , 1991 .

[22]  J. Simonoff Multivariate Density Estimation , 1996 .

[23]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[24]  D. Fogel,et al.  Discovering patterns in spatial data using evolutionary programming , 1996 .

[25]  Marcos M. Campos,et al.  O-Cluster: scalable clustering of large high dimensional data sets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[26]  T B Comstock,et al.  U. S. Geological Survey , 1907, Radiocarbon.

[27]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[28]  Anastasia Kiratzi,et al.  A detailed study of the active crustal deformation in the Aegean and surrounding area , 1996 .

[29]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[30]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[31]  Anastasia Kiratzi,et al.  Active crustal deformation from the Azores triple junction to the Middle East , 1995 .

[32]  Sudha Ram,et al.  Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[33]  Anastasia Kiratzi,et al.  Relocation of the 26 July 2001 Skyros Island (Greece) earthquake sequence using the double-difference technique , 2003 .

[34]  Ali M. S. Zalzala,et al.  Towards effective subspace clustering with an evolutionary algorithm , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[35]  Hans-Peter Kriegel,et al.  Density-Connected Sets and their Application for Trend Detection in Spatial Databases , 1997, KDD.

[36]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[37]  Zbigniew Michalewicz,et al.  Evolutionary Computation 2 : Advanced Algorithms and Operators , 2000 .

[38]  Ali M. S. Zalzala,et al.  Mining Comprehensible Clustering Rules with an Evolutionary Algorithm , 2003, GECCO.

[39]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[40]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[41]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[42]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[43]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[44]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[45]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[46]  Emanuel Falkenauer,et al.  Genetic Algorithms and Grouping Problems , 1998 .

[47]  Paul Scheunders,et al.  A genetic c-Means clustering algorithm applied to color image quantization , 1997, Pattern Recognit..

[48]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[49]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[50]  Jiawei Han,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[51]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[52]  G. Terrell The Maximal Smoothing Principle in Density Estimation , 1990 .

[53]  Zbigniew Michalewicz,et al.  Evolutionary Computation 1 , 2018 .

[54]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[55]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[56]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[57]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[58]  David B. Fogel,et al.  Evolution-ary Computation 1: Basic Algorithms and Operators , 2000 .

[59]  Zbigniew Michalewicz,et al.  Evolutionary Computation 2 , 2000 .

[60]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[61]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[62]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[63]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..