AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms

Most of the existing Data Mining algorithms have been manually produced, that is, have been developed by a human programmer. A prominent Artificial Intelligence research area is automatic programming - the generation of a computer program by another computer program. Clustering is an important data mining task with many useful real-world applications. Particularly, the class of clustering algorithms based on the idea of data density to identify clusters has many advantages, such as the ability to identify arbitrary-shape clusters. We propose the use of Estimation of Distribution Algorithms for the artificial generation of density-based clustering algorithms. In order to guarantee the generation of valid algorithms, a directed acyclic graph (DAG) was defined where each node represents a procedure (building block) and each edge represents a possible execution sequence between two nodes. The Building Blocks DAG specifies the alphabet of the EDA, that is, any possibly generated algorithm. Preliminary experimental results compare the clustering algorithms artificially generated by AutoClustering to DBSCAN, a well-known manually-designed algorithm.

[1]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[2]  Hitoshi Iba,et al.  Estimation of Distribution Programming: EDA-based Approach to Program Generation , 2006, Towards a New Evolutionary Computation.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[5]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[6]  Chen Wang,et al.  SUDEPHIC: Self-Tuning Density-Based Partitioning and Hierarchical Clustering , 2004, DASFAA.

[7]  Clara Pizzuti,et al.  DESCRY: A Density Based Clustering Algorithm for Very Large Data Sets , 2004, IDEAL.

[8]  Wei-keng Liao,et al.  A Grid-based Clustering Algorithm using Adaptive Mesh Refinement , 2004 .

[9]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[10]  Jian Pei,et al.  DHC: a density-based hierarchical clustering method for time series gene expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[11]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[12]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.