A Multi-objective Sequential Ensemble for Cluster Structure Analysis and Visualization and Application to Gene Expression

In the presence of huge high dimensional datasets, it is important to investigate and visualize the connectivity of patterns in huge arbitrary shaped clusters. While density or distance-relatedness based clustering algorithms are used to efficiently discover clusters of arbitrary shapes and densities, classical (yet less efficient) clustering algorithms can be used to analyze the internal cluster structure and visualize it. In this work, a sequential ensemble, that uses an efficient distance-relatedness based clustering, “Mitosis”, followed by the centre-based K-means algorithm, is proposed. K-means is used to segment the clusters obtained by Mitosis into a number of subclusters. The ensemble is used to reveal the gradual change of patterns when applied to gene expression sets.

[1]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[2]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[3]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[4]  Mohamed A. Ismail,et al.  A Fuzzy Approach for Analyzing Outliers in Gene Expression Data , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[5]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[6]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[7]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[8]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[9]  Giorgio Valentini,et al.  Discovering multi–level structures in bio-molecular data through the Bernstein inequality , 2008, BMC Bioinformatics.

[10]  Mohamed A. Ismail,et al.  A novel validity measure for clusters of arbitrary shapes and densities , 2008, 2008 19th International Conference on Pattern Recognition.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[13]  Ludmila I. Kuncheva,et al.  Moderate diversity for better cluster ensembles , 2006, Inf. Fusion.

[14]  Mohamed A. Ismail,et al.  A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities , 2009, Pattern Recognit..

[15]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[17]  Mohamed A. Ismail,et al.  Discovering Connected Patterns in Gene Expression Arrays , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[18]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  M.S. Kamel,et al.  Pattern Cores And Connectedness in Cancer Gene Expression , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[20]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.