Link-based cluster ensembles for heterogeneous biological data analysis

Clinical data has been employed as the major factor for traditional cancer prognosis. However, this classic approach may be ineffective for analyzing morphologically indistinguishable tumor subtypes. As such, the microarray technology emerges as the promising alternative. Despite a large number of microarray studies, the actual clinical application of gene expression data analysis remains limited due to the complexity of generated data and the noise level. Recently, the integrative cluster analysis of both clinical and gene expression data has shown to be an effective alternative to overcome the above-mentioned problems. This paper presents a novel method for using cluster ensembles that is accurate for analyzing heterogeneous biological data. It overcomes the problem of selecting an appropriate clustering algorithm or parameter setting of any potential candidate, especially with a new set of data. The evaluation on real biological and benchmark datasets suggests that the quality of the proposed model is higher than many state-of-the-art cluster ensemble techniques and standard clustering algorithms. Also, its performance is robust to the parameter perturbation, thus providing a reliable and useful means for data analysts and bioinformaticians. Online supplementary is available at http://users.aber.ac.uk/nii07/bibm2010.

[1]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[2]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[3]  Shinichi Morishita,et al.  Constrained clusters of gene expression profiles with pathological features , 2004, Bioinform..

[4]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Tossapon Boongoen,et al.  Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations , 2008, Discovery Science.

[6]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[7]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[8]  Yixiao Li,et al.  Clustering Mixed Data Based on Evidence Accumulation , 2006, ADMA.

[9]  Geoffrey J. McLachlan,et al.  Integrative mixture of experts to combine clinical factors and gene markers , 2010, Bioinform..

[10]  Dustin P. Potter,et al.  Heritable clustering and pathway discovery in breast cancer integrating epigenetic and phenotypic data , 2007, BMC Bioinformatics.

[11]  Lee Bennett,et al.  Gene expression analysis reveals chemical-specific profiles. , 2002, Toxicological sciences : an official journal of the Society of Toxicology.

[12]  Weihui Dai,et al.  K-Centers Algorithm for Clustering Mixed Type Data , 2007, PAKDD.

[13]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[14]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[17]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Arie Perry,et al.  Mantel statistics to correlate gene expression levels from microarrays with clinical covariates , 2002, Genetic epidemiology.

[19]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[20]  Kikuya Kato,et al.  Adaptor-tagged competitive PCR: a novel method for measuring relative gene expression. , 1997, Nucleic acids research.

[21]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[22]  Zengyou He,et al.  Scalable algorithms for clustering large datasets with mixed type attributes , 2005, Int. J. Intell. Syst..

[23]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[24]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[25]  Pierre R. Bushel,et al.  Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes , 2007, BMC Systems Biology.

[26]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[27]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.