MicroClAn: Microarray clustering analysis

Evaluating clustering results is a fundamental task in microarray data analysis, due to the lack of enough biological knowledge to know in advance the true partition of genes. Many quality indexes for gene clustering evaluation have been proposed. A critical issue in this domain is to compare and aggregate quality indexes to select the best clustering algorithm and the optimal parameter setting for a dataset. Furthermore, due to the huge amount of data generated by microarray experiments and the requirement of external resources such as ontologies to compute biological indexes, another critical issue is the performance decline in term of execution time. Thus, the distributed computation of algorithms and quality indexes becomes essential. Addressing these issues, this paper presents the MicroClAn framework, a distributed system to evaluate and compare clustering algorithms using the most exploited quality indexes. The best solution is selected through a two-step ranking aggregation of the ranks produced by quality indexes. A new index oriented to the biological validation of microarray clustering results is also introduced. Several scheduling strategies integrated in the framework allow to distribute tasks in the grid environment to optimize the completion time. Experimental results show the effectiveness of our aggregation strategy in identifying the best rank among different clustering algorithms. Moreover, our framework achieves good performance in terms of completion time with few computational resources.

[1]  S. Bandyopadhyay,et al.  Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes , 2009, BMC Bioinformatics.

[2]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[6]  R. Sokal Clustering and Classification: Background and Current Directions , 1977 .

[7]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[8]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[9]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[10]  Giulia Bruno,et al.  Microarray data mining: issues and prospects , 2011 .

[11]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[12]  Michael J. Owen,et al.  A comparison of four clustering methods for brain expression microarray data , 2008, BMC Bioinformatics.

[13]  Frank Mueller,et al.  Data-intensive document clustering on graphics processing unit (GPU) clusters , 2011, J. Parallel Distributed Comput..

[14]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[15]  Robert C. Thompson,et al.  Analysis of microRNA expression by in situ hybridization with RNA oligonucleotide probes. , 2007, Methods.

[16]  Ruth Etzioni,et al.  Combining Results of Microarray Experiments: A Rank Aggregation Approach , 2006 .

[17]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[18]  Alfredo Cuzzocrea,et al.  Enabling OLAP in mobile environments via intelligent data cube compression techniques , 2008, Journal of Intelligent Information Systems.

[19]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[20]  Paul D. W. Kirk,et al.  Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements , 2011, BMC Bioinformatics.

[21]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[22]  Lech Raczynski,et al.  Application of Density Based Clustering to Microarray Data Analysis , 2010 .

[23]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[24]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[25]  Susmita Datta,et al.  Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes , 2006, BMC Bioinformatics.

[26]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[27]  Li-Chun Wang,et al.  Distributed clustering algorithms for data-gathering in wireless mobile sensor networks , 2007, J. Parallel Distributed Comput..

[28]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[29]  V.S. Tseng,et al.  Efficiently mining gene expression data via a novel parameterless clustering method , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Jun Zhang,et al.  An Ant Colony Optimization Approach to a Grid Workflow Scheduling Problem With Various QoS Requirements , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[32]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[33]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[34]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[36]  Jian Pei,et al.  Mining coherent gene clusters from gene-sample-time microarray data , 2004, KDD.

[38]  Alfredo Cuzzocrea,et al.  Efficient Fragmentation of Large XML Documents , 2007, DEXA.

[39]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[40]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[41]  Vasyl Pihur,et al.  RankAggreg, an R package for weighted rank aggregation , 2009, BMC Bioinformatics.

[42]  Ladislau Bölöni,et al.  A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems , 2001, J. Parallel Distributed Comput..

[43]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[44]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[45]  R.A. Ammar,et al.  Dynamic On-Line Allocation of Independent Task onto Heterogeneous Computing Systems to Maximize Load Balancing , 2008, 2008 IEEE International Symposium on Signal Processing and Information Technology.

[46]  Vasyl Pihur,et al.  Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach , 2007, Bioinform..

[47]  S. Falcon,et al.  Combining Results of Microarray Experiments: A Rank Aggregation Approach , 2006, Statistical applications in genetics and molecular biology.

[48]  Chang-Tsun Li,et al.  A temporal precedence based clustering method for gene expression microarray data , 2010, BMC Bioinformatics.

[49]  Jian Pei,et al.  DHC: a density-based hierarchical clustering method for time series gene expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[50]  Elena Baralis,et al.  Measuring gene similarity by means of the classification distance , 2011, Knowledge and Information Systems.

[51]  Ulrich Mansmann,et al.  Parallelized preprocessing algorithms for high-density oligonucleotide arrays , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[52]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[53]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[54]  Mark D. Robinson,et al.  FunSpec: a web-based cluster interpreter for yeast , 2002, BMC Bioinformatics.

[55]  Susmita Datta,et al.  Cluster Validation for Microarray Data: An Appraisal , 2009 .

[56]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[57]  Hironori Kasahara,et al.  Performance Evaluation of Minimum Execution Time Multiprocessor Scheduling Algorithms Using Standard Task Graph Set , 2000, PDPTA.