A Biologically Inspired Validity Measure for Comparison of Clustering Methods over Metabolic Data Sets

In the biological domain, clustering is based on the assumption that genes or metabolites involved in a common biological process are coexpressed/coaccumulated under the control of the same regulatory network. Thus, a detailed inspection of the grouped patterns to verify their memberships to well-known metabolic pathways could be very useful for the evaluation of clusters from a biological perspective. The aim of this work is to propose a novel approach for the comparison of clustering methods over metabolic data sets, including prior biological knowledge about the relation among elements that constitute the clusters. A way of measuring the biological significance of clustering solutions is proposed. This is addressed from the perspective of the usefulness of the clusters to identify those patterns that change in coordination and belong to common pathways of metabolic regulation. The measure summarizes in a compact way the objective analysis of clustering methods, which respects coherence and clusters distribution. It also evaluates the biological internal connections of such clusters considering common pathways. The proposed measure was tested in two biological databases using three clustering methods.

[1]  Matej Oresic,et al.  An integrative approach for biological data mining and visualisation , 2008, Int. J. Data Min. Bioinform..

[2]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[4]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[5]  Sueli Aparecida Mingoti,et al.  Comparing SOM neural network with Fuzzy c , 2006, Eur. J. Oper. Res..

[6]  Kazuki Saito,et al.  Integrated Data Mining of Transcriptome and Metabolome Based on BL-SOM , 2006 .

[7]  Loren H. Rieseberg,et al.  lntrogression and Its Consequences in Plants , 1993 .

[8]  Florence Forbes,et al.  Gene Clustering via Integrated Markov Models Combining Individual and Pairwise Features , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[10]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[11]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[12]  M. Hirai,et al.  Decoding genes with coexpression networks and metabolomics - 'majority report by precogs'. , 2008, Trends in plant science.

[13]  Susmita Datta,et al.  Validation Measures for Clustering Algorithms Incorporating Biological Information , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[14]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[15]  Susmita Datta,et al.  Evaluation of clustering algorithms for gene expression data , 2006, BMC Bioinformatics.

[16]  Steven J. Barrett Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems , 2006, Genetic Programming and Evolvable Machines.

[17]  Georgina Stegmayer,et al.  Neural network model for integration and visualization of introgressed genome and metabolite data , 2009, 2009 International Joint Conference on Neural Networks.

[18]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[19]  Z. Lippman,et al.  An integrated view of quantitative trait variation using tomato interspecific introgression lines. , 2007, Current opinion in genetics & development.

[20]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[21]  Simon Kasif,et al.  Seeing the forest for the trees: using the Gene Ontology to restructure hierarchical clustering , 2009, Bioinform..

[22]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[23]  Gary B. Fogel,et al.  Computational intelligence approaches for pattern discovery in biological systems , 2008, Briefings Bioinform..

[24]  Lothar Willmitzer,et al.  Interaction with Diurnal and Circadian Regulation Results in Dynamic Metabolic and Transcriptional Changes during Cold Acclimation in Arabidopsis , 2010, PloS one.

[25]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[26]  Takayuki Tohge,et al.  Combining genetic diversity, informatics and metabolomics to facilitate annotation of plant gene function , 2010, Nature Protocols.

[27]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[28]  V. Lacroix,et al.  An Introduction to Metabolic Networks and Their Structural Analysis , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  O. Rubel,et al.  Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Vasyl Pihur,et al.  Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach , 2007, Bioinform..

[31]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[32]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[33]  美弦 矢野,et al.  <ファクトデータベース・フリーウェア特集号> 一括学習型自己組織化マップ(BL-SOM)を利用したメタボロームおよびトランスクリプトームデータの統合解析 , 2006 .

[34]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[35]  Zhen Hu,et al.  BMC Bioinformatics BioMed Central Methodology article CLEAN: CLustering Enrichment ANalysis , 2009 .

[36]  Sanghamitra Bandyopadhyay,et al.  A Biologically Inspired Measure for Coexpression Analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Georgina Stegmayer,et al.  *omeSOM: a software for clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants , 2010, BMC Bioinformatics.

[38]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.