Improving clustering with metabolic pathway

Background: It is a common practice in bioinformatics to validate each group returned by a clustering algorithm through manual analysis, according to a-priori biological knowledge. This procedure helps finding functionally related patterns to propose hypotheses for their behavior and the biological processes involved. Therefore, this knowledge is used only as a second step, after data are just clustered according to their expression patterns. Thus, it could be very useful to be able to improve the clustering of biological data by incorporating prior knowledge into the cluster formation itself, in order to enhance the biological value of the clusters. Results: A novel training algorithm for clustering is presented, which evaluates the biological internal connections of the data points while the clusters are being formed. Within this training algorithm, the calculation of distances among data points and neurons centroids includes a new term based on information from well-known metabolic pathways. The standard self-organizing map (SOM) training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets of transcripts and metabolites from Solanumlycopersicum and Arabidopsisthaliana species. Classical data mining validation measures were used to evaluate the clustering solutions obtained by both algorithms. Moreover, a new measure that takes into account the biological connectivity of the clusters was applied. The results of bSOM show important improvements in the convergence and performance for the proposed clustering method in comparison to standard SOM training, in particular, from the application point of view. Conclusions: Analyses of the clusters obtained with bSOM indicate that including biological information during training can certainly increase the biological value of the clusters found with the proposed method. It is worth to highlight that this fact has effectively improved the results, which can simplify their further analysis. The algorithm is available as a web-demo at http://fich.unl.edu.ar/sinc/web-demo/bsom-lite/. The source code and thedatasetssupportingtheresultsofthisarticleareavailableathttp://sourceforge.net/projects/sourcesinc/files/bsom.

[1]  Staffan Persson,et al.  Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. , 2009, Plant, cell & environment.

[2]  Lyle H. Ungar,et al.  The CRASSS plug-in for integrating annotation data with hierarchical clustering results , 2004, Bioinform..

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[5]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[6]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[7]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8]  Jason C. Mills,et al.  GOurmet: A tool for quantitative comparison and visualization of gene expression profiles based on gene ontology (GO) distributions , 2006, BMC Bioinformatics.

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  Andreas Zell,et al.  A memetic co-clustering algorithm for gene expression profiles and biological annotation , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[11]  Takayuki Tohge,et al.  Combining genetic diversity, informatics and metabolomics to facilitate annotation of plant gene function , 2010, Nature Protocols.

[12]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[13]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[14]  M. Hirai,et al.  Elucidation of Gene-to-Gene and Metabolite-to-Gene Networks in Arabidopsis by Integration of Metabolomics and Transcriptomics* , 2005, Journal of Biological Chemistry.

[15]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Georgina Stegmayer,et al.  *omeSOM: a software for clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants , 2010, BMC Bioinformatics.

[17]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Georgina Stegmayer,et al.  A Biologically Inspired Validity Measure for Comparison of Clustering Methods over Metabolic Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Timothy M. D. Ebbels,et al.  Correlation Network Analysis reveals a sequential reorganization of metabolic and transcriptional states during germination and gene-metabolite relationships in developing seedlings of Arabidopsis , 2010, BMC Systems Biology.

[20]  Michael A. Siani-Rose,et al.  A Knowledge-Based Clustering Algorithm Driven by Gene Ontology , 2004, Journal of biopharmaceutical statistics.

[21]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[22]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[23]  Georgina Stegmayer,et al.  Neural network model for integration and visualization of introgressed genome and metabolite data , 2009, 2009 International Joint Conference on Neural Networks.

[24]  Raj Acharya,et al.  Clustering of diverse genomic data using information fusion , 2005, Bioinform..

[25]  O. Rubel,et al.  Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Simon Kasif,et al.  Hierarchical tree snipping: clustering guided by prior knowledge , 2007, Bioinform..

[27]  V. Lacroix,et al.  An Introduction to Metabolic Networks and Their Structural Analysis , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  Teuvo Kohonen,et al.  Essentials of the self-organizing map , 2013, Neural Networks.

[29]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[30]  Petri Törönen,et al.  Selection of informative clusters from hierarchical cluster tree with gene classes , 2004, BMC Bioinformatics.