Data Mining Over Biological Datasets: An Integrated Approach Based on Computational Intelligence

Biology is in the middle of a data explosion. The technical advances achieved by the genomics, metabolomics, transcriptomics and proteomics technologies in recent years have significantly increased the amount of data that are available for biologists to analyze different aspects of an organism. However, *omics data sets have several additional problems: they have inherent biological complexity and may have significant amounts of noise as well as measurement artifacts. The need to extract information from such databases has once again become a challenge. This requires novel computational techniques and models to automatically perform data mining tasks such as integration of different data types, clustering and knowledge discovery, among others. In this article, we will present a novel integrated computational intelligence approach for biological data mining that involves neural networks and evolutionary computation. We propose the use of self-organizing maps for the identification of coordinated patterns variations; a new training algorithm that can include a priori biological information to obtain more biological meaningful clusters; a validation measure that can assess the biological significance of the clusters found; and finally, an evolutionary algorithm for the inference of unknown metabolic pathways involving the selected clusters.

[1]  Petri Törönen,et al.  Selection of informative clusters from hierarchical cluster tree with gene classes , 2004, BMC Bioinformatics.

[2]  V. Lacroix,et al.  An Introduction to Metabolic Networks and Their Structural Analysis , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Lincoln Stein,et al.  Gramene: a growing plant comparative genomics resource , 2007, Nucleic Acids Res..

[4]  M. Hirai,et al.  Integration of transcriptomics and metabolomics for understanding of global responses to nutritional stresses in Arabidopsis thaliana. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  H. Alper,et al.  Systems metabolic engineering: Genome‐scale models and beyond , 2010, Biotechnology journal.

[6]  S. Rao,et al.  PathMiner: predicting metabolic pathways by heuristic search , 2003, Bioinform..

[7]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[8]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[9]  Steven J. Barrett Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems , 2006, Genetic Programming and Evolvable Machines.

[10]  Francisco Azuaje,et al.  Clustering Genomic Expression Data: Design and Evaluation Principles , 2003 .

[11]  O. Rubel,et al.  Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  J. C. Nuño,et al.  Optimal stoichiometric designs of ATP-producing systems as determined by an evolutionary algorithm. , 1999, Journal of theoretical biology.

[13]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[14]  Shoshana J. Wodak,et al.  Metabolic PathFinding: inferring relevant pathways in biochemical networks , 2005, Nucleic Acids Res..

[15]  Anthony J. Bonner,et al.  Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements , 2007, BMC Bioinformatics.

[16]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[17]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[18]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[19]  Jun Zhang,et al.  Evolutionary Computation Meets Machine Learning: A Survey , 2011, IEEE Computational Intelligence Magazine.

[20]  M. Hirai,et al.  Decoding genes with coexpression networks and metabolomics - 'majority report by precogs'. , 2008, Trends in plant science.

[21]  Theodoros N. Arvanitis,et al.  Linked Metabolites: A tool for the construction of directed metabolic graphs , 2010, Comput. Biol. Medicine.

[22]  Kyongbum Lee,et al.  Utilizing elementary mode analysis, pathway thermodynamics, and a genetic algorithm for metabolic flux determination and optimal metabolic network design , 2010, BMC Systems Biology.

[23]  Gary B. Fogel,et al.  Computational intelligence approaches for pattern discovery in biological systems , 2008, Briefings Bioinform..

[24]  M. Zanor,et al.  Integrated Analysis of Metabolite and Transcript Levels Reveals the Metabolic Shifts That Underlie Tomato Fruit Development and Highlight Regulatory Aspects of Metabolic Network Behavior1[W] , 2006, Plant Physiology.

[25]  Lothar Willmitzer,et al.  Interaction with Diurnal and Circadian Regulation Results in Dynamic Metabolic and Transcriptional Changes during Cold Acclimation in Arabidopsis , 2010, PloS one.

[26]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[27]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Daniel Eriksson,et al.  Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. , 2007, The Plant journal : for cell and molecular biology.

[29]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[30]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[31]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[32]  Michael A. Siani-Rose,et al.  A Knowledge-Based Clustering Algorithm Driven by Gene Ontology , 2004, Journal of biopharmaceutical statistics.

[33]  Georgina Stegmayer,et al.  A Biologically Inspired Validity Measure for Comparison of Clustering Methods over Metabolic Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  David Corne,et al.  Evolutionary Computation In Bioinformatics , 2003 .

[35]  Lyle H. Ungar,et al.  The CRASSS plug-in for integrating annotation data with hierarchical clustering results , 2004, Bioinform..

[36]  Oscar Cordón,et al.  Medical Image Registration Using Evolutionary Computation: An Experimental Survey , 2011, IEEE Computational Intelligence Magazine.

[37]  Reinhart Heinrich,et al.  Evolutionary optimization of metabolic pathways. Theoretical reconstruction of the stoichiometry of ATP and NADH producing systems , 2001, Bulletin of mathematical biology.

[38]  Georgina Stegmayer,et al.  Neural network model for integration and visualization of introgressed genome and metabolite data , 2009, 2009 International Joint Conference on Neural Networks.

[39]  S. Rhee,et al.  AraCyc: A Biochemical Pathway Database for Arabidopsis1 , 2003, Plant Physiology.

[40]  Aaron M. Newman,et al.  AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number , 2010, BMC Bioinformatics.

[41]  Matej Oresic,et al.  An integrative approach for biological data mining and visualisation , 2008, Int. J. Data Min. Bioinform..

[42]  Susmita Datta,et al.  Validation Measures for Clustering Algorithms Incorporating Biological Information , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[43]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[44]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[45]  S. Billings,et al.  Metabolic Flux Estimation-A Self-Adaptive Evolutionary Algorithm with Singular Value Decomposition , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  John Quackenbush,et al.  Microarray gene expression data analysis - a beginner's guide , 2003 .

[47]  Kazuki Saito,et al.  Potential of metabolomics as a functional genomics tool. , 2004, Trends in plant science.

[48]  Chuan-Kang Ting,et al.  Linkage Discovery through Data Mining [Research Frontier] , 2010, IEEE Computational Intelligence Magazine.

[49]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[50]  Jason C. Mills,et al.  GOurmet: A tool for quantitative comparison and visualization of gene expression profiles based on gene ontology (GO) distributions , 2006, BMC Bioinformatics.

[51]  美弦 矢野,et al.  <ファクトデータベース・フリーウェア特集号> 一括学習型自己組織化マップ(BL-SOM)を利用したメタボロームおよびトランスクリプトームデータの統合解析 , 2006 .

[52]  Timothy M. D. Ebbels,et al.  Correlation Network Analysis reveals a sequential reorganization of metabolic and transcriptional states during germination and gene-metabolite relationships in developing seedlings of Arabidopsis , 2010, BMC Systems Biology.

[53]  R. Kustra,et al.  Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[54]  K. Sriram,et al.  Applications of Self-Organising Map (SOM) for prioritisation of endemic zones of filariasis in Andhra Pradesh, India , 2011, Int. J. Data Min. Bioinform..

[55]  Georgina Stegmayer,et al.  *omeSOM: a software for clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants , 2010, BMC Bioinformatics.

[56]  Kazuki Saito,et al.  Integrated Data Mining of Transcriptome and Metabolome Based on BL-SOM , 2006 .

[57]  L. Sweetlove,et al.  Comparison of changes in fruit gene expression in tomato introgression lines provides evidence of genome-wide transcriptional changes and reveals links to mapped QTLs and described traits. , 2005, Journal of experimental botany.

[58]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[59]  M. Kanehisa,et al.  Computation with the KEGG pathway database. , 1998, Bio Systems.

[60]  Simon Kasif,et al.  Seeing the forest for the trees: using the Gene Ontology to restructure hierarchical clustering , 2009, Bioinform..

[61]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Junbai Wang,et al.  Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study , 2002, BMC Bioinformatics.

[63]  V. Helms,et al.  Bridging the Gap: Linking Molecular Simulations and Systemic Descriptions of Cellular Compartments , 2010, PloS one.

[64]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[65]  Florence Forbes,et al.  Gene Clustering via Integrated Markov Models Combining Individual and Pairwise Features , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[66]  Andreas Zell,et al.  A memetic co-clustering algorithm for gene expression profiles and biological annotation , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[67]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[68]  Takayuki Tohge,et al.  Combining genetic diversity, informatics and metabolomics to facilitate annotation of plant gene function , 2010, Nature Protocols.

[69]  Staffan Persson,et al.  Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. , 2009, Plant, cell & environment.

[70]  Yi Pan,et al.  Computational Intelligence in Bioinformatics , 2007 .

[71]  Nature Genetics , 1991, Nature.

[72]  M. Hirai,et al.  Elucidation of Gene-to-Gene and Metabolite-to-Gene Networks in Arabidopsis by Integration of Metabolomics and Transcriptomics* , 2005, Journal of Biological Chemistry.

[73]  Rong Li,et al.  Investigating the regulation of one-carbon metabolism in Arabidopsis thaliana. , 2003, Plant & cell physiology.

[74]  Sang Yup Lee,et al.  Construction and optimization of synthetic pathways in metabolic engineering. , 2010, Current opinion in microbiology.