Spice: discovery of phenotype-determining component interplays

BackgroundA latent behavior of a biological cell is complex. Deriving the underlying simplicity, or the fundamental rules governing this behavior has been the Holy Grail of systems biology. Data-driven prediction of the system components and their component interplays that are responsible for the target system’s phenotype is a key and challenging step in this endeavor.ResultsThe proposed approach, which we call System Phenotype-related Interplaying Components Enumerator (Spice), iteratively enumerates statistically significant system components that are hypothesized (1) to play an important role in defining the specificity of the target system’s phenotype(s); (2) to exhibit a functionally coherent behavior, namely, act in a coordinated manner to perform the phenotype-specific function; and (3) to improve the predictive skill of the system’s phenotype(s) when used collectively in the ensemble of predictive models. Spice can be applied to both instance-based data and network-based data. When validated, Spice effectively identified system components related to three target phenotypes: biohydrogen production, motility, and cancer. Manual results curation agreed with the known phenotype-related system components reported in literature. Additionally, using the identified system components as discriminatory features improved the prediction accuracy by 10% on the phenotype-classification task when compared to a number of state-of-the-art methods applied to eight benchmark microarray data sets.ConclusionWe formulate a problem—enumeration of phenotype-determining system component interplays—and propose an effective methodology (Spice) to address this problem. Spice improved identification of cancer-related groups of genes from various microarray data sets and detected groups of genes associated with microbial biohydrogen production and motility, many of which were reported in literature. Spice also improved the predictive skill of the system’s phenotype determination compared to individual classifiers and/or other ensemble methods, such as bagging, boosting, random forest, nearest shrunken centroid, and random forest variable selection method.

[1]  Xing Yan,et al.  Succession of the Bacterial Community and Dynamics of Hydrogen Producers in a Hydrogen-Producing Bioreactor , 2010, Applied and Environmental Microbiology.

[2]  W. Buckel,et al.  Sodium ion-dependent hydrogen production in Acidaminococcus fermentans , 1996, Archives of Microbiology.

[3]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[4]  Robert Clarke,et al.  Identifying cancer biomarkers by network-constrained support vector machines , 2011, BMC Systems Biology.

[5]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[6]  Nagiza F. Samatova,et al.  The Multiple Alignment Algorithm for Metabolic Pathways without Abstraction , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[7]  Christopher C. Moser,et al.  Natural engineering principles of electron tunnelling in biological oxidation–reduction , 1999, Nature.

[8]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[9]  Herbert H. P. Fang,et al.  Fermentative Hydrogen Production From Wastewater and Solid Wastes by Mixed Cultures , 2007 .

[10]  Samir Kumar Khanal Biohydrogen Production: Fundamentals, Challenges, and Operation Strategies for Enhanced Yield , 2008 .

[11]  Shangtian Yang,et al.  Construction and Characterization of ack Deleted Mutant of Clostridium tyrobutyricum for Enhanced Butyric Acid and Hydrogen Production , 2008, Biotechnology progress.

[12]  K. Bagramyan,et al.  Structural and Functional Features of Formate Hydrogen Lyase, an Enzyme of Mixed-Acid Fermentation from Escherichia coli , 2003, Biochemistry (Moscow).

[13]  Kevin Black,et al.  Uptake and metabolism of glucose in the Nostoc–Gunnera symbiosis , 2002 .

[14]  Debabrata Das,et al.  Hydrogen production by biological processes: a survey of literature , 2001 .

[15]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[16]  Joost N. Kok,et al.  The Gaston Tool for Frequent Subgraph Mining , 2005, GraBaTs.

[17]  Yuxiao Hu,et al.  Face recognition using Laplacianfaces , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Stewart Shuman,et al.  An end-healing enzyme from Clostridium thermocellum with 5' kinase, 2',3' phosphatase, and adenylyltransferase activities. , 2005, RNA.

[19]  V. Zverlov,et al.  Biofuels from microbes , 2007, Applied Microbiology and Biotechnology.

[20]  August Böck,et al.  The complex between hydrogenase-maturation proteins HypC and HypD is an intermediate in the supply of cyanide to the active site iron of [NiFe]-hydrogenases. , 2004, Journal of molecular biology.

[21]  F. Rey,et al.  Redirection of Metabolism for Biological Hydrogen Production , 2007, Applied and Environmental Microbiology.

[22]  Adam P Arkin,et al.  Modularity of stress response evolution , 2008, Proceedings of the National Academy of Sciences.

[23]  Samir Kumar Khanal,et al.  Anaerobic Biotechnology for Bioenergy Production: Principles and Applications , 2008 .

[24]  I. Jolliffe,et al.  Forecast verification : a practitioner's guide in atmospheric science , 2011 .

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[26]  Ani Aprahamian,et al.  Simplicity from Complexity , 2012 .

[27]  Debabrata Das,et al.  Improvement of fermentative hydrogen production: various approaches , 2004, Applied Microbiology and Biotechnology.

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  David Hart Hydrogen power : the commerical future of 'the ultimate fuel' , 1997 .

[30]  Ramon D ´ õaz-Uriarte,et al.  Variable selection from random forests: application to gene expression data , 2005 .

[31]  Brent E. Harrison,et al.  Biclustering-Driven Ensemble of Bayesian Belief Network Classifiers for Underdetermined Problems , 2011, IJCAI.

[32]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[33]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[34]  N. Slonim,et al.  Ab initio genotype–phenotype association reveals intrinsic modularity in genetic networks , 2006, Molecular systems biology.

[35]  Holger Fröhlich,et al.  Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients , 2010, Bioinform..

[36]  Jaakko A Puhakka,et al.  The relationship between instability of H2 production and compositions of bacterial communities within a dark fermentation fluidised‐bed bioreactor , 2007, Biotechnology and bioengineering.

[37]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[38]  U. Maurer,et al.  High expression of bcl-2 mRNA as a determinant of poor prognosis in acute myeloid leukemia. , 1998, Annals of oncology : official journal of the European Society for Medical Oncology.

[39]  Raymond J. Mooney,et al.  Diverse ensembles for active learning , 2004, ICML.

[40]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[42]  Taeho Hwang,et al.  FiGS: a filter-based gene selection workbench for microarray data , 2010, BMC Bioinformatics.

[43]  Yang Li,et al.  Incorporating gene co-expression network in identification of cancer prognosis markers , 2010, BMC Bioinformatics.

[44]  J. Wu,et al.  The lysP gene encodes the lysine-specific permease , 1992, Journal of bacteriology.

[45]  Cheryl H Arrowsmith,et al.  Structure of Escherichia coli ribose-5-phosphate isomerase: a ubiquitous enzyme of the pentose phosphate pathway and the Calvin cycle. , 2003, Structure.

[46]  August Böck,et al.  HypF, a Carbamoyl Phosphate-converting Enzyme Involved in [NiFe] Hydrogenase Maturation* , 2002, The Journal of Biological Chemistry.

[47]  D. White The Physiology and Biochemistry of Prokaryotes , 1999 .

[48]  I. Halil Kavakli,et al.  Optimization Based Tumor Classification from Microarray Gene Expression Data , 2011, PloS one.

[49]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[50]  Marcin Czajkowski,et al.  Top scoring pair decision tree for gene expression data analysis. , 2011, Advances in experimental medicine and biology.

[51]  P. Vignais,et al.  Molecular biology of microbial hydrogenases. , 2004, Current issues in molecular biology.

[52]  M Kalim Akhtar,et al.  Engineering of a synthetic hydF-hydE-hydG-hydA operon for biohydrogen production. , 2008, Analytical biochemistry.

[53]  Shin-Ichi Aizawa,et al.  Type III secretion systems and bacterial flagella: Insights into their function from structural similarities , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[54]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[55]  Patrik R. Jones,et al.  Constructing and testing the thermodynamic limits of synthetic NAD(P)H:H2 pathways , 2008, Microbial biotechnology.

[56]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[57]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[58]  Louise C. Showe,et al.  Classification and biomarker identification using gene network modules and support vector machines , 2009, BMC Bioinformatics.

[59]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[60]  Tak W. Mak,et al.  A human T cell-specific cDNA clone encodes a protein having extensive homology to immunoglobulin chains , 1984, Nature.

[61]  F. Rey,et al.  Regulation of Uptake Hydrogenase and Effects of Hydrogen Utilization on Gene Expression in Rhodopseudomonas palustris , 2006, Journal of bacteriology.

[62]  Mona Singh,et al.  A cross-genomic approach for systematic mapping of phenotypic traits to genes. , 2003, Genome research.

[63]  Samir Kumar Khanal,et al.  Anaerobic Biotechnology for Bioenergy Production , 2008 .

[64]  Jerzy W. Grzymala-Busse,et al.  Leukemia Prediction from Gene Expression Data-A Rough Set Approach , 2006, ICAISC.

[65]  Katsuyoshi Hatakeyama,et al.  Zyxin, a Regulator of Actin Filament Assembly, Targets the Mitotic Apparatus by Interacting with H-Warts/Lats1 Tumor Suppressor , 2000, The Journal of cell biology.

[66]  Juanita Mathews,et al.  Metabolic pathway engineering for enhanced biohydrogen production , 2009 .

[67]  Joachim Rassow,et al.  Bcl-2 and porin follow different pathways of TOM-dependent insertion into the mitochondrial outer membrane. , 2002, Journal of molecular biology.

[68]  Wojciech Szpankowski,et al.  Detecting Conserved Interaction Patterns in Biological Networks , 2006, J. Comput. Biol..

[69]  Andrew Emili,et al.  Interactions of the Escherichia coli hydrogenase biosynthetic proteins: HybG complex formation , 2006, FEBS letters.

[70]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[71]  Fan Yang,et al.  Gene Selection Using Random Forest and Proximity Differences Criterion on DNA Microarray Data , 2010, J. Convergence Inf. Technol..

[72]  Nagiza F. Samatova,et al.  A Fast, Accurate Algorithm for Identifying Functional Modules Through Pairwise Local Alignment of Protein Interaction Networks , 2009, BIOCOMP.

[73]  Herbert H. P. Fang,et al.  Heterotrophic Photo Fermentative Hydrogen Production , 2009 .

[74]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[75]  J. Meyer,et al.  Classification and phylogeny of hydrogenases. , 2001, FEMS microbiology reviews.

[76]  Nagiza F. Samatova,et al.  From pull-down data to protein interaction networks and complexes with biological relevance. , 2008, Bioinformatics.

[77]  Nagiza F. Samatova,et al.  An Algorithm for the Discovery of Phenotype Related Metabolic Pathways , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[78]  Caroline Ash Antimalarial Drug Candidate , 2010 .

[79]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[80]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[81]  Yoshiki Higuchi,et al.  Crystal structures of hydrogenase maturation protein HypE in the Apo and ATP-bound forms. , 2007, Journal of molecular biology.

[82]  P. Dürre,et al.  Clostridium ljungdahlii represents a microbial production platform based on syngas , 2010, Proceedings of the National Academy of Sciences.

[83]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[84]  Johannes Goll,et al.  The protein network of bacterial motility , 2007 .

[85]  Dipankar Ghosh,et al.  Improvements in fermentative biological hydrogen production through metabolic engineering. , 2012, Journal of environmental management.

[86]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[87]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[88]  D. L. Hawkes,et al.  Sustainable fermentative hydrogen production: challenges for process optimisation , 2002 .

[89]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[90]  Dennis Shasha,et al.  Trait-to-Gene A Computational Method for Predicting the Function of Uncharacterized Genes , 2003, Current Biology.

[91]  F. Kargı,et al.  Bio-hydrogen production from waste materials , 2006 .

[92]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..