Microbial Forensics: Predicting Phenotypic Characteristics and Environmental Conditions from Large-Scale Gene Expression Profiles

A tantalizing question in cellular physiology is whether the cellular state and environmental conditions can be inferred by the expression signature of an organism. To investigate this relationship, we created an extensive normalized gene expression compendium for the bacterium Escherichia coli that was further enriched with meta-information through an iterative learning procedure. We then constructed an ensemble method to predict environmental and cellular state, including strain, growth phase, medium, oxygen level, antibiotic and carbon source presence. Results show that gene expression is an excellent predictor of environmental structure, with multi-class ensemble models achieving balanced accuracy between 70.0% (±3.5%) to 98.3% (±2.3%) for the various characteristics. Interestingly, this performance can be significantly boosted when environmental and strain characteristics are simultaneously considered, as a composite classifier that captures the inter-dependencies of three characteristics (medium, phase and strain) achieved 10.6% (±1.0%) higher performance than any individual models. Contrary to expectations, only 59% of the top informative genes were also identified as differentially expressed under the respective conditions. Functional analysis of the respective genetic signatures implicates a wide spectrum of Gene Ontology terms and KEGG pathways with condition-specific information content, including iron transport, transferases, and enterobactin synthesis. Further experimental phenotypic-to-genotypic mapping that we conducted for knock-out mutants argues for the information content of top-ranked genes. This work demonstrates the degree at which genome-scale transcriptional information can be predictive of latent, heterogeneous and seemingly disparate phenotypic and environmental characteristics, with far-reaching applications.

[1]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[3]  Philip M. Kim,et al.  Quantitative Genome-Wide Genetic Interaction Screens Reveal Global Epistatic Relationships of Protein Complexes in Escherichia coli , 2014, PLoS genetics.

[4]  Dorothea K. Thompson,et al.  Global Transcriptome Analysis of the Heat Shock Response of Shewanella oneidensis , 2004, Journal of bacteriology.

[5]  Tyson A. Clark,et al.  Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing , 2012, Nature Biotechnology.

[6]  Ahmad S. Khalil,et al.  Synthetic biology: applications come of age , 2010, Nature Reviews Genetics.

[7]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[8]  Frederick R. Blattner,et al.  High-Density Microarray-Mediated Gene Expression Profiling of Escherichia coli , 2001, Journal of bacteriology.

[9]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[10]  Chong Sun Hong,et al.  Mutual Information and Redundancy for Categorical Data , 2006 .

[11]  Peter L. Freddolino,et al.  Fitness Landscape Transformation through a Single Amino Acid Change in the Rho Terminator , 2012, PLoS genetics.

[12]  Takeshi Mizuno,et al.  Negative Control of rpoS Expression by Phosphoenolpyruvate:Carbohydrate Phosphotransferase System inEscherichia coli , 2001, Journal of bacteriology.

[13]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[14]  Gavin Sherlock,et al.  The Stanford Microarray Database: implementation of new analysis tools and open source release of software , 2002, Nucleic Acids Res..

[15]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[16]  H. Schellhorn,et al.  Control of RpoS in global gene expression of Escherichia coli in minimal media , 2008, Molecular Genetics and Genomics.

[17]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Yitzhak Pilpel,et al.  A mathematical model for adaptive prediction of environmental changes by microorganisms , 2011, Proceedings of the National Academy of Sciences.

[19]  Susanne Behrens-Kneip,et al.  PpiD is a player in the network of periplasmic chaperones in Escherichia coli , 2010, BMC Microbiology.

[20]  Ilias Tagkopoulos,et al.  An integrative, multi-scale, genome-wide model reveals the phenotypic landscape of Escherichia coli , 2014, Molecular systems biology.

[21]  Min Xu,et al.  Automated multidimensional phenotypic profiling using large public microarray repositories , 2009, Proceedings of the National Academy of Sciences.

[22]  Y. Pilpel,et al.  Adaptive prediction of environmental changes by microorganisms , 2009, Nature.

[23]  Mark Gerstein,et al.  Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data , 2003, Bioinform..

[24]  Thomas A Steitz,et al.  How Hibernation Factors RMF, HPF, and YfiA Turn Off Protein Synthesis , 2012, Science.

[25]  Kay Nieselt,et al.  High-Resolution Transcriptome Maps Reveal Strain-Specific Regulatory Features of Multiple Campylobacter jejuni Isolates , 2013, PLoS genetics.

[26]  Uri Alon,et al.  Linear Superposition and Prediction of Bacterial Promoter Activity Dynamics in Complex Conditions , 2014, PLoS Comput. Biol..

[27]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[28]  Massimiliano Pontil,et al.  Support Vector Machines: Theory and Applications , 2001, Machine Learning and Its Applications.

[29]  R Hengge-Aronis,et al.  Identification and molecular analysis of glgS, a novel growth‐phase‐regulated and rpoS‐dependent gene involved in glycogen synthesis in Escherichia coli , 1992, Molecular microbiology.

[30]  Sang Yup Lee,et al.  Transcript and protein level analyses of the interactions among PhoB, PhoR, PhoU and CreC in response to phosphate starvation in Escherichia coli. , 2007, FEMS microbiology letters.

[31]  A. Butte,et al.  Open Access Research Article Predicting Environmental Chemical Factors Associated with Disease-related Gene Expression Data , 2022 .

[32]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[33]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[34]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[35]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[36]  X. Wang,et al.  Predicting hepatitis B virus–positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning , 2003, Nature Medicine.

[37]  Nitin S Baliga The Scale of Prediction , 2008, Science.

[38]  Thomas Egli,et al.  Global gene expression in Escherichia coli K-12 during short-term and long-term adaptation to glucose-limited continuous culture conditions. , 2006, Microbiology.

[39]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[40]  Aurélien Mazurie,et al.  Gene networks inference using dynamic Bayesian networks , 2003, ECCB.

[41]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[42]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[43]  Peter D. Karp,et al.  EcoCyc: fusing model organism databases with systems biology , 2012, Nucleic Acids Res..

[44]  Chong Sun Hong,et al.  Mutual information and redundancy for categorical data , 2006 .

[45]  Edoardo M. Airoldi,et al.  Predicting Cellular Growth from Gene Expression Signatures , 2009, PLoS Comput. Biol..

[46]  Sang Yup Lee,et al.  Transcriptome analysis of phosphate starvation response in Escherichia coli. , 2007, Journal of microbiology and biotechnology.

[47]  Mark Gerstein,et al.  Bioinformatics Applications Note Gene Expression Rseqtools: a Modular Framework to Analyze Rna-seq Data Using Compact, Anonymized Data Summaries , 2022 .

[48]  J. Bähler,et al.  Tuning gene expression to changing environments: from rapid responses to evolutionary adaptation , 2008, Nature Reviews Genetics.

[49]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[50]  Anushya Muruganujan,et al.  PortEco: a resource for exploring bacterial biology through high-throughput data and analysis tools , 2013, Nucleic Acids Res..

[51]  P. Vincent,et al.  The Alternative Role of Enterobactin as an Oxidative Stress Protector Allows Escherichia coli Colony Development , 2014, PloS one.

[52]  Benoît Roux,et al.  The binding of antibiotics in OmpF porin. , 2013, Structure.

[53]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[54]  Lipo Wang Support vector machines : theory and applications , 2005 .

[55]  Qian Zhu,et al.  Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies , 2013, Bioinform..

[56]  A M Chakrabarty,et al.  Regulation of nucleoside diphosphate kinase and an alternative kinase in Escherichia coli: role of the sspA and rnk genes in nucleoside triphosphate formation , 1995, Molecular microbiology.

[57]  Gintaras Deikus,et al.  Erratum: Biotech's wellspring: the health of private biotech in 2012 , 2013, Nature Biotechnology.

[58]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[59]  R. Gourse,et al.  rRNA transcription and growth rate-dependent regulation of ribosome synthesis in Escherichia coli. , 1996, Annual Review of Microbiology.

[60]  L. Bodrossy,et al.  Oligonucleotide microarrays in microbial diagnostics. , 2004, Current opinion in microbiology.

[61]  T. Ideker,et al.  A gene ontology inferred from molecular networks , 2012, Nature Biotechnology.

[62]  Mónica Aguado-Urda,et al.  Global Transcriptome Analysis of Lactococcus garvieae Strains in Response to Temperature , 2013, PloS one.

[63]  Saeed Tavazoie,et al.  Predictive Behavior Within Microbial Genetic Networks , 2008, Science.

[64]  J. Cronan,et al.  The growth phase‐dependent synthesis of cyclopropane fatty acids in Escherichia coli is the result of an RpoS(KatF)‐dependent promoter plus enzyme instability , 1994, Molecular microbiology.

[65]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[66]  Minseung Kim,et al.  Empirical prediction of genomic susceptibilities for multiple cancer classes , 2014, Proceedings of the National Academy of Sciences.

[67]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[68]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[69]  K. Entian,et al.  Growth phase-dependent regulation and membrane localization of SpaB, a protein involved in biosynthesis of the lantibiotic subtilin , 1994, Applied and environmental microbiology.

[70]  Robert Veroff,et al.  A Bayesian Network Classification Methodology for Gene Expression Data , 2004, J. Comput. Biol..

[71]  W. Ramakrishna,et al.  Machine Learning Approaches Distinguish Multiple Stress Conditions using Stress-Responsive Genes and Identify Candidate Genes for Broad Resistance in Rice[C][W][OPEN] , 2013, Plant Physiology.

[72]  Jeremiah J. Faith,et al.  Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata , 2007, Nucleic Acids Res..

[73]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[74]  Epigenetic Gene Regulation in the Bacterial World , 2006, Microbiology and Molecular Biology Reviews.

[75]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[76]  H. Mori,et al.  Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection , 2006, Molecular systems biology.

[77]  Herbert Schmidt,et al.  Global Expression of Prophage Genes in Escherichia coli O157:H7 Strain EDL933 in Response to Norfloxacin , 2005, Antimicrobial Agents and Chemotherapy.

[78]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[79]  Oleg Paliy,et al.  Genome-Wide Transcriptional Responses of Escherichia coli K-12 to Continuous Osmotic and Heat Stresses , 2008, Journal of bacteriology.