Predicting essential genes in fungal genomes.

Essential genes are required for an organism's viability, and the ability to identify these genes in pathogens is crucial to directed drug development. Predicting essential genes through computational methods is appealing because it circumvents expensive and difficult experimental screens. Most such prediction is based on homology mapping to experimentally verified essential genes in model organisms. We present here a different approach, one that relies exclusively on sequence features of a gene to estimate essentiality and offers a promising way to identify essential genes in unstudied or uncultured organisms. We identified 14 characteristic sequence features potentially associated with essentiality, such as localization signals, codon adaptation, GC content, and overall hydrophobicity. Using the well-characterized baker's yeast Saccharomyces cerevisiae, we employed a simple Bayesian framework to measure the correlation of each of these features with essentiality. We then employed the 14 features to learn the parameters of a machine learning classifier capable of predicting essential genes. We trained our classifier on known essential genes in S. cerevisiae and applied it to the closely related and relatively unstudied yeast Saccharomyces mikatae. We assessed predictive success in two ways: First, we compared all of our predictions with those generated by homology mapping between these two species. Second, we verified a subset of our predictions with eight in vivo knockouts in S. mikatae, and we present here the first experimentally confirmed essential genes in this species.

[1]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[2]  P. Sharp,et al.  The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. , 1987, Nucleic acids research.

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  F. Wright The 'effective number of codons' used in a gene. , 1990, Gene.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[7]  R. S. Muir,et al.  Gene disruption with PCR products in Saccharomyces cerevisiae. , 1995, Gene.

[8]  E. Koonin,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[10]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[11]  T J Dougherty,et al.  Concordance analysis of microbial genomes. , 1998, Nucleic acids research.

[12]  P. Philippsen,et al.  Additional modules for versatile and economical PCR‐based gene deletion and modification in Saccharomyces cerevisiae , 1998, Yeast.

[13]  Manuel Peitsch,et al.  A genome-based approach for the identification of essential bacterial genes , 1998, Nature Biotechnology.

[14]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[16]  Ronald W. Davis,et al.  Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. , 1999, Science.

[17]  Laurence D. Hurst,et al.  Do essential genes evolve slowly? , 1999, Current Biology.

[18]  K. Reich The search for essential genes. , 2000, Research in microbiology.

[19]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[20]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[21]  R. Contreras,et al.  An antisense-based functional genomics approach for identification of genes critical for growth of Candida albicans , 2001, Nature Biotechnology.

[22]  E. Koonin,et al.  Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. , 2002, Genome research.

[23]  R. Lunsford,et al.  Rational identification of new antibacterial drug targets that are essential for viability using a genomics-based approach. , 2002, Pharmacology & therapeutics.

[24]  R. D. Gietz,et al.  Transformation of yeast by lithium acetate/single-stranded carrier DNA/polyethylene glycol method. , 2002, Methods in enzymology.

[25]  Hawoong Jeong,et al.  Prediction of Protein Essentiality Based on Genomic Data , 2002, Complexus.

[26]  S. Cole Comparative mycobacterial genomics as a tool for drug target and antigen discovery , 2002, European Respiratory Journal.

[27]  Ronald W. Davis,et al.  Functional profiling of the Saccharomyces cerevisiae genome , 2002, Nature.

[28]  Paul Nurse,et al.  Schizosaccharomyces pombe essential genes: a pilot study. , 2003, Genome research.

[29]  Karl W. Broman,et al.  A postgenomic method for predicting essential genes at subsaturation levels of mutagenesis: Application to Mycobacterium tuberculosis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  A. Barabasi,et al.  Bioinformatics analysis of experimentally determined protein complexes in the yeast Saccharomyces cerevisiae. , 2003, Genome research.

[31]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[32]  M. Gerstein,et al.  Genomic analysis of essentiality within protein networks. , 2004, Trends in genetics : TIG.

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[35]  A. Fuglsang,et al.  The 'effective number of codons' revisited. , 2004, Biochemical and biophysical research communications.

[36]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[37]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..