Predicting essential genes based on network and sequence analysis.

Essential genes are indispensable to the viability of an organism. Identification and analysis of essential genes is key to understanding the systems level organization of living cells. On the other hand, the ability to predict these genes in pathogens is of great importance for directed drug development. Global analysis of protein interaction networks provides an effective way to elucidate the relationships between genes. It has been found that essential genes tend to be highly connected and generally have more interactions than nonessential ones. With recent large-scale identifications of essential genes and protein-protein interactions in Saccharomyces cerevisiae and Escherichia coli, we have systematically investigated the topological properties of essential and nonessential genes in the protein-protein interaction networks. Essential genes tend to play topologically more important roles in protein interaction networks. Many topological features were found to be statistically discriminative between essential and nonessential genes. In addition, we have also examined sequence properties such as open reading frame length, strand, and phyletic retention for their association with the gene essentiality. Employing the topological features in the protein interaction network and the sequence properties, we have built a machine learning classifier capable of predicting essential genes. Computational prediction of essential genes circumvents expensive and difficult experimental screens and will help antimicrobial drug development.

[1]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[2]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[3]  Monica Riley,et al.  Escherichia coli K-12: a cooperatively developed annotation snapshot—2005 , 2006, Nucleic acids research.

[4]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[5]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[6]  H. Mori,et al.  Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection , 2006, Molecular systems biology.

[7]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[8]  Kara Dolinski,et al.  Gene Ontology annotations at SGD: new data sources and annotation methods , 2007, Nucleic Acids Res..

[9]  W. Gehring,et al.  Functional redundancy: the respective roles of the two sloppy paired genes in Drosophila segmentation. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Lingchong You,et al.  Toward computational systems biology , 2007, Cell Biochemistry and Biophysics.

[11]  Michael R. Seringhaus,et al.  Predicting essential genes in fungal genomes. , 2006, Genome research.

[12]  Ronald W. Davis,et al.  Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. , 1999, Science.

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[14]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Mark Gerstein,et al.  The Importance of Bottlenecks in Protein Networks: Correlation with Gene Essentiality and Expression Dynamics , 2007, PLoS Comput. Biol..

[17]  Ernesto Estrada Virtual identification of essential proteins within the protein interaction network of yeast , 2005, Proteomics.

[18]  M. Gerstein,et al.  Genomic analysis of essentiality within protein networks. , 2004, Trends in genetics : TIG.

[19]  Antoine Danchin,et al.  How essential are nonessential genes? , 2005, Molecular biology and evolution.

[20]  S. Kanaya,et al.  Large-scale identification of protein-protein interaction of Escherichia coli K-12. , 2006, Genome research.

[21]  Stephen C. J. Parker,et al.  Towards the identification of essential genes using targeted genome sequencing and comparative analysis , 2006, BMC Genomics.

[22]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[23]  E. Koonin,et al.  Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. , 2002, Genome research.

[24]  Ronald W. Davis,et al.  Systematic screen for human disease genes in yeast , 2002, Nature Genetics.

[25]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[26]  Bernhard Schölkopf,et al.  Incorporating Invariances in Support Vector Learning Machines , 1996, ICANN.

[27]  D. Ingber,et al.  High-Betweenness Proteins in the Yeast Protein Interaction Network , 2005, Journal of biomedicine & biotechnology.

[28]  R. Lunsford,et al.  Rational identification of new antibacterial drug targets that are essential for viability using a genomics-based approach. , 2002, Pharmacology & therapeutics.

[29]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[30]  Eduardo P C Rocha,et al.  Essentiality, not expressiveness, drives gene-strand bias in bacteria , 2003, Nature Genetics.

[31]  A. Emili,et al.  Interaction network containing conserved and essential protein complexes in Escherichia coli , 2005, Nature.