Identification of prokaryotic small proteins using a comparative genomic approach

MOTIVATION Accurate prediction of genes encoding small proteins (on the order of 50 amino acids or less) remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics on small sequences. Our approach is based upon the hypothesis that true small proteins will be under selective pressure for encoding the particular amino acid sequence, for ease of translation by the ribosome and for structural stability. This stability can be achieved either independently or as part of a larger protein complex. Given this assumption, it follows that small proteins should display conserved local protein structure properties much like larger proteins. Our method incorporates neural-net predictions for three local structure alphabets within a comparative genomic approach using a genomic alignment of 22 closely related bacteria genomes to generate predictions for whether or not a given open reading frame (ORF) encodes for a small protein. RESULTS We have applied this method to the complete genome for Escherichia coli strain K12 and looked at how well our method performed on a set of 60 experimentally verified small proteins from this organism. Out of a total of 11 407 possible ORFs, we found that 6 of the top 10 and 27 of the top 100 predictions belonged to the set of 60 experimentally verified small proteins. We found 35 of all the true small proteins within the top 200 predictions. We compared our method to Glimmer, using a default Glimmer protocol and a modified small ORF Glimmer protocol with a lower minimum size cutoff. The default Glimmer protocol identified 16 of the true small proteins (all in the top 200 predictions), but failed to predict on 34 due to size cutoffs. The small ORF Glimmer protocol made predictions for all the experimentally verified small proteins but only contained 9 of the 60 true small proteins within the top 200 predictions. CONTACT jsamayoa@jhu.edu

[1]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[2]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[3]  V. Nizet,et al.  Endogenous production of antimicrobial peptides in innate immunity and human disease , 2003, Current allergy and asthma reports.

[4]  T. Tatsuta,et al.  SpoVM, a small protein essential to development in Bacillus subtilis, interacts with the ATP-dependent protease FtsH , 1997, Journal of bacteriology.

[5]  N. Moran,et al.  Degenerative Minimalism in the Genome of a Psyllid Endosymbiont , 2001, Journal of bacteriology.

[6]  Anders Krogh,et al.  Large-scale prokaryotic gene prediction and comparison to genome annotation , 2005, Bioinform..

[7]  Wen-Hsiung Li,et al.  Sequences and Evolution of Human and Squirrel Monkey Blue Opsin Genes , 1997, Journal of Molecular Evolution.

[8]  Katherine S. Pollard,et al.  The UCSC Archaeal Genome Browser , 2005, Nucleic Acids Res..

[9]  S. Grimmond,et al.  Transcriptome content and dynamics at single-nucleotide resolution , 2008, Genome Biology.

[10]  W. Burkholder,et al.  Proteolysis of the replication checkpoint protein Sda is necessary for the efficient initiation of sporulation after transient replication stress in Bacillus subtilis , 2006, Molecular microbiology.

[11]  Sudhir Kumar,et al.  Efficiency of the Neighbor-Joining Method in Reconstructing Deep and Shallow Evolutionary Relationships in Large Phylogenies , 2000, Journal of Molecular Evolution.

[12]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[13]  Kevin Karplus,et al.  PREDICT-2ND: a tool for generalized protein local structure prediction , 2008, Bioinform..

[14]  Anders Krogh,et al.  EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance , 2003, BMC Bioinformatics.

[15]  T. D. Schneider,et al.  Small membrane proteins found by comparative genomics and ribosome binding site models , 2008, Molecular microbiology.

[16]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[17]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[18]  C. Yanofsky Transcription Attenuation: Once Viewed as a Novel Regulatory Strategy , 2000, Journal of bacteriology.

[19]  Grant Thiltgen Creating new local structure alphabets for protein structure prediction , 2010 .

[20]  S Brunak,et al.  On the total number of genes and their length distribution in complete microbial genomes. , 2001, Trends in genetics : TIG.

[21]  B. S. Laursen,et al.  Initiation of Protein Synthesis in Bacteria , 2005, Microbiology and Molecular Biology Reviews.

[22]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[23]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[24]  Howard Ochman,et al.  Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. , 2002, Trends in genetics : TIG.

[25]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[26]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[27]  Jin Wang,et al.  MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes , 2007, BMC Bioinformatics.

[28]  References , 1971 .

[29]  Saman Halgamuge,et al.  Analysis of SD sequences in completed microbial genomes: non-SD-led genes are as common as SD-led genes. , 2006, Gene.

[30]  J. E. Brock,et al.  Ribosomes bind leaderless mRNA in Escherichia coli through recognition of their 5'-terminal AUG. , 2008, RNA.