Comparative gene prediction in human and mouse.

The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.

[1]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[4]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[5]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[6]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[7]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[8]  Ewan Birney,et al.  Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods Used in Sequence Comparison , 1997, ISMB.

[9]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[10]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[11]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[12]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[13]  Roderic Guigó,et al.  Assembling Genes from Predicted Exons In Linear Time with Dynamic Programming , 1998, J. Comput. Biol..

[14]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[15]  R. Durbin,et al.  Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. , 1999, Genome research.

[16]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[17]  S. Lewis,et al.  Genome annotation assessment in Drosophila melanogaster. , 2000, Genome research.

[18]  T. Richmond Gene recognition via spliced alignment , 2000, Genome Biology.

[19]  Jill P. Mesirov,et al.  Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.

[20]  Webb Miller,et al.  Genome Sequence Comparisons: Hurdles in the Fast Lane to Functional Genomics , 2000, Briefings Bioinform..

[21]  C. Fizames,et al.  Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence , 2000, Nature Genetics.

[22]  Y Sakaki,et al.  The DNA sequence of human chromosome 21. , 2000, Nature.

[23]  M. Hattori,et al.  The DNA sequence of human chromosome 21 , 2000, Nature.

[24]  Webb Miller,et al.  Comparison of genomic DNA sequences: solved and unsolved problems , 2001, Bioinform..

[25]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems , 2001, J. Comput. Biol..

[26]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[27]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[28]  C. Burge,et al.  Assessment of the total number of human transcription units. , 2001, Genomics.

[29]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[30]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[31]  R. Stoughton,et al.  Experimental annotation of the human genome using microarray technology , 2001, Nature.

[32]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[33]  Burkhard Morgenstern,et al.  AGenDA: Gene prediction by comparative sequence analysis , 2002, Silico Biol..

[34]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[35]  Christian N. S. Pedersen,et al.  Comparative Methods for Gene Structure Prediction in Homologous Sequences , 2002, WABI.

[36]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[37]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[38]  Roderic Guigó,et al.  Gene prediction accuracy in large DNA sequences. , 2003 .

[39]  Pierre Rouzé,et al.  Orphan gene finding - an exon assembly approach , 2003, Theor. Comput. Sci..

[40]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.