EGASP: the human ENCODE Genome Annotation Assessment Project

BackgroundWe present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.ResultsThe best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.ConclusionThis is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[3]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[4]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[5]  Roland L. Dunbrack,et al.  Meeting review: the Second meeting on the Critical Assessment of Techniques for Protein Structure Prediction (CASP2), Asilomar, California, December 13-16, 1996. , 1997, Folding & design.

[6]  R George,et al.  An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region. , 1999, Genetics.

[7]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[8]  Ian Korf,et al.  MaskerAid : a performance enhancement to RepeatMasker , 2000, Bioinform..

[9]  S. Lewis,et al.  Genome annotation assessment in Drosophila melanogaster. , 2000, Genome research.

[10]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[11]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[12]  W Miller,et al.  Comparative genomic sequence analysis of the human and mouse cystic fibrosis transmembrane conductance regulator genes. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Vladimir B. Bajic,et al.  Comparing the Success of Different Prediction Software in Sequence Analysis: A Review , 2000, Briefings Bioinform..

[14]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[15]  A. Reymond,et al.  From PREDs and open reading frames to cDNA isolation: revisiting the human chromosome 21 transcription map. , 2001, Genomics.

[16]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[17]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[18]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[19]  D R Bentley,et al.  The DNA sequence and comparative analysis of human chromosome 20 , 2004, Nature.

[20]  Gregor Eichele,et al.  Human chromosome 21 gene expression atlas in the mouse , 2002, Nature.

[21]  C. V. Jongeneel,et al.  Nineteen additional unpredicted transcripts from human chromosome 21. , 2002, Genomics.

[22]  S. P. Fodor,et al.  Large-Scale Transcriptional Activity in Chromosomes 21 and 22 , 2002, Science.

[23]  I. Dunham,et al.  The DNA sequence and analysis of human chromosome 6 , 2003, Nature.

[24]  Ian Dunham,et al.  Reevaluating human gene annotation: a second-generation analysis of chromosome 22. , 2003, Genome research.

[25]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[26]  Abhijit A. Patel,et al.  Splicing double: insights from the second spliceosome , 2003, Nature Reviews Molecular Cell Biology.

[27]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[28]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[29]  Michael R. Brent,et al.  Eval: A software package for analysis of genome annotations , 2003, BMC Bioinformatics.

[30]  L. Pachter,et al.  SLAM web server for comparative gene finding and alignment , 2003, Nucleic Acids Res..

[31]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[32]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[34]  Charles J. Vaske,et al.  Gene prediction and verification in a compact genome with numerous small introns. , 2004, Genome research.

[35]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[36]  Charles E. Chapple,et al.  Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype , 2004, Nature.

[37]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[38]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[39]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[40]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[41]  I. Dunham,et al.  DNA sequence and analysis of human chromosome 9 , 2003, Nature.

[42]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[43]  Manimozhiyan Arumugam,et al.  Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing. , 2004, Genome research.

[44]  A. Taylor,et al.  The DNA sequence and comparative analysis of human chromosome 10 , 2004, Nature.

[45]  I. Dunham,et al.  The DNA sequence and analysis of human chromosome 13 , 2004, Nature.

[46]  Eduardo Eyras,et al.  Gene finding in the chicken genome , 2005, BMC Bioinformatics.

[47]  S. Cawley,et al.  Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. , 2004, Genome research.

[48]  Graziano Pesole,et al.  CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison , 2004, Nucleic Acids Res..

[49]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[50]  Mark Borodovsky,et al.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses , 2005, Nucleic Acids Res..

[51]  Tomaso Poggio,et al.  Identification and analysis of alternative splicing events conserved in human and mouse. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Paola Bonizzoni,et al.  ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences , 2005, BMC Bioinformatics.

[53]  Stylianos E. Antonarakis,et al.  Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes , 2005, Nucleic acids research.

[54]  Lior Pachter,et al.  Large Multiple Organism Gene Finding by Collapsed Gibbs Sampling , 2005, J. Comput. Biol..

[55]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[56]  Damian Smedley,et al.  Ensembl 2005 , 2004, Nucleic Acids Res..

[57]  Namshin Kim,et al.  ECgene: genome-based EST clustering and gene modeling for alternative splicing. , 2005, Genome research.

[58]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[59]  David L. Steffen,et al.  The DNA sequence of the human X chromosome , 2005, Nature.

[60]  Christopher B. Burge,et al.  Recognition of Unknown Conserved Alternatively Spliced Exons , 2005, PLoS Comput. Biol..

[61]  Daniel G. Brown,et al.  ExonHunter: a comprehensive approach to gene finding , 2005, ISMB.

[62]  Jonathan E. Allen,et al.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions , 2006, Genome Biology.

[63]  Uwe Ohler,et al.  Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment , 2006, Genome Biology.

[64]  M. Gerstein,et al.  A computational approach for identifying pseudogenes in the ENCODE regions , 2006, Genome Biology.

[65]  V. Solovyev,et al.  Automatic annotation of eukaryotic genes, pseudogenes and promoters , 2006, Genome Biology.

[66]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[67]  H. R. Crollius,et al.  Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA , 2006, Genome Biology.

[68]  R. Durbin,et al.  Vertebrate gene finding from multiple-species alignments using a two-level strategy , 2006, Genome Biology.

[69]  M. Brent,et al.  Pairagon+N-SCAN_EST: a model-based gene annotation pipeline , 2006, Genome Biology.

[70]  A. Reymond,et al.  Tandem chimerism as a means to increase protein complexity in the human genome. , 2005, Genome research.

[71]  B. Morgenstern,et al.  AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome , 2006, Genome Biology.

[72]  M. Brent,et al.  Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts , 2006, Genome Biology.

[73]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[74]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..