The Ensembl gene annotation system

The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail. Database URL: http://www.ensembl.org/index.html

[1]  August E. Woerner,et al.  Gibbon genome and the fast karyotype evolution of small apes , 2014 .

[2]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[3]  Qiang Li,et al.  Genome sequence and genetic diversity of the common carp, Cyprinus carpio , 2014, Nature Genetics.

[4]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[5]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[6]  Yoshiyuki Sakaki,et al.  Genome sequence of an Australian kangaroo, Macropus eugenii, provides insight into the evolution of mammalian reproduction and development , 2011, Genome Biology.

[7]  Carolyn Tregidgo,et al.  Genome Sequencing and Analysis of the Tasmanian Devil and Its Transmissible Cancer , 2012, Cell.

[8]  François Schiettecatte,et al.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders , 2014, Nucleic Acids Res..

[9]  Bronwen L. Aken,et al.  The sheep genome illuminates biology of the rumen and lipid metabolism , 2014, Science.

[10]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[11]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[12]  Albert J. Vilella,et al.  Comparative and demographic analysis of orang-utan genomes , 2011, Nature.

[13]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[14]  Inge Jonassen,et al.  The genome sequence of Atlantic cod reveals a unique immune system , 2011, Nature.

[15]  Clifford J. Tabin,et al.  Melanocortin 4 receptor mutations contribute to the adaptation of cavefish to nutrient-poor conditions , 2015, Proceedings of the National Academy of Sciences.

[16]  K. Worley,et al.  The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution , 2009, Science.

[17]  Bronwen L. Aken,et al.  Analyses of pig genomes provide insight into porcine demography and evolution , 2012, Nature.

[18]  E. Liu,et al.  Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation , 2005, Nature Methods.

[19]  Jonathan M. Mudge,et al.  The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. , 2009, Genome research.

[20]  Obi L. Griffith,et al.  A Phase I Trial of BKM120 (Buparlisib) in Combination with Fulvestrant in Postmenopausal Women with Estrogen Receptor–Positive Metastatic Breast Cancer , 2015, Clinical Cancer Research.

[21]  Alexander S. Garruss,et al.  Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution , 2013, Nature Genetics.

[22]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[23]  Paul P Gardner,et al.  The use of covariance models to annotate RNAs in whole genomes. , 2009, Briefings in functional genomics & proteomics.

[24]  David N. Messina,et al.  Evolutionary and Biomedical Insights from the Rhesus Macaque Genome , 2007, Science.

[25]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[26]  Anders Krogh,et al.  Prediction of Signal Peptides and Signal Anchors by a Hidden Markov Model , 1998, ISMB.

[27]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[28]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[29]  Melainia McClain,et al.  Stem cells and fluid flow drive cyst formation in an invertebrate excretory organ , 2015, eLife.

[30]  Albert J. Vilella,et al.  The genome of a songbird , 2010, Nature.

[31]  Nuno A. Fonseca,et al.  Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction , 2015, BMC Genomics.

[32]  Bronwen L. Aken,et al.  Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences , 2007, Nature.

[33]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[34]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[35]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.

[36]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[37]  Elin Videvall,et al.  The Avian Transcriptome Response to Malaria Infection , 2015, Molecular biology and evolution.

[38]  Michael R Brent,et al.  Genome annotation past, present, and future: how to define an ORF at each locus. , 2005, Genome research.

[39]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[40]  Bianca M. Schmitt,et al.  Decoupling of evolutionary changes in transcription factor binding and gene expression in mammals , 2015, Genome research.

[41]  S. Searle,et al.  Incorporating RNA-seq data into the zebrafish Ensembl genebuild , 2012, Genome research.

[42]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[43]  John C. Marioni,et al.  Molecular and neuronal homology between the olfactory systems of zebrafish and mouse , 2015, Scientific Reports.

[44]  R B Denman,et al.  Using RNAFOLD to predict the activity of small catalytic RNAs. , 1993, BioTechniques.

[45]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[46]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2013 , 2012, Nucleic Acids Res..

[47]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[48]  Terri K. Attwood,et al.  The PRINTS database: a fine-grained protein sequence annotation and analysis resource—its status in 2012 , 2012, Database J. Biol. Databases Curation.

[49]  Albert J. Vilella,et al.  A high-resolution map of human evolutionary constraint using 29 mammals , 2011, Nature.

[50]  Daniel N. Murphy,et al.  De Novo Origin of Protein-Coding Genes in Murine Rodents , 2012, PloS one.

[51]  Anton J. Enright,et al.  The zebrafish reference genome sequence and its relationship to the human genome , 2013, Nature.

[52]  Xiu Lin,et al.  Facing growth in the European Nucleotide Archive , 2012, Nucleic Acids Res..

[53]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[54]  Eric S. Lander,et al.  The genomic substrate for adaptive radiation in African cichlid fish , 2014, Nature.

[55]  Albert J. Vilella,et al.  Insights into hominid evolution from the gorilla genome sequence , 2012, Nature.

[56]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[57]  Jane Loveland,et al.  Tracking and coordinating an international curation effort for the CCDS Project , 2012, Database J. Biol. Databases Curation.

[58]  H. Chandler Database , 1985 .

[59]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[60]  Martin Kircher,et al.  High‐throughput DNA sequencing – concepts and limitations , 2010, BioEssays : news and reviews in molecular, cellular and developmental biology.

[61]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[62]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[63]  Robert D. Finn,et al.  Rfam: updates to the RNA families database , 2008, Nucleic Acids Res..

[64]  Patrice Duroux,et al.  IMGT/LIGM-DB, the IMGT® comprehensive database of immunoglobulin and T cell receptor nucleotide sequences , 2005, Nucleic Acids Res..

[65]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[66]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[67]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[68]  Albert J. Vilella,et al.  Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis , 2010, PLoS biology.

[69]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[70]  S. Searle,et al.  The duck genome and transcriptome provide insight into an avian influenza virus reservoir species , 2013, Nature Genetics.

[71]  Jacob D. Jaffe,et al.  The genome of the green anole lizard and a comparative analysis with birds and mammals , 2011, Nature.

[72]  Miriam K. Konkel,et al.  Genome analysis of the platypus reveals unique signatures of evolution , 2008, Nature.

[73]  Amir Ali Abbasi,et al.  Phylogenomic analysis reveals ancient segmental duplications in the human genome. , 2016, Molecular phylogenetics and evolution.

[74]  Daniel R. Zerbino,et al.  Ensembl regulation resources , 2016, Database J. Biol. Databases Curation.

[75]  Michelle G. Giglio,et al.  TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes , 2006, Nucleic Acids Res..

[76]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[77]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[78]  Erika Check Hayden,et al.  Technology: The $1,000 genome , 2014, Nature.

[79]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[80]  Sonja J. Prohaska,et al.  Analysis of the African coelacanth genome sheds light on tetrapod evolution , 2013, Nature.

[81]  Toshihisa Takagi,et al.  DDBJ progress report: a new submission system for leading to a correct annotation , 2013, Nucleic Acids Res..

[82]  Erich E Wanker,et al.  The palmitoyl acyltransferase HIP14 shares a high proportion of interactors with huntingtin: implications for a role in the pathogenesis of Huntington's disease. , 2014, Human molecular genetics.

[83]  Denman Rb,et al.  Using RNAFOLD to predict the activity of small catalytic RNAs. , 1993 .

[84]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[85]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[86]  Robert S. Ledley,et al.  PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[87]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[88]  J. N. MacLeod,et al.  Genome Sequence, Comparative Analysis, and Population Genetics of the Domestic Horse , 2009, Science.

[89]  Guoqing Lu,et al.  Analysis of the Skin Transcriptome in Two Oujiang Color Varieties of Common Carp , 2014, PloS one.

[90]  Jean-Baptiste Cazier,et al.  Choice of transcripts and software has a large effect on variant annotation , 2014, Genome Medicine.

[91]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[92]  S. Searle,et al.  The Ensembl analysis pipeline. , 2004, Genome research.

[93]  David Haussler,et al.  The UCSC Genome Browser database: 2014 update , 2013, Nucleic Acids Res..

[94]  Joshua B. Gross,et al.  The cavefish genome reveals candidate genes for eye loss , 2014, Nature Communications.

[95]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[96]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[97]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[98]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[99]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[100]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[101]  E. Birney,et al.  The Ensembl core software libraries. , 2004, Genome research.

[102]  Alex A. Pollen,et al.  The genomic basis of adaptive evolution in threespine sticklebacks , 2012, Nature.

[103]  Xinxia Peng,et al.  The draft genome sequence of the ferret (Mustela putorius furo) facilitates study of human respiratory disease , 2014, Nature Biotechnology.

[104]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[105]  Angel Amores,et al.  The genome of the platyfish, Xiphophorus maculatus, provides insights into evolutionary adaptation and several complex traits , 2013, Nature Genetics.

[106]  Sean R. Eddy,et al.  A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure , 2002, BMC Bioinformatics.