GENCODE reference annotation for the human and mouse genomes

Abstract The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.

Mark Gerstein | Tomás Di Domenico | Jane Loveland | Julien Lagarde | Jyoti Choudhary | Manolis Kellis | Daniel R. Zerbino | Joel Armstrong | Benedict Paten | Yan Zhang | Paul Flicek | Fiona Cunningham | Tim J. P. Hubbard | Matthew Hardy | Mark Diekhans | James C. Wright | Jonathan M. Mudge | Tiago Grego | Andrew D. Yates | Magali Ruffier | Michael L. Tress | Roderic Guigó | Anne Parker | Baikang Pei | Ian T. Fiddes | Bronwen L. Aken | Thibaut Hourlier | Fergal J. Martin | Adam Frankish | Jose M. Gonzalez | Paul Muir | Alexandre Reymond | James Wright | Shamika Mohanan | Sarah M. Donaldson | Irwin Jungreis | Anne-Maud Ferreira | Rory Johnson | Marie-Marthe Suner | Toby Hunt | Osagie G. Izuogu | Laura Martínez | Cristina Sisu | If Barnes | Andrew E. Berry | Alexandra Bignell | Silvia Carbonell Sala | Jacqueline Chrast | Carlos García-Girón | Fabio C. P. Navarro | Fernando Pozo | Bianca M. Schmitt | Eloise Stapleton | Irina Sycheva | Barbara Uszczynska-Ratajczak | Jinuri Xu | Andrew D. Yates | Fergal J. Martin | M. Gerstein | A. Frankish | J. Gonzalez | M. Diekhans | A. Bignell | T. Hunt | J. Chrast | M. Tress | Manolis Kellis | A. Reymond | R. Guigó | T. Hubbard | F. Cunningham | D. Zerbino | Irwin Jungreis | P. Flicek | Julien Lagarde | B. Paten | Rory Johnson | Anne Parker | Fábio C. P. Navarro | Magali Ruffier | M. Suner | C. García-Girón | Thibaut Hourlier | J. Loveland | Cristina Sisu | Matthew Hardy | J. Choudhary | J. Armstrong | Tiago Grego | Jinrui Xu | Barbara Uszczynska-Ratajczak | Paul Muir | Yan Zhang | Anne-Maud Ferreira | S. Donaldson | Andrew E. Berry | If H. A. Barnes | Shamika Mohanan | T. D. Domenico | Eloise Stapleton | Baikang Pei | J. M. Gonzalez | S. C. Sala | F. Pozo | Irina Sycheva | Fiona Cunningham | Laura Martínez | Jacqueline Chrast | Paul Flicek | T. Domenico

[1]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[2]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[3]  Ian T. Fiddes Comparative Annotation Toolkit (CAT) - Simultaneous Clade and Personal Genome Annotation , 2018, Genome research.

[4]  Mario Stanke,et al.  Simultaneous gene finding in multiple genomes , 2016, Bioinform..

[5]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[6]  A. Frankish,et al.  Towards a complete map of the human long non-coding RNA transcriptome , 2018, Nature Reviews Genetics.

[7]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[8]  Ira W. Deveson,et al.  Spliced synthetic genes as internal controls in RNA sequencing experiments , 2016, Nature Methods.

[9]  Ana Kozomara,et al.  miRBase: annotating high confidence microRNAs using deep sequencing data , 2013, Nucleic Acids Res..

[10]  Alfonso Valencia,et al.  APPRIS: annotation of principal and alternative splice isoforms , 2012, Nucleic Acids Res..

[11]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[12]  J. Harrow,et al.  Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes , 2014, Human molecular genetics.

[13]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[14]  T. Blauwkamp,et al.  Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events , 2015, Nature Biotechnology.

[15]  Laura Martinez,et al.  Loose ends: almost one in five human genes still have unresolved coding status , 2018, Nucleic acids research.

[16]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[17]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[18]  David Haussler,et al.  High-resolution comparative analysis of great ape genomes , 2018, Science.

[19]  Alfonso Valencia,et al.  APPRIS 2017: principal isoforms for multiple gene sets , 2017, Nucleic Acids Res..

[20]  Steven J. M. Jones,et al.  The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery , 2016, Cell.

[21]  S. Eddy Computational Genomics of Noncoding RNA Genes , 2002, Cell.

[22]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[23]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[24]  Mark Gerstein,et al.  Sixteen diverse laboratory mouse reference genomes define strain specific haplotypes and novel functional loci , 2018, Nature Genetics.

[25]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[26]  Jennifer Harrow,et al.  High-throughput annotation of full-length long noncoding RNAs with Capture Long-Read Sequencing , 2017, Nature Genetics.

[27]  James C. Wright,et al.  Flexible Data Analysis Pipeline for High-Confidence Proteogenomics , 2016, Journal of proteome research.

[28]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[29]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[30]  Robert J. Weatheritt,et al.  A Highly Conserved Program of Neuronal Microexons Is Misregulated in Autistic Brains , 2014, Cell.

[31]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[32]  Robert D. Finn,et al.  Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families , 2017, Nucleic Acids Res..

[33]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[34]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[35]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[36]  James C. Wright,et al.  Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow , 2016, Nature Communications.

[37]  David Haussler,et al.  The UCSC Genome Browser database: 2018 update , 2017, Nucleic Acids Res..

[38]  James C. Wright,et al.  DecoyPyrat: Fast Non-redundant Hybrid Decoy Sequence Generation for Large Scale Proteomics. , 2016, Journal of proteomics & bioinformatics.

[39]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[40]  Bronwen L. Aken,et al.  Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome , 2012, Genome research.

[41]  Alfonso Valencia,et al.  Most highly expressed protein-coding genes have a single dominant isoform. , 2015, Journal of proteome research.

[42]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[43]  Fabian J Theis,et al.  The Human Cell Atlas , 2017, bioRxiv.

[44]  D. Haussler,et al.  Retrocopy contributions to the evolution of the human genome , 2008, BMC Genomics.

[45]  Mark Gerstein,et al.  PseudoPipe: an automated pseudogene identification pipeline , 2006, Bioinform..

[46]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[47]  Carol J. Bult,et al.  The Mouseion at the JAXlibrary , 2022 .

[48]  Pedro A. F. Galante,et al.  RCPedia: a database of retrocopied genes , 2013, Bioinform..

[49]  Andrew J. Hill,et al.  Analysis of protein-coding genetic variation in 60,706 humans , 2015, bioRxiv.

[50]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .