Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise

We assembled the sequences from 9,795 RNA sequencing experiments, collected from 31 human tissues and hundreds of subjects as part of the GTEx project, to create a new, comprehensive catalog of human genes and transcripts. The new human gene database contains 43,162 genes, of which 21,306 are protein-coding and 21,856 are noncoding, and a total of 323,824 transcripts, for an average of 7.5 transcripts per gene. Our expanded gene list includes 4,998 novel genes (1,178 coding and 3,819 noncoding) and 97,511 novel splice variants of protein-coding genes as compared to the most recent human gene catalogs. We detected over 30 million additional transcripts at more than 650,000 sites, nearly all of which are likely to be nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells.

[1]  Peer Bork,et al.  20 years of the SMART protein domain annotation resource , 2017, Nucleic Acids Res..

[2]  A. Pandey,et al.  Discovery of noncanonical translation initiation sites through mass spectrometric analysis of protein N termini , 2018, Genome research.

[3]  Michelle S. Scott,et al.  Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins , 2017, eLife.

[4]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[5]  Jordan A. Ramilowski,et al.  An atlas of human long non-coding RNAs with accurate 5′ ends , 2017, Nature.

[6]  L. Duret,et al.  The fitness cost of mis-splicing is the main determinant of alternative splicing patterns , 2017, Genome Biology.

[7]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[8]  Jeffrey T Leek,et al.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown , 2016, Nature Protocols.

[9]  Hae Kyung Im,et al.  Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues , 2016, bioRxiv.

[10]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[11]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences , 2015, F1000Research.

[12]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. , 2015, F1000Research.

[13]  Ellen T. Gelfand,et al.  A Novel Approach to High-Quality Postmortem Tissue Procurement: The GTEx Project , 2015, Biopreservation and biobanking.

[14]  G. Kempermann Faculty Opinions recommendation of Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. , 2015 .

[15]  Dmitri D. Pervouchine,et al.  The human transcriptome across tissues and individuals , 2015, Science.

[16]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[17]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[18]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[19]  Alexander F. Palazzo,et al.  Non-coding RNA: what is functional and what is junk? , 2015, Front. Genet..

[20]  Yujun Han,et al.  Whole-exome sequencing in undiagnosed genetic diseases: interpreting 119 trios , 2015, Genetics in Medicine.

[21]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[22]  J. Harrow,et al.  Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes , 2014, Human molecular genetics.

[23]  B. Kuster,et al.  Mass-spectrometry-based draft of the human proteome , 2014, Nature.

[24]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[25]  David Haussler,et al.  Current status and new features of the Consensus Coding Sequence database , 2013, Nucleic Acids Res..

[26]  Jeannie T. Lee,et al.  Long Noncoding RNAs: Past, Present, and Future , 2013, Genetics.

[27]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[28]  Kelly Schoch,et al.  Clinical application of exome sequencing in undiagnosed genetic conditions , 2012, Journal of Medical Genetics.

[29]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[30]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[31]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[32]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[33]  Steven L Salzberg,et al.  Between a chicken and a grape: estimating the number of human genes , 2010, Genome Biology.

[34]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[35]  O. Gascuel,et al.  SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. , 2010, Molecular biology and evolution.

[36]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.

[37]  S. Sunkin,et al.  Specific expression of long noncoding RNAs in the mouse brain , 2008, Proceedings of the National Academy of Sciences.

[38]  James A. Cuff,et al.  Distinguishing protein-coding and noncoding genes in the human genome , 2007, Proceedings of the National Academy of Sciences.

[39]  D. Tranchina,et al.  Stochastic mRNA Synthesis in Mammalian Cells , 2006, PLoS biology.

[40]  C. V. Jongeneel,et al.  Identification of a new cancer/testis gene family, CT47, among expressed multicopy genes on the human X chromosome , 2006, Genes, chromosomes & cancer.

[41]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[42]  M. Long,et al.  Extensive Gene Traffic on the Mammalian X Chromosome , 2004, Science.

[43]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[44]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[45]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[46]  John Quackenbush,et al.  Gene Index analysis of the human genome estimates approximately 120,000 genes , 2000, Nature Genetics.

[47]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[48]  P. Deloukas,et al.  A Gene Map of the Human Genome , 1996, Science.

[49]  A. Bird,et al.  Predicting the total number of human genes , 1994, Nature Genetics.

[50]  M. Adams,et al.  How many genes in the human genome? , 1994, Nature Genetics.

[51]  J. Craig Venter,et al.  3,400 new expressed sequence tags identify diversity of transcripts in human brain , 1993, Nature Genetics.

[52]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[53]  R. Roberts,et al.  An amazing sequence arrangement at the 5′ ends of adenovirus 2 messenger RNA , 1977, Cell.

[54]  P. Sharp,et al.  Spliced segments at the 5′ terminus of adenovirus 2 late mRNA* , 1977, Proceedings of the National Academy of Sciences.

[55]  F. Vogel A Preliminary Estimate of the Number of Human Genes , 1964, Nature.