Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into biological functions. Here we describe riboHMM, a new method that uses ribosome footprint data along with gene expression and sequence information to accurately infer translated sequences. We applied our method to human lymphoblastoid cell lines and identified 7,273 previously unannotated coding sequences, including 2,442 translated upstream open reading frames. We observed an enrichment of harringtonine-treated ribosome footprints at the inferred initiation sites, validating many of the novel coding sequences. The novel sequences exhibit significant signatures of selective constraint in the reading frames of the inferred proteins, suggesting that many of these are functional. Nearly 40% of bicistronic transcripts showed significant negative correlation in the levels of translation of their two coding sequences, suggesting a key regulatory role for these novel translated sequences. Our work significantly expands the set of known coding regions in humans.

[1]  Jeffrey A. Hussmann,et al.  Improved ribosome-footprint and mRNA measurements provide insights into dynamics and regulation of yeast translation , 2015, bioRxiv.

[2]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[3]  B. Ma Novor: Real-Time Peptide de Novo Sequencing Software , 2015, Journal of The American Society for Mass Spectrometry.

[4]  Christophe Dunand,et al.  Primary transcripts of microRNAs encode regulatory peptides , 2015, Nature.

[5]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[6]  Alexis Battle,et al.  Impact of regulatory variation from RNA to protein , 2015, Science.

[7]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[8]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[9]  Nicholas T Ingolia,et al.  Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. , 2014, Cell reports.

[10]  Nikolaus Rajewsky,et al.  Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation , 2014, The EMBO journal.

[11]  David Haussler,et al.  Current status and new features of the Consensus Coding Sequence database , 2013, Nucleic Acids Res..

[12]  Jonathan K. Pritchard,et al.  Primate Transcript and Protein Expression Levels Evolve Under Compensatory Selection Pressures , 2013, Science.

[13]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[14]  François-Michel Boisvert,et al.  Direct Detection of Alternative Open Reading Frames Translation Products in Human Significantly Expands the Proteome , 2013, PloS one.

[15]  L. Romão,et al.  Gene Expression Regulation by Upstream Open Reading Frames and Human Disease , 2013, PLoS genetics.

[16]  Nicholas T. Ingolia,et al.  Ribosome Profiling Provides Evidence that Large Noncoding RNAs Do Not Encode Proteins , 2013, Cell.

[17]  Michael T. McManus,et al.  Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs , 2013, PLoS genetics.

[18]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[19]  Audrey M. Michel,et al.  Observation of dually decoded regions of the human genome using ribosome profiling data , 2012, Genome research.

[20]  M. Gerstein,et al.  The GENCODE pseudogene resource , 2012, Genome Biology.

[21]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[22]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[23]  B. Shen,et al.  Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution , 2012, Proceedings of the National Academy of Sciences.

[24]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[25]  Nicholas T. Ingolia,et al.  Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of Mammalian Proteomes , 2011, Cell.

[26]  Paulo P. Amaral,et al.  The Reality of Pervasive Transcription , 2011, PLoS biology.

[27]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[28]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[29]  Elias Bareinboim,et al.  Bioinformatics Applications Note Analyzing Marginal Cases in Differential Shotgun Proteomics , 2022 .

[30]  S Kobayashi,et al.  Small Peptides Switch the Transcriptional Activity of Shavenbaby During Drosophila Embryogenesis , 2010, Science.

[31]  T. Hughes,et al.  Most “Dark Matter” Transcripts Are Associated With Known Genes , 2010, PLoS biology.

[32]  Xiangyin Kong,et al.  Length of the ORF, position of the first AUG and the Kozak motif are important factors in potential dual-coding transcripts , 2010, Cell Research.

[33]  V. Mootha,et al.  Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans , 2009, Proceedings of the National Academy of Sciences.

[34]  Nicholas T. Ingolia,et al.  Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling , 2009, Science.

[35]  Jane Glazebrook,et al.  Priming in Systemic Plant Immunity , 2009, Science.

[36]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[37]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[38]  A. Kochetov,et al.  Alternative translation start sites and hidden coding potential of eukaryotic mRNAs. , 2008, BioEssays : news and reviews in molecular, cellular and developmental biology.

[39]  Bin Ma,et al.  Peptide De Novo Sequencing with MS/MS , 2008, Encyclopedia of Algorithms.

[40]  G. Weiller,et al.  Bioinformatic analysis of the CLE signaling peptide family , 2008, BMC Plant Biology.

[41]  S. Ogawa,et al.  Alternative splicing due to an intronic SNP in HMSD generates a novel minor histocompatibility antigen. , 2007, Blood.

[42]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[43]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[44]  P. Stadler,et al.  RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription , 2007, Science.

[45]  Sachi Inagaki,et al.  Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA , 2007, Nature Cell Biology.

[46]  Juan Pablo Couso,et al.  Peptides Encoded by Short ORFs Control Development and Define a New Eukaryotic Gene Family , 2007, PLoS biology.

[47]  Steven N Evans,et al.  Non-equilibrium theory of the allele frequency spectrum. , 2006, Theoretical population biology.

[48]  川瀬 孝和 Alternative splicing due to an intronic SNP in HMSD generates a novel minor histocompatibility antigen , 2007 .

[49]  R. Kiss,et al.  Galectin-1: a small protein with major functions. , 2006, Glycobiology.

[50]  Charles Buck,et al.  Performance evaluation of existing de novo sequencing algorithms. , 2006, Journal of proteome research.

[51]  R. Nielsen Molecular signatures of natural selection. , 2005, Annual review of genetics.

[52]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[53]  C. Haass,et al.  Expression of the Alzheimer protease BACE1 is suppressed via its 5'‐untranslated region , 2004, EMBO reports.

[54]  E. J. de la Rosa,et al.  Upstream AUGs in embryonic proinsulin mRNA control its low translation level , 2003, The EMBO journal.

[55]  Yoshihide Hayashizaki,et al.  CDS annotation in full-length cDNA sequence. , 2003, Genome research.

[56]  J. Pelletier,et al.  An upstream open reading frame impedes translation of the huntingtin gene. , 2002, Nucleic acids research.

[57]  D. Morris,et al.  Upstream Open Reading Frames as Regulators of mRNA Translation , 2000, Molecular and Cellular Biology.

[58]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[59]  M. Kozak An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. , 1987, Nucleic acids research.

[60]  R. F. Brown,et al.  PERFORMANCE EVALUATION , 2019, ISO 22301:2019 and business continuity management – Understand how to plan, implement and enhance a business continuity management system (BCMS).