Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow

Complete annotation of the human genome is indispensable for medical research. The GENCODE consortium strives to provide this, augmenting computational and experimental evidence with manual annotation. The rapidly developing field of proteogenomics provides evidence for the translation of genes into proteins and can be used to discover and refine gene models. However, for both the proteomics and annotation groups, there is a lack of guidelines for integrating this data. Here we report a stringent workflow for the interpretation of proteogenomic data that could be used by the annotation community to interpret novel proteogenomic evidence. Based on reprocessing of three large-scale publicly available human data sets, we show that a conservative approach, using stringent filtering is required to generate valid identifications. Evidence has been found supporting 16 novel protein-coding genes being added to GENCODE. Despite this many peptide identifications in pseudogenes cannot be annotated due to the absence of orthogonal supporting evidence.

[1]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[2]  D. Bartel,et al.  Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. , 2015, Cell reports.

[3]  Burkhard Morgenstern,et al.  AUGUSTUS: a web server for gene finding in eukaryotes , 2004, Nucleic Acids Res..

[4]  James C. Wright,et al.  Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. , 2011, Genome research.

[5]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[6]  Morgan C. Giddings,et al.  Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions , 2013, BMC Genomics.

[7]  Piero Carninci,et al.  CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. , 2012, Methods in molecular biology.

[8]  Jürgen Cox,et al.  High performance computational analysis of large-scale proteome data sets to assess incremental contribution to coverage of the human genome. , 2013, Journal of proteome research.

[9]  J. Nielsen,et al.  Analysis of the Human Tissue-specific Expression by Genome-wide Integration of Transcriptomics and Antibody-based Proteomics* , 2013, Molecular & Cellular Proteomics.

[10]  A. Nesvizhskii,et al.  Metrics for the Human Proteome Project 2015: Progress on the Human Proteome and Guidelines for High-Confidence Protein Identification. , 2015, Journal of proteome research.

[11]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[12]  M. Huss,et al.  HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics , 2013, Nature Methods.

[13]  Erik Sjölund,et al.  Fast and accurate database searches with MS-GF+Percolator. , 2014, Journal of proteome research.

[14]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[15]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[16]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[17]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[18]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[19]  J. Harrow,et al.  Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes , 2014, Human molecular genetics.

[20]  E. Lundberg,et al.  Towards a knowledge-based Human Protein Atlas , 2010, Nature Biotechnology.

[21]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[22]  Akhilesh Pandey,et al.  Proteogenomic analysis of human chromosome 9-encoded genes from human samples and lung cancer tissues. , 2014, Journal of proteome research.

[23]  Tao Zhang,et al.  Tissue-Based Proteogenomics Reveals that Human Testis Endows Plentiful Missing Proteins. , 2015, Journal of proteome research.

[24]  P. Pevzner,et al.  Target-Decoy Approach and False Discovery Rate: When Things May Go Wrong , 2011, Journal of the American Society for Mass Spectrometry.

[25]  Nichole L. King,et al.  The PeptideAtlas Project , 2010, Proteome Bioinformatics.

[26]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[27]  Cesare Furlanello,et al.  A promoter-level mammalian expression atlas , 2015 .

[28]  Markus Brosch,et al.  Enhanced Peptide Identification by Electron Transfer Dissociation Using an Improved Mascot Percolator* , 2012, Molecular & Cellular Proteomics.

[29]  Vineet Bafna,et al.  Advanced Proteogenomic Analysis Reveals Multiple Peptide Mutations and Complex Immunoglobulin Peptides in Colon Cancer. , 2015, Journal of proteome research.

[30]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[31]  Knut Reinert,et al.  TOPP - the OpenMS proteomics pipeline , 2007, Bioinform..

[32]  Chao Liu,et al.  A note on the false discovery rate of novel peptides in proteogenomics , 2015, Bioinform..

[33]  Emma Lundberg,et al.  Immunofluorescence and fluorescent-protein tagging show high correlation for protein localization in mammalian cells , 2013, Nature Methods.

[34]  Natalie I. Tasman,et al.  A Cross-platform Toolkit for Mass Spectrometry and Proteomics , 2012, Nature Biotechnology.

[35]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[36]  William Stafford Noble,et al.  Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. , 2009, Journal of proteome research.

[37]  Eric W. Deutsch,et al.  The PeptideAtlas project , 2005, Nucleic Acids Res..

[38]  R. Guigó,et al.  Improving gene annotation using peptide mass spectrometry. , 2007, Genome research.

[39]  T. Babak,et al.  A quantitative atlas of polyadenylation in five mammals , 2012, Genome research.

[40]  Mathias Wilhelm,et al.  A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets , 2015, Molecular & Cellular Proteomics.

[41]  J. Harrow,et al.  Identifying protein-coding genes in genomic sequences , 2009, Genome Biology.

[42]  J. Armengaud,et al.  Non-model organisms, a species endangered by proteogenomics. , 2014, Journal of proteomics.

[43]  Amos Bairoch,et al.  neXtProt: a knowledge platform for human proteins , 2011, Nucleic Acids Res..

[44]  S. Hubbard,et al.  Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies , 2012, Journal of proteome research.

[45]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[46]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[47]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[48]  H. Wiker,et al.  Proteogenomics in microbiology: Taking the right turn at the junction of genomics and proteomics , 2014, Proteomics.

[49]  B. Kuster,et al.  Mass-spectrometry-based draft of the human proteome , 2014, Nature.

[50]  Markus Brosch,et al.  Accurate and sensitive peptide identification with Mascot Percolator. , 2009, Journal of proteome research.

[51]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[52]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[53]  Johannes Griss,et al.  The Proteomics Identifications (PRIDE) database and associated tools: status in 2013 , 2012, Nucleic Acids Res..

[54]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.