Flexible Data Analysis Pipeline for High-Confidence Proteogenomics

Proteogenomics leverages information derived from proteomic data to improve genome annotations. Of particular interest are “novel” peptides that provide direct evidence of protein expression for genomic regions not previously annotated as protein-coding. We present a modular, automated data analysis pipeline aimed at detecting such “novel” peptides in proteomic data sets. This pipeline implements criteria developed by proteomics and genome annotation experts for high-stringency peptide identification and filtering. Our pipeline is based on the OpenMS computational framework; it incorporates multiple database search engines for peptide identification and applies a machine-learning approach (Percolator) to post-process search results. We describe several new and improved software tools that we developed to facilitate proteogenomic analyses that enhance the wealth of tools provided by OpenMS. We demonstrate the application of our pipeline to a human testis tissue data set previously acquired for the Chromosome-Centric Human Proteome Project, which led to the addition of five new gene annotations on the human reference genome.

[1]  D. Bartel,et al.  Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. , 2015, Cell reports.

[2]  José A. Dianes,et al.  2016 update of the PRIDE database and its related tools , 2016, Nucleic Acids Res..

[3]  James C. Wright,et al.  Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow , 2016, Nature Communications.

[4]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[5]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[6]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[7]  Steven G E Marsh,et al.  The IPD-IMGT/HLA Database - New developments in reporting HLA variation. , 2016, Human immunology.

[8]  Martin Eisenacher,et al.  The mzIdentML Data Standard for Mass Spectrometry-Based Proteomics Results , 2012, Molecular & Cellular Proteomics.

[9]  Morgan C. Giddings,et al.  Peppy: proteogenomic search software. , 2013, Journal of proteome research.

[10]  D. Creasy,et al.  Error tolerant searching of uninterpreted tandem mass spectrometry data , 2002, Proteomics.

[11]  Roman A. Zubarev,et al.  DeMix Workflow for Efficient Identification of Cofragmented Peptides in High Resolution Data-dependent Tandem Mass Spectrometry , 2014, Molecular & Cellular Proteomics.

[12]  William Stafford Noble,et al.  Posterior error probabilities and false discovery rates: two sides of the same coin. , 2008, Journal of proteome research.

[13]  Knut Reinert,et al.  TOPP - the OpenMS proteomics pipeline , 2007, Bioinform..

[14]  James E. Johnson,et al.  Flexible and Accessible Workflows for Improved Proteogenomic Analysis Using the Galaxy Framework , 2014, Journal of proteome research.

[15]  Shivashankar H. Nagaraj,et al.  PGTools: A Software Suite for Proteogenomic Data Analysis and Visualization. , 2015, Journal of proteome research.

[16]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[17]  Natalie I. Tasman,et al.  A Cross-platform Toolkit for Mass Spectrometry and Proteomics , 2012, Nature Biotechnology.

[18]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[19]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[20]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[21]  Tao Zhang,et al.  Tissue-Based Proteogenomics Reveals that Human Testis Endows Plentiful Missing Proteins. , 2015, Journal of proteome research.

[22]  Mark Gerstein,et al.  Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation , 2006, Nucleic Acids Res..

[23]  Jun Fan,et al.  The mzTab Data Exchange Format: Communicating Mass-spectrometry-based Proteomics and Metabolomics Experimental Results to a Wider Audience* , 2014, Molecular & Cellular Proteomics.

[24]  Ting Wang,et al.  Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser , 2013, Bioinform..

[25]  Johannes Griss,et al.  Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets , 2016, Nature Methods.

[26]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[27]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[28]  Peter Z. Kunszt,et al.  Using synthetic peptides to benchmark peptide identification software and search parameters for MS/MS data analysis , 2014 .

[29]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[30]  O. Kohlbacher,et al.  Probabilistic consensus scoring improves tandem mass spectrometry peptide identification. , 2011, Journal of proteome research.

[31]  Andrew R Jones,et al.  ProteoAnnotator – Open source proteogenomics annotation software supporting PSI standards , 2014, Proteomics.

[32]  Markus Brosch,et al.  Enhanced Peptide Identification by Electron Transfer Dissociation Using an Improved Mascot Percolator* , 2012, Molecular & Cellular Proteomics.

[33]  Knut Reinert,et al.  TOPPAS: a graphical workflow editor for the analysis of high-throughput proteomics data. , 2012, Journal of proteome research.

[34]  Jane Loveland,et al.  The Vertebrate Genome Annotation browser 10 years on , 2013, Nucleic Acids Res..

[35]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[36]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[37]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[38]  Erik Sjölund,et al.  Fast and accurate database searches with MS-GF+Percolator. , 2014, Journal of proteome research.

[39]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[40]  Andreas Quandt,et al.  An automated pipeline for high-throughput label-free quantitative proteomics. , 2013, Journal of proteome research.

[41]  Burkhard Morgenstern,et al.  AUGUSTUS: a web server for gene finding in eukaryotes , 2004, Nucleic Acids Res..

[42]  K. Reinert,et al.  OpenMS: a flexible open-source software platform for mass spectrometry data analysis , 2016, Nature Methods.

[43]  Amos Bairoch,et al.  neXtProt: a knowledge platform for human proteins , 2011, Nucleic Acids Res..

[44]  Markus Brosch,et al.  Accurate and sensitive peptide identification with Mascot Percolator. , 2009, Journal of proteome research.