ProteoAnnotator – Open source proteogenomics annotation software supporting PSI standards

The recent massive increase in capability for sequencing genomes is producing enormous advances in our understanding of biological systems. However, there is a bottleneck in genome annotation – determining the structure of all transcribed genes. Experimental data from MS studies can play a major role in confirming and correcting gene structure – proteogenomics. However, there are some technical and practical challenges to overcome, since proteogenomics requires pipelines comprising a complex set of interconnected modules as well as bespoke routines, for example in protein inference and statistics. We are introducing a complete, open source pipeline for proteogenomics, called ProteoAnnotator, which incorporates a graphical user interface and implements the Proteomics Standards Initiative mzIdentML standard for each analysis stage. All steps are included as standalone modules with the mzIdentML library, allowing other groups to re‐use the whole pipeline or constituent parts within other tools. We have developed new modules for pre‐processing and combining multiple search databases, for performing peptide‐level statistics on mzIdentML files, for scoring grouped protein identifications matched to a given genomic locus to validate that updates to the official gene models are statistically sound and for mapping end results back onto the genome. ProteoAnnotator is available from http://www.proteoannotator.org/. All MS data have been deposited in the ProteomeXchange with identifiers PXD001042 and PXD001390 (http://proteomecentral.proteomexchange.org/dataset/PXD001042; http://proteomecentral.proteomexchange.org/dataset/PXD001390).

[1]  Lennart Martens,et al.  SearchGUI: An open‐source graphical user interface for simultaneous OMSSA and X!Tandem searches , 2011, Proteomics.

[2]  Morgan C. Giddings,et al.  Peppy: proteogenomic search software. , 2013, Journal of proteome research.

[3]  Norman W. Paton,et al.  Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines , 2009, Proteomics.

[4]  Juan Antonio Vizcaíno,et al.  A toolkit for the mzIdentML standard: the ProteoIDViewer, the mzidLibrary and the mzidValidator , 2013 .

[5]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[6]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[7]  C. Bessant,et al.  GAPP: a fully automated software for the confident identification of human peptides from tandem mass spectra. , 2006, Journal of proteome research.

[8]  Juan Antonio Vizcaíno,et al.  Tools (Viewer, Library and Validator) that Facilitate Use of the Peptide and Protein Identification Standard Format, Termed mzIdentML , 2013, Molecular & Cellular Proteomics.

[9]  Li Li,et al.  ToxoDB: accessing the Toxoplasma gondii genome , 2003, Nucleic Acids Res..

[10]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[11]  Martin Eisenacher,et al.  The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary , 2013, Database J. Biol. Databases Curation.

[12]  F. Berven,et al.  In-depth Characterization of the Cerebrospinal Fluid (CSF) Proteome Displayed Through the CSF Proteome Resource (CSF-PR)* , 2014, Molecular & Cellular Proteomics.

[13]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[14]  Johannes Griss,et al.  The Proteomics Identifications (PRIDE) database and associated tools: status in 2013 , 2012, Nucleic Acids Res..

[15]  Paul Bowness,et al.  Discovery of Candidate Serum Proteomic and Metabolomic Biomarkers in Ankylosing Spondylitis* , 2011, Molecular & Cellular Proteomics.

[16]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[17]  California Jack Cassidy,et al.  An Automated Proteogenomic Method Uses Mass Spectrometry to Reveal Novel Genes in Zea mays* , 2013, Molecular & Cellular Proteomics.

[18]  Bernhard Y. Renard,et al.  iPiG: Integrating Peptide Spectrum Matches into Genome Browser Visualizations , 2012, PloS one.

[19]  M. Wilkins,et al.  Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. , 2014, Journal of proteome research.

[20]  William Stafford Noble,et al.  Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. , 2013, Journal of proteomics.

[21]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[22]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[23]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[24]  Robertson Craig,et al.  Open source system for analyzing, validating, and storing protein identification data. , 2004, Journal of proteome research.

[25]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[26]  Martin Eisenacher,et al.  The mzIdentML Data Standard for Mass Spectrometry-Based Proteomics Results , 2012, Molecular & Cellular Proteomics.

[27]  V. Bafna,et al.  Template Proteogenomics: Sequencing Whole Proteins Using an Imperfect Database* , 2010, Molecular & Cellular Proteomics.

[28]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.