Tailor: non-parametric and rapid score calibration method for database search-based peptide identification in shotgun proteomics

Peptide-spectrum-match (PSM) scores used in database searching are calibrated to spectrum- or spectrum-peptide-specific null distributions. Some calibration methods rely on specific assumptions and use analytical models (e.g. binomial distributions), whereas other methods utilize exact empirical null distributions. The former may be inaccurate because of unjustified assumptions, while the latter are accurate, albeit computationally exhaustive. Here, we introduce a novel, non-parametric, heuristic PSM score calibration method, called Tailor, which calibrates PSM scores by dividing it with the top 100-quantile of the empirical, spectrum-specific null distributions (i.e. the score with an associated p-value of 0.01 at the tail, hence the name) observed during database searching. Tailor does not require any optimization steps or long calculations; it does not rely on any assumptions on the form of the score distribution, it works with any score functions with high- and low-resolution information. In our benchmark, we re-calibrated the match scores of XCorr from Crux, HyperScore scores from X!Tandem, and the p-values from OMSSA with Tailor method, and obtained more spectrum annotation than with raw scores at any false discovery rate level. Moreover, Tailor provided slightly more annotations than E-values of X!Tandem and OMSSA and approached the performance of the computationally exhaustive exact p-value method for XCorr on spectrum datasets containing low-resolution fragmentation information (MS2) around 20-150 times faster. On high-resolution MS2 datasets, the Tailor method with XCorr achieved state-of-the-art performance, and produced more annotations than the well-calibrated Res-ev score around 50-80 times faster. Graphical TOC Entry

[1]  Pavel Sulimov,et al.  Bias in False Discovery Rate Estimation in Mass-Spectrometry-Based Peptide Identification. , 2019, Journal of proteome research.

[2]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[3]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[4]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[5]  Lev I Levitsky,et al.  Pyteomics—a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics , 2013, Journal of The American Society for Mass Spectrometry.

[6]  Lev I Levitsky,et al.  Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. , 2018, Journal of proteome research.

[7]  Lev I Levitsky,et al.  Unbiased False Discovery Rate Estimation for Shotgun Proteomics Based on the Target-Decoy Approach. , 2017, Journal of proteome research.

[8]  M. MacCoss,et al.  A fast SEQUEST cross correlation algorithm. , 2008, Journal of proteome research.

[9]  William Stafford Noble,et al.  On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. , 2011, Journal of proteome research.

[10]  William Stafford Noble,et al.  Faster SEQUEST searching for peptide identification from tandem mass spectra. , 2011, Journal of proteome research.

[11]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[12]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Attila Kertész-Farkas,et al.  Database searching in mass spectrometry based proteomics , 2012 .

[14]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[15]  R. Aebersold,et al.  Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. , 2004, Drug discovery today.

[16]  Edward L Huttlin,et al.  Global analysis of protein expression and phosphorylation of three stages of Plasmodium falciparum intraerythrocytic development. , 2013, Journal of proteome research.

[17]  Shamil R. Sunyaev,et al.  Assigning spectrum-specific P-values to protein identifications by mass spectrometry , 2011, Bioinform..

[18]  William Stafford Noble,et al.  Combining High-Resolution and Exact Calibration To Boost Statistical Power: A Well-Calibrated Score Function for High-Resolution MS2 Data. , 2018, Journal of proteome research.

[19]  William Stafford Noble,et al.  Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis , 2014, Journal of proteome research.

[20]  William Stafford Noble,et al.  Improved False Discovery Rate Estimation Procedure for Shotgun Proteomics , 2015, Journal of proteome research.

[21]  Henry H N Lam,et al.  Proteome Informatics Research Group (iPRG)_2012: A Study on Detecting Modified Peptides in a Complex Mixture* , 2013, Molecular & Cellular Proteomics.

[22]  Stephan M. Winkler,et al.  MS Amanda, a Universal Identification Algorithm Optimized for High Accuracy Tandem Mass Spectra , 2014, Journal of proteome research.

[23]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[24]  William Stafford Noble,et al.  Computational and Statistical Analysis of Protein Mass Spectrometry Data , 2012, PLoS Comput. Biol..

[25]  L. Käll,et al.  Quality assessments of peptide–spectrum matches in shotgun proteomics , 2011, Proteomics.

[26]  J. Yates,et al.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. , 1995, Analytical chemistry.

[27]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[28]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[29]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[30]  William Stafford Noble,et al.  Computing Exact p-values for a Cross-correlation Shotgun Proteomics Score Function , 2014, Molecular & Cellular Proteomics.

[31]  R. Aebersold,et al.  An integrated workflow for charting the human interaction proteome: insights into the PP2A system , 2009, Molecular systems biology.

[32]  William Stafford Noble,et al.  Statistical calibration of the SEQUEST XCorr function. , 2009, Journal of proteome research.

[33]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[34]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[35]  Chao Liu,et al.  A theoretical foundation of the target-decoy search strategy for false discovery rate control in proteomics , 2015, 1501.00537.

[36]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[37]  William Stafford Noble,et al.  Tandem Mass Spectrum Identification via Cascaded Search , 2015, Journal of proteome research.

[38]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[39]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[40]  William Stafford Noble,et al.  On the Importance of Well-Calibrated Scores for Identifying Shotgun Proteomics Spectra , 2014, Journal of proteome research.