On the estimation of false positives in peptide identifications using decoy search strategy

False positive control/estimate in peptide identifications by MS is of critical importance for reliable inference at the protein level and downstream bioinformatics analysis. Approaches based on search against decoy databases have become popular for its conceptual simplicity and easy implementation. Although various decoy search strategies have been proposed, few studies have investigated their difference in performance. With datasets collected on a mixture of model proteins, we demonstrate that a single search against the target database coupled with its reversed version offers a good balance between performance and simplicity. In particular, both the accuracy of the estimate of the number of false positives and sensitivity is at least comparable to other procedures examined in this study. It is also shown that scrambling while preserving frequency of amino acid words can potentially improve the accuracy of false positive estimate, though more studies are needed to investigate the optimal scrambling procedure for specific condition and the variation of the estimate across repeated scrambling.

[1]  Fuchu He,et al.  Protein probabilities in shotgun proteomics: Evaluating different estimation methods using a semi‐random sampling model , 2006, Proteomics.

[2]  Mikhail S. Gelfand,et al.  Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors , 2001, Bioinform..

[3]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[4]  R. Beavis,et al.  Using annotated peptide mass spectrum libraries for protein identification. , 2006, Journal of proteome research.

[5]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.

[6]  William Stafford Noble,et al.  Posterior error probabilities and false discovery rates: two sides of the same coin. , 2008, Journal of proteome research.

[7]  D. Naiman,et al.  Probability model for assessing proteins assembled from peptide sequences inferred from tandem mass spectrometry data. , 2007, Analytical chemistry.

[8]  D. Berry,et al.  Symmetrized Percent Change for Treatment Comparisons , 2006 .

[9]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[10]  D. Tabb,et al.  MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. , 2007, Journal of proteome research.

[11]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[12]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[13]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[14]  Z. Smilansky,et al.  Intensity-based statistical scorer for tandem mass spectrometry. , 2003, Analytical chemistry.

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  Eivind Coward,et al.  Shufflet: shuffling sequences while conserving the k-let counts , 1999, Bioinform..

[17]  William Stafford Noble,et al.  Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. , 2006, Analytical chemistry.

[18]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[19]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[20]  Lang Li,et al.  A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry , 2008, Bioinform..

[21]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[22]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[23]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[24]  John D. Storey A direct approach to false discovery rates , 2002 .

[25]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[26]  Hyungwon Choi,et al.  False discovery rates and related statistical concepts in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[27]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.