RAId_DbS: Peptide Identification using Database Searches with Realistic Statistics

BackgroundThe key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.ResultsUsing a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

[1]  R. Appel,et al.  Popitam: Towards new heuristic strategies to improve protein identification from tandem mass spectrometry data , 2003, Proteomics.

[2]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[3]  Gilbert S Omenn,et al.  An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis , 2005, Proteomics.

[4]  David Fenyö,et al.  RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database , 2002, Proteomics.

[5]  Yi-Kuo Yu,et al.  Calibrating E-values for MS2 database search methods , 2007, Biology Direct.

[6]  Robert V. Hogg,et al.  Introduction to Mathematical Statistics. , 1966 .

[7]  John R Yates,et al.  Central limit theorem as an approximation for intensity-based scoring function. , 2006, Analytical chemistry.

[8]  Tao Xie,et al.  [A novel approach for peptide identification by tandem mass spectrometry]. , 2003, Sheng wu hua xue yu sheng wu wu li xue bao Acta biochimica et biophysica Sinica.

[9]  V. Statulevičius,et al.  Limit Theorems of Probability Theory , 2000 .

[10]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[11]  William H. Press,et al.  Numerical recipes in C , 2002 .

[12]  W. A. Ericson Introduction to Mathematical Statistics, 4th Edition , 1972 .

[13]  J. A. Taylor,et al.  Searching sequence databases via De novo peptide sequencing by tandem mass spectrometry , 2002, Molecular biotechnology.

[14]  Yi-Kuo Yu,et al.  Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics , 2005, Bioinform..

[15]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[16]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[17]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[18]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[19]  W. Feller,et al.  An Introduction to Probability Theory and Its Application. , 1951 .

[20]  Guanghui Wang,et al.  Comparative study of three proteomic quantitative methods, DIGE, cICAT, and iTRAQ, using 2D gel- or LC-MALDI TOF/TOF. , 2006, Journal of Proteome Research.

[21]  R. A. Fox,et al.  Introduction to Mathematical Statistics , 1947 .

[22]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[23]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[24]  Guanghui Wang,et al.  Label-free protein quantification using LC-coupled ion trap or FT mass spectrometry: Reproducibility, linearity, and application with complex proteomes. , 2006, Journal of proteome research.

[25]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[26]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[27]  J. Shabanowitz,et al.  Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Luigi Rossi Bernardi,et al.  Bioinformatics in mass spectrometry data analysis for proteomics studies , 2004, Expert review of proteomics.

[29]  Guanghui Wang,et al.  Identification and quantification of basic and acidic proteins using solution-based two-dimensional protein fractionation and label-free or 18O-labeling mass spectrometry. , 2007, Journal of proteome research.

[30]  Gordon A Anderson,et al.  Identification of tryptic peptides from large databases using multiplexed tandem mass spectrometry: simulations and experimental results , 2003, Proteomics.