Filtering strategies for improving protein identification in high‐throughput MS/MS studies

Despite the recent advances in streamlining high‐throughput proteomic pipelines using tandem mass spectrometry (MS/MS), reliable identification of peptides and proteins on a larger scale has remained a challenging task, still involving a considerable degree of user interaction. Recently, a number of papers have proposed computational strategies both for distinguishing poor MS/MS spectra prior to database search (pre‐filtering) as well as for verifying the peptide identifications made by the search programs (post‐filtering). Both of these filtering approaches can be very beneficial to the overall protein identification pipeline, since they can remove a substantial part of the time consuming manual validation work and convert large sets of MS/MS spectra into more reliable and interpretable proteome information. The choice of the filtering method depends both on the properties of the data and on the goals of the experiment. This review discusses the different pre‐ and post‐filtering strategies available to the researchers, together with their relative merits and potential pitfalls. We also highlight some additional research topics, such as spectral denoising and statistical assessment of the identification results, which aim at further improving the coverage and accuracy of high‐throughput protein identification studies.

[1]  Karl Mechtler,et al.  Cleaning of raw peptide MS/MS spectra: Improved protein identification following deconvolution of multiply charged peaks, isotope clusters, and removal of background noise , 2006, Proteomics.

[2]  William Stafford Noble,et al.  Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry , 2008, ECCB.

[3]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[4]  Ronald J Moore,et al.  Proteome-wide identification of proteins and their modifications with decreased ambiguities and improved false discovery rates using unique sequence tags. , 2008, Analytical chemistry.

[5]  S. Bryant,et al.  Assessing data quality of peptide mass spectra obtained by quadrupole ion trap mass spectrometry. , 2005, Journal of proteome research.

[6]  Mikhail M Savitski,et al.  New Data Base-independent, Sequence Tag-based Scoring of Peptide MS/MS Data Validates Mowse Scores, Recovers Below Threshold Data, Singles Out Modified Peptides, and Assesses the Quality of MS/MS Techniques* , 2005, Molecular & Cellular Proteomics.

[7]  Alexey I Nesvizhskii,et al.  Analysis and validation of proteomic data generated by tandem mass spectrometry , 2007, Nature Methods.

[8]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[9]  Hyungwon Choi,et al.  Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[10]  Lang Li,et al.  A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry , 2008, Bioinform..

[11]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[12]  T. Köcher,et al.  Preprocessing of tandem mass spectrometric data to support automatic protein identification , 2003, Proteomics.

[13]  Nikola S Mueller,et al.  Interrogation of MS/MS search data with an pI Filter algorithm to increase protein identification success , 2007, Electrophoresis.

[14]  E. Kolker,et al.  Spectral quality assessment for high-throughput tandem mass spectrometry proteomics. , 2004, Omics : a journal of integrative biology.

[15]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[16]  Patrice Waridel,et al.  Rapid validation of protein identifications with the borderline statistical confidence via de novo sequencing and MS BLAST searches. , 2006, Journal of proteome research.

[17]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[18]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[19]  Eunok Paek,et al.  Quality assessment of tandem mass spectra based on cumulative intensity normalization. , 2006, Journal of proteome research.

[20]  Marshall W. Bern,et al.  Automatic Quality Assessment of Peptide Tandem Mass Spectra , 2004, ISMB/ECCB.

[21]  M. Vannucci,et al.  A novel wavelet‐based thresholding method for the pre‐processing of mass spectrometry data that accounts for heterogeneous noise , 2008, Proteomics.

[22]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[23]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[24]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[25]  Qunhua Li,et al.  Modes of inference for evaluating the confidence of peptide identifications. , 2008, Journal of proteome research.

[26]  S. Hanash Disease proteomics : Proteomics , 2003 .

[27]  B. Searle,et al.  Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. , 2008, Journal of proteome research.

[28]  Ying Xu,et al.  A computational method for assessing peptide‐ identification reliability in tandem mass spectrometry analysis with SEQUEST , 2004 .

[29]  Lennart Martens,et al.  Implementation and application of a versatile clustering tool for tandem mass spectrometry data , 2007, Proteomics.

[30]  Albert Sickmann,et al.  Extractor for ESI quadrupole TOF tandem MS data enabled for high throughput batch processing , 2004, BMC Bioinformatics.

[31]  Jianqi Li,et al.  A new strategy to filter out false positive identifications of peptides in SEQUEST database search results , 2007, Proteomics.

[32]  Tero Aittokallio,et al.  Quality classification of tandem mass spectrometry data , 2006, Bioinform..

[33]  Timo Miettinen,et al.  Robust denoising of electrophoresis and mass spectrometry signals with minimum description length principle , 2004, FEBS letters.

[34]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[35]  Hyungwon Choi,et al.  False discovery rates and related statistical concepts in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[36]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[37]  Kai A Reidegeld,et al.  An easy‐to‐use Decoy Database Builder software tool, implementing different decoy strategies for false discovery rate calculation in automated MS/MS protein identifications , 2008, Proteomics.

[38]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[39]  I. Eidhammer,et al.  Improving the reliability and throughput of mass spectrometry‐based proteomics by spectrum quality filtering , 2006, Proteomics.

[40]  Rolf Danielsson,et al.  Matched filtering with background suppression for improved quality of base peak chromatograms and mass spectra in liquid chromatography - mass spectrometry , 2002 .

[41]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[42]  Richard E Higgs,et al.  Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. , 2007, Journal of proteome research.

[43]  Keith Richardson,et al.  Noise filtering techniques for electrospray quadrupole time of flight mass spectra , 2003, Journal of the American Society for Mass Spectrometry.

[44]  Ji Zhu,et al.  Improved Classification of Mass Spectrometry Database Search Results Using Newer Machine Learning Approaches* , 2006, Molecular & Cellular Proteomics.

[45]  R. Aebersold,et al.  Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data , 2006, Molecular & Cellular Proteomics.

[46]  Ilan Beer,et al.  Improving large‐scale proteomics by clustering of mass spectrometry data , 2004, Proteomics.