Quality classification of tandem mass spectrometry data

UNLABELLED Peptide identification by tandem mass spectrometry is an important tool in proteomic research. Powerful identification programs exist, such as SEQUEST, ProICAT and Mascot, which can relate experimental spectra to the theoretical ones derived from protein databases, thus removing much of the manual input needed in the identification process. However, the time-consuming validation of the peptide identifications is still the bottleneck of many proteomic studies. One way to further streamline this process is to remove those spectra that are unlikely to provide a confident or valid peptide identification, and in this way to reduce the labour from the validation phase. RESULTS We propose a prefiltering scheme for evaluating the quality of spectra before the database search. The spectra are classified into two classes: spectra which contain valuable information for peptide identification and spectra that are not derived from peptides or contain insufficient information for interpretation. The different spectral features developed for the classification are tested on a real-life material originating from human lymphoblast samples and on a standard mixture of 9 proteins, both labelled with the ICAT-reagent. The results show that the prefiltering scheme efficiently separates the two spectra classes.

[1]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[2]  E. Kolker,et al.  Spectral quality assessment for high-throughput tandem mass spectrometry proteomics. , 2004, Omics : a journal of integrative biology.

[3]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[6]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[7]  Ying Xu,et al.  A computational method for assessing peptide-identification reliability in tandem mass spectrometry analysis with SEQUEST , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[8]  K. Resing,et al.  Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. , 2004, Analytical chemistry.

[9]  John R. Yates,et al.  Peptide Sequencing by Tandem Mass Spectrometry , 2006 .

[10]  Marshall W. Bern,et al.  Automatic Quality Assessment of Peptide Tandem Mass Spectra , 2004, ISMB/ECCB.

[11]  J. A. Taylor,et al.  Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. , 2001, Analytical chemistry.

[12]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  J. Yates,et al.  Probability-based validation of protein identifications using a modified SEQUEST algorithm. , 2002, Analytical chemistry.

[15]  R. Lahesmaa,et al.  A comparative evaluation of software for the analysis of liquid chromatography‐tandem mass spectrometry data from isotope coded affinity tag experiments , 2005, Proteomics.

[16]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[17]  Mikhail M Savitski,et al.  New Data Base-independent, Sequence Tag-based Scoring of Peptide MS/MS Data Validates Mowse Scores, Recovers Below Threshold Data, Singles Out Modified Peptides, and Assesses the Quality of MS/MS Techniques* , 2005, Molecular & Cellular Proteomics.

[18]  K. Stühler,et al.  Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data , 2004, Proteomics.

[19]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[20]  Mark Cieliebak,et al.  AUDENS: a tool for automated peptide de novo sequencing. , 2005, Journal of proteome research.

[21]  Albert Sickmann,et al.  Extractor for ESI quadrupole TOF tandem MS data enabled for high throughput batch processing , 2004, BMC Bioinformatics.

[22]  N. Sherman,et al.  Protein Sequencing and Identification Using Tandem Mass Spectrometry: Kinter/Tandem Mass Spectrometry , 2000 .

[23]  B. Cargile,et al.  Potential for false positive identifications from large databases through tandem mass spectrometry. , 2004, Journal of proteome research.

[24]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[25]  RockOn Team,et al.  Re: Attenuation compensation in single-photon emission tomography: a comparative evaluation. , 1983, Journal of nuclear medicine : official publication, Society of Nuclear Medicine.

[26]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[27]  Riitta Lahesmaa,et al.  Characterization of microsomal fraction proteome in human lymphoblasts reveals the down‐regulation of galectin‐1 by interleukin‐12 , 2005, Proteomics.

[28]  Ying Xu,et al.  A computational method for assessing peptide‐ identification reliability in tandem mass spectrometry analysis with SEQUEST , 2004 .

[29]  Jue Wang,et al.  AMASS: Software for Automatically Validating the Quality of MS/MS Spectrum from SEQUEST Results* S , 2004, Molecular & Cellular Proteomics.