Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens

MOTIVATION Application of mass spectrometry in proteomics is a breakthrough in high-throughput analyses. Early applications have focused on protein expression profiles to differentiate among various types of tissue samples (e.g. normal versus tumor). Here our goal is to use mass spectra to differentiate bacterial species using whole-organism samples. The raw spectra are similar to spectra of tissue samples, raising some of the same statistical issues (e.g. non-uniform baselines and higher noise associated with higher baseline), but are substantially noisier. As a result, new preprocessing procedures are required before these spectra can be used for statistical classification. RESULTS In this study, we introduce novel preprocessing steps that can be used with any mass spectra. These comprise a standardization step and a denoising step. The noise level for each spectrum is determined using only data from that spectrum. Only spectral features that exceed a threshold defined by the noise level are subsequently used for classification. Using this approach, we trained the Random Forest program to classify 240 mass spectra into four bacterial types. The method resulted in zero prediction errors in the training samples and in two test datasets having 240 and 300 spectra, respectively.

[1]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[2]  Ting Chen,et al.  A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications , 2003, ECCB.

[3]  Douglas M. Hawkins,et al.  Exploring Blood Spectra for Signs of Ovarian Cancer , 2003 .

[4]  J. Lay,et al.  MALDI-TOF mass spectrometry of bacteria. , 2001, Mass spectrometry reviews.

[5]  Sebastian Böcker,et al.  SNP and mutation discovery using base-specific cleavage and MALDI-TOF mass spectrometry , 2003, ISMB.

[6]  M. Guertin,et al.  Reference values obtained by kernel-based estimation of quantile regressions. , 1995, Biometrics.

[7]  Z. Smilansky,et al.  Intensity-based statistical scorer for tandem mass spectrometry. , 2003, Analytical chemistry.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  G. Li,et al.  An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers , 2002, Bioinform..

[10]  P. Demirev,et al.  Characterization of intact microorganisms by MALDI mass spectrometry. , 2001, Mass spectrometry reviews.

[11]  Min Zhan,et al.  A data review and re-assessment of ovarian cancer serum proteomic profiling , 2003, BMC Bioinformatics.

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  Herbert Thiele Mass Spectrometry and Bioinformatics in Proteomics , 2003 .

[14]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[15]  David Banks,et al.  Finding Cancer Signals in Mass Spectrometry Data , 2003 .

[16]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[17]  David M. Rocke,et al.  Discriminant models for high‐throughput proteomics mass spectrometer data , 2003, Proteomics.

[18]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[19]  Jeffrey S. Morris,et al.  Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments , 2004, Bioinform..

[20]  L. Breiman Random Forests--random Features , 1999 .