Profiling MS proteomics data using smoothed non‐linear energy operator and Bayesian additive regression trees

This paper proposes a novel profiling method for SELDI‐TOF and MALDI‐TOF MS data that integrates a novel peak detection method based on modified smoothed non‐linear energy operator, correlation‐based peak selection and Bayesian additive regression trees. The peak detection and classification performance of the proposed approach is validated on two publicly available MS data sets, namely MALDI‐TOF simulation data and high‐resolution SELDI‐TOF ovarian cancer data. The results compared favorably with three state‐of‐the‐art peak detection algorithms and four machine‐learning algorithms. For the high‐resolution ovarian cancer data set, seven biomarkers (m/z windows) were found by our method, which achieved 97.30 and 99.10% accuracy at 25th and 75th percentiles, respectively, from 50 independent cross‐validation samples, which is significantly better than other profiling and dimensional reduction methods. The results show that the method is capable of finding parsimonious sets of biologically meaningful biomarkers with better accuracy than existing methods. Supporting Information material and MATLAB/R scripts to implement the methods described in the article are available at: http://www.cs.bham.ac.uk/szh/SourceCode‐for‐Proteomics.zip

[1]  E. Diamandis Analysis of serum proteomic patterns for early cancer diagnosis: drawing attention to potential problems. , 2004, Journal of the National Cancer Institute.

[2]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[3]  Xin Yao,et al.  A wavelet-based data pre-processing analysis approach in mass spectrometry , 2007, Comput. Biol. Medicine.

[4]  Petros Maragos,et al.  On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[5]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[6]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Jeffrey S. Morris,et al.  Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum , 2005, Bioinform..

[8]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[9]  Xiaoli Li,et al.  Detection of Epileptic Spikes with Empirical Mode Decomposition and Nonlinear Energy Operator , 2005, ISNN.

[10]  J. Albrethsen Reproducibility in protein profiling by MALDI-TOF mass spectrometry. , 2007, Clinical chemistry.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[13]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[14]  Jeffrey S. Morris,et al.  Serum proteomics profiling—a young technology begins to mature , 2005, Nature Biotechnology.

[15]  K. Coombes Analysis of mass spectrometry profiles of the serum proteome. , 2005, Clinical chemistry.

[16]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[17]  DuPan,et al.  Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching , 2006 .

[18]  D. Chan,et al.  Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. Assessment of platform reproducibility. , 2005, Clinical chemistry.

[19]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[20]  Neal O. Jeffries,et al.  Performance of a genetic algorithm for mass spectrometry proteomics , 2004, BMC Bioinformatics.

[21]  J. Whitin,et al.  Improving feature detection and analysis of surface‐enhanced laser desorption/ionization‐time of flight mass spectra , 2005, Proteomics.

[22]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[23]  Jeffrey S. Morris,et al.  Improved peak detection and quantification of mass spectrometry data acquired from surface‐enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform , 2005, Proteomics.

[24]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[25]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Xiaoli Li,et al.  Profiling of High-Throughput Mass Spectrometry Data for Ovarian Cancer Detection , 2007, IDEAL.

[27]  Jeffrey S. Morris,et al.  Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization. , 2003, Clinical chemistry.

[28]  Timothy W Randolph,et al.  Signal detection in high-resolution mass spectrometry data. , 2008, Journal of proteome research.

[29]  Karin Noy,et al.  Improved model-based, platform-independent feature extraction for mass spectrometry , 2007, Bioinform..

[30]  Jiangsheng Yu,et al.  Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data , 2005, ISMB.

[31]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[32]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[33]  Marina Vannucci,et al.  Comparison of algorithms for pre-processing of SELDI-TOF mass spectrometry data , 2008, Bioinform..

[34]  Chao Yang,et al.  Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis , 2009, BMC Bioinformatics.

[35]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[36]  Xuewu Zhang,et al.  Mass spectrometry-based "omics" technologies in cancer diagnostics. , 2007, Mass spectrometry reviews.

[37]  Petros Maragos,et al.  A comparison of the energy operator and the Hilbert transform approach to signal and speech demodulation , 1994, Signal Process..

[38]  Habtom W. Ressom,et al.  Peak selection from MALDI-TOF mass spectra using ant colony optimization , 2007, Bioinform..

[39]  Claudio Cobelli,et al.  Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data , 2005, Bioinform..

[40]  Laurence Lodé,et al.  Surface‐enhanced laser desorption/ionization time of flight mass spectrometry protein profiling identifies ubiquitin and ferritin light chain as prognostic biomarkers in node‐negative breast cancer tumors , 2006, Proteomics.

[41]  E. Diamandis Mass Spectrometry as a Diagnostic and a Cancer Biomarker Discovery Tool , 2004, Molecular & Cellular Proteomics.

[42]  Habtom W. Ressom,et al.  Analysis of mass spectral serum profiles for biomarker selection , 2005, Bioinform..

[43]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[44]  Edward I. George,et al.  Bayesian Ensemble Learning , 2006, NIPS.

[45]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[46]  S. Mukhopadhyay,et al.  A new interpretation of nonlinear energy operator and its efficacy in spike detection , 1998, IEEE Transactions on Biomedical Engineering.

[47]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[48]  J. Potter,et al.  A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. , 2003, Biostatistics.

[49]  Emanuel F. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004 .

[50]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[51]  C. Floyd,et al.  Decision tree classification of proteins identified by mass spectrometry of blood serum samples from people with and without lung cancer , 2003, Proteomics.