Computational protein biomarker prediction: a case study for prostate cancer

BackgroundRecent technological advances in mass spectrometry pose challenges in computational mathematics and statistics to process the mass spectral data into predictive models with clinical and biological significance. We discuss several classification-based approaches to finding protein biomarker candidates using protein profiles obtained via mass spectrometry, and we assess their statistical significance. Our overall goal is to implicate peaks that have a high likelihood of being biologically linked to a given disease state, and thus to narrow the search for biomarker candidates.ResultsThorough cross-validation studies and randomization tests are performed on a prostate cancer dataset with over 300 patients, obtained at the Eastern Virginia Medical School using SELDI-TOF mass spectrometry. We obtain average classification accuracies of 87% on a four-group classification problem using a two-stage linear SVM-based procedure and just 13 peaks, with other methods performing comparably.ConclusionsModern feature selection and classification methods are powerful techniques for both the identification of biomarker candidates and the related problem of building predictive models from protein mass spectrometric profiles. Cross-validation and randomization are essential tools that must be performed carefully in order not to bias the results unfairly. However, only a biological validation and identification of the underlying proteins will ultimately confirm the actual value and power of any computational predictions.

[1]  Ravindra Khattree,et al.  Multivariate Data Reduction and Discrimination With SAS® Software , 2001 .

[2]  D. Chan,et al.  Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. , 2002, Clinical chemistry.

[3]  Michael J Campa,et al.  Identification and validation of a potential lung cancer serum biomarker detected by matrix‐assisted laser desorption/ionization‐time of flight spectra analysis , 2003, Proteomics.

[4]  Bruce Randall Donald,et al.  Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum , 2003, J. Comput. Biol..

[5]  Min Zhan,et al.  A data review and re-assessment of ovarian cancer serum proteomic profiling , 2003, BMC Bioinformatics.

[6]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[7]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[8]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[9]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[10]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[11]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[12]  G. Wright,et al.  Proteinchip® surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures , 1999, Prostate Cancer and Prostatic Diseases.

[13]  A. Pothen,et al.  Protocols for disease classification from mass spectrometry data , 2003, Proteomics.

[14]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[15]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.