Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data

MOTIVATION High-throughput and high-resolution mass spectrometry instruments are increasingly used for disease classification and therapeutic guidance. However, the analysis of immense amount of data poses considerable challenges. We have therefore developed a novel method for dimensionality reduction and tested on a published ovarian high-resolution SELDI-TOF dataset. RESULTS We have developed a four-step strategy for data preprocessing based on: (1) binning, (2) Kolmogorov-Smirnov test, (3) restriction of coefficient of variation and (4) wavelet analysis. Subsequently, support vector machines were used for classification. The developed method achieves an average sensitivity of 97.38% (sd = 0.0125) and an average specificity of 93.30% (sd = 0.0174) in 1000 independent k-fold cross-validations, where k = 2, ..., 10. AVAILABILITY The software is available for academic and non-commercial institutions.

[1]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[2]  Douglas C Pearl,et al.  Proteomic patterns in serum and identification of ovarian cancer , 2002, The Lancet.

[3]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[4]  David G. Stork,et al.  Pattern Classification , 1973 .

[5]  Chih-Jen Lin,et al.  Formulations of Support Vector Machines: A Note from an Optimization Point of View , 2001, Neural Computation.

[6]  M. Ferrari,et al.  Clinical proteomics: Written in blood , 2003, Nature.

[7]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[8]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  A. Vlahou,et al.  Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Data , 2003, Journal of biomedicine & biotechnology.

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  A. T. McKay,et al.  Distribution of the Coefficient of Variation and the Extended “T” Distribution , 1932 .

[12]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[13]  Bruce Randall Donald,et al.  Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum , 2003, J. Comput. Biol..

[14]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[15]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[16]  Jeffrey S. Morris,et al.  Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments , 2004, Bioinform..

[17]  J. Glimm,et al.  Detection of cancer-specific markers amid massive mass spectral data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[19]  Charles K. Chui,et al.  An Introduction to Wavelets , 1992 .

[20]  E. Petricoin,et al.  SELDI-TOF-based serum proteomic pattern diagnostics for early detection of cancer. , 2004, Current opinion in biotechnology.

[21]  E. Diamandis Mass Spectrometry as a Diagnostic and a Cancer Biomarker Discovery Tool , 2004, Molecular & Cellular Proteomics.

[22]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[23]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[24]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[25]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[26]  M. Elwood Proteomic patterns in serum and identification of ovarian cancer , 2002, The Lancet.

[27]  Pietro Liò,et al.  Wavelets in bioinformatics and computational biology: state of art and perspectives , 2003, Bioinform..

[28]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[29]  L. Joseph 4. Bayesian data analysis (2nd edn). Andrew Gelman, John B. Carlin, Hal S. Stern and Donald B. Rubin (eds), Chapman & Hall/CRC, Boca Raton, 2003. No. of pages: xxv + 668. Price: $59.95. ISBN 1‐58488‐388‐X , 2004 .

[30]  Bernhard Schölkopf,et al.  Margin Distribution and Soft Margin , 2000 .

[31]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[32]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[33]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[34]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[35]  E. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004, Endocrine-related cancer.

[36]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[37]  Yi-ding Chen,et al.  An integrated approach to the detection of colorectal cancer utilizing proteomics and bioinformatics. , 2004, World journal of gastroenterology.

[39]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[40]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[41]  Pär Stattin,et al.  Correspondence re: B-L. Adam et al., Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res., 62: 3609-3614, 2002. , 2003, Cancer research.

[42]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[43]  G. Nason,et al.  Wavelet processes and adaptive estimation of the evolutionary wavelet spectrum , 2000 .

[44]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[45]  Emanuel F Petricoin,et al.  Mass spectrometry-based diagnostics: the upcoming revolution in disease detection. , 2003, Clinical chemistry.

[46]  H. Toutenburg,et al.  Lehmann, E. L., Nonparametrics: Statistical Methods Based on Ranks, San Francisco. Holden‐Day, Inc., 1975. 480 S., $ 22.95 . , 1977 .

[47]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..