Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data

BackgroundLike microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data.ResultsWe developed a recursive support vector machine (R-SVM) algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE), paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5 %-~20 % improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-MS proteomics data, one from a human breast cancer study and the other from a study on rat liver cirrhosis. Important biomarkers found by the algorithm were validated by follow-up biological experiments.ConclusionThe proposed R-SVM method is suitable for analyzing noisy high-throughput proteomics and microarray data and it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features. The multivariate SVM-based method outperforms the univariate method in the classification performance, but univariate methods can reveal more of the differentially expressed features especially when there are correlations between the features.

[1]  M. Hulett,et al.  Murine histidine‐rich glycoprotein: Cloning, characterization and cellular origin , 2000, Immunology and cell biology.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[5]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[6]  E. Petricoin,et al.  Clinical proteomics: translating benchside promise into bedside reality , 2002, Nature Reviews Drug Discovery.

[7]  Nir Friedman,et al.  Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays , 2004, Bioinform..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Bart Kosko,et al.  Neural networks for signal processing , 1992 .

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Xin Lu,et al.  Molecular classification of liver cirrhosis in a rat model by proteomics and bioinformatics , 2004, Proteomics.

[12]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[13]  Daniel W Chan,et al.  Cancer Proteomics: Serum Diagnostics for Tumor Marker Discovery , 2004, Annals of the New York Academy of Sciences.

[14]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[17]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[18]  Zhijin Wu,et al.  Preprocessing of oligonucleotide array data , 2004, Nature Biotechnology.

[19]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[20]  Xuegong Zhang,et al.  Recursive Sample Classification and Gene Selection based on SVM: Method and Software Description # , 2001 .

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  E. Fung,et al.  ProteinChip clinical proteomics: computational challenges and solutions. , 2002, BioTechniques.

[23]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[24]  Xuegong Zhang,et al.  Using class-center vectors to build support vector machines , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[25]  Jill P. Mesirov,et al.  Support Vector Machine Classification of Microarray Data , 2001 .

[26]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[27]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[28]  Robert Gentleman,et al.  Declining plasma fibrinogen alpha fragment identifies HER2-positive breast cancer patients and reverts to normal levels after surgery. , 2006, Journal of proteome research.

[29]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  J. Potter,et al.  A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. , 2003, Biostatistics.

[31]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[32]  E. Diamandis Analysis of serum proteomic patterns for early cancer diagnosis: drawing attention to potential problems. , 2004, Journal of the National Cancer Institute.

[33]  Xuegong Zhang,et al.  An Improved Support Vector Machine Using Class-Median Vectors * , .

[34]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.