Classifying osteosarcoma patients using machine learning approaches

Metabolomic data analysis presents a unique opportunity to advance our understanding of osteosarcoma, a common bone malignancy for which genomic and proteomic studies have enjoyed limited success. One of the major goals of metabolomic studies is to classify osteosarcoma in early stages, which is required for metastasectomy treatment. In this paper we subject our metabolomic data on osteosarcoma patients collected by the SJTU team to three classification methods: logistic regression, support vector machine (SVM) and random forest (RF). The performances are evaluated and compared using receiver operating characteristic curves. All three classifiers are successful in distinguishing between healthy control and tumor cases, with random forest outperforming the other two for cross-validation in training set (accuracy rate for logistic regression, support vector machine and random forest are 88%, 90% and 97% respectively). Random forest achieved overall accuracy rate of 95% with 0.99 AUC on testing set.

[1]  Richard D. Beger,et al.  A Review of Applications of Metabolomics in Cancer , 2013, Metabolites.

[2]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[3]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[4]  J. Chen,et al.  Metabonomics study of liver cancer based on ultra performance liquid chromatography coupled to mass spectrometry with HILIC and RPLC separations. , 2009, Analytica chimica acta.

[5]  Alexander G. Gray,et al.  Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines , 2009, BMC Bioinformatics.

[6]  Andreas Zell,et al.  Prediction of breast cancer by profiling of urinary RNA metabolites using Support Vector Machine-based feature selection , 2009, BMC Cancer.

[7]  Hong Tang,et al.  Data mining techniques for cancer detection using serum proteomic profiling , 2004, Artif. Intell. Medicine.

[8]  Yu Cao,et al.  Random Forest in Clinical Metabolomics for Phenotypic Discrimination and Biomarker Selection , 2013, Evidence-based complementary and alternative medicine : eCAM.

[9]  D. R. Cox,et al.  The analysis of binary data , 1971 .

[10]  Li Ding,et al.  Recurrent somatic structural variations contribute to tumorigenesis in pediatric osteosarcoma. , 2014, Cell reports.

[11]  Wei Sun,et al.  Serum and urinary metabonomic study of human osteosarcoma. , 2010, Journal of proteome research.

[12]  Mark A van de Wiel,et al.  Support Vector Machine Approach to Separate Control and Breast Cancer Serum Samples , 2008, Statistical applications in genetics and molecular biology.

[13]  S. Altekruse,et al.  Declining childhood and adolescent cancer mortality , 2014, Cancer.

[14]  O. Fiehn,et al.  Mass spectrometry-based metabolic profiling reveals different metabolite patterns in invasive ovarian carcinomas and ovarian borderline tumors. , 2006, Cancer research.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[17]  D. Cox,et al.  The analysis of binary data , 1971 .