Comprehensive comparison of classifiers for metabolic profiling analysis

Metabonomics is an emerging field providing insight into physiological processes and difference. Besides conventional PCA, PLS and OPLS approaches, more and more machine learning classifiers are likely to become the supplements for metabolic profiling data analysis. A comprehensive comparison of PLS, support vector machine (SVM, with linear and quadratic kernels), linear discriminant analysis (LDA), and random forest (RF) was reported applying on clinical metabonomics data. The accuracy of these classifiers was tested by 7-fold and holdout Cross Validation. Their stability and over fitting were evaluated by holdout Cross Validation and permutation (repeated 100 times). Their prediction ability was investigated by ROC curve, and their sensitivity on irrelevant variables was studied by variable ranking combining selection step by step. The overall performance of RF and SVM (linear kernel) is superior to the others. Some selected variables are of significance for further research on metabolic difference.

[1]  Xiao-Bing Li,et al.  Clustering-based two-dimensional linear discriminant analysis for speech recognition , 2007, INTERSPEECH.

[2]  Johan Trygg,et al.  Chemometrics in metabonomics. , 2007, Journal of proteome research.

[3]  Yu Cao,et al.  Metabonomic evaluation of melamine-induced acute renal toxicity in rats. , 2010, Journal of proteome research.

[4]  Rasmus Bro,et al.  Orthogonal signal correction, wavelet analysis, and multivariate calibration of complicated process fluorescence data , 2000 .

[5]  S. Sathiya Keerthi,et al.  Evaluation of simple performance measures for tuning SVM hyperparameters , 2003, Neurocomputing.

[6]  Sirish L. Shah,et al.  Analysis of metabolomic data using support vector machines. , 2008, Analytical chemistry.

[7]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[8]  Tetsunori Kobayashi,et al.  Two-dimensional Heteroscedastic Linear Discriminant Analysis for Age-group Classification , 2006, ICPR.

[9]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[10]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[11]  Tianlu Chen,et al.  Serum metabolite profiling of human colorectal cancer using GC-TOFMS and UPLC-QTOFMS. , 2009, Journal of proteome research.

[12]  James D Katz,et al.  Random Forests Classification Analysis for the Assessment of Diagnostic Skill , 2010, American journal of medical quality : the official journal of the American College of Medical Quality.

[13]  Kyoungmi Kim,et al.  Urine Metabolomics Analysis for Kidney Cancer Detection and Biomarker Discovery*S , 2009, Molecular & Cellular Proteomics.

[14]  Wei Jia,et al.  Metabonomic variations in the drug-treated type 2 diabetes mellitus patients and healthy volunteers. , 2009, Journal of proteome research.