A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection

Metabolic markers are the core of metabonomic surveys. Hence selection of differential metabolites is of great importance for either biological or clinical purpose. Here, a feature selection method was developed for complex metabonomic data set. As an effective tool for metabonomics data analysis, support vector machine (SVM) was employed as the basic classifier. To find out meaningful features effectively, support vector machine recursive feature elimination (SVM-RFE) was firstly applied. Then, genetic algorithm (GA) and random forest (RF) which consider the interaction among the metabolites and independent performance of each metabolite in all samples, respectively, were used to obtain more informative metabolic difference and avoid the risk of false positive. A data set from plasma metabonomics study of rat liver diseases developed from hepatitis, cirrhosis to hepatocellular carcinoma was applied for the validation of the method. Besides the good classification results for 3 kinds of liver diseases, 31 important metabolites including lysophosphatidylethanolamine (LPE) C16:0, palmitoylcarnitine, lysophosphatidylethanolamine (LPC) C18:0 were also selected for further studies. A better complementary effect of the three feature selection methods could be seen from the current results. The combinational method also represented more differential metabolites and provided more metabolic information for a “global” understanding of diseases than any single method. Further more, this method is also suitable for other complex biological data sets.

[1]  T. Liang,et al.  Integrated analysis of serum and liver metabonome in liver transplanted rats by gas chromatography coupled with mass spectrometry. , 2009, Analytica chimica acta.

[2]  C W Yap,et al.  Classification of a diverse set of Tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. , 2006, Chemical research in toxicology.

[3]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[4]  V. Tolstikov,et al.  Probing genetic algorithms for feature selection in comprehensive metabolic profiling approach. , 2008, Rapid communications in mass spectrometry : RCM.

[5]  Francesco Falciani,et al.  GALGO: an R package for multivariate variable selection using genetic algorithms , 2006, Bioinform..

[6]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[7]  Terry L. Holst Genetic Algorithms Applied to Multi-Objective Aerospace Shape Optimization , 2005, J. Aerosp. Comput. Inf. Commun..

[8]  Seoung Bum Kim,et al.  Discovery of metabolite features for the modelling and analysis of high-resolution NMR spectra , 2008, Int. J. Data Min. Bioinform..

[9]  L. Suva,et al.  Biomarkers that Discriminate Multiple Myeloma Patients with or without Skeletal Involvement Detected Using SELDI-TOF Mass Spectrometry and Statistical and Machine Learning Tools , 2006, Disease markers.

[10]  D. Balding,et al.  Handbook of statistical genetics , 2004 .

[11]  Xin Lu,et al.  A metabonomic study of hepatitis B-induced liver cirrhosis and hepatocellular carcinoma by using RP-LC and HILIC coupled with mass spectrometry. , 2009, Molecular bioSystems.

[12]  E. K. Kemsley,et al.  THE USE AND MISUSE OF CHEMOMETRICS FOR TREATING CLASSIFICATION PROBLEMS , 1997 .

[13]  Hwee-Ling Koh,et al.  Ultra-performance liquid chromatography/time-of-flight mass spectrometry based metabolomics of raw and steamed Panax notoginseng. , 2007, Rapid communications in mass spectrometry : RCM.

[14]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[15]  Elaine Holmes,et al.  Susceptibility of human metabolic phenotypes to dietary modulation. , 2006, Journal of proteome research.

[16]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[17]  Ronald Eugene Shaffer,et al.  Multi‐ and Megavariate Data Analysis. Principles and Applications, I. Eriksson, E. Johansson, N. Kettaneh‐Wold and S. Wold, Umetrics Academy, Umeå, 2001, ISBN 91‐973730‐1‐X, 533pp. , 2002 .

[18]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  M. Cocchi,et al.  Discrimination of healthy and neoplastic human colon tissues by ex vivo HR-MAS NMR spectroscopy and chemometric analyses. , 2009, Journal of proteome research.

[21]  Jaakko Hollmén,et al.  Functional prediction of unidentified lipids using supervised classifiers , 2010, Metabolomics.

[22]  Mark Walker,et al.  Optimization of human plasma 1H NMR spectroscopic data processing for high-throughput metabolic phenotyping studies and detection of insulin resistance related to type 2 diabetes. , 2008, Analytical chemistry.

[23]  Younghoon Kim,et al.  Integrated Data Mining Strategy for Effective Metabolomic Data Analysis , 2007 .

[24]  Donald A. Dinero Use and Misuse , 2011 .

[25]  Sirish L. Shah,et al.  Analysis of metabolomic data using support vector machines. , 2008, Analytical chemistry.

[26]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[27]  Chris Cannings,et al.  Handbook of Statistical Genetics: Balding/Handbook of Statistical Genetics, Third Edition , 2007 .

[28]  Trairak Pisitkun,et al.  Discovery of Urinary Biomarkers* , 2006, Molecular & Cellular Proteomics.

[29]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[30]  Padraig Cunningham,et al.  MetaFIND: A feature analysis tool for metabolomics data , 2008, BMC Bioinformatics.

[31]  Wei Zou,et al.  Pattern Recognition and Pathway Analysis with Genetic Algorithms in Mass Spectrometry Based Metabolomics , 2009, Algorithms.

[32]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[33]  J. Nicholson Global systems biology, personalized medicine and molecular epidemiology , 2006, Molecular systems biology.

[34]  Qing Yang,et al.  Diagnosis of liver cancer using HPLC-based metabonomics avoiding false-positive result from hepatitis and hepatocirrhosis diseases. , 2004, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[35]  Kjell Johnson,et al.  Evaluating Methods for Classifying Expression Data , 2004, Journal of biopharmaceutical statistics.

[36]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[37]  Elaine Holmes,et al.  NMR-based metabonomic studies on the biochemical effects of epicatechin in the rat. , 2003, Journal of agricultural and food chemistry.

[38]  Z. Ramadan,et al.  Metabolic profiling using principal component analysis, discriminant partial least squares, and genetic algorithms. , 2006, Talanta.

[39]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[40]  Lei Sun,et al.  EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis , 2008, Bioinform..

[41]  Eun Jung Choi,et al.  Discrimination of Scrophularia spp. according to geographic origin with HPLC-DAD combined with multivariate analysis , 2010 .