Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery.

Biomarker discovery is one important goal in metabolomics, which is typically modeled as selecting the most discriminating metabolites for classification and often referred to as variable importance analysis or variable selection. Until now, a number of variable importance analysis methods to discover biomarkers in the metabolomics studies have been proposed. However, different methods are mostly likely to generate different variable ranking results due to their different principles. Each method generates a variable ranking list just as an expert presents an opinion. The problem of inconsistency between different variable ranking methods is often ignored. To address this problem, a simple and ideal solution is that every ranking should be taken into account. In this study, a strategy, called rank aggregation, was employed. It is an indispensable tool for merging individual ranking lists into a single "super"-list reflective of the overall preference or importance within the population. This "super"-list is regarded as the final ranking for biomarker discovery. Finally, it was used for biomarkers discovery and selecting the best variable subset with the highest predictive classification accuracy. Nine methods were used, including three univariate filtering and six multivariate methods. When applied to two metabolic datasets (Childhood overweight dataset and Tubulointerstitial lesions dataset), the results show that the performance of rank aggregation has improved greatly with higher prediction accuracy compared with using all variables. Moreover, it is also better than penalized method, least absolute shrinkage and selectionator operator (LASSO), with higher prediction accuracy or less number of selected variables which are more interpretable.

[1]  Yizeng Liang,et al.  Erratum to: Informative metabolites identification by variable importance analysis based on random variable combination , 2015, Metabolomics.

[2]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[5]  O. Kvalheim,et al.  Biomarker discovery in mass spectral profiles by means of selectivity ratio plot , 2009 .

[6]  Qing-Song Xu,et al.  Using variable combination population analysis for variable selection in multivariate calibration. , 2015, Analytica chimica acta.

[7]  Dong-Sheng Cao,et al.  A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration , 2014 .

[8]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[9]  Dong-Sheng Cao,et al.  Model population analysis for variable selection , 2010 .

[10]  Inge S. Helland,et al.  Relevant components in regression , 1993 .

[11]  Vasyl Pihur,et al.  RankAggreg, an R package for weighted rank aggregation , 2009, BMC Bioinformatics.

[12]  Dong-Sheng Cao,et al.  A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. , 2014, Analytica chimica acta.

[13]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[15]  Stefania Favilla,et al.  Assessing feature relevance in NPLS models by VIP , 2013 .

[16]  Lutgarde M. C. Buydens,et al.  Interpretation of variable importance in Partial Least Squares with Significance Multivariate Correlation (sMC) , 2014 .

[17]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[18]  Ryan Gosselin,et al.  A Bootstrap-VIP approach for selecting wavelength intervals in spectral imaging applications , 2010 .

[19]  Dong-Sheng Cao,et al.  Recipe for revealing informative metabolites based on model population analysis , 2010, Metabolomics.

[20]  Olav M. Kvalheim,et al.  Interpretation of latent-variable regression models , 1989 .

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[23]  G. Shulman,et al.  Skeletal muscle lipid metabolism with obesity. , 2003, American journal of physiology. Endocrinology and metabolism.

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  Andreu Palou,et al.  Blood amino acid compartmentation in men and women with different degrees of obesity , 1998 .

[26]  Vasyl Pihur,et al.  Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach , 2007, Bioinform..

[27]  K. Siamopoulos,et al.  Evaluation of tubulointerstitial lesions' severity in patients with glomerulonephritides: an NMR-based metabonomic study. , 2007, Journal of proteome research.

[28]  Yong-Huan Yun,et al.  A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. , 2015, The Analyst.

[29]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[30]  Dongsheng Cao,et al.  Plasma metabolic fingerprinting of childhood obesity by GC/MS in conjunction with multivariate statistical analysis. , 2010, Journal of pharmaceutical and biomedical analysis.

[31]  C Lawrence Kien,et al.  Increasing dietary palmitic acid decreases fat oxidation and daily energy expenditure. , 2005, The American journal of clinical nutrition.

[32]  Qing-Song Xu,et al.  Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. , 2012, Analytica chimica acta.

[33]  Olav M. Kvalheim,et al.  Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots , 2010 .

[34]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[35]  Lunzhao Yi,et al.  A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. , 2014, The Analyst.

[36]  Jaques Reifman,et al.  Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions , 2002, Bioinform..

[37]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[38]  Dong-Sheng Cao,et al.  A new strategy to prevent over-fitting in partial least squares models based on model population analysis. , 2015, Analytica chimica acta.

[39]  Melanie Hilario,et al.  Approaches to dimensionality reduction in proteomic biomarker studies , 2007, Briefings Bioinform..

[40]  Dong-Sheng Cao,et al.  An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration. , 2013, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.