Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive markers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on an original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data.

[1]  Yiyu Cheng,et al.  Urinary nucleosides based potential biomarker selection by support vector machine for bladder cancer recognition. , 2007, Analytica chimica acta.

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  Frank Hsu,et al.  Knowledge Discovery , 2014, Encyclopedia of Social Network Analysis and Mining.

[4]  D. Wishart,et al.  Translational biomarker discovery in clinical metabolomics: an introductory tutorial , 2012, Metabolomics.

[5]  L. Beran,et al.  [Formal concept analysis]. , 1996, Casopis lekaru ceskych.

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[7]  Taghi M. Khoshgoftaar,et al.  Measuring Stability of Feature Selection Techniques on Real-World Software Datasets , 2013 .

[8]  Carolin Strobl,et al.  A new variable importance measure for random forests with missing data , 2012, Statistics and Computing.

[9]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[10]  Thomas Brendan Murphy,et al.  Applying random forests to identify biomarker panels in serum 2D-DIGE data for the detection and staging of prostate cancer. , 2011, Journal of proteome research.

[11]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[12]  K. Schram Urinary nucleosides. , 1998, Mass spectrometry reviews.

[13]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[14]  Maria Liakata,et al.  Merits of random forests emerge in evaluation of chemometric classifiers by external validation. , 2013, Analytica chimica acta.

[15]  E F Sawyer,et al.  A new variable. , 1885, Science.

[16]  David I. Ellis,et al.  A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding. , 2015, Analytica chimica acta.

[17]  Joachim M. Buhmann,et al.  Feature selection for support vector machines , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[18]  Frans M van der Kloet,et al.  Analytical error reduction using single point calibration for accurate and precise metabolomic phenotyping. , 2009, Journal of proteome research.

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  Daniel Jacob,et al.  Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics , 2014, Bioinform..

[21]  Rawi Ramautar,et al.  Human metabolomics: strategies to understand biology. , 2013, Current opinion in chemical biology.

[22]  J. Lindon,et al.  'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. , 1999, Xenobiotica; the fate of foreign compounds in biological systems.

[23]  Steffen Neumann,et al.  Highly sensitive feature detection for high resolution LC/MS , 2008, BMC Bioinformatics.

[24]  Carolin Strobl,et al.  Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations , 2012, Briefings Bioinform..

[25]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[27]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  Yu Cao,et al.  Random Forest in Clinical Metabolomics for Phenotypic Discrimination and Biomarker Selection , 2013, Evidence-based complementary and alternative medicine : eCAM.

[30]  A. Zell,et al.  Metabonomics in cancer diagnosis: mass spectrometry-based profiling of urinary nucleosides from breast cancer patients. , 2008, Biomarkers : biochemical indicators of exposure, response, and susceptibility to chemicals.

[31]  M. Zins,et al.  Cohort Profile Update: The GAZEL Cohort Study. , 2015, International journal of epidemiology.

[32]  O. Fiehn,et al.  Metabolite profiling for plant functional genomics , 2000, Nature Biotechnology.

[33]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[34]  Estelle Pujos-Guillot,et al.  Development and validation of a UPLC/MS method for a nutritional metabolomic study of human plasma , 2010, Metabolomics.

[35]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[36]  T. Veenstra,et al.  Analytical and statistical approaches to metabolomics research. , 2009, Journal of separation science.

[37]  Serge Rudaz,et al.  Mass spectrometry metabolomic data handling for biomarker discovery , 2020, Proteomic and Metabolomic Approaches to Biomarker Discovery.

[38]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Age K. Smilde,et al.  Reflections on univariate and multivariate analysis of metabolomics data , 2013, Metabolomics.

[40]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[41]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[42]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[43]  David I. Ellis,et al.  A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. , 2014, Analytica chimica acta.

[44]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[45]  Paolo Giudici,et al.  Applied Data Mining: Statistical Methods for Business and Industry , 2003 .

[46]  Christian Baumgartner,et al.  Bioinformatic-driven search for metabolic biomarkers in disease , 2011, Journal of Clinical Bioinformatics.

[47]  Maria P. Pavlou,et al.  Proteomic and Mass Spectrometry Technologies for Biomarker Discovery , 2013 .

[48]  Kamlesh Khunti,et al.  Risk assessment tools for detecting those with pre-diabetes: a systematic review. , 2014, Diabetes research and clinical practice.

[49]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[50]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[51]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[52]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[53]  Stephen T. C. Wong,et al.  Gene Selection and Classification , 2008 .

[54]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[55]  Seoung Bum Kim,et al.  Discovery of metabolite features for the modelling and analysis of high-resolution NMR spectra , 2008, Int. J. Data Min. Bioinform..

[56]  R. Balasubramanian,et al.  Comparative Evaluation of Classifiers in the Presence of Statistical Interactions between Features in High Dimensional Data Settings , 2012, The international journal of biostatistics.

[57]  R. Goodacre,et al.  The role of metabolites and metabolomics in clinically applicable biomarkers of disease , 2010, Archives of Toxicology.

[58]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[59]  C. Watkinson Risk assessment tools. , 1997, Professional nurse.

[60]  Serge Rudaz,et al.  Knowledge discovery in metabolomics: an overview of MS data handling. , 2010, Journal of separation science.

[61]  Gabriel S. Eichler,et al.  Metabolomics Reveals Attenuation of the SLC6A20 Kidney Transporter in Nonhuman Primate and Mouse Models of Type 2 Diabetes Mellitus* , 2011, The Journal of Biological Chemistry.

[62]  A. Zell,et al.  Metabonomics in cancer diagnosis: mass spectrometry-based profiling of urinary nucleosides from breast cancer patients , 2008 .

[63]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[64]  Jason Weston,et al.  Embedded Methods , 2006, Feature Extraction.

[65]  Bowei Xi,et al.  Statistical analysis and modeling of mass spectrometry-based metabolomics data. , 2014, Methods in molecular biology.