A Hybrid Data Mining Approach for the Identification of Biomarkers in Metabolomic Data

In this paper, we introduce an approach for analyzing complex biological data obtained from metabolomic analytical platforms. Such platforms generate massive and complex data that need appropriate methods for discovering meaningful biological information. The datasets to analyze consist in a limited set of individuals and a large set of attributes (variables). In this study, we are interested in mining metabolomic data to identify predictive biomarkers of metabolic diseases, such as type 2 diabetes. Our experiments show that a combination of numerical methods, e.g. SVM, Random Forests (RF), and ANOVA, with a symbolic method such as FCA, can be successfully used for discovering the best combination of predictive features. Our results show that RF and ANOVA seem to be the best suited methods for feature selection and discovery. We then use FCA for visualizing the markers in a suggestive and interpretable concept lattice. The outputs of our experiments consist in a short list of the 10 best potential predictive biomarkers.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Christian V. Forst,et al.  Identifying Genes of Gene Regulatory Networks Using Formal Concept Analysis , 2008, J. Comput. Biol..

[3]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[4]  David I. Ellis,et al.  A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. , 2014, Analytica chimica acta.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Seoung Bum Kim,et al.  Discovery of metabolite features for the modelling and analysis of high-resolution NMR spectra , 2008, Int. J. Data Min. Bioinform..

[7]  Jonas Poelmans,et al.  Formal concept analysis in knowledge processing: A survey on applications , 2013, Expert Syst. Appl..

[8]  J. J. Jansen,et al.  ASCA: analysis of multivariate data obtained from an experimental design , 2005 .

[9]  Taghi M. Khoshgoftaar,et al.  Measuring Stability of Feature Selection Techniques on Real-World Software Datasets , 2013 .

[10]  R. Goodacre,et al.  The role of metabolites and metabolomics in clinically applicable biomarkers of disease , 2010, Archives of Toxicology.

[11]  Rainer Brüggemann,et al.  Application of formal concept analysis to structure-activity relationships , 1998 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Ajit Narayanan,et al.  An introductory tutorial to quantum computing , 1997 .

[14]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[15]  D. Wishart,et al.  Translational biomarker discovery in clinical metabolomics: an introductory tutorial , 2012, Metabolomics.

[16]  David I. Ellis,et al.  A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding. , 2015, Analytica chimica acta.

[17]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..