A hybrid and exploratory approach to knowledge discovery in metabolomic data

In this paper, we propose a hybrid and exploratory knowledge discovery approach for analyzing metabolomic complex data based on a combination of supervised classifiers, pattern mining and Formal Concept Analysis (FCA). The approach is based on three main operations, preprocessing, classification, and postprocessing. Classifiers are applied to datasets of the form individuals×features and produce sets of ranked features which are further analyzed. Pattern mining and FCA are used to provide a complementary analysis and support for visualization. A practical application of this framework is presented in the context of metabolomic data, where two interrelated problems are considered, discrimination and prediction of class membership. The dataset is characterized by a small set of individuals and a large set of features, in which predictive biomarkers of clinical outcomes should be identified. The problems of combining numerical and symbolic data mining methods, as well as discrimination and prediction, are detailed and discussed. Moreover, it appears that visualization based on FCA can be used both for guiding knowledge discovery and for interpretation by domain analysts.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Age K. Smilde,et al.  Reflections on univariate and multivariate analysis of metabolomics data , 2013, Metabolomics.

[3]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[4]  Amedeo Napoli,et al.  Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data , 2016, Front. Mol. Biosci..

[5]  David I. Ellis,et al.  A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding. , 2015, Analytica chimica acta.

[6]  Melanie Hilario,et al.  Ontology-Based Meta-Mining of Knowledge Discovery Workflows , 2011, Meta-Learning in Computational Intelligence.

[7]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[8]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[9]  Hendrik Blockeel,et al.  Data Mining: From Procedural to Declarative Approaches , 2015, New Generation Computing.

[10]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[11]  Rainer Brüggemann,et al.  Application of formal concept analysis to structure-activity relationships , 1998 .

[12]  D. Wishart,et al.  Translational biomarker discovery in clinical metabolomics: an introductory tutorial , 2012, Metabolomics.

[13]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[14]  Miroslava Cuperlovic-Culf,et al.  Machine Learning Methods for Analysis of Metabolic Data and Metabolic Pathway Modeling , 2018, Metabolites.

[15]  Amedeo Napoli,et al.  Mining gene expression data with pattern structures in formal concept analysis , 2011, Inf. Sci..

[16]  Matthijs van Leeuwen Interactive Data Exploration Using Pattern Mining , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[17]  Igor Jurisica,et al.  Knowledge Discovery and interactive Data Mining in Bioinformatics - State-of-the-Art, future challenges and research directions , 2014, BMC Bioinformatics.

[18]  Melanie Hilario,et al.  Using Meta-mining to Support Data Mining Workflow Planning and Optimization , 2014, J. Artif. Intell. Res..

[19]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20]  Aedín C. Culhane,et al.  Dimension reduction techniques for the integrative analysis of multi-omics data , 2016, Briefings Bioinform..

[21]  Philippe Rinaudo,et al.  biosigner: A New Method for the Discovery of Significant Molecular Signatures from Omics Data , 2016, Front. Mol. Biosci..

[22]  Mehwish Alam,et al.  LatViz: A New Practical Tool for Performing Interactive Exploration over Concept Lattices , 2016, CLA.

[23]  Aleksey Buzmakov,et al.  Fast Generation of Best Interval Patterns for Nonmonotonic Constraints , 2015, ECML/PKDD.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Tijl De Bie,et al.  Subjective Interestingness in Exploratory Data Mining , 2013, IDA.

[26]  J. J. Jansen,et al.  ASCA: analysis of multivariate data obtained from an experimental design , 2005 .

[27]  Christian V. Forst,et al.  Identifying Genes of Gene Regulatory Networks Using Formal Concept Analysis , 2008, J. Comput. Biol..

[28]  David I. Ellis,et al.  A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. , 2014, Analytica chimica acta.

[29]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[30]  Jonas Poelmans,et al.  Formal concept analysis in knowledge processing: A survey on applications , 2013, Expert Syst. Appl..

[31]  Amedeo Napoli,et al.  A Hybrid Data Mining Approach for the Identification of Biomarkers in Metabolomic Data , 2016, CLA.

[32]  Sergei O. Kuznetsov,et al.  Learning Closed Sets of Labeled Graphs for Chemical Applications , 2005, ILP.

[33]  Sergei O. Kuznetsov,et al.  Toxicology Analysis by Means of the JSM-method , 2003, Bioinform..

[34]  Blandine Comte,et al.  Systems Metabolomics for Prediction of Metabolic Syndrome. , 2017, Journal of proteome research.

[35]  R. Goodacre,et al.  The role of metabolites and metabolomics in clinically applicable biomarkers of disease , 2010, Archives of Toxicology.

[36]  Fabian J Theis,et al.  Statistical methods for the analysis of high-throughput metabolomics data , 2013, Computational and structural biotechnology journal.

[37]  Amedeo Napoli,et al.  Hermes: a simple and efficient algorithm for building the AOC-poset of a binary relation , 2014, Annals of Mathematics and Artificial Intelligence.

[38]  Aleksey Buzmakov,et al.  Discovering Structural Alerts for Mutagenicity Using Stable Emerging Molecular Patterns , 2015, J. Chem. Inf. Model..

[39]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Amedeo Napoli,et al.  A Hybrid Knowledge Discovery Approach for Mining Predictive Biomarkers in Metabolomic Data , 2016, ECML/PKDD.

[41]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[42]  Mehwish Alam,et al.  Exploratory knowledge discovery over Web of Data , 2018, Discret. Appl. Math..