Understanding and Classifying Metabolite Space and Metabolite-Likeness

While the entirety of ‘Chemical Space’ is huge (and assumed to contain between 1063 and 10200 ‘small molecules’), distinct subsets of this space can nonetheless be defined according to certain structural parameters. An example of such a subspace is the chemical space spanned by endogenous metabolites, defined as ‘naturally occurring’ products of an organisms' metabolism. In order to understand this part of chemical space in more detail, we analyzed the chemical space populated by human metabolites in two ways. Firstly, in order to understand metabolite space better, we performed Principal Component Analysis (PCA), hierarchical clustering and scaffold analysis of metabolites and non-metabolites in order to analyze which chemical features are characteristic for both classes of compounds. Here we found that heteroatom (both oxygen and nitrogen) content, as well as the presence of particular ring systems was able to distinguish both groups of compounds. Secondly, we established which molecular descriptors and classifiers are capable of distinguishing metabolites from non-metabolites, by assigning a ‘metabolite-likeness’ score. It was found that the combination of MDL Public Keys and Random Forest exhibited best overall classification performance with an AUC value of 99.13%, a specificity of 99.84% and a selectivity of 88.79%. This performance is slightly better than previous classifiers; and interestingly we found that drugs occupy two distinct areas of metabolite-likeness, the one being more ‘synthetic’ and the other being more ‘metabolite-like’. Also, on a truly prospective dataset of 457 compounds, 95.84% correct classification was achieved. Overall, we are confident that we contributed to the tasks of classifying metabolites, as well as to understanding metabolite chemical space better. This knowledge can now be used in the development of new drugs that need to resemble metabolites, and in our work particularly for assessing the metabolite-likeness of candidate molecules during metabolite identification in the metabolomics field.

[1]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[2]  Benjamin P Bowen,et al.  Dealing with the unknown: Metabolomics and Metabolite Atlases , 2010, Journal of the American Society for Mass Spectrometry.

[3]  Andreas Bender,et al.  Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier , 2004, J. Chem. Inf. Model..

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[6]  Andreas Bender,et al.  A Discussion of Measures of Enrichment in Virtual Screening: Comparing the Information Content of Descriptors with Increasing Levels of Sophistication , 2005, J. Chem. Inf. Model..

[7]  Anthony E. Klon,et al.  Combination of a naive Bayes classifier with consensus scoring improves enrichment of high-throughput docking results. , 2004, Journal of medicinal chemistry.

[8]  J. German,et al.  Metabolomics and individual metabolic assessment: the next great challenge for nutrition. , 2002, The Journal of nutrition.

[9]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[10]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[11]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[12]  David I. Ellis,et al.  Metabolomics: Current analytical platforms and methodologies , 2005 .

[13]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[14]  Oliver Fiehn,et al.  Mass-spectrometry-based metabolomics: limitations and recommendations for future progress with particular focus on nutrition research , 2009, Metabolomics.

[15]  Andreas Bender,et al.  How similar are those molecules after all? Use two descriptors and you will have three different answers , 2010, Expert opinion on drug discovery.

[16]  E. Go,et al.  Database Resources in Metabolomics: An Overview , 2010, Journal of Neuroimmune Pharmacology.

[17]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[18]  Oliver Fiehn,et al.  How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry , 2009, PloS one.

[19]  Susumu Goto,et al.  KEGG for representation and analysis of molecular networks involving diseases and drugs , 2009, Nucleic Acids Res..

[20]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[21]  Nikolai S. Zefirov,et al.  Computer Generation of Molecular Structures by the SMOG Program , 1996, J. Chem. Inf. Comput. Sci..

[22]  Peter Ertl,et al.  Natural Product-likeness Score and Its Application for Prioritization of Compound Libraries , 2008, J. Chem. Inf. Model..

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  Douglas B. Kell,et al.  Automated workflows for accurate mass-based putative metabolite identification in LC/MS-derived metabolomic datasets , 2011, Bioinform..

[25]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[26]  Andreas Bender,et al.  How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space , 2009, J. Chem. Inf. Model..

[27]  B. Hammock,et al.  Mass spectrometry-based metabolomics. , 2007, Mass spectrometry reviews.

[28]  G. Bemis,et al.  The properties of known drugs. 1. Molecular frameworks. , 1996, Journal of medicinal chemistry.

[29]  Oliver Fiehn,et al.  Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry , 2007, BMC Bioinformatics.

[30]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[31]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[32]  Knut Baumann,et al.  Cross-validation as the objective function for variable-selection techniques , 2003 .

[33]  Alexander Golbraikh,et al.  Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection , 2004, Molecular Diversity.

[34]  David S. Wishart,et al.  HMDB: a knowledgebase for the human metabolome , 2008, Nucleic Acids Res..

[35]  Sunil Gupta,et al.  Comparing the chemical spaces of metabolites and available chemicals: models of metabolite-likeness , 2007, Molecular Diversity.

[36]  Andreas Bender,et al.  Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance , 2004, J. Chem. Inf. Model..

[37]  C. Ouzounis,et al.  Expansion of the BioCyc collection of pathway/genome databases to 160 genomes , 2005, Nucleic acids research.

[38]  Joshua D. Knowles,et al.  Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry , 2011, Nature Protocols.

[39]  Emma L. Schymanski,et al.  The use of MS classifiers and structure generation to assist in the identification of unknowns in effect-directed analysis. , 2008, Analytica chimica acta.

[40]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[41]  Sébastien Lê,et al.  FactoMineR: An R Package for Multivariate Analysis , 2008 .

[42]  O. Fiehn,et al.  Data Processing, Metabolomic Databases and Pathway Analysis , 2011 .

[43]  Jens Nielsen,et al.  The next wave in metabolome analysis. , 2005, Trends in biotechnology.

[44]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[45]  Andreas Bender,et al.  A Large Descriptor Set and a Probabilistic Kernel-Based Classifier Significantly Improve Druglikeness Classification , 2007, J. Chem. Inf. Model..

[46]  Emma L. Schymanski,et al.  Automated strategies to identify compounds on the basis of GC/EI-MS and calculated properties. , 2011, Analytical chemistry.

[47]  J. Rabinowitz,et al.  Analytical strategies for LC-MS-based targeted metabolomics. , 2008, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[48]  Oliver Fiehn,et al.  Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm , 2006, BMC Bioinformatics.

[49]  Patrick Fontana,et al.  Assemble 2.0: a structure generator , 2000 .

[50]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[51]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[52]  John C. Lindon,et al.  Analytical technologies for metabonomics and metabolomics, and multi-omic information recovery , 2008 .

[53]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[54]  Nigel W. Hardy,et al.  Plant Metabolomics , 2002, The Plant Cell Online.

[55]  Oliver Fiehn,et al.  Advances in structure elucidation of small molecules using mass spectrometry , 2010, Bioanalytical reviews.

[56]  J. Thornton,et al.  A structure-based anatomy of the E.coli metabolome. , 2003, Journal of molecular biology.

[57]  David S. Wishart,et al.  Quantitative metabolomics using NMR , 2008 .

[58]  Jérôme Hert,et al.  Quantifying Biogenic Bias in Screening Libraries , 2009, Nature chemical biology.

[59]  A. Schuffenhauer,et al.  Chemical diversity and biological activity , 2006 .

[60]  Jens Sadowski,et al.  Comparison of Support Vector Machine and Artificial Neural Network Systems for Drug/Nondrug Classification , 2003, J. Chem. Inf. Comput. Sci..

[61]  Wendy A. Warr,et al.  ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI) , 2009, J. Comput. Aided Mol. Des..

[62]  William Stafford Noble,et al.  Support vector machine , 2013 .

[63]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[64]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[65]  R. Abagyan,et al.  METLIN: A Metabolite Mass Spectral Database , 2005, Therapeutic drug monitoring.

[66]  D. Kell,et al.  'Metabolite-likeness' as a criterion in the design and selection of pharmaceutical drug libraries. , 2009, Drug discovery today.