Models for Identification of Erroneous Atom-to-Atom Mapping of Reactions Performed by Automated Algorithms

Machine learning (SVM and JRip rule learner) methods have been used in conjunction with the Condensed Graph of Reaction (CGR) approach to identify errors in the atom-to-atom mapping of chemical reactions produced by an automated mapping tool by ChemAxon. The modeling has been performed on the three first enzymatic classes of metabolic reactions from the KEGG database. Each reaction has been converted into a CGR representing a pseudomolecule with conventional (single, double, aromatic, etc.) bonds and dynamic bonds characterizing chemical transformations. The ChemAxon tool was used to automatically detect the matching atom pairs in reagents and products. These automated mappings were analyzed by the human expert and classified as "correct" or "wrong". ISIDA fragment descriptors generated for CGRs for both correct and wrong mappings were used as attributes in machine learning. The learned models have been validated in n-fold cross-validation on the training set followed by a challenge to detect correct and wrong mappings within an external test set of reactions, never used for learning. Results show that both SVM and JRip models detect most of the wrongly mapped reactions. We believe that this approach could be used to identify erroneous atom-to-atom mapping performed by any automated algorithm.

[1]  J. Gasteiger,et al.  The Principle of Minimum Chemical Distance (PMCD) , 1980 .

[2]  Johann Gasteiger,et al.  THE PRINCIPLE OF MINIMUM CHEMICAL DISTANCE (PMCD) , 1980 .

[3]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[4]  M. Kanehisa,et al.  Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. , 2003, Journal of the American Chemical Society.

[5]  M. Kanehisa,et al.  Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. , 2004, Journal of the American Chemical Society.

[6]  Alexandre Varnek,et al.  Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures , 2005, J. Comput. Aided Mol. Des..

[7]  Alexandre Varnek,et al.  Stochastic versus Stepwise Strategies for Quantitative Structure-Activity Relationship GenerationHow Much Effort May the Mining for Successful QSAR Models Take? , 2007, J. Chem. Inf. Model..

[8]  Joannis Apostolakis,et al.  Automatic Determination of Reaction Mappings and Reaction Center Information. 1. The Imaginary Transition State Energy Approach , 2008, J. Chem. Inf. Model..

[9]  I. Tetko,et al.  ISIDA - Platform for Virtual Screening Based on Fragment and Pharmacophoric Descriptors , 2008 .

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  D. Horvath,et al.  ISIDA Property‐Labelled Fragment Descriptors , 2010, Molecular informatics.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Nicolas Lachiche,et al.  A Representation to Apply Usual Data Mining Techniques to Chemical reactions - Illustration on the Rate Constant of SN2 reactions in water , 2011, Int. J. Artif. Intell. Tools.

[14]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[15]  Gilles Marcou,et al.  Mining Chemical Reactions Using Neighborhood Behavior and Condensed Graphs of Reactions Approaches , 2012, J. Chem. Inf. Model..