Building a Generic Graph-based Descriptor Set for use in Drug Discovery

The ability to predict drug activity from molecular structure is an important field of research both in academia and in the pharmaceutical industry. Raw 3D structure data is not in a form suitable for identifying properties using machine learning so it must be reconfigured into descriptor sets that continue to encapsulate important structural properties of the molecule. In this study, a large number of small molecule structures, obtained from publicly available databases, was used to generate a set of molecular descriptors that can be used with machine learning to predict drug activity. The descriptors were for the most part simple graph strings representing chains of connected atoms. Atom counts averaging seventy, using a dataset of just over one million molecules, resulted in a very large set of simple graph strings of lengths two to twelve atoms. Elimination of duplicates, reverse strings and feature reduction techniques were applied to reduce the path count to about three thousand which was viable for machine learning. Training data from twenty six data sets was used to build a decision tree classifier using J48 and Random Forest. Forty three thousand molecules from the NCI HIV dataset were used with the descriptor set to generate decision tree models with good accuracy. A similar algorithm was used to extract ring structures in the molecules. Inclusion of thirteen ring structure descriptors increased the accuracy of prediction.

[1]  H. Kubinyi QSAR and 3D QSAR in drug design Part 1: methodology , 1997 .

[2]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  R. Quinlan,et al.  Decision tree discovery , 1999 .

[4]  Peter A. Flach,et al.  Learning Decision Trees Using the Area Under the ROC Curve , 2002, ICML.

[5]  L Xue,et al.  Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. , 2000, Combinatorial chemistry & high throughput screening.

[6]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[7]  J. A. Grant,et al.  A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. , 2005, Journal of medicinal chemistry.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Jun Xu,et al.  Drug-Like Index: A New Approach to Measure Drug-Like Compounds and Their Diversity. , 2001 .

[10]  Thierry Hanser,et al.  A New Algorithm for Exhaustive Ring Perception in a Molecular Graph , 1996, J. Chem. Inf. Comput. Sci..

[11]  Nada Lavrac,et al.  Classification Rule Learning with APRIORI-C , 2001, EPIA.

[12]  R. Mckinnon,et al.  Effect of steric molecular field settings on CoMFA predictivity , 2008, Journal of molecular modeling.

[13]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[14]  Petko Valtchev,et al.  On Computing the Minimal Generator Family for Concept Lattices and Icebergs , 2005, ICFCA.

[15]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[16]  Bezem,et al.  Enumeration in graphs , 1987 .

[17]  Jakub Pas,et al.  Ligand.Info small-molecule Meta-Database. , 2004, Combinatorial chemistry & high throughput screening.

[18]  G. Danielson,et al.  On finding the simple paths and circuits in a graph , 1968 .

[19]  Jonas Boström,et al.  Computational chemistry-driven decision making in lead generation. , 2006, Drug discovery today.

[20]  H. Kubinyi Comparative Molecular Field Analysis (CoMFA) , 2002 .