Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques

In this paper, we study the classifications of unbalanced data sets of drugs. As an example we chose a data set of 2D6 inhibitors of cytochrome P450. The human cytochrome P450 2D6 isoform plays a key role in the metabolism of many drugs in the preclinical drug discovery process. We have collected a data set from annotated public data and calculated physicochemical properties with chemoinformatics methods. On top of this data, we have built classifiers based on machine learning methods. Data sets with different class distributions lead to the effect that conventional machine learning methods are biased toward the larger class. To overcome this problem and to obtain sensitive but also accurate classifiers we combine machine learning and feature selection methods with techniques addressing the problem of unbalanced classification, such as oversampling and threshold moving. We have used our own implementation of a support vector machine algorithm as well as the maximum entropy method. Our feature selection is based on the unsupervised McCabe method. The classification results from our test set are compared structurally with compounds from the training set. We show that the applied algorithms enable the effective high throughput in silico classification of potential drug candidates.

[1]  J. Gasteiger,et al.  ITERATIVE PARTIAL EQUALIZATION OF ORBITAL ELECTRONEGATIVITY – A RAPID ACCESS TO ATOMIC CHARGES , 1980 .

[2]  Joseph Drish,et al.  Obtaining Calibrated Probability Estimates from Support Vector Machines , 2001 .

[3]  L. Hall,et al.  Molecular Structure Description: The Electrotopological State , 1999 .

[4]  Chris de Graaf,et al.  Cytochrome P450 in Silico: An Integrative Modeling Approach , 2005 .

[5]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[6]  Sean Ekins,et al.  Generation and validation of rapid computational filters for cyp2d6 and cyp3a4. , 2003, Drug metabolism and disposition: the biological fate of chemicals.

[7]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[8]  Thomas Lengauer,et al.  Ensemble Methods for Classification in Cheminformatics , 2004, J. Chem. Inf. Model..

[9]  Yan-Shi Dong,et al.  Boosting SVM classifiers by ensemble , 2005, WWW '05.

[10]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[11]  J. Gasteiger,et al.  Automatic generation of 3D-atomic coordinates for organic molecules , 1990 .

[12]  Achim Kless,et al.  Cytochrome P450 Classification of Drugs with Support Vector Machines Implementing the Nearest Point Algorithm , 2004, KELSI.

[13]  Steven L. Dixon,et al.  Use of Robust Classification Techniques for the Prediction of Human Cytochrome P450 2D6 Inhibition , 2003, J. Chem. Inf. Comput. Sci..

[14]  Johann Gasteiger,et al.  The Coding of the Three-Dimensional Structure of Molecules by Molecular Transforms and Its Application to Structure-Spectra Correlations and Studies of Biological Activity , 1996, J. Chem. Inf. Comput. Sci..

[15]  Wolf-Dietrich Ihlenfeldt,et al.  Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility , 1994, J. Chem. Inf. Comput. Sci..

[16]  Yu Zong Chen,et al.  Prediction of Cytochrome P450 3A4, 2D6, and 2C9 Inhibitors and Substrates by Using Support Vector Machines , 2005, J. Chem. Inf. Model..

[17]  Hans Briem,et al.  Classifying “Kinase Inhibitor‐Likeness” by Using Machine‐Learning Methods , 2005, Chembiochem : a European journal of chemical biology.

[18]  L. Hall,et al.  The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in Structure‐Property Modeling , 2007 .

[19]  Adwait Ratnaparkhi,et al.  A Simple Introduction to Maximum Entropy Models for Natural Language Processing , 1997 .

[20]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[21]  A. Alex,et al.  Novel approach to predicting P450-mediated drug metabolism: development of a combined protein and pharmacophore model for CYP2D6. , 1999, Journal of medicinal chemistry.

[22]  M. Randic,et al.  Graph theoretical approach to local and overall aromaticity of benzenenoid hydrocarbons , 1975 .

[23]  Rieko Arimoto,et al.  Development of CYP3A4 Inhibition Models: Comparisons of Machine-Learning Techniques and Molecular Descriptors , 2005, Journal of biomolecular screening.

[24]  Gisbert Schneider,et al.  SVM-Based Feature Selection for Characterization of Focused Compound Collections , 2004, J. Chem. Inf. Model..

[25]  Sean Ekins,et al.  Pharmacophore modeling of cytochromes P450. , 2002, Advanced drug delivery reviews.

[26]  Chris de Graaf,et al.  Cytochrome p450 in silico: an integrative modeling approach. , 2005, Journal of medicinal chemistry.

[27]  I. Jolliffe Principal Component Analysis , 2002 .

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[30]  S. Sathiya Keerthi,et al.  A fast iterative nearest point algorithm for support vector machine classifier design , 2000, IEEE Trans. Neural Networks Learn. Syst..

[31]  Bernd Beck,et al.  Multivariate modeling of cytochrome P450 3A4 inhibition. , 2005, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[32]  Chris Oostenbrink,et al.  Catalytic site prediction and virtual screening of cytochrome P450 2D6 substrates by consideration of water and rescoring in automated docking. , 2006, Journal of medicinal chemistry.

[33]  Gordon C K Roberts,et al.  Validation of model of cytochrome P450 2D6: an in silico tool for predicting metabolism and inhibition. , 2004, Journal of medicinal chemistry.

[34]  Rajarshi Guha,et al.  Determining the Validity of a QSAR Model - A Classification Approach , 2005, J. Chem. Inf. Model..

[35]  H Matter,et al.  Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. , 1998, Journal of medicinal chemistry.

[36]  Nico P E Vermeulen Prediction of drug metabolism: the case of cytochrome P450 2D6. , 2003, Current topics in medicinal chemistry.

[37]  B. Lang,et al.  Efficient optimization of support vector machine learning parameters for unbalanced datasets , 2006 .

[38]  J. L. Durant,et al.  Reoptimization of MDL Keys for Use in Drug Discovery. , 2003 .

[39]  Slobodan Petar Rendic,et al.  Human cytochrome P450 enzymes: a status report summarizing their reactions, substrates, inducers, and inhibitors. , 1997, Drug metabolism reviews.

[40]  Jun Xu,et al.  Drug-Like Index: A New Approach to Measure Drug-Like Compounds and Their Diversity. , 2001 .

[41]  Stefanie Eberhardt Support Vector Machines For Pattern Recognition , 2006 .

[42]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[43]  David F. V. Lewis,et al.  Structure–activity relationship for human cytochrome P450 substrates and inhibitors , 2002, Drug metabolism reviews.

[44]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[45]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[46]  Lemont B. Kier,et al.  Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information , 1995, J. Chem. Inf. Comput. Sci..

[47]  Johann Gasteiger,et al.  Empirical Methods for the Calculation of Physicochemical Data of Organic Compounds , 1988 .

[48]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[49]  Chris Oostenbrink,et al.  Metabolic regio- and stereoselectivity of cytochrome P450 2D6 towards 3,4-methylenedioxy-N-alkylamphetamines: in silico predictions and experimental validation. , 2005, Journal of medicinal chemistry.

[50]  Lawrence O. Hall,et al.  Comparing pure parallel ensemble creation techniques against bagging , 2003, Third IEEE International Conference on Data Mining.

[51]  R. Sheridan,et al.  A model for predicting likely sites of CYP3A4-mediated metabolism on drug-like molecules. , 2003, Journal of medicinal chemistry.

[52]  J. C. Slater Atomic Shielding Constants , 1930 .

[53]  Stewart B Kirton,et al.  In silico methods for predicting ligand binding determinants of cytochromes P450. , 2004, Current topics in medicinal chemistry.

[54]  Xin Chen,et al.  Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents , 2004, J. Chem. Inf. Model..

[55]  Chris de Graaf,et al.  Metabolic regio- and stereoselectivity of cytochrome P450 2D6 towards 3,4-methylenedioxy-N-alkylamphetamines: in silico predictions and experimental validation. , 2005, Journal of medicinal chemistry.

[56]  Chris de Graaf,et al.  Role of the conserved threonine 309 in mechanism of oxidation by cytochrome P450 2D6. , 2005, Biochemical and biophysical research communications.

[57]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[58]  Andreas Zell,et al.  Feature Selection for Descriptor Based Classification Models. 2. Human Intestinal Absorption (HIA) , 2004, J. Chem. Inf. Model..

[59]  Slobodan Petar Rendic Summary of information on human CYP enzymes: human P450 metabolism data , 2002, Drug metabolism reviews.

[60]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[61]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[62]  Bernd Beck,et al.  A support vector machine approach to classify human cytochrome P450 3A4 inhibitors , 2005, J. Comput. Aided Mol. Des..

[63]  Tudor I. Oprea,et al.  Property distribution of drug-related chemical databases* , 2000, J. Comput. Aided Mol. Des..

[64]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[65]  S. O'Brien,et al.  Greater than the sum of its parts: combining models for useful ADMET prediction. , 2005, Journal of medicinal chemistry.

[66]  Milan Randic,et al.  On molecular identification numbers , 1984, J. Chem. Inf. Comput. Sci..

[67]  Chris de Graaf,et al.  Topological role of cytochrome P450 2D6 active site residues. , 2006, Archives of biochemistry and biophysics.

[68]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[69]  Andreas Zell,et al.  Feature Selection for Descriptor Based Classification Models. 1. Theory and GA-SEC Algorithm , 2004, J. Chem. Inf. Model..

[70]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[71]  Jan M. Kriegl,et al.  Prediction of Human Cytochrome P450 Inhibition Using Support Vector Machines , 2005 .

[72]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[73]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[74]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[75]  P. Selzer,et al.  Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. , 2000, Journal of medicinal chemistry.

[76]  Gordon M. Crippen,et al.  Prediction of Physicochemical Parameters by Atomic Contributions , 1999, J. Chem. Inf. Comput. Sci..

[77]  John C. Slater,et al.  Analytic Atomic Wave Functions , 1932 .