Combination of kernel PCA and linear support vector machine for modeling a nonlinear relationship between bioactivity and molecular descriptors

In this paper, a two‐step nonlinear classification algorithm is proposed to model the structure–activity relationship (SAR) between bioactivities and molecular descriptors of compounds, which consists of kernel principal component analysis (KPCA) and linear support vector machines (KPCA + LSVM). KPCA is used to remove some uninformative gradients such as noises and then exactly capture the latent structure of the training dataset using some new variables called the principal components in the kernel‐defined feature space. LSVM makes full use of the maximal margin hyperplane to give the best generalization performance in the KPCA‐transformed space. The combination of KPCA and LSVM can effectively improve the prediction performance compared with the linear SVM as well as two nonlinear methods. Three datasets related to different categorical bioactivities of compounds are used to evaluate the performance of KPCA + LSVM. The results show that our algorithm is competitive. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  J. F. Wang,et al.  Prediction of P-Glycoprotein Substrates by a Support Vector Machine Approach , 2004, J. Chem. Inf. Model..

[2]  Johann Gasteiger,et al.  Self-organizing maps for identification of new inhibitors of P-glycoprotein. , 2007, Journal of medicinal chemistry.

[3]  S. D. Jong,et al.  The kernel PCA algorithms for wide data. Part I: Theory and algorithms , 1997 .

[4]  Björn Waske,et al.  Random Feature Selection for Decision Tree Classification of Multi-temporal SAR Data , 2006, 2006 IEEE International Symposium on Geoscience and Remote Sensing.

[5]  Xin Chen,et al.  Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents , 2004, J. Chem. Inf. Model..

[6]  Hua Yuan,et al.  Prediction of Skin Sensitization with a Particle Swarm Optimized Support Vector Machine , 2009, International journal of molecular sciences.

[7]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[8]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[9]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[10]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery, 7. Prediction of Oral Absorption by Correlation and Classification , 2007, J. Chem. Inf. Model..

[11]  Mikko Kolehmainen,et al.  Structure-based classification of active and inactive estrogenic compounds by decision tree, LVQ and kNN methods. , 2006, Chemosphere.

[12]  Ling Yang,et al.  Classification of Substrates and Inhibitors of P-Glycoprotein Using Unsupervised Machine Learning Approach , 2005, J. Chem. Inf. Model..

[13]  Zhide Hu,et al.  Structure-activity relationship study of oxindole-based inhibitors of cyclin-dependent kinases based on least-squares support vector machines. , 2007, Analytica chimica acta.

[14]  Rachid Darnag,et al.  QSAR Studies of HEPT Derivatives Using Support Vector Machines , 2009 .

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Andrzej Cichocki,et al.  Kernel PCA for Feature Extraction and De-Noising in Nonlinear Regression , 2001, Neural Computing & Applications.

[17]  Olivier Taboureau,et al.  Classification of Cytochrome P450 1A2 Inhibitors and Noninhibitors by Machine Learning Techniques , 2009, Drug Metabolism and Disposition.

[18]  P. Hansen Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion , 1987 .

[19]  Sune Askjaer,et al.  Combining Pharmacophore Fingerprints and PLS-Discriminant Analysis for Virtual Screening and SAR Elucidation , 2008, J. Chem. Inf. Model..

[20]  Yongjun Wang,et al.  Considerations and recent advances in QSAR models for cytochrome P450-mediated drug metabolism prediction , 2008, J. Comput. Aided Mol. Des..

[21]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[22]  U. Kruger,et al.  Moving window kernel PCA for adaptive monitoring of nonlinear processes , 2009 .

[23]  Sarel Steel,et al.  Variable Selection for Support Vector Machines , 2009, Commun. Stat. Simul. Comput..

[24]  J. V. Turner,et al.  Structure-activity relationships for serotonin transporter and dopamine receptor selectivity. , 2009, Medicinal chemistry (Shariqah (United Arab Emirates)).

[25]  Walter Schmitt,et al.  A physiological model for the estimation of the fraction dose absorbed in humans. , 2004, Journal of medicinal chemistry.

[26]  Jing Chen,et al.  QSAR study of Akt/protein kinase B (PKB) inhibitors using support vector machine. , 2009, European journal of medicinal chemistry.

[27]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[28]  Ruisheng Zhang,et al.  Prediction of CCR5 receptor binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas based on the heuristic method, support vector machine and projection pursuit regression. , 2009, European journal of medicinal chemistry.

[29]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[30]  Yen-Ming Chen,et al.  Prediction of human cytochrome P450 2B6‐substrate interactions using hierarchical support vector regression approach , 2009, J. Comput. Chem..