The One‐Class Classification Approach to Data Description and to Models Applicability Domain

In this paper, we associate an applicability domain (AD) of QSAR/QSPR models with the area in the input (descriptor) space in which the density of training data points exceeds a certain threshold. It could be proved that the predictive performance of the models (built on the training set) is larger for the test compounds inside the high density area, than for those outside this area. Instead of searching a decision surface separating high and low density areas in the input space, the one‐class classification 1‐SVM approach looks for a hyperplane in the associated feature space. Unlike other reported in the literature AD definitions, this approach: (i) is purely “data‐based”, i.e. it assigns the same AD to all models built on the same training set, (ii) provides results that depend only on the initial descriptors pool generated for the training set, (iii) can be used for the huge number of descriptors, as well as in the framework of structured kernel‐based approaches, e.g., chemical graph kernels. The developed approach has been applied to improve the performance of QSPR models for stability constants of the complexes of organic ligands with alkaline‐earth metals in water.

[1]  J. Simonoff Multivariate Density Estimation , 1996 .

[2]  Alexandre Varnek,et al.  Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures , 2005, J. Comput. Aided Mol. Des..

[3]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[4]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[5]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[6]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[7]  Holger Schwenk,et al.  The Diabolo Classifier , 1998, Neural Computation.

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[10]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[11]  Refractor Vision , 2000, The Lancet.

[12]  Christopher M. Bishop,et al.  Novelty detection and neural network validation , 1994 .

[13]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[14]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[15]  Andreas Zell,et al.  Estimation of the applicability domain of kernel-based machine learning models for virtual screening , 2010, J. Cheminformatics.

[16]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[21]  Tudor I. Oprea,et al.  Ligand-Based Virtual Screening by Novelty Detection with Self-Organizing Maps , 2007, J. Chem. Inf. Model..

[22]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[23]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[24]  W G Richards,et al.  Computer-aided molecular design. , 1983, Endeavour.

[25]  Jean-Philippe Vert,et al.  Consistency and Convergence Rates of One-Class SVMs and Related Algorithms , 2006, J. Mach. Learn. Res..

[26]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[27]  Stephen Grossberg,et al.  ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition , 1991, Neural Networks.

[28]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[29]  Gunter Ritter,et al.  Outliers in statistical pattern recognition and an application to automatic chromosome classification , 1997, Pattern Recognit. Lett..

[30]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[31]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[32]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[33]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .