A Multi-Objective Genetic Algorithm for Outlier Removal

Quantitative structure activity relationship (QSAR) or quantitative structure property relationship (QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set, compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. Here we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the k nearest neighbors (kNN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors), and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets that (1) better maintain the internal diversity of the parent data sets and (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed "preservation"), forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications.

[1]  Ki Hwan Kim,et al.  Outliers in SAR and QSAR: Is unusual binding mode a possible source of outliers? , 2007, J. Comput. Aided Mol. Des..

[2]  Peter J. Fleming,et al.  Genetic Algorithms for Multiobjective Optimization: FormulationDiscussion and Generalization , 1993, ICGA.

[3]  Carlos A. Coello Coello,et al.  Handling multiple objectives with particle swarm optimization , 2004, IEEE Transactions on Evolutionary Computation.

[4]  Abraham Yosipof,et al.  k‐Nearest neighbors optimization‐based outlier removal , 2015, J. Comput. Chem..

[5]  Carlos A. Coello Coello,et al.  Evolutionary multi-objective optimization: a historical view of the field , 2006, IEEE Comput. Intell. Mag..

[6]  Scott Boyer,et al.  Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination , 2014, J. Chem. Inf. Model..

[7]  Alexander Golbraikh,et al.  Predictive QSAR modeling workflow, model applicability domains, and virtual screening. , 2007, Current pharmaceutical design.

[8]  Paul G. Mezey,et al.  The application of iterative optimization techniques to chemical kinetic data of large random error , 1976 .

[9]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[10]  J. Platts,et al.  Correlation and prediction of a large blood-brain distribution data set--an LFER study. , 2001, European journal of medicinal chemistry.

[11]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[12]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[13]  Alexander Golbraikh,et al.  QSAR Modeling of the Blood–Brain Barrier Permeability for Diverse Organic Compounds , 2008, Pharmaceutical Research.

[14]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[15]  Alexandre Varnek,et al.  Correlation of blood-brain penetration using structural descriptors. , 2006, Bioorganic & medicinal chemistry.

[16]  Alexander Tropsha,et al.  Quantitative structure-activity relationship modeling of rat acute toxicity by oral exposure. , 2009, Chemical research in toxicology.

[17]  Jeffrey J. Sutherland,et al.  Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships , 2003, J. Chem. Inf. Comput. Sci..

[18]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[19]  David W. Corne,et al.  Approximating the Nondominated Front Using the Pareto Archived Evolution Strategy , 2000, Evolutionary Computation.

[20]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[21]  Abraham Yosipof,et al.  Optimization of Molecular Representativeness , 2014, J. Chem. Inf. Model..

[22]  Tarko Laszlo,et al.  Monte Carlo method for identification of outlier molecules in QSAR studies , 2009 .

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.