Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction

BackgroundWe present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024–1029). We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581–590) of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset.ResultsStarting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6°C and R2 of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, ε of 0.21) and an RMSE of 45.1°C and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3°C, R2 of 0.47) for the same data and has similar performance to a Random Forest model (RMSE of 44.5°C, R2 of 0.55). However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest.ConclusionWith a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors.

[1]  R. M. Muir,et al.  Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients , 1962, Nature.

[2]  Thomas Marill,et al.  On the effectiveness of receptors in recognition systems , 1963, IEEE Trans. Inf. Theory.

[3]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[4]  A. Balaban Highly discriminating distance-based topological index , 1982 .

[5]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[6]  D. Livingstone,et al.  Structure-activity relationships of antifilarial antimycin analogues: a multivariate pattern recognition study. , 1990, Journal of medicinal chemistry.

[7]  Anton J. Hopfinger,et al.  Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships , 1994, J. Chem. Inf. Comput. Sci..

[8]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[9]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[10]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[11]  Kimito Funatsu,et al.  GA Strategy for Variable Selection in QSAR Studies: GA-Based PLS Analysis of Calcium Channel Antagonists , 1997, J. Chem. Inf. Comput. Sci..

[12]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[13]  Luca Maria Gambardella,et al.  Ant Algorithms for Discrete Optimization , 1999, Artificial Life.

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  D. Agrafiotis,et al.  Feature selection for structure-activity correlation using binary particle swarms. , 2002, Journal of medicinal chemistry.

[16]  D. Agrafiotis,et al.  Variable selection for QSAR by artificial ant colony systems , 2002, SAR and QSAR in environmental research.

[17]  Andreas Zell,et al.  Prediction of Aqueous Solubility and Partition Coefficient Optimized by a Genetic Algorithm Based Descriptor Selection Method , 2003, J. Chem. Inf. Comput. Sci..

[18]  Johann Gasteiger,et al.  Handbook of Chemoinformatics , 2003 .

[19]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[20]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[22]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[23]  Rajarshi Guha,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors , 2004, J. Chem. Inf. Model..

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Andreas Zell,et al.  Towards Optimal Descriptor Subset Selection with Support Vector Machines in Classification and Regression , 2004 .

[26]  J. Deneubourg,et al.  Self-organized shortcuts in the Argentine ant , 1989, Naturwissenschaften.

[27]  Ying Liu,et al.  A Comparative Study on Feature Selection Methods for Drug Discovery , 2004, J. Chem. Inf. Model..

[28]  Muthukumarasamy Karthikeyan,et al.  General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[29]  Jian-Hui Jiang,et al.  Optimized Block-wise Variable Combination by Particle Swarm Optimization for Partial Least Squares Modeling in Quantitative Structure-Activity Relationship Studies , 2005, J. Chem. Inf. Model..

[30]  Jian-Hui Jiang,et al.  Modified Ant Colony Optimization Algorithm for Variable Selection in QSAR Modeling: QSAR Studies of Cyclooxygenase Inhibitors , 2005, J. Chem. Inf. Model..

[31]  Akash Khandelwal,et al.  In silico ADME modelling 2: computational models to predict human serum albumin binding affinity using ant colony systems. , 2006, Bioorganic & medicinal chemistry.

[32]  Andreas Bender,et al.  Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization , 2006, J. Chem. Inf. Model..

[33]  Tomasz Arodz,et al.  Computational methods in developing quantitative structure-activity relationships (QSAR): a review. , 2006, Combinatorial chemistry & high throughput screening.

[34]  Alexander von Homeyer,et al.  Evolutionary Algorithms and Their Applications in Chemistry , 2008 .

[35]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .