Robust QSAR model development in high-throughput catalyst discovery based on genetic parameter optimisation

Abstract High-throughput strategies are gaining importance in catalyst formulation and discovery. The increased experimental capacity produces valuable data from which quantitative structure–activity relationship (QSAR) models can be developed to link catalyst composition and structure with the final performance. Various QSAR modelling algorithms are available, however, they are generally configurable and their performance is highly dependent on the correct choice of parameters. With the proliferation and increasing sophistication of integrated data-mining tools, there is a need for systematic, robust, and generic parameter optimisation methods. This paper investigates a genetic algorithm (GA) for parameter optimisation of several QSAR methods for classification and regression: including feed-forward neural networks, decision tree generators, and support vector machines, with cross-validation providing the performance estimate. The methods were applied to four datasets, including three datasets from recent reports of high-throughput studies and one from our own laboratory. The results confirm that parameter optimisation is a critical step in QSAR modelling, and demonstrate the effectiveness of the GA approach. The best results were shared among the modelling methods, emphasising the importance of considering more than one type of model.

[1]  D. Suits Use of Dummy Variables in Regression Equations , 1957 .

[2]  Krishna Rajan,et al.  Combinatorial design of semiconductor chemistry for bandgap engineering: “virtual” combinatorial experimentation , 2004 .

[3]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[4]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[5]  Estefania Argente,et al.  Application of artificial neural networks to high-throughput synthesis of zeolites , 2005 .

[6]  Claude Mirodatos,et al.  How to Design Diverse Libraries of Solid Catalysts , 2003 .

[7]  Anthony F. Volpe,et al.  Applications of combinatorial methods in catalysis , 2001 .

[8]  Bhaskar D. Kulkarni,et al.  Support vector classification with parameter tuning assisted by agent-based technique , 2004, Comput. Chem. Eng..

[9]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[10]  Xin Yao,et al.  Evolving artificial neural networks , 1999, Proc. IEEE.

[11]  Laurent A Baumes,et al.  MAP: an iterative experimental design methodology for the optimization of catalytic search space structure modeling. , 2006, Journal of combinatorial chemistry.

[12]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[13]  Yadunandan Dar,et al.  High‐Throughput Experimentation: A Powerful Enabling Technology for the Chemicals and Materials Industry , 2004 .

[14]  Gadi Rothenberg,et al.  In Silico Design in Homogeneous Catalysis Using Descriptor Modelling , 2006 .

[15]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[16]  José M. Serra,et al.  A New Mapping/Exploration Approach for HT Synthesis of Zeolites , 2006 .

[17]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[18]  Frédéric Clerc,et al.  Effect of the Genetic Algorithm Parameters on the Optimisation of Heterogeneous Catalysts , 2005 .

[19]  Manfred Baerns,et al.  An evolutionary approach in the combinatorial selection and optimization of catalytic materials , 2000 .

[20]  Michael J. Fasolka,et al.  Combinatorial Materials Synthesis , 2003 .

[21]  N. A. Diamantidis,et al.  Unsupervised stratification of cross-validation for accuracy estimation , 2000, Artif. Intell..

[22]  J. M. Serra,et al.  Heterogeneous combinatorial catalysis applied to oil refining, petrochemistry and fine chemistry , 2005 .

[23]  Richard G. Brereton,et al.  Chemometrics: Data Analysis for the Laboratory and Chemical Plant , 2003 .

[24]  András Tompos,et al.  Holographic research strategy for catalyst library design: Description of a new powerful optimisation method , 2003 .

[25]  Claude Mirodatos,et al.  Design of Discovery Libraries for Solids Based on QSAR Models , 2005 .

[26]  Krishna Rajan,et al.  Principal Component Analysis of Catalytic Functions in the Composition Space of Heterogeneous Catalysts , 2007 .

[27]  M. Rothschild Projection optical lithography , 2005 .

[28]  J. M. Serra,et al.  Support vector machines for predictive modeling in heterogeneous catalysis: a comprehensive introduction and overfitting investigation based on two real applications. , 2006, Journal of combinatorial chemistry.

[29]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[30]  L. Harmon,et al.  Experiment planning for combinatorial materials discovery , 2003 .

[31]  Randy L. Haupt,et al.  Practical Genetic Algorithms , 1998 .

[32]  W. Maier,et al.  Combinatorial and high-throughput materials science. , 2007, Angewandte Chemie.