Machine-learning models for combinatorial catalyst discovery

A variety of machine learning algorithms, including hierarchical clustering, decision trees, k-nearest neighbours, support vector machines and bagging, were applied to construct models to predict the molecular weight of the polymers produced by a set of 96 homogeneous catalysts. The goal of the study was to develop models that could be used to screen large virtual libraries of catalysts in order to suggest candidates for further synthesis and screening. The descriptors used to represent the catalysts did not require detailed information about the catalysts themselves; they could be calculated using only the topology of the ligands. Using an initial set of five descriptors, model accuracies of about 70% were observed from each learning algorithm. A larger descriptor set (with ten descriptors) allowed bag classifiers that were 80% accurate to be built. All models were carefully evaluated to detect overfitting (memorization of the training data) and one example of the effects of overfitting is provided. Because the descriptors used in this study can be calculated very rapidly and the models themselves are very efficient, these bag classifiers are well suited to screening very large virtual libraries.

[1]  Steven H. Bertz,et al.  The first general index of molecular complexity , 1981 .

[2]  W. M. Skiff,et al.  Modeling Metal‐Catalyzed Olefin Polymerization , 2000 .

[3]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[4]  Robert A. Lordo,et al.  Learning from Data: Concepts, Theory, and Methods , 2001, Technometrics.

[5]  L. Hall,et al.  The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in Structure‐Property Modeling , 2007 .

[6]  R. Mülhaupt Catalytic polymerization and post polymerization catalysis fifty years after the discovery of Ziegler's catalysts , 2003 .

[7]  L. Hall,et al.  Molecular Structure Description: The Electrotopological State , 1999 .

[8]  Joachim Sauer,et al.  Combined Quantum Mechanics: Interatomic Potential Function Investigation of rac-meso Configurational Stability and Rotational Transition in Zirconocene-Based Ziegler−Natta Catalysts , 2000 .

[9]  R. Muelhaupt,et al.  ansa-Zirconocene Polymerization Catalysts with Anelated Ring Ligands - Effects on Catalytic Activity and Polymer Chain Length , 1994 .

[10]  H. Charles Romesburg,et al.  Cluster analysis for researchers , 1984 .

[11]  V. R. Jensen,et al.  Toward Quantitative Prediction of Stereospecificity of Metallocene-Based Catalysts for alpha-Olefin Polymerization. , 2000, Chemical reviews.

[12]  G. Landrum,et al.  Application of machine-learning methods to solid-state chemistry: ferromagnetism in transition metal alloys , 2003 .

[13]  I. I. Ioffe Application of Pattern Recognition to Catalytic Research , 1988 .

[14]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[15]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[16]  P. Labute A widely applicable set of descriptors. , 2000, Journal of molecular graphics & modelling.

[17]  Mark E. Oxley,et al.  Interplay of large materials databases, semi-empirical methods, neuro-computing and first principle calculations for ternary compound former/nonformer prediction , 2000 .

[18]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[19]  Vince Murphy,et al.  A fully integrated high-throughput screening methodology for the discovery of new polyolefin catalysts: discovery of a new class of high temperature single-site group (IV) copolymerization catalysts. , 2003, Journal of the American Chemical Society.

[20]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[21]  J. Gladysz Frontiers in Metal-Catalyzed Polymerization: Designer Metallocenes, Designs on New Monomers, Demystifying MAO, Metathesis Déshabillé. , 2000, Chemical reviews.

[22]  L. Cavallo,et al.  Towards more realistic computational modeling of homogenous catalysis by density functional theory: combined QM/MM and ab initio molecular dynamics , 1999 .

[23]  Vladimir Cherkassky,et al.  Learning from data , 1998 .

[24]  J. Gasteiger,et al.  ITERATIVE PARTIAL EQUALIZATION OF ORBITAL ELECTRONEGATIVITY – A RAPID ACCESS TO ATOMIC CHARGES , 1980 .

[25]  B. Bachmann,et al.  The Influence of Aromatic Substituents on the Polymerization Behavior of Bridged Zirconocene Catalysts , 1994 .

[26]  Laszlo Zsolnai,et al.  Conformation of tripod Metal Templates in CH3C(CH2PPh2)3MLn (n = 2, 3): Neural Networks in Conformational Analysis† , 1996 .

[27]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[28]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[29]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[30]  Thomas R. Cundari,et al.  Database Mining Using Soft Computing Techniques. An Integrated Neural Network-Fuzzy Logic-Genetic Algorithm Approach , 2001, J. Chem. Inf. Comput. Sci..

[31]  G. Landrum,et al.  The Rational Discovery Framework TM : A Novel Tool for Computationally Guided High-Throughput Discovery , 2001 .

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[33]  An Introduction to Cluster Analysis for Data Mining , 2000 .

[34]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[35]  Jun Deng,et al.  Structural Analysis of Transition Metal -X Substituent Interactions. Toward the Use of Soft Computing Methods for Catalyst Modeling , 2000, J. Chem. Inf. Comput. Sci..

[36]  Thomas R. Cundari,et al.  Robust Fuzzy Principal Component Analysis (FPCA). A Comparative Study Concerning Interaction of Carbon—Hydrogen Bonds with Molybdenum—Oxo Bonds. , 2003 .

[37]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[38]  Derek J. Pike,et al.  Empirical Model‐building and Response Surfaces. , 1988 .

[39]  W. H. Weinberg,et al.  High-throughput approaches for the discovery and optimization of new olefin polymerization catalysts. , 2002, Chemical record.