Machine Learning Methods to Predict Density Functional Theory B3LYP Energies of HOMO and LUMO Orbitals

Machine learning algorithms were explored for the fast estimation of HOMO and LUMO orbital energies calculated by DFT B3LYP, on the basis of molecular descriptors exclusively based on connectivity. The whole project involved the retrieval and generation of molecular structures, quantum chemical calculations for a database with >111 000 structures, development of new molecular descriptors, and training/validation of machine learning models. Several machine learning algorithms were screened, and an applicability domain was defined based on Euclidean distances to the training set. Random forest models predicted an external test set of 9989 compounds achieving mean absolute error (MAE) up to 0.15 and 0.16 eV for the HOMO and LUMO orbitals, respectively. The impact of the quantum chemical calculation protocol was assessed with a subset of compounds. Inclusion of the orbital energy calculated by PM7 as an additional descriptor significantly improved the quality of estimations (reducing the MAE in >30%).

[1]  Xing-Fang Li,et al.  Emerging Disinfection Byproducts, Halobenzoquinones: Effects of Isomeric Structure and Halogen Substitution on Cytotoxicity, Formation of Reactive Oxygen Species, and Genotoxicity. , 2016, Environmental science & technology.

[2]  Alán Aspuru-Guzik,et al.  Lead candidates for high-performance organic photovoltaics from high-throughput quantum chemistry – the Harvard Clean Energy Project , 2014 .

[3]  Zhihui Yang,et al.  Quantitative Structure--Activity Relationship (QSAR) for the Oxidation of Trace Organic Contaminants by Sulfate Radical. , 2015, Environmental science & technology.

[4]  Tanfeng Zhao,et al.  Machine Learning Estimation of Atom Condensed Fukui Functions , 2016, Molecular informatics.

[5]  A. Becke Density-functional thermochemistry. III. The role of exact exchange , 1993 .

[6]  M. Rupp,et al.  Machine learning of molecular electronic properties in chemical compound space , 2013, 1305.7074.

[7]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[8]  Kristin A. Persson,et al.  Commentary: The Materials Project: A materials genome approach to accelerating materials innovation , 2013 .

[9]  Mark S. Gordon,et al.  General atomic and molecular electronic structure system , 1993, J. Comput. Chem..

[10]  Ghanshyam Pilania,et al.  Rational design of all organic polymer dielectrics , 2014, Nature Communications.

[11]  M. Karelson,et al.  Quantum-Chemical Descriptors in QSAR/QSPR Studies. , 1996, Chemical reviews.

[12]  Zhenan Bao,et al.  Material and device considerations for organic thin-film transistor sensors , 2009 .

[13]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[14]  Muratahan Aykol,et al.  Materials Design and Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials Database (OQMD) , 2013 .

[15]  P. C. Hariharan,et al.  The influence of polarization functions on molecular orbital hydrogenation energies , 1973 .

[16]  Apilak Worachartcheewan,et al.  QSAR modeling of aromatase inhibitory activity of 1-substituted 1,2,3-triazole analogs of letrozole. , 2013, European journal of medicinal chemistry.

[17]  E. Chamorro,et al.  A comparison between theoretical and experimental models of electrophilicity and nucleophilicity , 2009 .

[18]  Arun Mannodi-Kanakkithodi,et al.  Machine Learning Strategy for Accelerated Design of Polymer Dielectrics , 2016, Scientific Reports.

[19]  Marco Buongiorno Nardelli,et al.  A RESTful API for exchanging materials data in the AFLOWLIB.org consortium , 2014, 1403.2642.

[20]  A. Becke A New Mixing of Hartree-Fock and Local Density-Functional Theories , 1993 .

[21]  Klaus-Robert Müller,et al.  Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. , 2013, Journal of chemical theory and computation.

[22]  M. Al‐Assiri,et al.  Ab initio investigation of 2,2′‐bis(4‐trifluoromethylphenyl)‐5,5′‐bithiazole for the design of efficient organic field‐effect transistors , 2016 .

[23]  J. Aires-de-Sousa,et al.  Estimation of Mayr electrophilicity with a quantitative structure-property relationship approach using empirical and DFT descriptors. , 2011, The Journal of organic chemistry.

[24]  Edward O. Pyzer-Knapp,et al.  Learning from the Harvard Clean Energy Project: The Use of Neural Networks to Accelerate Materials Discovery , 2015 .

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  P. Chattaraj,et al.  Update 2 of: electrophilicity index. , 2011, Chemical reviews.

[28]  E. Chamorro,et al.  Further relationships between theoretical and experimental models of electrophilicity and nucleophilicity , 2009 .

[29]  M. Plewa,et al.  Energy of the Lowest Unoccupied Molecular Orbital, Thiol Reactivity, and Toxicity of Three Monobrominated Water Disinfection Byproducts. , 2016, Environmental science & technology.

[30]  Tanfeng Zhao,et al.  A QSPR approach for the fast estimation of DFT/NBO partial atomic charges☆ , 2014 .

[31]  Sang Peng,et al.  QSPR modeling of bioconcentration factor of nonionic compounds using Gaussian processes and theoretical descriptors derived from electrostatic potentials on molecular surface. , 2011, Chemosphere.

[32]  Matthias Rupp,et al.  Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach. , 2015, Journal of chemical theory and computation.

[33]  James J. P. Stewart,et al.  Optimization of parameters for semiempirical methods VI: more modifications to the NDDO approximations and re-optimization of parameters , 2012, Journal of Molecular Modeling.

[34]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[35]  E. Chamorro,et al.  Intrinsic relative scales of electrophilicity and nucleophilicity. , 2013, The journal of physical chemistry. A.

[36]  Noel M. O'Boyle,et al.  Computational Design and Selection of Optimal Organic Photovoltaic Materials , 2011 .

[37]  J. Pople,et al.  Self—Consistent Molecular Orbital Methods. XII. Further Extensions of Gaussian—Type Basis Sets for Use in Molecular Orbital Studies of Organic Molecules , 1972 .

[38]  Brajesh K. Rai,et al.  Fast and accurate generation of ab initio quality atomic charges using nonparametric statistical regression , 2013, J. Comput. Chem..

[39]  H. Mayr,et al.  Kinetics of electrophile-nucleophile combinations: A general approach to polar organic reactivity , 2005 .

[40]  André Bessette,et al.  Design, synthesis and photophysical studies of dipyrromethene-based materials: insights into their applications in organic photovoltaic devices. , 2014, Chemical Society reviews.

[41]  Xiaohui Qu,et al.  A big data approach to the ultra-fast prediction of DFT-calculated bond energies , 2013, Journal of Cheminformatics.

[42]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[43]  P. Phukan,et al.  A DFT study on nucleophilicity and site selectivity of nitrogen nucleophiles , 2012 .

[44]  Gabriele Bianchi,et al.  “All That Glisters Is Not Gold”: An Analysis of the Synthetic Complexity of Efficient Polymer Donors for Polymer Solar Cells , 2015 .

[45]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[46]  Jenny Nelson Organic photovoltaic films , 2002 .

[47]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[48]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[49]  Christoph J. Brabec,et al.  Design Rules for Donors in Bulk‐Heterojunction Solar Cells—Towards 10 % Energy‐Conversion Efficiency , 2006 .

[50]  Alán Aspuru-Guzik,et al.  The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid , 2011 .

[51]  S. Forrest,et al.  Measurement of the lowest unoccupied molecular orbital energies of molecular organic semiconductors , 2009 .