Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties.

The training of molecular models of quantum mechanical properties based on statistical machine learning requires large data sets which exemplify the map from chemical structure to molecular property. Intelligent a priori selection of training examples is often difficult or impossible to achieve, as prior knowledge may be unavailable. Ordinarily representative selection of training molecules from such data sets is achieved through random sampling. We use genetic algorithms for the optimization of training set composition consisting of tens of thousands of small organic molecules. The resulting machine learning models are considerably more accurate: in the limit of small training sets, mean absolute errors for out-of-sample predictions are reduced by up to ∼75%. We discuss and present optimized training sets consisting of 10 molecular classes for all molecular properties studied. We show that these classes can be used to design improved training sets for the generation of machine learning models of the same properties in similar but unrelated molecular sets.

[1]  O. A. V. Lilienfeld,et al.  First principles view on chemical compound space: Gaining rigorous atomistic control of molecular properties , 2012, 1209.5033.

[2]  M. Rupp,et al.  Machine learning of molecular electronic properties in chemical compound space , 2013, 1305.7074.

[3]  Alex Fraser,et al.  Simulation of Genetic Systems by Automatic Digital Computers I. Introduction , 1957 .

[4]  Klaus-Robert Müller,et al.  Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. , 2013, Journal of chemical theory and computation.

[5]  J. S. F. Barker,et al.  Simulation of Genetic Systems by Automatic Digital Computers , 1958 .

[6]  O. A. von Lilienfeld,et al.  Communication: Understanding molecular representations in machine learning: The role of uniqueness and target similarity. , 2016, The Journal of chemical physics.

[7]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[8]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[9]  Rampi Ramprasad,et al.  Learning scheme to predict atomic forces and accelerate materials simulations , 2015, 1505.02701.

[10]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[11]  Walter Thiel,et al.  Machine Learning of Parameters for Accurate Semiempirical Quantum Chemical Calculations , 2015, Journal of chemical theory and computation.

[12]  J. Gauss,et al.  Basis set limit CCSD(T) harmonic vibrational frequencies. , 2007, The journal of physical chemistry. A.

[13]  K. Müller,et al.  Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space , 2015, The journal of physical chemistry letters.

[14]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[15]  O. A. von Lilienfeld,et al.  Electronic spectra from TDDFT and machine learning in chemical space. , 2015, The Journal of chemical physics.

[16]  C. Rowley,et al.  Benchmarking quantum chemical methods for the calculation of molecular dipole moments and polarizabilities. , 2014, The journal of physical chemistry. A.

[17]  M. Rupp,et al.  Fourier series of atomic radial distribution functions: A molecular fingerprint for machine learning models of quantum chemical properties , 2013, 1307.2918.

[18]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[19]  Raghunathan Ramakrishnan,et al.  Many Molecular Properties from One Kernel in Chemical Space. , 2015, Chimia.

[20]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[21]  Barbara König,et al.  Tools and Algorithms for the Construction and Analysis of Systems , 2012, Lecture Notes in Computer Science.

[22]  Kalyanmoy Deb,et al.  A combined genetic adaptive search (GeneAS) for engineering design , 1996 .

[23]  Kenneth A. De Jong,et al.  An Analysis of Multi-Point Crossover , 1990, FOGA.

[24]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[25]  P. Wipf,et al.  Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. , 2013, Journal of the American Chemical Society.

[26]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[27]  Matthias Rupp,et al.  Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach. , 2015, Journal of chemical theory and computation.

[28]  Wolfgang Jahnke,et al.  Fragment-based Drug Discovery Lessons and Outlook , 2016 .

[29]  D. Goldberg,et al.  Modeling tournament selection with replacement using apparent added noise , 2001 .