Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition

Most machine learning applications in quantum-chemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body functions), (ii) many-body tensor representation (MBTR, with two- and three-body terms), and (iii) smooth overlap of atomic positions (SOAP), to evaluate the prediction process's performance using different numbers of molecules in training samples and the effect of bias and variance on the final MSE. Overall, low sample sizes are related to higher MSE. Moreover, the bias component strongly influences the larger MSEs. Furthermore, there is little agreement among molecules with higher errors (outliers) across different descriptors. However, there is a high prevalence among the outliers intersection set and the convex hull volume of geometric coordinates (VCH). According to the obtained results with the distribution of MSE (and its components bias and variance) and the appearance of outliers, it is suggested to use ensembles of models with a low bias to minimize the MSE, more specifically when using a small number of molecules in the training set.

[1]  Noam Bernstein,et al.  Machine learning unifies the modeling of materials and molecules , 2017, Science Advances.

[2]  Olexandr Isayev,et al.  ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules , 2017, Scientific Data.

[3]  J. Paruelo,et al.  How to evaluate models : Observed vs. predicted or predicted vs. observed? , 2008 .

[4]  Maura R. Grossman,et al.  Quantifying Bias and Variance of System Rankings , 2019, SIGIR.

[5]  Gabriel A. Pinheiro,et al.  Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset. , 2020, The journal of physical chemistry. A.

[6]  Remco R. Bouckaert Practical Bias Variance Decomposition , 2008, Australasian Conference on Artificial Intelligence.

[7]  O. A. von Lilienfeld,et al.  Retrospective on a decade of machine learning for chemical discovery , 2020, Nature Communications.

[8]  Matthias Rupp,et al.  Machine learning for quantum mechanics in a nutshell , 2015 .

[9]  Shubin Liu,et al.  Information‐theoretic approach in density functional theory and its recent applications to chemical problems , 2020 .

[10]  Stefano de Gironcoli,et al.  Reproducibility in density functional theory calculations of solids , 2016, Science.

[11]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[12]  O. Anatole von Lilienfeld,et al.  Machine learning the computational cost of quantum chemistry , 2019, Mach. Learn. Sci. Technol..

[13]  Sunghwan Choi,et al.  Transfer Learning from Simulation to Experimental Data: NMR Chemical Shift Predictions. , 2021, The journal of physical chemistry letters.

[14]  E. Baerends,et al.  Kohn-Sham Density Functional Theory: Predicting and Understanding Chemistry , 2007 .

[15]  AkshatKumar Nigam,et al.  Assigning confidence to molecular property prediction , 2021, Expert opinion on drug discovery.

[16]  Bartolomeo Civalleri,et al.  Prediction uncertainty of density functional approximations for properties of crystals with cubic symmetry. , 2015, The journal of physical chemistry. A.

[17]  J. Behler Atom-centered symmetry functions for constructing high-dimensional neural network potentials. , 2011, The Journal of chemical physics.

[18]  Justin S. Smith,et al.  The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules , 2020, Scientific Data.

[19]  Klaus-Robert Müller,et al.  Machine Learning Force Fields , 2020, Chemical reviews.

[20]  Jerzy Leszczynski,et al.  Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network , 2018, Science Advances.

[21]  George E. Dahl,et al.  Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error. , 2017, Journal of chemical theory and computation.

[22]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[23]  F. Pan,et al.  Algebraic graph-assisted bidirectional transformers for molecular property prediction , 2021, Nature Communications.

[24]  Adam S. Foster,et al.  Machine learning hydrogen adsorption on nanoclusters through structural descriptors , 2018, npj Computational Materials.

[25]  Rory A. Fisher,et al.  The Moments of the Distribution for Normal Samples of Measures of Departure from Normality , 1930 .

[26]  Z. Deng,et al.  A Critical Review of Machine Learning of Energy Materials , 2020, Advanced Energy Materials.

[27]  N. Nagelkerke,et al.  A note on a general definition of the coefficient of determination , 1991 .

[28]  Seokho Kang,et al.  Neural Message Passing for NMR Chemical Shift Prediction , 2020, J. Chem. Inf. Model..

[29]  Eric Xing,et al.  Methods for comparing uncertainty quantifications for material property predictions. , 2019 .

[30]  Michael H Abraham,et al.  Fast calculation of van der Waals volume as a sum of atomic and bond contributions and its application to drug compounds. , 2003, The Journal of organic chemistry.

[31]  B. Huang,et al.  Impact of non-normal error distributions on the benchmarking and ranking of quantum machine learning models , 2020, Mach. Learn. Sci. Technol..

[32]  Ashutosh Kumar,et al.  Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery , 2018, Front. Chem..

[33]  Pavlo O. Dral,et al.  Quantum Chemistry in the Age of Machine Learning. , 2020, The journal of physical chemistry letters.

[34]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[35]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[36]  Jilles Vreeken,et al.  Identifying domains of applicability of machine learning models for materials science , 2020, Nature communications.