Structure-Property Maps with Kernel Principal Covariates Regression

Data analyses based on linear methods constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regression (PCovR) is an underappreciated method that interpolates between principal component analysis and linear regression, and can be used to conveniently reveal structure-property relations in terms of simple-to-interpret, low-dimensional maps. Here we provide a pedagogic overview of these data analysis schemes, including the use of the kernel trick to introduce an element of non-linearity, while maintaining most of the convenience and the simplicity of linear approaches. We then introduce a kernelized version of PCovR and a sparsified extension, and demonstrate the performance of this approach in revealing and predicting structure-property relations in chemistry and materials science, showing a variety of examples including elemental carbon, porous silicate frameworks, organic molecules, amino acid conformers, and molecular materials.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  Michele Ceriotti,et al.  Mapping and classifying molecules from a high-throughput structural database , 2016, Journal of Cheminformatics.

[3]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[4]  Yehoshua Y. Zeevi,et al.  The farthest point strategy for progressive image sampling , 1997, IEEE Trans. Image Process..

[5]  Gábor Csányi,et al.  Accuracy and transferability of Gaussian approximation potential models for tungsten , 2014 .

[6]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[7]  Henk A. L. Kiers,et al.  Principal covariates regression: Part I. Theory , 1992 .

[8]  Christopher M Wolverton,et al.  High‐Throughput Computational Screening of New Li‐Ion Battery Anode Materials , 2013 .

[9]  Michele Ceriotti,et al.  Recognizing molecular patterns by machine learning: an agnostic structural definition of the hydrogen bond. , 2014, The Journal of chemical physics.

[10]  C. Baldauf,et al.  The conformational space of a flexible amino acid at metallic surfaces , 2020, International Journal of Quantum Chemistry.

[11]  N. Marzari,et al.  High-throughput computational screening for solid-state Li-ion conductors , 2019, Energy & Environmental Science.

[12]  Eva Ceulemans,et al.  Obtaining insights from high-dimensional data: sparse principal covariates regression , 2018, BMC Bioinformatics.

[13]  Eva Ceulemans,et al.  PCovR: An R Package for Principal Covariates Regression , 2015 .

[14]  Michele Ceriotti,et al.  Atom-density representations for machine learning. , 2018, The Journal of chemical physics.

[15]  Fujio Izumi,et al.  VESTA 3 for three-dimensional visualization of crystal, volumetric and morphology data , 2011 .

[16]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[17]  Michele Ceriotti,et al.  Chemiscope: interactive structure-property explorer for materials and molecules , 2020, J. Open Source Softw..

[18]  Michael W Deem,et al.  A database of new zeolite-like materials. , 2011, Physical chemistry chemical physics : PCCP.

[19]  Gabor Csanyi,et al.  Achieving DFT accuracy with a machine-learning interatomic potential: thermomechanics and defects in bcc ferromagnetic iron , 2017, 1706.10229.

[20]  Markus Schneider,et al.  First-principles data set of 45,892 isolated and cation-coordinated conformers of 20 proteinogenic amino acids , 2015, Scientific Data.

[21]  R. Kondor,et al.  On representing chemical environments , 2012, 1209.3140.

[22]  Ekin D. Cubuk,et al.  Holistic computational structure screening of more than 12 000 candidates for solid lithium-ion conductor materials , 2017 .

[23]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[24]  Dick van Dijk,et al.  Forecast comparison of principal component regression and principal covariate regression , 2005, Comput. Stat. Data Anal..

[25]  Eva Ceulemans,et al.  Principal Covariates Clusterwise Regression (PCCR): Accounting for Multicollinearity and Population Heterogeneity in Hierarchically Organized Data , 2017, Psychometrika.

[26]  Chris J Pickard,et al.  Ab initio random structure searching , 2011, Journal of physics. Condensed matter : an Institute of Physics journal.

[27]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[28]  Josh E. Campbell,et al.  Machine learning for the structure–energy–property landscapes of molecular crystals† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc04665k , 2017, Chemical science.

[29]  Gábor Csányi,et al.  Gaussian approximation potentials: A brief tutorial introduction , 2015, 1502.01366.

[30]  Eva Ceulemans,et al.  On the selection of the weighting parameter value in Principal Covariates Regression , 2013 .

[31]  Michele Ceriotti,et al.  Unsupervised machine learning in atomistic simulations, between predictions and understanding. , 2019, The Journal of chemical physics.

[32]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[33]  Eva Ceulemans,et al.  Model selection in principal covariates regression , 2016 .

[34]  Jörg Behler,et al.  Automatic selection of atomic fingerprints and reference configurations for machine-learning potentials. , 2018, The Journal of chemical physics.

[35]  Michele Ceriotti,et al.  Large-Scale Computational Screening of Molecular Organic Semiconductors Using Crystal Structure Prediction , 2018, Chemistry of Materials.

[36]  Michele Ceriotti,et al.  A new kind of atlas of zeolite building blocks. , 2019, The Journal of chemical physics.

[37]  Michele Ceriotti,et al.  Atomic Motif Recognition in (Bio)Polymers: Benchmarks From the Protein Data Bank , 2019, Front. Mol. Biosci..

[38]  Helmuth Späth,et al.  Algorithm 39 Clusterwise linear regression , 1979, Computing.

[39]  M. Fischer Regularized principal covariates regression and its application to finding coupled patterns in climate fields , 2014 .

[40]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[41]  Klaus-Robert Müller,et al.  Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. , 2013, Journal of chemical theory and computation.

[42]  Volker L. Deringer,et al.  Machine learning based interatomic potential for amorphous carbon , 2016, 1611.03277.

[43]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[44]  E. Ellerbeck,et al.  Nutrition literacy predicts adherence to healthy/unhealthy diet patterns in adults with a nutrition-related chronic condition , 2019, Public Health Nutrition.

[45]  George E. Dahl,et al.  Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error. , 2017, Journal of chemical theory and computation.

[46]  Diego A. Gómez-Gualdrón,et al.  The materials genome in action: identifying the performance limits for methane storage , 2015 .