Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search

Machine learning (ML) plays a growing role in the design and discovery of chemicals, aiming to reduce the need to perform expensive experiments and simulations. ML for such applications is promising but difficult, as models must generalize to vast chemical spaces from small training sets and must have reliable uncertainty quantification metrics to identify and prioritize unexplored regions. Ab initio computational chemistry and chemical intuition alike often take advantage of differences between chemical conditions, rather than their absolute structure or state, to generate more reliable results. We have developed an analogous comparison-based approach for ML regression, called pairwise difference regression (PADRE), which is applicable to arbitrary underlying learning models and operates on pairs of input data points. During training, the model learns to predict differences between all possible pairs of input points. During prediction, the test points are paired with all training set points, giving rise to a set of predictions that can be treated as a distribution of which the mean is treated as a final prediction and the dispersion is treated as an uncertainty measure. Pairwise difference regression was shown to reliably improve the performance of the random forest algorithm across five chemical ML tasks. Additionally, the pair-derived dispersion is both well correlated with model error and performs well in active learning. We also show that this method is competitive with state-of-the-art neural network techniques. Thus, pairwise difference regression is a promising tool for candidate selection algorithms used in chemical discovery.

[1]  H. Kulik,et al.  A Quantitative Uncertainty Metric Controls Error in Neural Network-Driven Chemical Discovery , 2019 .

[2]  Gisbert Schneider,et al.  Active-learning strategies in computer-assisted drug discovery. , 2015, Drug discovery today.

[3]  Julian E. Fuchs,et al.  Matched molecular pair analysis: significance and the impact of experimental uncertainty. , 2014, Journal of medicinal chemistry.

[4]  S. M. Moosavi,et al.  The Role of Machine Learning in the Understanding and Design of Materials , 2020, Journal of the American Chemical Society.

[5]  Christian Tyrchan,et al.  Matched Molecular Pair Analysis in Short: Algorithms, Applications and Limitations , 2016, Computational and structural biotechnology journal.

[6]  Prasanna V. Balachandran,et al.  Machine learning guided design of functional materials with targeted properties , 2019, Computational Materials Science.

[7]  Paul A. Bartlett,et al.  Differential binding energy: a detailed evaluation of the influence of hydrogen-bonding and hydrophobic groups on the inhibition of thermolysin by phosphorus-containing inhibitors , 1991 .

[8]  Eva Nittinger,et al.  Siamese Recurrent Neural Network with a Self-Attention Mechanism for Bioactivity Prediction , 2021, ACS omega.

[9]  Ronan M. T. Fleming,et al.  Consistent Estimation of Gibbs Energy Using Component Contributions , 2013, PLoS Comput. Biol..

[10]  C. Sutton Classification and Regression Trees, Bagging, and Boosting , 2005 .

[11]  Andrew G. Leach,et al.  Matched molecular pair analysis in drug discovery. , 2013, Drug discovery today.

[12]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[13]  S. F. Boys,et al.  The calculation of small molecular interactions by the differences of separate total energies. Some procedures with reduced errors , 1970 .

[14]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[15]  Enrique del Castillo,et al.  Query-by-committee improvement with diversity and density in batch active learning , 2018, Inf. Sci..

[16]  Riley J. Hickman,et al.  Gryffin: An algorithm for Bayesian optimization of categorical variables informed by expert knowledge , 2020, 2003.12127.

[17]  Robert Abel,et al.  Reaction-Based Enumeration, Active Learning, and Free Energy Calculations To Rapidly Explore Synthetically Tractable Chemical Space and Optimize Potency of Cyclin-Dependent Kinase 2 Inhibitors , 2019, J. Chem. Inf. Model..

[18]  Chenru Duan,et al.  Accurate Multiobjective Design in a Space of Millions of Transition Metal Complexes with Neural-Network-Driven Efficient Global Optimization , 2020, ACS central science.

[19]  Peter C. St. John,et al.  Prediction of organic homolytic bond dissociation enthalpies at near chemical accuracy with sub-second computational cost , 2020, Nature Communications.

[20]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[21]  Rainier Barrett,et al.  Investigating Active Learning and Meta-Learning for Iterative Peptide Design. , 2020, Journal of chemical information and modeling.

[22]  Naftali Tishby,et al.  Machine learning and the physical sciences , 2019, Reviews of Modern Physics.

[23]  Markus Reiher,et al.  Error-Controlled Exploration of Chemical Reaction Networks with Gaussian Processes. , 2018, Journal of chemical theory and computation.

[24]  Daniel W. Davies,et al.  Machine learning for molecular and materials science , 2018, Nature.

[25]  Ryan-Rhys Griffiths,et al.  Constrained Bayesian optimization for automatic chemical design using variational autoencoders , 2019, Chemical science.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Barbara Pernici,et al.  Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-Based Molecular Property Prediction , 2020, J. Chem. Inf. Model..

[28]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[29]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[30]  J. Keith,et al.  Theoretical examination of the thermodynamic factors in the selective extraction of Am3+ from Eu3+ by dithiophosphinic acids. , 2012, Inorganic chemistry.

[31]  Eyke Hüllermeier,et al.  Aleatoric and Epistemic Uncertainty with Random Forests , 2020, IDA.

[32]  Gustavo Henrique Goulart Trossini,et al.  Use of machine learning approaches for novel drug discovery , 2016, Expert opinion on drug discovery.

[33]  Adrian E. Roitberg,et al.  Less is more: sampling chemical space with active learning , 2018, The Journal of chemical physics.

[34]  Davide Chicco,et al.  Siamese Neural Networks: An Overview , 2021, Artificial Neural Networks, 3rd Edition.

[35]  Johannes Hachmann,et al.  Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry , 2020, ArXiv.

[36]  Stefan Grimme,et al.  GFN2-xTB-An Accurate and Broadly Parametrized Self-Consistent Tight-Binding Quantum Chemical Method with Multipole Electrostatics and Density-Dependent Dispersion Contributions. , 2018, Journal of Chemical Theory and Computation.

[37]  Donghyeon Park,et al.  ReSimNet: drug response similarity prediction using Siamese neural networks , 2019, Bioinform..

[38]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[39]  Sungwon Kim,et al.  Uncertainty-Quantified Hybrid Machine Learning/Density Functional Theory High Throughput Screening Method for Crystals , 2020, J. Chem. Inf. Model..

[40]  Yuansheng Cheng,et al.  Pseudo expected improvement criterion for parallel EGO algorithm , 2017, J. Glob. Optim..

[41]  J. E. Gubernatis,et al.  Machine learning in materials design and discovery: Examples from the present and suggestions for the future , 2018, Physical Review Materials.

[42]  Alán Aspuru-Guzik,et al.  Phoenics: A Bayesian Optimizer for Chemistry , 2018, ACS central science.

[43]  P Schneider,et al.  Multi-objective active machine learning rapidly improves structure–activity models and reveals new protein–protein interaction inhibitors† †Electronic supplementary information (ESI) available: Details about computational comparisons and all screening results. See DOI: 10.1039/c5sc04272k , 2016, Chemical science.

[44]  Eric F. May,et al.  The removal of CO2 and N2 from natural gas: A review of conventional and emerging process technologies , 2012 .

[45]  Heather J. Kulik,et al.  molSimplify: A toolkit for automating discovery in inorganic chemistry , 2016, J. Comput. Chem..

[46]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[47]  Beena Rai,et al.  Applied machine learning for predicting the lanthanide-ligand binding affinities , 2020, Scientific Reports.

[48]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[49]  Junsang Cho,et al.  Machine Learning-Directed Navigation of Synthetic Design Space: A Statistical Learning Approach to Controlling the Synthesis of Perovskite Halide Nanoplatelets in the Quantum-Confined Regime , 2019, Chemistry of Materials.

[50]  Benji Maruyama,et al.  The machine learning revolution in materials? , 2019, MRS Bulletin.

[51]  Ryan P. Adams,et al.  Bayesian reaction optimization as a tool for chemical synthesis , 2021, Nature.

[52]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[53]  Chenru Duan,et al.  Strategies and Software for Machine Learning Accelerated Discovery in Transition Metal Chemistry , 2018, Industrial & Engineering Chemistry Research.

[54]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[55]  Krishna Rajan,et al.  Deep Learning Model for Identifying Critical Structural Motifs in Potential Endocrine Disruptors , 2021, J. Chem. Inf. Model..

[56]  Benjamin A. Shoemaker,et al.  PubChem in 2021: new data content and improved web interfaces , 2020, Nucleic Acids Res..

[57]  Hasan Şakir Bilge,et al.  Deep Metric Learning: A Survey , 2019, Symmetry.

[58]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[59]  Jason K. Streit,et al.  Autonomous materials discovery driven by Gaussian process regression with inhomogeneous measurement noise and anisotropic kernels , 2020, Scientific Reports.

[60]  Gianni De Fabritiis,et al.  DeltaDelta neural networks for lead optimization of small molecule potency† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc04606b , 2019, Chemical science.

[61]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[62]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2001, Springer Series in Statistics.

[63]  Jeffrey C Grossman,et al.  Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. , 2017, Physical review letters.