On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 1—From Data Collection to Model Construction: Understanding of the Methods and Their Effects

In the present work, a multi-angle approach is adopted to develop two ML-QSPR models for the prediction of the enthalpy of formation and the entropy of molecules, in their ideal gas state. The molecules were represented by high-dimensional vectors of structural and physico-chemical characteristics (i.e., descriptors). In this sense, an overview is provided of the possible methods that can be employed at each step of the ML-QSPR procedure (i.e., data preprocessing, dimensionality reduction and model construction) and an attempt is made to increase the understanding of the effects related to a given choice or method on the model performance, interpretability and applicability domain. At the same time, the well-known OECD principles for the validation of (Q)SAR models are also considered and addressed. The employed data set is a good representation of two common problems in ML-QSPR modeling, namely the high-dimensional descriptor-based representation and the high chemical diversity of the molecules. This diversity effectively impacts the subsequent applicability of the developed models to a new molecule. The data set complexity is addressed through customized data preprocessing techniques and genetic algorithms. The former improves the data quality while limiting the loss of information, while the latter allows for the automatic identification of the most important descriptors, in accordance with a physical interpretation. The best performances are obtained with Lasso linear models (MAE test = 25.2 kJ/mol for the enthalpy and 17.9 J/mol/K for the entropy). Finally, the overall developed procedure is also tested on various enthalpy and entropy related data sets from the literature to check its applicability to other problems and competing performances are obtained, highlighting that different methods and molecular representations can lead to good performances.

[1]  O. Herbinet,et al.  On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers , 2023, Algorithms.

[2]  G. Sin,et al.  Application of interpretable group-embedded graph neural networks for pure compound properties , 2023, Comput. Chem. Eng..

[3]  Xinliang Yu,et al.  QSPR-based model extrapolation prediction of enthalpy of solvation , 2023, Journal of Molecular Liquids.

[4]  G. Sin,et al.  Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models , 2023, J. Chem. Inf. Model..

[5]  An Su,et al.  A Review on Artificial Intelligence Enabled Design, Synthesis, and Process Optimization of Chemical Products for Industry 4.0 , 2023, Processes.

[6]  F. Grisoni,et al.  Exposing the Limitations of Molecular Machine Learning with Activity Cliffs , 2022, J. Chem. Inf. Model..

[7]  Andrea Mauri,et al.  Alvascience: A New Software Suite for the QSAR Workflow Applied to the Blood–Brain Barrier Permeability , 2022, International journal of molecular sciences.

[8]  Le Zhanggao,et al.  QSPR models for the critical temperature and pressure of cycloalkanes , 2022, Chemical Physics Letters.

[9]  Tengyi Zhu,et al.  Multiple machine learning algorithms assisted QSPR models for aqueous solubility: Comprehensive assessment with CRITIC-TOPSIS. , 2022, The Science of the total environment.

[10]  Delora Baptista,et al.  Evaluating molecular representations in machine learning models for drug response prediction and interpretability , 2022, J. Integr. Bioinform..

[11]  David M. Kuntz,et al.  Machine learning, artificial intelligence, and chemistry: How smart algorithms are reshaping simulation and the laboratory , 2022, Pure and Applied Chemistry.

[12]  K. Héberger,et al.  Comparison of Descriptor- and Fingerprint Sets in Machine Learning Models for ADME-Tox Targets , 2022, Frontiers in Chemistry.

[13]  M. Zimmermann,et al.  On Integrating Prior Knowledge into Gaussian Processes for Prognostic Health Monitoring , 2022, Mechanical Systems and Signal Processing.

[14]  C. Si-Moussa,et al.  QSPR Modelling of the Solubility of Drug and Drug‐like Compounds in Supercritical Carbon Dioxide , 2022, Molecular informatics.

[15]  Michael W. Mahoney,et al.  AutoIP: A United Framework to Integrate Physics into Gaussian Processes , 2022, ICML.

[16]  J. Goodman,et al.  A review of molecular representation in the age of machine learning , 2022, WIREs Computational Molecular Science.

[17]  T. Knotts,et al.  New QSPRs for Liquid Heat Capacity , 2022, Molecular informatics.

[18]  Jose Martin Herreros,et al.  Machine learning-quantitative structure property relationship (ML-QSPR) method for fuel physicochemical properties prediction of multiple fuel types , 2021 .

[19]  Fengqi You,et al.  Next generation pure component property estimation models: With and without machine learning techniques , 2021, AIChE Journal.

[20]  Marta Królikowska,et al.  Predicting melting point of ionic liquids using QSPR approach: Literature review and new models , 2021, Journal of Molecular Liquids.

[21]  Sandrine Hoppe,et al.  Machine Learning in Chemical Product Engineering: The State of the Art and a Guide for Newcomers , 2021, Processes.

[22]  Philip S. Yu,et al.  Outlier Detection in High Dimensional Data , 2021, Regular issue.

[23]  Fiorella Cravero,et al.  Polymer informatics: Expert-in-the-loop in QSPR modeling of refractive index , 2021 .

[24]  Pieter P. Plehiers,et al.  Learning Molecular Representations for Thermochemistry Prediction of Cyclic Hydrocarbons and Oxygenates. , 2021, The journal of physical chemistry. A.

[25]  Xin Gao,et al.  Predicting entropy and heat capacity of hydrocarbons using machine learning , 2021, Energy and AI.

[26]  Brett M. Savoie,et al.  Transferable Ring Corrections for Predicting Enthalpy of Formation of Cyclic Compounds , 2021, J. Chem. Inf. Model..

[27]  Yi Ding,et al.  Machine learning assisted QSPR model for prediction of ionic liquid’s refractive index and viscosity: The effect of representations of ionic liquid and ensemble model development , 2021, Journal of Molecular Liquids.

[28]  Brian J. Smith,et al.  Predicting aqueous solubility by QSPR modeling. , 2021, Journal of molecular graphics & modelling.

[29]  Xiaojie Xu,et al.  Machine learning glass transition temperature of polyacrylamides using quantum chemical descriptors , 2021 .

[30]  Gerhard R. Wittreich,et al.  Accurate Thermochemistry of Complex Lignin Structures via Density Functional Theory, Group Additivity, and Machine Learning , 2021 .

[31]  R. Giryes,et al.  Autoencoders , 2021, Deep Learning in Science.

[32]  S. Kirmani,et al.  Topological indices and QSPR/QSAR analysis of some antiviral drugs being investigated for the treatment of COVID‐19 patients , 2020, International journal of quantum chemistry.

[33]  T. Knotts,et al.  Proper Use of the DIPPR 801 Database for Creation of Models, Methods, and Processes , 2020 .

[34]  Zhongyu Wan Quantitative structure-property relationship of standard enthalpies of nitrogen oxides based on a MSR and LS-SVR algorithm predictions , 2020, Journal of Molecular Structure.

[35]  Thierry Langer,et al.  A compact review of molecular property prediction with graph neural networks. , 2020, Drug discovery today. Technologies.

[36]  F. Shafiei,et al.  QSPR Models for the prediction of some thermodynamic Properties of Cycloalkanes Using GA-MLR Method. , 2020, Current computer-aided drug design.

[37]  Chang-Yu Hsieh,et al.  Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models , 2020, Journal of Cheminformatics.

[38]  Li Yang,et al.  On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice , 2020, Neurocomputing.

[39]  S. M. Sarathy,et al.  Data Science Approach to Estimate Enthalpy of Formation of Cyclic Hydrocarbons , 2020, The journal of physical chemistry. A.

[40]  Chad V. Mashuga,et al.  Quantitative Structure-Property Relationship (QSPR) models for Minimum Ignition Energy (MIE) prediction of combustible dusts using machine learning , 2020, Powder Technology.

[41]  Yajuan Shi,et al.  QSPR models for the properties of ionic liquids at variable temperatures based on norm descriptors , 2020 .

[42]  Chih-Wen Chen,et al.  Ensemble feature selection in medical datasets: Combining filter, wrapper, and embedded feature selection results , 2020, Expert Syst. J. Knowl. Eng..

[43]  Qiyuan Zhao,et al.  A Self-Consistent Component Increment Theory for Predicting Enthalpy of Formation. , 2020, Journal of chemical information and modeling.

[44]  Pavlo O. Dral,et al.  Quantum Chemistry in the Age of Machine Learning. , 2020, The journal of physical chemistry letters.

[45]  André Bardow,et al.  Computer-aided molecular and processes design based on quantum chemistry: current status and future prospects , 2020, Current Opinion in Chemical Engineering.

[46]  Xuefeng Yan,et al.  A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds , 2020 .

[47]  Bernd Bischl,et al.  Benchmark for filter methods for feature selection in high-dimensional classification data , 2020, Comput. Stat. Data Anal..

[48]  A. Toropova,et al.  QSPR/QSAR: State-of-Art, Weirdness, the Future , 2020, Molecules.

[49]  P. Duchowicz QSPR studies on water solubility, octanol-water partition coefficient and vapour pressure of pesticides , 2019, SAR and QSAR in environmental research.

[50]  Ellen Poliakoff,et al.  Machine learning algorithm validation with a limited sample size , 2019, PloS one.

[51]  N. Sheibani Heat of Formation Assessment of Organic Azido Compounds Used as Green Energetic Plasticizers by QSPR Approaches , 2019, Propellants, Explosives, Pyrotechnics.

[52]  S. M. Sarathy,et al.  Machine Learning to Predict Standard Enthalpy of Formation of Hydrocarbons. , 2019, The journal of physical chemistry. A.

[53]  Rajeev S. Assary,et al.  Accurate quantum chemical energies for 133 000 organic molecules† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02834j , 2019, Chemical science.

[54]  Colin A. Grambow,et al.  Accurate Thermochemistry with Small Data Sets: A Bond Additivity Correction and Transfer Learning Approach. , 2019, The journal of physical chemistry. A.

[55]  Xuefeng Yan,et al.  Norm indexes for predicting enthalpy of vaporization of organic compounds at the boiling point , 2019, Journal of Molecular Liquids.

[56]  N. Oulahal,et al.  Antibacterial Properties of Polyphenols: Characterization and QSAR (Quantitative Structure–Activity Relationship) Models , 2019, Front. Microbiol..

[57]  S. Hanini,et al.  QSPR estimation models of normal boiling point and relative liquid density of pure hydrocarbons using MLR and MLP-ANN methods. , 2019, Journal of molecular graphics & modelling.

[58]  J. Dearden,et al.  Aqueous Drug Solubility: What Do We Measure, Calculate and QSPR Predict? , 2019, Mini reviews in medicinal chemistry.

[59]  Colin A. Grambow,et al.  Self-Evolving Machine: A Continuously Improving Model for Molecular Thermochemistry. , 2019, The journal of physical chemistry. A.

[60]  R. Rengaswamy,et al.  Machine Learning Derived Quantitative Structure Property Relationship (QSPR) to Predict Drug Solubility in Binary Solvent Systems , 2019, Industrial & Engineering Chemistry Research.

[61]  Nicolas P. D. Sawaya,et al.  Quantum Chemistry in the Age of Quantum Computing. , 2018, Chemical reviews.

[62]  Geun Ho Gu,et al.  Thermochemistry of gas-phase and surface species via LASSO-assisted subgraph selection , 2018 .

[63]  Elizabeth A. Holm,et al.  A Comparative Study of Feature Selection Methods for Stress Hotspot Classification in Materials , 2018, Integrating Materials and Manufacturing Innovation.

[64]  William H. Green,et al.  An Extended Group Additivity Method for Polycyclic Thermochemistry Estimation: AN EXTENDED GROUP ADDITIVITY METHOD FOR POLYCYCLIC THERMOCHEMISTRY ESTIMATION , 2018 .

[65]  Tatsuya Takagi,et al.  Mordred: a molecular descriptor calculator , 2018, Journal of Cheminformatics.

[66]  P. Hawkins Conformation Generation: The State of the Art , 2017, J. Chem. Inf. Model..

[67]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[68]  Paola Gramatica,et al.  A Historical Excursus on the Statistical Validation Parameters for QSAR Models: A Clarification Concerning Metrics and Terminology , 2016, J. Chem. Inf. Model..

[69]  S. Yousefinejad,et al.  Chemometrics tools in QSAR/QSPR studies: A historical perspective , 2015 .

[70]  Sereina Riniker,et al.  Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation , 2015, J. Chem. Inf. Model..

[71]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[72]  Vipin Kumar,et al.  Feature Selection: A literature Review , 2014, Smart Comput. Rev..

[73]  Age K. Smilde,et al.  Principal Component Analysis , 2003, Encyclopedia of Machine Learning.

[74]  Ljubomir J. Buturovic,et al.  Cross-validation pitfalls when selecting and assessing regression and classification models , 2014, Journal of Cheminformatics.

[75]  L. Carlsson,et al.  Choosing Feature Selection and Learning Algorithms in QSAR , 2014, J. Chem. Inf. Model..

[76]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[77]  M. Shahlaei Descriptor selection methods in quantitative structure-activity relationship studies: a review study. , 2013, Chemical reviews.

[78]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[79]  Paola Gramatica,et al.  Real External Predictivity of QSAR Models: How To Evaluate It? Comparison of Different Validation Criteria and Proposal of Using the Concordance Correlation Coefficient , 2011, J. Chem. Inf. Model..

[80]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[81]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[82]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[83]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[84]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[85]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[86]  Peixun Liu,et al.  Current Mathematical Methods Used in QSAR/QSPR Studies , 2009, International journal of molecular sciences.

[87]  J. Dearden,et al.  How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR) , 2009, SAR and QSAR in environmental research.

[88]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[89]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[90]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[91]  R. Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[92]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[93]  Jorge A. Marrero,et al.  Group-contribution based estimation of pure component properties , 2001 .

[94]  Takahiro Yamada,et al.  Thermodynamic Parameters and Group Additivity Ring Corrections for Three- to Six-Membered Oxygen Heterocyclic Hydrocarbons , 1997 .

[95]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[96]  R. Gani,et al.  New group contribution method for estimating properties of pure compounds , 1994 .

[97]  K. Esbensen,et al.  Principal component analysis , 1987 .

[98]  Q. Li,et al.  Graph neural networks for molecular and materials representation , 2023, Journal of Materials Informatics.

[99]  R. Pintelon,et al.  Improved frequency response function estimation by Gaussian process regression with prior knowledge , 2021, IFAC-PapersOnLine.

[100]  G. Casañola-Martín,et al.  QSAR/QSPR in Polymers , 2020 .

[101]  B. Sepehri A review on created QSPR models for predicting ionic liquids properties and their reliability from chemometric point of view , 2020 .

[102]  Andrea Mauri,et al.  alvaDesc: A Tool to Calculate and Analyze Molecular Descriptors and Fingerprints , 2020 .

[103]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[104]  Américo Pereira,et al.  Review of feature selection techniques in bioinformatics , 2012 .

[105]  Davide Anguita,et al.  The 'K' in K-fold Cross Validation , 2012, ESANN.

[106]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[107]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[108]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .