Quantitative structure-property relationships for the calculation of the soil adsorption coefficient using machine learning algorithms with calculated chemical properties from open-source software.

The soil adsorption coefficient (Koc) is an environmental fate parameter that is essential for environmental risk assessment. However, obtaining Koc requires a significant amount of time and enormous expenditure. Thus, it is necessary to efficiently estimate Koc in the early stages of a chemical's development. In this study, a quantitative structure-property relationship (QSPR) model was developed using calculated physicochemical properties and molecular descriptors with the OPEn structure-activity/property Relationship App (OPERA) and Mordred software using the largest available Koc dataset. Specifically, we compared the accuracies of the model using the light gradient boosted machine (LightGBM), a gradient boosting decision tree (GBDT) algorithm, with those of previous models. The experimental results suggested the potential to develop a QSPR model that will produce highly accurate Koc values using molecular descriptors and physicochemical properties. Unlike previous studies, the use of a combination of LightGBM, OPERA and Mordred enables the prediction of Koc for many chemicals with high accuracy. In this study, OPERA was used to calculate the physicochemical properties, and Mordred was used to calculate molecular descriptors. The wide range of chemicals covered by OPERA and Mordred enables the analysis of a diverse range of chemical compounds. We also report a method to tune the LightBGM program. The use of fast-processing software, such as LightGBM, enables parameter tuning of a method required to obtain best performance. Our research represents one of the few studies in the field of environmental chemistry to use LightGBM. Using physicochemical properties as well as molecular descriptors, we could develop highly accurate Koc prediction models when compared to prior studies. In addition, our QSPR models may be useful for preliminary environmental risk assessment without incurring significant costs during the early chemical developmental stage.

[1]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[2]  Paola Gramatica,et al.  Prediction of aqueous solubility, vapor pressure and critical micelle concentration for aquatic partitioning of perfluorinated chemicals. , 2011, Environmental science & technology.

[3]  A. Ragas,et al.  A review of quantitative structure-property relationships for the fate of ionizable organic chemicals in water matrices and identification of knowledge gaps. , 2017, Environmental science. Processes & impacts.

[4]  J. Verstraten,et al.  Naphthalene sorption to organic soil materials studied with continuous stirred flow experiments , 1999 .

[5]  S. C. Sampaio,et al.  The effect of different log P algorithms on the modeling of the soil sorption coefficient of nonionic pesticides. , 2013, Water research.

[6]  Hong-jun Wang,et al.  QSPR models of n-octanol/water partition coefficients and aqueous solubility of halogenated methyl-phenyl ethers by DFT method. , 2012, Chemosphere.

[7]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[8]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[9]  S. C. Sampaio,et al.  Statistical equivalence of prediction models of the soil sorption coefficient obtained using different log P algorithms. , 2017, Chemosphere.

[10]  R. E. Jessup,et al.  Sorption kinetics of organic chemicals : evaluation of gas-purge and miscible-displacement techniques , 1990 .

[11]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[12]  Humayun Kabir,et al.  Comparative Studies on Some Metrics for External Validation of QSPR Models , 2012, J. Chem. Inf. Model..

[13]  C. Topping,et al.  Ecological Recovery and Resilience in Environmental Risk Assessments at the European Food Safety Authority , 2018, Integrated environmental assessment and management.

[14]  Antony J. Williams,et al.  OPERA models for predicting physicochemical properties and environmental fate endpoints , 2018, Journal of Cheminformatics.

[15]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[16]  Tatsuya Takagi,et al.  Mordred: a molecular descriptor calculator , 2018, Journal of Cheminformatics.

[17]  B. M. Gawlik,et al.  Alternatives for the determination of the soil adsorption coefficient, Koc, of non-ionicorganic compounds : A review , 1997 .

[18]  Yilin Wang,et al.  QSPR Studies on Vapor Pressure, Aqueous Solubility, and the Prediction of Water-Air Partition Coefficients , 1998, J. Chem. Inf. Comput. Sci..

[19]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[20]  K. Funatsu,et al.  Strategy of Structure Generation within Applicability Domains with One-Class Support Vector Machine , 2015 .

[21]  Ali Eslamimanesh,et al.  QSPR molecular approach for representation/prediction of very large vapor pressure dataset , 2012 .

[22]  Paola Gramatica,et al.  CHEMOMETRIC METHODS AND THEORETICAL MOLECULAR DESCRIPTORS IN PREDICTIVE QSAR MODELING OF THE ENVIRONMENTAL BEHAVIOR OF ORGANIC POLLUTANTS , 2010 .

[23]  Supratik Kar,et al.  On a simple approach for determining applicability domain of QSAR models , 2015 .

[24]  Tomasz Puzyn,et al.  “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling , 2015 .

[25]  X. Yao,et al.  Integrated QSPR models to predict the soil sorption coefficient for a large diverse set of compounds by using different modeling methods , 2014 .

[26]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[27]  I. Marrucho,et al.  Solubility of non-aromatic ionic liquids in water and correlation using a QSPR approach , 2010 .

[28]  Fredrik Svensson,et al.  LightGBM: An Effective and Scalable Algorithm for Prediction of Chemical Toxicity-Application to the Tox21 and Mutagenicity Data Sets , 2019, J. Chem. Inf. Model..

[29]  D. Whitley,et al.  Quantitative structure-property relationships for predicting sorption of pharmaceuticals to sewage sludge during waste water treatment processes , 2017, The Science of the total environment.

[30]  Ritu Jain,et al.  QSPR Correlation of the Melting Point for Pyridinium Bromides, Potential Ionic Liquids , 2002, J. Chem. Inf. Comput. Sci..

[31]  Kenichi Yoshida,et al.  Prediction of Soil Adsorption Coefficient in Pesticides Using Physicochemical Properties and Molecular Descriptors by Machine Learning Models , 2020, Environmental toxicology and chemistry.

[32]  R. Altenburger,et al.  Future pesticide risk assessment: narrowing the gap between intention and reality , 2019, Environmental Sciences Europe.

[33]  Gordon M. Crippen,et al.  Prediction of Physicochemical Parameters by Atomic Contributions , 1999, J. Chem. Inf. Comput. Sci..

[34]  Xinyi Liu,et al.  Predicting drug-induced hepatotoxicity based on biological feature maps and diverse classification strategies , 2019, Briefings Bioinform..

[35]  Paola Gramatica,et al.  Real External Predictivity of QSAR Models: How To Evaluate It? Comparison of Different Validation Criteria and Proposal of Using the Concordance Correlation Coefficient , 2011, J. Chem. Inf. Model..

[36]  Hassan Golmohammadi,et al.  Quantitative structure-activity relationship prediction of blood-to-brain partitioning behavior using support vector machine. , 2012, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[37]  Dan C. Fara,et al.  QSPR Treatment of the Soil Sorption Coefficients of Organic Pollutants , 2005, J. Chem. Inf. Model..

[38]  L. Su,et al.  Linear and non-linear relationships between soil sorption and hydrophobicity: model, validation and influencing factors. , 2012, Chemosphere.

[39]  Tore Brinck,et al.  Prediction of water–octanol partition coefficients usingtheoretical descriptors derived from the molecular surface area and theelectrostatic potential , 1997 .

[40]  P. Gramatica,et al.  Modelling and prediction of soil sorption coefficients of non-ionic organic pesticides by molecular descriptors. , 2000, Chemosphere.

[41]  Manuela Pavan,et al.  DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS , 2006 .

[42]  M. C. U. Araújo,et al.  QSPR modeling of soil sorption coefficients (K(OC)) of pesticides using SPA-ANN and SPA-MLR. , 2009, Journal of agricultural and food chemistry.

[43]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[44]  Jie Xu,et al.  QSPR analysis for melting point of fatty acids using genetic algorithm based multiple linear regression (GA-MLR) , 2013 .

[45]  Frank A. P. C. Gobas,et al.  A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms , 2006 .

[46]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[47]  Paola Gramatica,et al.  Real External Predictivity of QSAR Models. Part 2. New Intercomparable Thresholds for Different Validation Criteria and the Need for Scatter Plot Inspection , 2012, J. Chem. Inf. Model..

[48]  S. C. Sampaio,et al.  An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients. , 2014, Water research.