论文信息 - Comparison of Tree Based Ensemble Machine Learning Methods for Prediction of Rate Constant of Diels-Alder Reaction

Comparison of Tree Based Ensemble Machine Learning Methods for Prediction of Rate Constant of Diels-Alder Reaction

Abstract The design of molecular solvents has garnered significant interest because solvents have been shown to influence the rate of chemical product generation in a reaction. In order to quantitatively understand the influence of solvent structure on the rate of the reaction, models are needed that capture this influence, in addition to that of the reactants’ structure, on the rate constant. A quantitative structure-property relationship (QSPR) for the Diels-Alder reaction was recently developed using a hybrid genetic algorithm-decision tree (GA-DT) approach. However, there is still scope for improvement in the performance of the QSPR. In an attempt to further improve upon the performance of the aforementioned QSPR, we have assessed various tree based ensemble machine learning regression methods for prediction of rate constant (modeled using connectivity indices) of Diels-Alder reaction. The assessed methods are random forest regression, gradient boosted regression trees, regularized random forest regression and extremely randomized trees. The evaluation was carried out in terms of the R2 and Q2 values. Extremely randomized trees were found to provide the highest R2 value of 0.91 while random forests provided the highest Q2 value of 0.76.

Mario R. Eden | Nishanth G. Chemmangattuvalappil | Shounak Datta | Vikrant A. Dev

[1] Mario R. Eden,et al. Hybrid genetic algorithm-decision tree approach for rate constant prediction using structures of reactants and solvent for Diels-Alder reaction , 2017, Comput. Chem. Eng..

[2] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[3] George C. Runger,et al. Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[4] Marko Robnik-Sikonja,et al. An adaptation of Relief for attribute estimation in regression , 1997, ICML.

[5] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[6] John B O Mitchell,et al. Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems , 2015, Molecular informatics.

[7] Pierre Geurts,et al. Extremely randomized trees , 2006, Machine Learning.