Comparison of Tree Based Ensemble Machine Learning Methods for Prediction of Rate Constant of Diels-Alder Reaction

Abstract The design of molecular solvents has garnered significant interest because solvents have been shown to influence the rate of chemical product generation in a reaction. In order to quantitatively understand the influence of solvent structure on the rate of the reaction, models are needed that capture this influence, in addition to that of the reactants’ structure, on the rate constant. A quantitative structure-property relationship (QSPR) for the Diels-Alder reaction was recently developed using a hybrid genetic algorithm-decision tree (GA-DT) approach. However, there is still scope for improvement in the performance of the QSPR. In an attempt to further improve upon the performance of the aforementioned QSPR, we have assessed various tree based ensemble machine learning regression methods for prediction of rate constant (modeled using connectivity indices) of Diels-Alder reaction. The assessed methods are random forest regression, gradient boosted regression trees, regularized random forest regression and extremely randomized trees. The evaluation was carried out in terms of the R2 and Q2 values. Extremely randomized trees were found to provide the highest R2 value of 0.91 while random forests provided the highest Q2 value of 0.76.