Training ACD/LogP with Experimental Data

The commercial physical property calculation software, ACD/Labs Physico-Chemical Laboratory, has the capability to accept experimental data for logP and pKa values which it can use to "train" its model to better predict unrepresented structural classes. An attempt was made to produce a training set, called a "user database" by the software, based on Merck in-house data, which could be used to train the ACD/LogP model in order to achieve better predictivity on molecules of interest to Merck researchers. A user database consisting of a randomly selected 10% subset of the available Shake-Flask measured logP data was constructed and used to predict itself as well as the remaining 90% data set. The training produced a modest increase in accuracy of the model, with the R 2 value of the prediction improving in the test set from 0.316 to 0.527. Narrowing the selection to a project-based, targeted subset of the in-house data in hopes of decreasing the diversity of the set, enhanced the coverage of the model but only produced an improvement in the R 2 value from 0.350 to 0.537. Finally, training on a single, small representative of a structural class produced a sizable reduction in the bias of the prediction in a congeneric series of compounds, essentially confirming the original claim of the software developers. These improvements came with an increase in time and machine load to perform the calculation relative to the size of the training set.