Model Selection Based on Structural Similarity-Method Description and Application to Water Solubility Prediction

A method is introduced that allows one to select, for a given property and compound, among several prediction methods the presumably best-performing scheme based on prediction errors evaluated for structurally similar compounds. The latter are selected through analysis of atom-centered fragments (ACFs) in accord with a k nearest neighbor procedure in the two-dimensional structural space. The approach is illustrated with seven estimation methods for the water solubility of organic compounds and a reference set of 1876 compounds with validated experimental values. The discussion includes a comparison with the similarity-based error correction as an alternative approach to improve the performance of prediction methods and an extension that enables an ad hoc specification of the application domain.