论文信息 - Consensus QSAR modeling and domain of applicability: an integrated approach

Consensus QSAR modeling and domain of applicability: an integrated approach

Consensus modelling is a term that has been used in many scientific disciplines to define methods by which a group of individuals can come to an agreement. The QSAR community has used this term for methodologies that aggregate the predictions of several QSAR models to arrive at a single prediction. Literature reports on the validity of consensus modelling approaches are quite conflicting. Many publications present advantages of consensus models: More accurate QSAR models, greater confidence in predictions, regulatory significance, improved robustness. Several other references however, have criticized consensus modelling for complexity, lack of portability, transparency and mechanistic interpretation and for not showing significant improvements over single QSAR models. Many consensus QSAR models that have appeared in the literature use a naive approach that calculates the average value among all the individual model predictions. More sophisticated methods consider only the models for which the compound to be predicted falls into their domain of applicability. Alternative consensus modelling methods consider the individual model predictions as attributes in an overall multiple linear regression model, where the model coefficients play the role of weights. This way, the contribution of each individual model in the overall prediction is weighted. In this work, we present a new approach, integrating three basic components in the process of building a QSAR model: variable selection, regression/classification, and domain of applicability. In particular, the proposed method requires a single wrapper variable selection method, a single method for defining the domain of applicability and many regression/classification algorithms depending on the type of the problem. The wrapper variable selection method is applied separately to each QSAR algorithm and produces a QSAR model which used a certain subset of features. In general, different sets of features are selected by the various QSAR models that are generated. Thus, for each QSAR model, a different domain of applicability is defined, by applying the domain of applicability method on the respective set of descriptors. For a new compound, the proposed method first calculates the individual QSAR models predictions. Then it checks for each model, if the compound falls into its domain of applicability. In the case of a negative answer, the model is not taken into account in the calculation of the aggregated prediction. If the answer is positive, a weight is produced depending on the location of the compound inside the domain of applicability. Obviously the weight becomes lower when the location of the compound is closer to the boundaries of the domain of applicability. The weights are finally normalized, so they add to 1. The normalized weights are used to produce the final aggregated prediction. The results of the application of the method to QSAR problems illustrate the advantages and limitations of the method.

Pantelis Sopasakis | Haralambos Sarimveis | Georgia Melagraki | Antreas Afantitis

[1] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.