Biclustering as Strategy for Improving Feature Selection in Consensus QSAR Modeling

Abstract Feature selection applied to QSAR (Quantitative Structure-Activity Relationship) modeling is a challenging combinatorial optimization problem due to the high dimensionality of the chemical space associated with molecules and the complexity of the physicochemical properties usually studied in Cheminformatics. This derives commonly in classification models with a large number of variables, decreasing the generalization and interpretability of these classifiers. In this paper, a novel strategy based on biclustering analysis is proposed for addressing this problem. The new method is applied as a post-processing step for feature selection outputs generated by consensus feature selection methods. The approach was evaluated using datasets oriented to ready biodegradation prediction of chemical compounds. These preliminary results show that biclustering can help to identify features with low class-discrimination power, which it is useful for reducing the complexity of QSAR models without losing prediction accuracy.