Feature selection method based on fuzzy entropy for regression in QSAR studies

Feature selection and feature extraction are the most important steps in classification and regression systems. Feature selection is commonly used to reduce the dimensionality of datasets with tens or hundreds of thousands of features, which would be impossible to process further. Recent example includes quantitative structure–activity relationships (QSAR) dataset including 1226 features. A major problem of QSAR is the high dimensionality of the feature space; therefore, feature selection is the most important step in this study. This paper presents a novel feature selection algorithm that is based on entropy. The performance of the proposed algorithm is compared with that of a genetic algorithm method and a stepwise regression method. The root mean square error of prediction in a QSAR study using entropy, genetic algorithm and stepwise regression using multiple linear regressions model for training set and test set were 0.3433, 0.3591 and 0.5500, 0.4326 and 0.6373, 0.6672, respectively.

[1]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[2]  Matheus P Freitas,et al.  Improvement of Multivariate Image Analysis Applied to Quantitative Structure–Activity Relationship (QSAR) Analysis by Using Wavelet‐Principal Component Analysis Ranking Variable Selection and Least‐Squares Support Vector Machine Regression: QSAR Study of Checkpoint Kinase WEE1 Inhibitors , 2009, Chemical biology & drug design.

[3]  Manoranjan Dash,et al.  Entropy-based fuzzy clustering and fuzzy modeling , 2000, Fuzzy Sets Syst..

[4]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[5]  B. Luke An Overview of Genetic Methods , 1996 .

[6]  Yiyu Yao,et al.  Constructive and Algebraic Methods of the Theory of Rough Sets , 1998, Inf. Sci..

[7]  Andrzej Skowron,et al.  Rough set methods in feature selection and recognition , 2003, Pattern Recognit. Lett..

[8]  C. B. Lucasius,et al.  Understanding and using genetic algorithms Part 1. Concepts, properties and context , 1993 .

[9]  Matheus P. Freitas,et al.  On the use of PLS and N-PLS in MIA-QSAR : Azole antifungals , 2009 .

[10]  M. Goodarzi,et al.  Spectrophotometric simultaneous determination of manganese(II) and iron(II) in pharmaceutical by orthogonal signal correction-partial least squares. , 2007, Annali di chimica.

[11]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[12]  Enric Hernández,et al.  A reformulation of entropy in the presence of indistinguishability operators , 2002, Fuzzy Sets Syst..

[13]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[14]  Qiang Shen,et al.  A rough-fuzzy approach for generating classification rules , 2002, Pattern Recognit..

[15]  Qinghua Hu,et al.  Uncertainty measures for fuzzy relations and their applications , 2007, Appl. Soft Comput..

[16]  J. Devillers Genetic algorithms in molecular modeling , 1996 .

[17]  C. B. Lucasius,et al.  Understanding and using genetic algorithms Part 2. Representation, configuration and hybridization , 1994 .

[18]  Gary W. Small,et al.  Peer Reviewed: Learning Optimization From Nature: Genetic Algorithms and Simulated Annealing , 1997 .

[19]  M. Goodarzi,et al.  Prediction of the logarithmic of partition coefficients (log P) of some organic compounds by least square-support vector machine (LS-SVM) , 2008 .

[20]  Manish K. Gupta,et al.  Topological Descriptors in Modeling the Antimalarial Activity of 4-(3', 5'-Disubstituted anilino)quinolines , 2006, J. Chem. Inf. Model..

[21]  E. Castro,et al.  Modified and enhanced replacement method for the selection of molecular descriptors in QSAR and QSPR theories , 2008 .

[22]  Mohammad Goodarzi,et al.  Augmented Three-mode MIA-QSAR Modeling for a Series of Anti-HIV-1 Compounds , 2008 .

[23]  Daniel F Ortwine,et al.  4-Phenylpyrrolo[3,4-c]carbazole-1,3(2H,6H)-dione inhibitors of the checkpoint kinase Wee1. Structure-activity relationships for chromophore modification and phenyl ring substitution. , 2006, Journal of medicinal chemistry.

[24]  Igor V. Tetko,et al.  Prediction of n-Octanol/Water Partition Coefficients from PHYSPROP Database Using Artificial Neural Networks and E-State Indices , 2001, J. Chem. Inf. Comput. Sci..

[25]  Ronald R. Yager,et al.  Entropy measures under similarity relations , 1992 .

[26]  Duoqian Miao,et al.  Analysis on attribute reduction strategies of rough set , 1998, Journal of Computer Science and Technology.

[27]  E. Castro,et al.  Prediction of aqueous toxicity for heterogeneous phenol derivatives by QSAR , 2008 .

[28]  Ericka Stricklin-Parker,et al.  Ann , 2005 .

[29]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognit. Lett..

[30]  Nostrand Reinhold,et al.  the utility of using the genetic algorithm approach on the problem of Davis, L. (1991), Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. , 1991 .

[31]  Thomas G. Dietterich Machine Learning 1 , 1996 .

[32]  Matheus P Freitas,et al.  Predicting boiling points of aliphatic alcohols through multivariate image analysis applied to quantitative structure-property relationships. , 2008, The journal of physical chemistry. A.

[33]  Lutgarde M. C. Buydens,et al.  Evolutionary optimisation : a tutorial , 1998 .