A Comparative Study on Feature Selection Methods for Drug Discovery

Feature selection is frequently used as a preprocessing step to machine learning. The removal of irrelevant and redundant information often improves the performance of learning algorithms. This paper is a comparative study of feature selection in drug discovery. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including information gain, mutual information, a chi2-test, odds ratio, and GSS coefficient. Two well-known classification algorithms, Naïve Bayesian and Support Vector Machine (SVM), were used to classify the chemical compounds. The results showed that Naïve Bayesian benefited significantly from the feature selection, while SVM performed better when all features were used. In this experiment, information gain and chi2-test were most effective feature selection methods. Using information gain with a Naïve Bayesian classifier, removal of up to 96% of the features yielded an improved classification accuracy measured by sensitivity. When information gain was used to select the features, SVM was much less sensitive to the reduction of feature space. The feature set size was reduced by 99%, while losing only a few percent in terms of sensitivity (from 58.7% to 52.5%) and specificity (from 98.4% to 97.2%). In contrast to information gain and chi2-test, mutual information had relatively poor performance due to its bias toward favoring rare features and its sensitivity to probability estimation errors.

[1]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[2]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[3]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[4]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[5]  Marko Grobelnik,et al.  Interaction of Feature Selection Methods and Linear Classification Models , 2002 .

[6]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[7]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[8]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[9]  Carol Wellington,et al.  Symbolic, Neural, and Bayesian Machine Learning Models for Predicting Carcinogenicity of Chemical Compounds , 2000, J. Chem. Inf. Comput. Sci..

[10]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[11]  M. Wagener,et al.  Potential Drugs and Nondrugs: Prediction and Identification of Important Structural Features. , 2000 .

[12]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[13]  Isabelle Moulinier,et al.  Applying an existing machine learning algorithm to text categorization , 1995, Learning for Natural Language Processing.

[14]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[15]  Nitesh V. Chawla,et al.  Generalization Methods in Bioinformatics , 2002, BIOKDD.

[16]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[17]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[18]  Tobias Scheffer,et al.  International Conference on Machine Learning (ICML-99) , 1999, Künstliche Intell..

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Masahiko Haruno,et al.  Feature Selection in SVM Text Categorization , 1999, AAAI/IAAI.

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[24]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization , 1999, SIGIR 1999.

[25]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[26]  Robert Bywater,et al.  Improving the Odds in Discriminating "Drug-like" from "Non Drug-like" Compounds , 2000, J. Chem. Inf. Comput. Sci..

[27]  David Page,et al.  KDD Cup 2001 report , 2002, SKDD.

[28]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[29]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[30]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[32]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.