A comparative evaluation of feature ranking methods for high dimensional bioinformatics data

Feature selection is an important component of data mining analysis with high dimensional data. Reducing the number of features in the dataset can have numerous positive implications, such as eliminating redundant or irrelevant features, decreasing development time and improving the performance of classification models. In this work, four filter-based feature selection techniques are compared using a wide variety of bioinformatics datasets. The first three filters, χ2, Relief-F and Information Gain, are widely used techniques that are well known to many researchers and practitioners. The fourth filter, recently proposed by our research group and denoted TBFS-AUC (i.e., Threshold-Based Feature Selection technique with the AUC metric), is compared to these three commonly-used techniques using three different classification performance metrics. The empirical results demonstrate the strong performance of our technique.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[3]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[4]  Shyamal D. Peddada,et al.  Gene Selection and Clustering for Time-course and Dose-response Microarray Experiments Using Order-restricted Inference , 2003, Bioinform..

[5]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[6]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[7]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[8]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Hong Tang,et al.  Data mining techniques for cancer detection using serum proteomic profiling , 2004, Artif. Intell. Medicine.

[11]  Keun Ho Ryu,et al.  Classification of Enzyme Function from Protein Sequence based on Feature Representation , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[12]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[13]  Fabian Model,et al.  Feature selection for DNA methylation based cancer classification , 2001, ISMB.

[14]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[15]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[16]  Taghi M. Khoshgoftaar,et al.  Feature Selection Algorithms for Mining High Dimensional DNA Microarray Data , 2011 .

[17]  Neil Davey,et al.  Using Feature Selection Filtering Methods for Binding Site Predictions , 2006, 2006 5th IEEE International Conference on Cognitive Informatics.

[18]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[19]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[20]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[22]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[23]  Taghi M. Khoshgoftaar,et al.  Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[24]  R. Mlynarski,et al.  New feature selection methods for qualification of the patients for cardiac pacemaker implantation , 2007, 2007 Computers in Cardiology.

[25]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[27]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[28]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[29]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[30]  InzaIñaki,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004 .