Feature Selection with High-Dimensional Imbalanced Data

Feature selection is an important topic in data mining, especially for high dimensional datasets. Filtering techniques in particular have received much attention, but detailed comparisons of their performance is lacking. This work considers three filters using classifier performance metrics and six commonly-used filters. All nine filtering techniques are compared and contrasted using five different microarray expression datasets. In addition, given that these datasets exhibit an imbalance between the number of positive and negative examples, the utilization of sampling techniques in the context of feature selection is examined.

[1]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[2]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[3]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[4]  Fabian Model,et al.  Feature selection for DNA methylation based cancer classification , 2001, ISMB.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Keun Ho Ryu,et al.  Classification of Enzyme Function from Protein Sequence based on Feature Representation , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[7]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[9]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[10]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hong Tang,et al.  Data mining techniques for cancer detection using serum proteomic profiling , 2004, Artif. Intell. Medicine.

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[16]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[17]  Neil Davey,et al.  Using Feature Selection Filtering Methods for Binding Site Predictions , 2006, 2006 5th IEEE International Conference on Cognitive Informatics.

[18]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[19]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[20]  Shyamal D. Peddada,et al.  Gene Selection and Clustering for Time-course and Dose-response Microarray Experiments Using Order-restricted Inference , 2003, Bioinform..

[21]  Gene H. Golub,et al.  Matrix computations , 1983 .

[22]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[23]  Taghi M. Khoshgoftaar,et al.  Hybrid sampling for imbalanced data , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[24]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[25]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[26]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[27]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .