Impact of Noise and Data Sampling on Stability of Feature Selection

High dimensionality is one of the major problems in data mining, occurring when there is a large abundance of attributes. One common technique used to alleviate high dimensionality is feature selection, the process of selecting the most relevant attributes and removing irrelevant and redundant ones. Much research has been done towards evaluating the performance of classifiers before and after feature selection, but little work has been done examining how sensitive the selected feature subsets are to changes (additions/deletions) in the dataset. In this study we evaluate the robustness of six commonly used feature selection techniques, investigating the impact of data sampling and class noise on the stability of feature selection. All experiments are carried out with six commonly used feature rankers on four groups of datasets from the biology domain. We employ three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results demonstrate that although no ranker consistently outperforms the others, Gain Ratio shows the least stability on average. Additional tests using our feature rankers for building classification models also show that a feature ranker's stability is not an indicator of its performance in classification.

[1]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[2]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[3]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[4]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Taghi M. Khoshgoftaar,et al.  Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[9]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[10]  Yue Jiang,et al.  Variance Analysis in Software Fault Prediction Models , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[11]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[12]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[13]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[14]  Taghi M. Khoshgoftaar,et al.  Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[15]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[16]  Josef Kittler,et al.  Improving Stability of Feature Selection Methods , 2007, CAIP.