First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques

Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.

[1]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[2]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[3]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[4]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[5]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[9]  RAINER BREITLING,et al.  Rank-based Methods as a Non-parametric Alternative of the T-statistic for the Analysis of Biological Microarray Data , 2005, J. Bioinform. Comput. Biol..

[10]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[11]  Taghi M. Khoshgoftaar,et al.  Random forest: A reliable tool for patient response prediction , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[12]  Taghi M. Khoshgoftaar,et al.  Stability Analysis of Feature Ranking Techniques on Biological Datasets , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[13]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[14]  Tian-Yu Liu,et al.  EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[15]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[16]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[17]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .