Threshold-based feature selection techniques for high-dimensional bioinformatics data

Analysis conducted for bioinformatics applications often requires the use of feature selection methodologies to handle datasets with very high dimensionality. We propose 11 new threshold-based feature selection techniques and compare the performance of these new techniques to that of six standard filter-based feature selection procedures. Unlike other comparisons of feature selection techniques, we directly compare the feature rankings produced by each technique using Kendall’s Tau rank correlation, showing that the newly proposed techniques exhibit substantially different behaviors than the standard filter-based feature selection methods. Our experiments consider 17 different bioinformatics datasets, and the similarities of the feature selection techniques are analyzed using the Frobenius norm. The feature selection techniques are also compared by using Naive Bayes and Support Vector Machine algorithms to learn from the training datasets. The experimental results show that the new procedures perform very well compared to the standard filters, and hence are useful feature selection methodologies for the analysis of bioinformatics data.

[1]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[2]  R. A. Groeneveld,et al.  Practical Nonparametric Statistics (2nd ed). , 1981 .

[3]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[4]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[5]  Abhijit S. Pandya,et al.  The Impact of Gene Selection on Imbalanced Microarray Expression Data , 2009, BICoB.

[6]  Neil Davey,et al.  Using Feature Selection Filtering Methods for Binding Site Predictions , 2006, 2006 5th IEEE International Conference on Cognitive Informatics.

[7]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[9]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[11]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[12]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[14]  Taghi M. Khoshgoftaar,et al.  A Study on the Relationships of Classifier Performance Metrics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[15]  Y. Saeys,et al.  Towards robust feature selection techniques , 2008 .

[16]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Hong Tang,et al.  Data mining techniques for cancer detection using serum proteomic profiling , 2004, Artif. Intell. Medicine.

[18]  Keun Ho Ryu,et al.  Classification of Enzyme Function from Protein Sequence based on Feature Representation , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[19]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[20]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[21]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[22]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[26]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[27]  Caroline C. Friedel,et al.  Reliable gene signatures for microarray classification: assessment of stability and performance , 2006, Bioinform..

[28]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[29]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Shyamal D. Peddada,et al.  Gene Selection and Clustering for Time-course and Dose-response Microarray Experiments Using Order-restricted Inference , 2003, Bioinform..

[31]  Taghi M. Khoshgoftaar,et al.  A comparative evaluation of feature ranking methods for high dimensional bioinformatics data , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[32]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[33]  Stephen Kwek,et al.  Adapting support vector machines to predict translation initiation sites in the human genome , 2005, 2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05).

[34]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[35]  Taghi M. Khoshgoftaar,et al.  Using regression trees to classify fault-prone software modules , 2002, IEEE Trans. Reliab..

[36]  Josef Kittler,et al.  Improving Stability of Feature Selection Methods , 2007, CAIP.

[37]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[38]  D. Whitefield,et al.  A review of: “Practical Nonpararnetric Statistics. By W. J. CONOVER. (New York: Wiley, 1971.) [Pl" x+462.] £5·25. , 1972 .

[39]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[40]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[41]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[42]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[43]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[44]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[45]  Kuldip Singh,et al.  A Novel and Efficient Technique for Identification and Classification of GPCRs , 2008, IEEE Transactions on Information Technology in Biomedicine.

[46]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[47]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[48]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[49]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[50]  Fillia Makedon,et al.  HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data , 2005, Bioinform..

[51]  Fabian Model,et al.  Feature selection for DNA methylation based cancer classification , 2001, ISMB.

[52]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[53]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[54]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.