Evolving Feature Selection

Data preprocessing is an indispensable step in effective data analysis. It prepares data for data mining and machine learning, which aim to turn data into business intelligence or knowledge. Feature selection is a preprocessing technique commonly used on high-dimensional data. Feature selection studies how to select a subset or list of attributes or variables that are used to construct models describing data. Its purposes include reducing dimensionality, removing irrelevant and redundant features, reducing the amount of data needed for learning, improving algorithms' predictive accuracy, and increasing the constructed models' comprehensibility. This article considers feature-selection overfitting with small-sample classifier design; feature selection for unlabeled data; variable selection using ensemble methods; minimum redundancy-maximum relevance feature selection; and biological relevance in feature selection for microarray data.

[1]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[2]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[3]  Paolo Avesani,et al.  Active Sampling for Knowledge Discovery from Biomedical Data , 2005, PKDD.

[4]  Jiawei Han,et al.  Efficient Classification from Multiple Heterogeneous Databases , 2005, PKDD.

[5]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Huan Liu,et al.  Efficiently handling feature redundancy in high-dimensional data , 2003, KDD '03.

[7]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[8]  Walter L. Ruzzo,et al.  Improved Gene Selection for Classification of Microarrays , 2002, Pacific Symposium on Biocomputing.

[9]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[10]  Jan M. Van Campenhout,et al.  On the Possible Orderings in the Measurement Selection Problem , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[12]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[14]  Christos Davatzikos,et al.  A Bayesian morphometry algorithm , 2004, IEEE Transactions on Medical Imaging.

[15]  Jie Chen,et al.  Grand challenges for multimodal bio-medical systems , 2005 .

[16]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[17]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[20]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[23]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[24]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[25]  Aniruddha Datta,et al.  Genomic signal processing: diagnosis and therapy , 2005, IEEE Signal Process. Mag..

[26]  Ulisses Braga-Neto,et al.  Impact of error estimation on feature selection , 2005, Pattern Recognit..

[27]  Anil K. Jain,et al.  Feature Selection in Mixture-Based Clustering , 2002, NIPS.

[28]  Carla E. Brodley,et al.  Visualization and interactive feature selection for unsupervised data , 2000, KDD '00.

[29]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.