A Heuristic Method for Selecting Support Features from Large Datasets

For feature selection in machine learning, set covering (SC) is most suited, for it selects support features for data under analysis based on the individual and the collective roles of the candidate features. However, the SC-based feature selection requires the complete pair-wise comparisons of the members of the different classes in a dataset, and this renders the meritorious SC principle impracticable for selecting support features from a large number of data. Introducing the notion of implicit SC-based feature selection, this paper presents a feature selection procedure that is equivalent to the standard SC-based feature selection procedure in supervised learning but with the memory requirement that is multiple orders of magnitude less than the counterpart. With experiments on six large machine learning datasets, we demonstrate the usefulness of the proposed implicit SC-based feature selection scheme in large-scale supervised data analysis.

[1]  Nimrod Megiddo,et al.  On the complexity of polyhedral separability , 1988, Discret. Comput. Geom..

[2]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Peter L. Hammer,et al.  Logical Analysis of Data: From Combinatorial Optimization to Medical Applications , 2005 .

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Toshihide Ibaraki,et al.  An Implementation of Logical Analysis of Data , 2000, IEEE Trans. Knowl. Data Eng..

[6]  Xiaowei Wang,et al.  Selection of Oligonucleotide Probes for Protein Coding Sequences , 2003, Bioinform..

[7]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[8]  Egon Balas,et al.  A Dynamic Subgradient-Based Branch-and-Bound Procedure for Set Covering , 1992, Oper. Res..

[9]  Sven Rahmann,et al.  Fast Large Scale Oligonucleotide Selection Using the Longest Common Factor Approach , 2003, J. Bioinform. Comput. Biol..

[10]  Inderpal S. Bhandari,et al.  Advanced Scout: Data Mining and Knowledge Discovery in NBA Data , 2004, Data Mining and Knowledge Discovery.

[11]  S. Weiss,et al.  Predicting defects in disk drive manufacturing: A case study in high-dimensional classification , 1993, Proceedings of 9th IEEE Conference on Artificial Intelligence for Applications.

[12]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[13]  F. J. Vasko,et al.  Hybrid heuristics for minimum cardinality set covering problems , 1986 .

[14]  Kwangsoo Kim,et al.  A LAD-based method for selecting short oligo probes for genotyping applications , 2007, OR Spectr..

[15]  Antonio Sassano,et al.  A Lagrangian-based heuristic for large-scale set covering problems , 1998, Math. Program..

[16]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[17]  Martin Vingron,et al.  Optimal robust non-unique probe selection using Integer Linear Programming , 2004, ISMB/ECCB.

[18]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[19]  Chris Carter,et al.  Assessing Credit Card Applications Using Machine Learning , 1987, IEEE Expert.

[20]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Matteo Fischetti,et al.  A Heuristic Method for the Set Covering Problem , 1999, Oper. Res..

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  Marek Chrobak,et al.  Probe selection algorithms with applications in the analysis of microbial communities , 2001, ISMB.

[24]  James E. Falk,et al.  The Surgical Separation of Sets , 1997, J. Glob. Optim..

[25]  Francis J. Vasko,et al.  An efficient heuristic for large set covering problems , 1984 .

[26]  M. Fisher,et al.  Optimal solution of set covering/partitioning problems using dual heuristics , 1990 .

[27]  Julian R. Ullmann,et al.  Pattern recognition techniques , 1973 .

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .