Combinatorial feature selection problems

Motivated by frequently recurring themes in information retrieval and related disciplines, we define a genre of problems called combinatorial feature selection problems. Given a set S of multidimensional objects, the goal is to select a subset K of relevant dimensions (or features) such that some desired property /spl Pi/ holds for the set S restricted to K. Depending on /spl Pi/, the goal could be to either maximize or minimize the size of the subset K. Several well-studied feature selection problems can be cast in this form. We study the problems in this class derived from several natural and interesting properties /spl Pi/, including variants of the classical p-center problem as well as problems akin to determining the VC-dimension of a set system. Our main contribution is a theoretical framework for studying combinatorial feature selection, providing (in most cases essentially tight) approximation algorithms and hardness results for several instances of these problems.

[1]  U. Feige,et al.  On the Densest K-subgraph Problem , 1997 .

[2]  Mihalis Yannakakis,et al.  On limited nondeterminism and the complexity of the V-C dimension , 1993, [1993] Proceedings of the Eigth Annual Structure in Complexity Theory Conference.

[3]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[4]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[5]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[6]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[7]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[8]  Nathan Linial,et al.  The geometry of graphs and some of its algorithmic applications , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[9]  Huan Liu,et al.  Handling Large Unsupervised Data via Dimensionality Reduction , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[10]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[11]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  Aravind Srinivasan,et al.  Improved approximations of packing and covering problems , 1995, STOC '95.

[14]  A. Frieze,et al.  A simple heuristic for the p-centre problem , 1985 .

[15]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[16]  D. Hochbaum,et al.  A best possible approximation algorithm for the k--center problem , 1985 .

[17]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[18]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[19]  J. Bourgain On lipschitz embedding of finite metric spaces in Hilbert space , 1985 .

[20]  Yuval Rabani,et al.  An O(log k) Approximate Min-Cut Max-Flow Theorem and Approximation Algorithm , 1998, SIAM J. Comput..

[21]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[22]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[23]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[24]  Gerald Salton,et al.  Automatic text processing , 1988 .

[25]  Lars Engebretsen,et al.  Clique Is Hard To Approximate Within , 2000 .

[26]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[27]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[28]  Prabhakar Raghavan,et al.  Randomized rounding: A technique for provably good algorithms and algorithmic proofs , 1985, Comb..

[29]  Johan Håstad,et al.  Clique is hard to approximate within n/sup 1-/spl epsiv// , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[30]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[31]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.