Paper ID : 92 Feature Selection in Web Applications Using ROC Inflections and Power Set Pruning

A basic problem of information processing is selecting enou gh features to ensure that events are accurately represented for classification problems, while simu ltaneously minimizing storage and processing of irrelevant or marginally important features. To address thi problem, feature selection procedures perform a search through the feature power set to find the smal lest subset meeting performance requirements. Major restrictions of existing procedures are that t ey typically explicitly or implicitly assume a fixed operating point, and make limited use of the statistica l structure of the feature power set. We present a method that combines the Neyman-Pearson design procedure on finite data, with the directed set structure of the Receiver Operating Curves on the feature subsets , to determine the maximal size of the feature subsets that can be ranked in a given problem. The search can t en be restricted to the smaller subsets, resulting in significant reductions in computational complex ity. Optimizing the overall Receiver Operating Curve also allows for end users to select different operatin g points and cost functions to optimize. The algorithm also produces a natural method of Boolean represe ntation of the minimal feature combinations that best describe the data near a given operating point. The se representations are especially appropriate when describing data using common text-related features us ful on the Web, such as thresholded TFIDF data. We show how to use these results to perform automatic Bo olean query modification generation for distributed databases, such as niche meta search engines.

[1]  A. Gualtierotti H. L. Van Trees, Detection, Estimation, and Modulation Theory, , 1976 .

[2]  Luis Gravano,et al.  Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[3]  James R. Munkres,et al.  Topology; a first course , 1974 .

[4]  C. Lee Giles,et al.  Inquirus, the NECI Meta Search Engine , 1998, Comput. Networks.

[5]  Michael Höding,et al.  Adapter Generation for Extracting and Querying Data from Web , 1999, WebDB.

[6]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[7]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[8]  Nripendra N. Biswas,et al.  Minimization of Boolean Functions , 1971, IEEE Transactions on Computers.

[9]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[10]  William P. Birmingham,et al.  Architecture of a metasearch engine that supports user information needs , 1999, CIKM '99.

[11]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[12]  Russell Greiner,et al.  PALO: A Probabilistic Hill-Climbing Algorithm , 1996, Artif. Intell..

[13]  C. Lee Giles,et al.  Bayesian Classification and Feature Selection from Finite Data Sets , 2000, UAI.

[14]  Ron Kohavi Feature Subset Selection as Search with Probabilistic Estimates , 1994 .

[15]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[16]  Luis Gravano,et al.  STARTS: Stanford Protocol Proposal for Internet Retrieval and Search , 1997 .

[17]  Pat Langley,et al.  Oblivious Decision Trees and Abstract Cases , 1994 .

[18]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[19]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.