Learning Boolean Concepts in the Presence of Many Irrelevant Features

Abstract In many domains, an appropriate inductive bias is the MIN-FEATURES bias, which prefers consistent hypotheses definable over as few features as possible. This paper defines and studies this bias in Boolean domains. First, it is shown that any learning algorithm implementing the MIN-FEATURES bias requires ⊖(( ln ( l δ ) + [2 p + p ln n])/e) training examples to guarantee PAC-learning a concept having p relevant features out of n available features. This bound is only logarithmic in the number of irrelevant features. For implementing the MIN-FEATURES bias, the paper presents five algorithms that identify a subset of features sufficient to construct a hypothesis consistent with the training examples. FOCUS-1 is a straightforward algorithm that returns a minimal and sufficient subset of features in quasi-polynomial time. FOCUS-2 does the same task as FOCUS-1 but is empirically shown to be substantially faster than FOCUS-1. Finally, the Simple-Greedy, Mutual-Information-Greedy and Weighted-Greedy algorithms are three greedy heuristics that trade optimality for computational efficiency. Experimental studies are presented that compare these exact and approximate algorithms to two well-known algorithms, ID3 and FRINGE, in learning situations where many irrelevant features are present. These experiments show that—contrary to expectations—the ID3 and FRINGE algorithms do not implement good approximations of MIN-FEATURES. The sample complexity and generalization performance of the FOCUS algorithms is substantially better than either ID3 or FRINGE on learning problems where the MIN-FEATURES bias is appropriate. These experiments also show that, among our three heuristics, the Weighted-Greedy algorithm provides an excellent approximation to the FOCUS algorithms.

[1]  Wray L. Buntine Myths and Legends in Learning Classification Rules , 1990, AAAI.

[2]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[3]  Thomas G. Dietterich,et al.  Efficient Algorithms for Identifying Relevant Features , 1992 .

[4]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[5]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[6]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[7]  Hussein Almuallim,et al.  Concept coverage and its application to two learning tasks , 1992 .

[8]  Anthony N. Mucciardi,et al.  A Comparison of Seven Techniques for Choosing Subsets of Pattern Recognition Properties , 1971, IEEE Transactions on Computers.

[9]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[10]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[11]  Temple F. Smith Occam's razor , 1980, Nature.

[12]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[13]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[14]  P. Laird Learning from Good and Bad Data , 1988 .

[15]  MANABU ICHINO,et al.  Optimum feature selection by zero-one integer programming , 1984, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[17]  Salvatore D. Morgera COMPUTATIONAL COMPLEXITY AND VLSI IMPLEMENTATION OF AN OPTIMAL FEATURE SELECTION STRATEGY , 1986 .