Toward Provably Correct Feature Selection in Arbitrary Domains

In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability.

[1]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[7]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[8]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[9]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[10]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[11]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[12]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[13]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[14]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[15]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[16]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[17]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.