A novel Markov boundary based feature subset selection algorithm

We aim to identify the minimal subset of random variables that is relevant for probabilistic classification in data sets with many variables but few instances. A principled solution to this problem is to determine the Markov boundary of the class variable. In this paper, we propose a novel constraint-based Markov boundary discovery algorithm called MBOR with the objective of improving accuracy while still remaining scalable to very high dimensional data sets and theoretically correct under the so-called faithfulness condition. We report extensive empirical experiments on synthetic data sets scaling up to tens of thousand variables.

[1]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[2]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[3]  Jesper Tegnér,et al.  Consistent Feature Selection for Pattern Recognition in Polynomial Time , 2007, J. Mach. Learn. Res..

[4]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[5]  Jesper Tegnér,et al.  Towards scalable and data efficient learning of Markov boundaries , 2007, Int. J. Approx. Reason..

[6]  Isabelle Guyon,et al.  Design and Analysis of the Causation and Prediction Challenge , 2008, WCCI Causation and Prediction Challenge.

[7]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[8]  Ethem Alpaydin,et al.  Handling of Deterministic Relationships in Constraint-based Causal Discovery , 2002, European Workshop on Probabilistic Graphical Models.

[9]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  Shunkai Fu,et al.  Tradeoff Analysis of Different Markov Blanket Local Learning Approaches , 2008, PAKDD.

[12]  Alex Aussem,et al.  A Novel Scalable and Data Efficient Feature Subset Selection Algorithm , 2008, ECML/PKDD.

[13]  Laura E. Brown,et al.  Bounding the False Discovery Rate in Local Bayesian Network Learning , 2008, AAAI.

[14]  Dimitris Margaritis,et al.  Speculative Markov blanket discovery for optimal feature selection , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[15]  Laura E. Brown,et al.  A Strategy for Making Predictions Under Manipulation , 2008, WCCI Causation and Prediction Challenge.

[16]  Alex Aussem,et al.  Handling almost-deterministic relationships in constraint-based Bayesian network discovery : Application to cancer risk factor identification , 2008, ESANN.