Feature selection for Bayesian network classifiers using the MDL-FS score

When constructing a Bayesian network classifier from data, the more or less redundant features included in a dataset may bias the classifier and as a consequence may result in a relatively poor classification accuracy. In this paper, we study the problem of selecting appropriate subsets of features for such classifiers. To this end, we propose a new definition of the concept of redundancy in noisy data. For comparing alternative classifiers, we use the Minimum Description Length for Feature Selection (MDL-FS) function that we introduced before. Our function differs from the well-known MDL function in that it captures a classifier's conditional log-likelihood. We show that the MDL-FS function serves to identify redundancy at different levels and is able to eliminate redundant features from different types of classifier. We support our theoretical findings by comparing the feature-selection behaviours of the various functions in a practical setting. Our results indicate that the MDL-FS function is more suited to the task of feature selection than MDL as it often yields classifiers of equal or better performance with significantly fewer attributes.

[1]  Yun-Ze Cai,et al.  Feature Selection for Classificatory Analysis Based on Information-theoretic Criteria , 2008 .

[2]  Feiping Nie,et al.  A unified framework for semi-supervised dimensionality reduction , 2008, Pattern Recognit..

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[5]  Pedro Larrañaga,et al.  Feature subset selection by Bayesian networks: a comparison with genetic and sequential algorithms , 2001, Int. J. Approx. Reason..

[6]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[7]  Trevor J. Hastie,et al.  Discriminative vs Informative Learning , 1997, KDD.

[8]  Lei Liu,et al.  Feature selection with dynamic mutual information , 2009, Pattern Recognit..

[9]  Pedro M. Domingos,et al.  Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[10]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[11]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[12]  Tomi Silander,et al.  Learning locally minimax optimal Bayesian networks , 2010, Int. J. Approx. Reason..

[13]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[14]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[15]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[16]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[17]  Jesper Tegnér,et al.  Towards scalable and data efficient learning of Markov boundaries , 2007, Int. J. Approx. Reason..

[18]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[19]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[20]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[21]  LiuLei,et al.  Feature selection with dynamic mutual information , 2009 .

[22]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[23]  Terran Lane,et al.  Learning class-discriminative dynamic Bayesian networks , 2005, ICML.

[24]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[25]  Pedro Larrañaga,et al.  Feature selection in Bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS , 2005, J. Biomed. Informatics.

[26]  Pedro Larrañaga,et al.  Discriminative vs. Generative Learning of Bayesian Network Classifiers , 2007, ECSQARU.

[27]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[28]  Estevam R. Hruschka,et al.  BayesRule: A Markov-Blanket based procedure for extracting a set of probabilistic rules from Bayesian classifiers , 2008, Int. J. Hybrid Intell. Syst..

[29]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[30]  Jesper Tegnér,et al.  Consistent Feature Selection for Pattern Recognition in Polynomial Time , 2007, J. Mach. Learn. Res..

[31]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[32]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[33]  Vincent J. Carey,et al.  Supervised Machine Learning , 2008 .

[34]  Franz Pernkopf,et al.  Discriminative versus generative parameter and structure learning of Bayesian network classifiers , 2005, ICML.

[35]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[36]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[37]  Pedro Larrañaga,et al.  Bayesian classifiers based on kernel density estimation: Flexible classifiers , 2009, Int. J. Approx. Reason..

[38]  Linda C. van der Gaag,et al.  A New MDL-Based Function for Feature Selection for Bayesian Network Classifiers , 2004, ECAI.

[39]  Yijun Sun,et al.  Iterative RELIEF for Feature Weighting: Algorithms, Theories, and Applications , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Vladimir Pavlovic,et al.  Boosted Bayesian network classifiers , 2008, Machine Learning.

[41]  Marco Zaffalon,et al.  Fast algorithms for robust classification with Bayesian nets , 2007, Int. J. Approx. Reason..

[42]  Jorma Rissanen,et al.  Strong optimality of the normalized ML models as universal codes and information in data , 2001, IEEE Trans. Inf. Theory.

[43]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[44]  Henry Tirri,et al.  BAYDA: Software for Bayesian Classification and Feature Selection , 1998, KDD.

[45]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[46]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[47]  Adam C. Winstanley,et al.  Invariant optimal feature selection: A distance discriminant and feature ranking based solution , 2008, Pattern Recognit..

[48]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[49]  Tommi S. Jaakkola,et al.  Feature Selection and Dualities in Maximum Entropy Discrimination , 2000, UAI.

[50]  G. Bortolan,et al.  The problem of linguistic approximation in clinical decision making , 1988, Int. J. Approx. Reason..

[51]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[52]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[53]  Russell Greiner,et al.  Discriminative Model Selection for Belief Net Structures , 2005, AAAI.

[54]  Jorma Rissanen,et al.  Efficient Computation of Stochastic Complexity , 2003 .

[55]  Madalina M. Drugan Conditional log-likelihood MDL and Evolutionary MCMC , 2006 .

[56]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[57]  InzaIñaki,et al.  Feature selection in Bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS , 2005 .

[58]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[59]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[60]  Jeff A. Bilmes,et al.  Dynamic Bayesian Multinets , 2000, UAI.

[61]  Luis M. de Campos,et al.  Bayesian network models for hierarchical text classification from a thesaurus , 2009, Int. J. Approx. Reason..

[62]  Laurent Younes,et al.  A Stochastic Algorithm for Feature Selection in Pattern Recognition , 2007, J. Mach. Learn. Res..

[63]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[64]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.