Robust Feature Selection by Mutual Information Distributions

Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sample-to-population inferential approaches. This paper deals with the distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean and an analytical approximation of the variance are reported. Asymptotic approximations of the distribution are proposed. The results are applied to the problem of selecting features for incremental learning and classification of the naive Bayes classifier. A fast, newly defined method is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets. Finally, a theoretical development is reported that allows one to efficiently extend the above methods to incomplete samples in an easy and effective way.

[1]  Marco Zaffalon Robust discovery of tree-dependency structures , 2001, ISIPTA.

[2]  Michael J. Townsend,et al.  Thomas Piketty: Capital in the twenty-first century , 2014, Public Choice.

[3]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[4]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[5]  Marcus Hutter,et al.  Distribution of Mutual Information , 2001, NIPS.

[6]  Ron Kohavi,et al.  MLC++: a machine learning library in C++ , 1994, Proceedings Sixth International Conference on Tools with Artificial Intelligence. TAI 94.

[7]  Wray L. Buntine A Guide to the Literature on Learning Probabilistic Networks from Data , 1996, IEEE Trans. Knowl. Data Eng..

[8]  David G. Stork,et al.  Pattern Classification , 1973 .

[9]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[10]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[11]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[12]  BuntineWray A Guide to the Literature on Learning Probabilistic Networks from Data , 1996 .

[13]  Gernot D. Kleiter,et al.  The posterior probability of Bayes nets with strong dependences , 1999, Soft Comput..

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  David Page,et al.  KDD Cup 2001 report , 2002, SKDD.

[16]  S. Fienberg,et al.  Two-Dimensional Contingency Tables with Both Completely and Partially Cross-Classified Data , 1974 .

[17]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[19]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[22]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[23]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[27]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[28]  David R. Wolf,et al.  Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[29]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[30]  Ehsan S. Soo Principal Information Theoretic Approaches , 2000 .

[31]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[32]  Marco Zaffalon,et al.  Distribution of mutual information for robust feature selection , 2002 .

[33]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[34]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[35]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[36]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.