Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional

The matrix-based Renyi's α-order entropy functional was recently introduced using the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the current theory in the matrix-based Renyi's α-order entropy functional only defines the entropy of a single variable or mutual information between two random variables. In information theory and machine learning communities, one is also frequently interested in multivariate information quantities, such as the multivariate joint entropy and different interactive quantities among multiple variables. In this paper, we first define the matrix-based Renyi's α-order joint entropy among multiple variables. We then show how this definition can ease the estimation of various information quantities that measure the interactions among multiple variables, such as interactive information and total correlation. We finally present an application to feature selection to show how our definition provides a simple yet powerful way to estimate a widely-acknowledged intractable quantity from data. A real example on hyperspectral image (HSI) band selection is also provided.

[1]  Sergio Verdú α-mutual information , 2015, 2015 Information Theory and Applications Workshop (ITA).

[2]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[3]  Robert Jenssen,et al.  Understanding Convolutional Neural Networks With Information Theory: An Initial Exploration , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[5]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[6]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[7]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[8]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[9]  Benjamin Flecker,et al.  Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective , 2014, Journal of Computational Neuroscience.

[10]  R. E. Lee,et al.  Distribution-free multiple comparisons between successive treatments , 1995 .

[11]  A. J. Bell THE CO-INFORMATION LATTICE , 2003 .

[12]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[13]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[14]  James Bailey,et al.  Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View , 2014, AAAI.

[15]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[16]  A. Rényi On Measures of Entropy and Information , 1961 .

[17]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[18]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Aram Galstyan,et al.  The Information Sieve , 2015, ICML.

[20]  Ivan Bratko,et al.  Quantifying and Visualizing Attribute Interactions , 2003, ArXiv.

[21]  Rajendra Bhatia,et al.  Infinitely Divisible Matrices , 2006, Am. Math. Mon..

[22]  David A. Landgrebe,et al.  Signal Theory Methods in Multispectral Remote Sensing , 2003 .

[23]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[24]  Jon Atli Benediktsson,et al.  Advances in Hyperspectral Image Classification: Earth Monitoring with Statistical Learning Methods , 2013, IEEE Signal Processing Magazine.

[25]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Michael Satosi Watanabe,et al.  Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..

[27]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[28]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[29]  Serge Fehr,et al.  On quantum Rényi entropies: A new generalization and some properties , 2013, 1306.3142.

[30]  Chein-I Chang,et al.  Band Subset Selection for Hyperspectral Image Classification , 2018, Remote. Sens..

[31]  Gavin Brown,et al.  A New Perspective for Information Theoretic Feature Selection , 2009, AISTATS.

[32]  Raymond W. Yeung,et al.  A new outlook of Shannon's information measures , 1991, IEEE Trans. Inf. Theory.

[33]  Licheng Jiao,et al.  Multiple Kernel Learning Based on Discriminative Kernel Clustering for Hyperspectral Band Selection , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[34]  LinLin Shen,et al.  Unsupervised Band Selection for Hyperspectral Imagery Classification Without Manual Band Removal , 2012, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[35]  Lorenzo Bruzzone,et al.  Kernel-based methods for hyperspectral image classification , 2005, IEEE Transactions on Geoscience and Remote Sensing.

[36]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[37]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[38]  Randall D. Beer,et al.  Nonnegative Decomposition of Multivariate Information , 2010, ArXiv.

[40]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .