Computationally Efficient Approximations for Matrix-based Renyi's Entropy

The recently developed matrix-based Rényi’s α-order entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi-definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix G to power α (i.e., tr(Gα)), with a normal complexity of nearly O(n3), which severely hampers its practical usage when the number of samples (i.e., n) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than O(n2). To this end, we first develop randomized approximations to tr(Gα) that transform the trace estimation into matrix-vector multiplications problem. We extend such strategy for arbitrary values of α (integer or non-integer). We then establish the connection between the matrix-based Renyi’s entropy and PSD matrix approximation, which enables us to exploit both clustering and block low-rank structure of G to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.

[1]  Robert Jenssen,et al.  Multivariate Extension of Matrix-Based Rényi's $\alpha$α-Order Entropy Functional , 2018, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  José Carlos Príncipe,et al.  Information Theoretic Learning with Infinitely Divisible Kernels , 2013, ICLR.

[3]  James Bailey,et al.  Can high-order dependencies improve mutual information based feature selection? , 2016, Pattern Recognit..

[4]  Ilker Ali Ozkan,et al.  Multiclass classification of dry beans using computer vision and machine learning techniques , 2020, Comput. Electron. Agric..

[5]  Luis Filipe Coelho Antunes,et al.  Conditional Rényi Entropies , 2012, IEEE Transactions on Information Theory.

[6]  G. Klir,et al.  Uncertainty-based information: Elements of generalized information theory (studies in fuzziness and soft computing). , 1998 .

[7]  David S. Watkins,et al.  The QR Algorithm Revisited , 2008, SIAM Rev..

[8]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[9]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[10]  James T. Kwok,et al.  Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[12]  Robert Jenssen,et al.  Measuring Dependence with Matrix-based Entropy Functional , 2021, AAAI.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Huan Liu,et al.  Improving backpropagation learning with feature selection , 1996, Applied Intelligence.

[15]  Peter E. Latham,et al.  Mutual Information , 2006 .

[16]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[17]  William H. Press,et al.  Numerical recipes in C , 2002 .

[18]  Rajendra Bhatia,et al.  Infinitely Divisible Matrices , 2006, Am. Math. Mon..

[19]  D. A. García-Hernández,et al.  University of Birmingham The Fourteenth Data Release of the Sloan Digital Sky Survey: , 2017 .

[20]  James Bailey,et al.  Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View , 2014, AAAI.

[21]  Ullrich Kothe,et al.  Training Normalizing Flows with the Information Bottleneck for Competitive Generative Classification , 2020, NeurIPS.

[22]  Weiwei Sun,et al.  The perturbation bounds for eigenvalues of normal matrices , 2005, Numer. Linear Algebra Appl..

[23]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[24]  Jure Leskovec,et al.  Graph Information Bottleneck , 2020, NeurIPS.

[25]  Andres M. Alvarez-Meza,et al.  A Data-Driven Measure of Effective Connectivity Based on Renyi's α-Entropy , 2019, Front. Neurosci..

[26]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[27]  Sivan Toledo,et al.  Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix , 2011, JACM.

[28]  G. Crooks On Measures of Entropy and Information , 2015 .

[29]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[30]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[31]  Robert Jenssen,et al.  Understanding Convolutional Neural Networks With Information Theory: An Initial Exploration , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Tingting Mu,et al.  Quantifying the Informativeness of Similarity Measurements , 2017, J. Mach. Learn. Res..

[33]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[34]  David H. Wolpert,et al.  Nonlinear Information Bottleneck , 2017, Entropy.

[35]  Christos Boutsidis,et al.  A Randomized Algorithm for Approximating the Log Determinant of a Symmetric Positive Definite Matrix , 2015, ArXiv.

[36]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[37]  Michel Verleysen,et al.  Kernel-based dimensionality reduction using Renyi's α-entropy measures of similarity , 2017, Neurocomputing.

[38]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Michael Kampffmeyer,et al.  Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels , 2019, ArXiv.

[40]  Yochai Blau,et al.  Direct Validation of the Information Bottleneck Principle for Deep Nets , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[41]  C. W. Clenshaw A note on the summation of Chebyshev series , 1955 .

[42]  Jose C. Principe,et al.  Deep Deterministic Information Bottleneck with Matrix-Based Entropy Functional , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[44]  Gavin Brown,et al.  A New Perspective for Information Theoretic Feature Selection , 2009, AISTATS.

[45]  Krystian Mikolajczyk,et al.  Information Theoretic Representation Distillation , 2021, ArXiv.

[46]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[47]  Michael Elad,et al.  Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[48]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[49]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[50]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[51]  D. V. Sridhar,et al.  Information theoretic subset selection for neural network models , 1998 .

[52]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[53]  Inderjit S. Dhillon,et al.  Memory Efficient Kernel Approximation , 2014, ICML.

[54]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[55]  Sourav Das Inequalities for q-gamma function ratios , 2019 .